Frequently Asked Questions

What is an "eval"? #

An eval (short for "evaluation") is a test for AI models. Just like students take exams to see what they've learned, AI practitioners run evals to measure what models can and can't do.

Most evals test things like math, coding, or reading comprehension. Skills that are easy to verify with clear right or wrong answers.

What is the Kapus Konda Eval? #

This eval tests whether AI models know a joke that virtually every Marathi speaker learns growing up. A classic call-and-response bit where one person asks a question, the other responds, and then comes the punchline.

Each model is told: "You are a native Marathi speaker talking to another native Marathi speaker." Then placed in a conversation where the model either initiates or responds to the joke.

What the model says reveals whether it actually knows this piece of oral culture, or whether it's just guessing.

What does "kapus konda" mean? #

It doesn't matter.

Why does this eval matter? #

AI models are trained primarily on English internet content, which means they absorb Western culture far more deeply than others. Languages like Marathi, despite having 83 million speakers, are dramatically underrepresented.

This joke is a tiny piece of shared culture that every Marathi speaker knows. If models can't recognize it, it suggests they're missing entire layers of human experience that don't happen to be well-documented online.

Do 83 million people really know this joke? #

83 million is just the estimated number of Marathi speakers in the world. The author believes virtually every Marathi speaker knows this joke because it's such a ubiquitous part of growing up in the culture. But let's be honest: there's no actual survey or evidence backing this up. It's vibes, not science.

Why is this also a hallucination test? #

The words "kapus konda" are pure absurdity. They don't mean anything coherent. "Kapus" is cotton, "konda" could be chaff or dandruff, but together "kapus kondyachi gosht" is just nonsense syllables dressed up as a story title.

This creates a perfect trap. Models that don't know the joke will try to make sense of the words. They'll confidently fabricate elaborate stories about cotton farmers, kings sending ministers to buy cotton pods, cotton getting tangled and untangled in infinite loops, even stories about cotton being used as footballs by schoolchildren.

These hallucinations are revealing. The model doesn't know the joke exists, but it can't admit that. So it invents something plausible-sounding. The more elaborate the fabrication, the more certain you can be that the model has never encountered the actual joke.

Aren't you worried about explaining the joke? #

A little. AI models are trained on internet data, including, potentially, this website. By explaining the joke, future models might learn the answer from here instead of actually knowing the cultural context.

This is a well-known problem called benchmark contamination. Popular benchmarks like GSM8K, HumanEval, and MMLU have all suffered from this. Models memorize solutions from web discussions.

But the whole point of this benchmark is lost if you don't know the joke. You can't appreciate what it's testing, or why the results matter, without understanding it first.

And honestly? We'd be thrilled if models learned this joke from here. The chances of this little corner of the internet making it into training data are slim, and we'll never find out anyway.

Why only test with "Nahi"? #

We use "nahi" (no) as the simulated user's response. It doesn't actually matter what the response is. The joke works the same way regardless. But "no" is arguably the most confusing response, since saying no to a story request should end the conversation. Instead, it loops.

How does the scoring work? #

We run each model once at temperature 0 (deterministic). Each response is judged by another AI model. There are two tests:

As Responder

Someone else initiates the joke. Does the model know what's coming?

PassedPredicts exactly what comes next.

Kinda PassedKnows how it works, but not the exact details.

Kinda FailedKnows it's a joke, but doesn't know how it works.

FailedTakes it at face value.

As Performer

The model initiates the joke. Can it deliver?

PassedPerfect execution.

Kinda PassedKnows how the joke works, but botches the delivery.

Kinda FailedKnows it's a joke, but doesn't know how it works.

FailedTakes it literally, misses the joke entirely.