The AI models passed too: we put eleven models through the Dutch final exams

Doga ArasDA
Bink HolBH
Doga Aras & Bink Hol
June 12, 2026
6 min read
Final Exams
Education
AI Models

Yesterday, just under 200,000 Dutch students got their exam results. We gave the same exams to eleven AI models and graded them with the official answer keys. Nearly all of them pass. Which is exactly why the report card is so instructive.

A classroom of friendly AI robots sitting the Dutch final exams

Yesterday, phones rang all across the Netherlands. Today, school bags hang from flagpoles, the Dutch way of announcing to the street that someone in this house just passed their finals. We were curious how a rather different kind of candidate would fare, so we sat eleven AI models down for the same exams: from workhorses a few years old to a model released only this week. Same questions, same official answer keys, same grading scale. The verdict: ten of the eleven pass, and one candidate has to come back for a resit. And still, that is the least interesting thing about it.

Ask how good an AI model is and you will be served percentages from foreign leaderboards with cryptic names like MMLU and SWE-bench. Impressive, but what does 91 percent on an American multiple-choice test tell you about whether a model can handle your Dutch customer queries? A school exam has what those leaderboards lack: context you can feel. Everyone knows what a 7 in Dutch means, and everyone remembers walking into the exam hall. So we gave each model this year's HAVO exams (the Dutch senior secondary track), seven subjects wide, and graded them the way a teacher would: question by question, with the official answer key beside us.

How we did it

Seven HAVO subjects: Dutch, English, mathematics, physics, economics, history and geography. Eleven models from five AI companies (Anthropic, OpenAI, Google, Mistral and Qwen), ranging from an estimated 8 billion to around ten trillion parameters: the AI equivalent of brain cells, where a model stores its knowledge. Each model received the exam text per question and was graded on the Dutch scale of 1 to 10, where 5.5 is a pass. Web search was switched off: no cheating, no looking things up. With search enabled every grade improves, but it blurs exactly what we wanted to measure: what a model knows and understands on its own.

The report card

The full grade list is below, per model and per subject. Click around before you read on: the patterns the rest of this piece is about are all visible in there.

The report card: 11 AI models, 7 subjects

Dutch HAVO final exams, graded question by question against the official answer keys.

School year 2025/2026
Loading…
8.0 and up6.5 to 8.05.5 to 6.5failing grade
Pass mark is 5.5. Below each model name: its size in parameters (~ marks an estimate).

The hardest subject is Dutch

First, the question every family party opens with: who is top of the class? That would be Claude Fable 5, averaging an 8.9 with no subject below a 7. The kind of student whose parent-teacher meeting is over in two minutes. But far more interesting than the winner is the subject the entire class stumbled on. Not mathematics, not physics. Dutch: a meagre 6.5 on average, the lowest grade of all seven subjects. Even Fable 5 got no further than a 7.2.

That is stranger than it looks, because these models write flawless Dutch without breaking a sweat. The thing is, the Dutch exam does not test language. It tests reading: what does the author actually believe, which argument is left unspoken, why does this particular word sit in this particular place? That is reading between the lines, and the lines are Dutch ones, shaped by opinion pages, columns and literary conventions that barely feature in international training data. Compare economics, the easiest subject of all at nearly a 9 on average. There the answer space is small: a price ceiling does what a price ceiling does, in any language. AI does not trip over difficulty, it trips over ambiguity. And that is precisely the kind of insight no international leaderboard will ever give you.

Bigger is not always better

The second lesson comes courtesy of Google, which unwittingly ran the nicest experiment in the class by enrolling two family members at once. Gemini Small, the compact, inexpensive model, averaged an 8.1. Gemini Pro, many times larger and more expensive, got stuck at 7.1. On mathematics the gap was downright painful: the small one scored an 8.9, the big one a 5.1, the lowest maths grade in the entire class. More parameters do not automatically buy better answers. What a model was trained for matters more than how big it is, and a newer small model can simply overtake an older large one.

The reverse exists too, and that report card is the most dangerous one in the pile. GPT-4.1 Mini scored a 9.5 in economics, very nearly the top mark in the class, and then failed English with a 4.6 and Dutch with a 4.9. Rounded, that is two fives in two core subjects, and Dutch exam rules allow only one. So the one model in the class that genuinely fails is a language model that fails the languages. Anyone who only sees the economics grade thinks they are dealing with a prodigy. And the real problem is not even the failing grades, it is the confidence: the model answers a comprehension question with exactly the same certainty as an economics question. It does not know what it does not know. If you deploy AI on tasks where interpretation matters, that is a more important warning than any average.

The bill

Then the conversation no school ever needs to have: what did that diploma cost? Between the cheapest and the most expensive model in this class sits a price gap of roughly a hundred and fifty times. The grades, meanwhile, run from 6.4 to 8.9. Paying forty times more does not buy a forty times better answer; it does not even guarantee a full extra point. The chart below plots each model's average grade against its cost per million tokens, the snippets of text an AI computes in; a million tokens comes down to roughly a thousand pages. Two dots light up: Claude Fable 5, top of the class and also the priciest, and Qwen 3.5, which scores an 8.1 for less than a dollar per million tokens. The latter is, not coincidentally, the model Localign builds its assistants on.

Interactive chart

Paying more barely buys a higher grade

Average grade set against what using each model costs.

Best valueTop of the class
Cost
Loading…
Cost in dollars per million tokens, the snippets of text an AI computes in. Blended averages reading (what the model takes in) and writing (what it returns). Hover over a dot for the details.

So what does all that money actually buy? Mostly certainty. Fable 5 is not so much more brilliant than the rest as it is flatter, in the good sense: no slip-ups anywhere, no subject below a 7, not even on the treacherous Dutch exam. The mid-field matches those grades on a good day, but carries weak subjects you have to plan around. At the top end you are not paying for the average. You are paying for the lowest grade staying high.

Pick per task, test in your own context

What do you take from this if you run an organisation rather than an exam class? Three things. One: the question has long stopped being whether AI can do it, because apart from one resit candidate the whole class passes, the cheapest model included. Two: the differences are in where a model slips, and no foreign leaderboard will tell you that; for that you need to test in your own language and your own context. Three: for factual work with a bounded answer space, a small, cheap model is often plenty, while interpretation in Dutch calls for the best there is, or for a human reading along. Which is exactly why Localign does not run on a single model but on a mix: a strong, affordable base where possible, top of the class where it counts.

Fair is fair

An exam is not an office. And passed comes with a wink: we only measured the central written exams, while a real final grade also weighs in the school exams. Grading was done by a fixed model using the official answer keys, consistent but not a certified teacher. Figures and graphs were provided to the models as text descriptions, not as images. And a grade says nothing about tone, reliability or safety in production. This test tells you where models differ from each other, not which model solves your problem. That last part remains testing work.

So the flags can go up, for ten of the eleven AI models too. At home, that settles it: a pass is a pass, and then you celebrate. Choosing an AI model works the other way around: there, passed is where it starts, and asking to see the grade list is exactly the right move.

About the authors

Doga ArasDA
Doga Aras
AI Engineer

AI engineer at Localign and builder of the exam pipeline: from ingesting the exam booklets to automated grading against the official answer keys.

An average tells me very little. Show me where a model slips and I'll tell you what it's worth.

Bink HolBH
Bink Hol
AI work student

AI work student at Localign. Ran the exams, combed through the answers of eleven models and knows the grade list by heart by now.

The biggest AI is not always the smartest one.

Questions about this research?

Reach out. We're happy to explain the methodology or discuss what this means for your use case.