A Benchmark for language models on European Portuguese

A couple of weeks ago in Lisbon, I went to a friend's birthday dinner. In front of me sat someone that recently started working for the Portuguese government where they focus on modernization and technology. It's not everyday that I talk to someone that works for the Portuguese government in an area similar to mine, so I was very curious. I asked about the AMÁLIA project. The 5.5 Million Euro project about creating a new LLM specifically designed for Portuguese.

The first beta release of AMÁLIA is scheduled for the first trimester of 2025, but I've heard little news about it. Still I asked: "Are you pre-training it? Or are you fine-tuning something that is already out there?". "No, we're training from scratch." the person told me. "Really, what's the point of that?" I asked, but did not get a clear reply. But the question stuck in my mind. What's the goal? To show Portugal is capable of pre-training a model from scratch? To build a model that is specifically really good at Portuguese? ¹

The problem: Shooting in the dark

Let's assume the goal is to build a model that is really good at Portuguese, and that the Portuguese state is really not interested in "showing we are capable of training models" (we have bigger problems). If we assume that, the first thing I would do would actually be to measure how good language models are at speaking Portuguese.

Portuguese is a popular language, it's the 8th most spoken language in the world. But it comes in flavors. Portuguese spoken in Brazil (82% of native speakers) is quite different from Portuguese from Portugal (4% of speakers). A common complaint from people in Portugal is that the models often reply in "Brazilian" Portuguese. The difference is not only about the pronunciation: a lot of the vocabulary, verbs, and conjugations are different.

Let's take Llama 3. It was trained on 8% multilingual data. If we assume 2% of that was Portuguese, and if we assume 5% of that is European Portuguese, then only 0.008% of the data Llama saw was European Portuguese. That's not a lot. But under representation is a common issue.

European Portuguese on EuroEval

Denmark has a similar problem, and I knew Dan Nielsen at The Alexandra Institute worked on something called ScandEval where he evaluated the performance of language models across different Scandinavian languages. I was surprised to see the project evolved into something more general: EuroEval.

EuroEval, is similar to ScandEval - but for all European languages. And you want to guess the one that was missing? Yes, Portuguese.

Over a couple of weeks, we put together² a collection of datasets to evaluate the performance of language models in in European Portuguese. We also did some extra work to ensure that the data is exclusively Portuguese from Portugal.

Here are the datasets we put together:

Sentiment Classification (SST2-PT): Part of the work from the ExtraGLUE project. A sentiment analysis dataset built using machine translation (DeepL).
Named Entity Recognition (HAREM): Part of the work from the HAREM project. We filter by entries where the origin is PT - to create an NER dataset.
Linguistic Acceptability (ScaLA-pt): Based on Portuguese-Bosque treebank, filtered by entries from CETEMPúblico. Created by corrupting grammatically correct sentences.
Reading Comprehension (BoolQ-PT): Also part of the ExtraGlue work. Adapted by taking the original passage, question, and yes/no options, and turning it into a Q/A style question where the model can answer yes or no.
Knowledge (MMLU-pt): Based on this paper. Already included entries specifically for Portuguese from Portugal.
Common-sense Reasoning (GoldenSwag-pt): High quality filtered samples from the HellaSwag dataset. Also machine translated with DeepL for European Portuguese.
Summarization (Publico): Filtered the CCNews corpus for entries where the url matched Público. Transformed into summarization dataset by extracting the first two sentences as the summary (a common trick).

I you want to look at some samples, I published the datasets on HuggingFace. You can also read the extensive descriptions on the EuroEval docs.

Running benchmarks for European Portuguese

While Dan is working on the general leaderboard, I ran some benchmarks on my own which you saw on top of this blog post or in this link. I selected some models I was curious about within 3 "categories": Large, Small, and things I can run on my laptop without the fan coming on.

If you're curious about how a particular model performs that I (or Dan) didn't benchmark, you can also run them yourself (assuming you have uv installed):

$ uvx --with euroeval euroeval \
    --model ollama_chat/smollm2:135m \
    --task sentiment-classification \
    --language pt

Set the --model flag to any model that LiteLLM supports.

The reality of building these benchmarks is a clear realization: European Portuguese is an unpopular language. Brazilian Portuguese is simply a much more popular version of the language. Still, there is value in building benchmarks focused on the European variant. And it was a lot of fun tracking/building these datasets. I expect to do some more work in this area.

And I don't know if investing 5.5M Euro in developing a Portuguese Language Model is a good idea. But there's one thing I do know: Whenever that model comes out, we are now in a much better position to say if it hit the mark. Or if it didn't.

Note: The benchmarks above are preliminary. For the official ones keep an eye on EuroEval Leaderboards

Other countries like Italy have done this - See Minerva. They also spend 5M, but from EU funds afaik ↩
Thanks Dan for all the help in making this happen! ↩