NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark.
10.04.2025
: 📕 Our pre-print is available on arXiv.09.04.2025
: 🚀 We release NorEval, including our annotation guidelines and novel datasets (NorRewrite-Instruct & NorSummarize-Instruct).
Overview of the NorEval design. 😼 denotes datasets used in NorBench, NLEBench, ScandEval, and SEB, 🚀 represents datasets that have not been used in the existing Norwegian benchmarks, and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.
🇳🇴 NorEval combines 19 existing peer-reviewed datasets with five datasets created from scratch (NCB, NorRewrite-Instruct, NorSummarize-Instruct for Norwegian Bokmål, and NorIdiom for Norwegian Bokmål and Nynorsk). NorEval covers nine diverse task categories, including sentiment analysis, Norwegian language knowledge, Norwegian-specific & world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:
- 🌐 Linguistic diversity: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
- 📊 Task diversity: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: NorBench, NLEBench, ScandEval, and SEB.
- 🧠 Data quality: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
- 📏 Prompt sensitivity: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
- 👩🏻🔬 Standardized evaluation: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.
We group our datasets into text classification, sentence ranking, sentence completion, multiple-choice question answering, generative question answering, and sequence-to-sequence generation tasks. We refer the reader to our paper for more details and describe our tasks below.
Name | Bokmål | Nynorsk | k-shot | Task type | Task category |
---|---|---|---|---|---|
NoReC Sentence | norec_sentence |
❌ | ✅ | Text classification | Sentiment analysis |
NoReC Document | norec_document |
❌ | ✅ | Text classification | Sentiment analysis |
NCB | ncb |
❌ | ❌ | Sentence ranking | Norwegian language knowledge |
NorIdiom | noridiom_nob |
noridiom_nno |
❌ | Sentence completion | Norwegian language knowledge |
Belebele | norbelebele |
❌ | ❌ | Multiple-choice question answering | Machine reading comprehension |
NRK-Quiz-QA | nrk_quiz_qa_nob |
nrk_quiz_qa_nno |
❌ | Multiple-choice question answering | Norwegian-specific & world knowledge |
NorOpenBookQA | noropenbookqa_nob |
noropenbookqa_nno |
✅ | Multiple-choice question answering | Norwegian-specific & world knowledge |
NorCommonsenseQA | norcommonsenseqa_nob |
norcommonsenseqa_nno |
❌ | Multiple-choice question answering | Commonsense reasoning |
NorTruthfulQA Multiple choice | nortruthfulqa_mc_nob |
nortruthfulqa_mc_nno |
❌ | Multiple-choice question answering | Truthfulness |
NorQuAD | norquad |
❌ | ✅ | Generative question answering | Machine reading comprehension |
NorTruthfulQA Generation | nortruthfulqa_gen_nob |
nortruthfulqa_gen_nno |
❌ | Generative question answering | Truthfulness |
ASK-GEC | ask_gec |
❌ | ✅ | Sequence-to-sequence generation | Norwegian language knowledge |
NorSumm | norsumm_nob |
norsumm_nno |
✅ | Sequence-to-sequence generation | Text summarization |
Tatoeba (English → Bokmål/Nynorsk) | tatoeba_eng_nob |
tatoeba_eng_nno |
✅ | Sequence-to-sequence generation | Machine translation |
Tatoeba (Bokmål/Nynorsk → English) | tatoeba_nob_eng |
tatoeba_nno_eng |
✅ | Sequence-to-sequence generation | Machine translation |
NorRewrite-Instruct | norrewrite_instruct |
❌ | ❌ | Sequence-to-sequence generation | Instruction following |
NorSummarize-Instruct | norsummarize_instruct |
❌ | ❌ | Sequence-to-sequence generation | Instruction following |
Table description
- Name: a dataset name with a HuggingFace link.
- Bokmål: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
- Nynorsk: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
- k-shot: the support for k-shot evaluation regimes with k > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
- ✅ means that the user can run the evaluation in both zero-shot and k-shot regimes.
- ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, k-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
- Task type: the task type.
- Task category: the task category.
Install LM Evaluation Harness and clone our repository.
Note: NorEval can currently be used only locally. We are in the process of integrating our benchmark into LM Evaluation Harness.
pip install --quiet https://github.com/EleutherAI/lm-evaluation-harness/archive/refs/tags/v0.4.8.tar.gz # current recent version
git clone https://github.com/ltgoslo/noreval.git
More detailed guidelines on how to use LM Evaluation Harness can be found here.
The high-level framework usage requires the following arguments:
--model
: the model type (e.g.,hf
andvllm
). please refer to the documentation on using vLLM.--model_args
: the model type and the HuggingFace model name (e.g.,pretrained=norallm/normistral-7b-warm
).--tasks
: the name(s) of evaluation tasks (e.g.,norcommonsenseqa_nob
).--include_path
: a path to custom configuration files in the.yaml
format (in our case, it isnoreval
). this is used to add the NorEval tasks to the framework's task registry as available tasks.--log_samples
: allows to save the model inputs and outputs in a directory specified with the help of the--output
argument.--output
: a path where high-level results will be saved. if one provides--log_samples
, both model predictions and results will be saved in the specified directory.--write_out
: a complementary function, which prints out the format of the prompts and outputs.--show_config
: a complementary function, which prints out the configuration file.--batch_size
: the batch size."auto"
allows to automatically select the largest batch size that will fit in memory, speeding up evaluation.--num_fewshot
: the number of demonstrations used in the model input. the default value is0
; the user can adjust this parameter based on the support for k-shot regimes (refer to 🗃️ Tasks).--predict_only
: allows to not compute the performance metrics but only save the predictions. should be used together with--log_samples
.
Example 1: Zero-shot evaluation on NorQuAD across five prompts.
lm_eval \
--model hf \
--model_args pretrained=norallm/normistral-7b-warm \
--tasks norquad \
--include_path ./noreval/ \
--output results/norquad/0-shot/\
--log_samples \
--show_config \
--write_out \
--batch_size auto \
--num_fewshot 0
Example 2: One-shot evaluation on NorQuAD across five prompts.
lm_eval \
--model hf \
--model_args pretrained=norallm/normistral-7b-warm \
--tasks norquad \
--include_path ./noreval/ \
--output results/norquad/0-shot/ \
--log_samples \
--show_config \
--write_out \
--batch_size auto \
--num_fewshot 1
Example 3: Zero-shot evaluation on NorQuAD using one prompt of interest.
All prompts are numbered from 0
to 6
, and the corresponding configuration files for all supported prompts can be found in the task directories.
lm_eval \
--model hf \
--model_args pretrained=norallm/normistral-7b-warm \
--tasks norquad_p0 \
--include_path ./noreval/ \
--output results/norquad_p0/0-shot/ \
--log_samples \
--show_config \
--write_out \
--batch_size auto \
--num_fewshot 0
Example 4: Zero-shot evaluation on task groups.
Consider an example of conducting an evaluation on a task category of interest, e.g., Norwegian-specific & world knowledge. LM Evaluation Harness allows to group tasks as shown below; please find more details here.
Step 1: Create a configuration file
Create a configuration file containing the name of the group and corresponding tasks and save it in the noreval
folder.
group: norwegian_specific_and_world_knowledge_tasks_nob
task:
- nrk_quiz_qa_nob
- noropenbookqa_nob
aggregate_metric_list:
- metric: acc
weight_by_size: True
Step 2: Run the evaluation
Here, we are specifying the name of our created group as tasks
:
lm_eval \
--model hf \
--model_args pretrained=norallm/normistral-7b-warm \
--tasks norwegian_specific_and_world_knowledge_tasks_nob \
--include_path ./noreval/ \
--output results/norwegian_specific_and_world_knowledge_tasks_nob/0-shot/ \
--log_samples \
--show_config \
--write_out \
--batch_size auto \
--num_fewshot 0
Example 5: Zero-shot evaluation on ASK-GEC, which requires computation of the performance metric using a separate script.
Here, we use the --predict_only
argument and compute the performance metrics as described below.
Step 1: Generate the predictions
lm_eval \
--model hf \
--model_args pretrained=AI-Sweden-Models/Llama-3-8B \
--tasks ask_gec \
--include_path ./noreval/ \
--output results/ask_gec/0-shot/ \
--log_samples \
--show_config \
--write_out \
--predict_only \
--batch_size auto \
--num_fewshot 0
Step 2: Evaluate the predictions with ERRANT
- Please refer to the installation instructions here.
- Run the following:
python3 ask_gec/errant.py --fpath results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441.jsonl --out_fdir results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/
- The results will be saved as
results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441_errant.json
Comment: Running BERTScore.
The optimal support of BERTScore in LM Evaluation Harness remains an open issue. We follow the proposed workaround for NorSumm but compute the BERTScore for the other sequence-to-sequence generation tasks offline after running the evaluation with the --predict_only
argument.
@article{mikhailov2025noreval,
title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
journal={arXiv preprint arXiv:2504.07749},
year={2025}
}