Skip to content

A Norwegian Language Understanding and Generation Evaluation Benchmark

License

Notifications You must be signed in to change notification settings

ltgoslo/noreval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🇳🇴 NorEval

NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark.

🔥 Updates

📖 Contents

😎 Overview

noreval

Overview of the NorEval design. 😼 denotes datasets used in NorBench, NLEBench, ScandEval, and SEB, 🚀 represents datasets that have not been used in the existing Norwegian benchmarks, and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.

🇳🇴 NorEval combines 19 existing peer-reviewed datasets with five datasets created from scratch (NCB, NorRewrite-Instruct, NorSummarize-Instruct for Norwegian Bokmål, and NorIdiom for Norwegian Bokmål and Nynorsk). NorEval covers nine diverse task categories, including sentiment analysis, Norwegian language knowledge, Norwegian-specific & world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:

  • 🌐 Linguistic diversity: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
  • 📊 Task diversity: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: NorBench, NLEBench, ScandEval, and SEB.
  • 🧠 Data quality: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
  • 📏 Prompt sensitivity: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
  • 👩🏻‍🔬 Standardized evaluation: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.

🗃️ Tasks

We group our datasets into text classification, sentence ranking, sentence completion, multiple-choice question answering, generative question answering, and sequence-to-sequence generation tasks. We refer the reader to our paper for more details and describe our tasks below.

Name Bokmål Nynorsk k-shot Task type Task category
NoReC Sentence norec_sentence Text classification Sentiment analysis
NoReC Document norec_document Text classification Sentiment analysis
NCB ncb Sentence ranking Norwegian language knowledge
NorIdiom noridiom_nob noridiom_nno Sentence completion Norwegian language knowledge
Belebele norbelebele Multiple-choice question answering Machine reading comprehension
NRK-Quiz-QA nrk_quiz_qa_nob nrk_quiz_qa_nno Multiple-choice question answering Norwegian-specific & world knowledge
NorOpenBookQA noropenbookqa_nob noropenbookqa_nno Multiple-choice question answering Norwegian-specific & world knowledge
NorCommonsenseQA norcommonsenseqa_nob norcommonsenseqa_nno Multiple-choice question answering Commonsense reasoning
NorTruthfulQA Multiple choice nortruthfulqa_mc_nob nortruthfulqa_mc_nno Multiple-choice question answering Truthfulness
NorQuAD norquad Generative question answering Machine reading comprehension
NorTruthfulQA Generation nortruthfulqa_gen_nob nortruthfulqa_gen_nno Generative question answering Truthfulness
ASK-GEC ask_gec Sequence-to-sequence generation Norwegian language knowledge
NorSumm norsumm_nob norsumm_nno Sequence-to-sequence generation Text summarization
Tatoeba (English → Bokmål/Nynorsk) tatoeba_eng_nob tatoeba_eng_nno Sequence-to-sequence generation Machine translation
Tatoeba (Bokmål/Nynorsk → English) tatoeba_nob_eng tatoeba_nno_eng Sequence-to-sequence generation Machine translation
NorRewrite-Instruct norrewrite_instruct Sequence-to-sequence generation Instruction following
NorSummarize-Instruct norsummarize_instruct Sequence-to-sequence generation Instruction following
Table description
  • Name: a dataset name with a HuggingFace link.
  • Bokmål: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
  • Nynorsk: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
  • k-shot: the support for k-shot evaluation regimes with k > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
    • ✅ means that the user can run the evaluation in both zero-shot and k-shot regimes.
    • ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, k-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
  • Task type: the task type.
  • Task category: the task category.

👨🏻‍💻 Installation and Usage

Install LM Evaluation Harness and clone our repository.

Note: NorEval can currently be used only locally. We are in the process of integrating our benchmark into LM Evaluation Harness.

pip install --quiet https://github.com/EleutherAI/lm-evaluation-harness/archive/refs/tags/v0.4.8.tar.gz # current recent version
git clone https://github.com/ltgoslo/noreval.git

More detailed guidelines on how to use LM Evaluation Harness can be found here.

The high-level framework usage requires the following arguments:

  • --model: the model type (e.g., hf and vllm). please refer to the documentation on using vLLM.
  • --model_args: the model type and the HuggingFace model name (e.g., pretrained=norallm/normistral-7b-warm).
  • --tasks: the name(s) of evaluation tasks (e.g., norcommonsenseqa_nob).
  • --include_path: a path to custom configuration files in the .yaml format (in our case, it is noreval). this is used to add the NorEval tasks to the framework's task registry as available tasks.
  • --log_samples: allows to save the model inputs and outputs in a directory specified with the help of the --output argument.
  • --output: a path where high-level results will be saved. if one provides --log_samples, both model predictions and results will be saved in the specified directory.
  • --write_out: a complementary function, which prints out the format of the prompts and outputs.
  • --show_config: a complementary function, which prints out the configuration file.
  • --batch_size: the batch size. "auto" allows to automatically select the largest batch size that will fit in memory, speeding up evaluation.
  • --num_fewshot: the number of demonstrations used in the model input. the default value is 0; the user can adjust this parameter based on the support for k-shot regimes (refer to 🗃️ Tasks).
  • --predict_only: allows to not compute the performance metrics but only save the predictions. should be used together with --log_samples.

Examples

Example 1: Zero-shot evaluation on NorQuAD across five prompts.
lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad \
  --include_path ./noreval/ \
  --output results/norquad/0-shot/\
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0
Example 2: One-shot evaluation on NorQuAD across five prompts.
lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad \
  --include_path ./noreval/ \
  --output results/norquad/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 1
Example 3: Zero-shot evaluation on NorQuAD using one prompt of interest.

All prompts are numbered from 0 to 6, and the corresponding configuration files for all supported prompts can be found in the task directories.

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad_p0 \
  --include_path ./noreval/ \
  --output results/norquad_p0/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0
Example 4: Zero-shot evaluation on task groups.

Consider an example of conducting an evaluation on a task category of interest, e.g., Norwegian-specific & world knowledge. LM Evaluation Harness allows to group tasks as shown below; please find more details here.

Step 1: Create a configuration file

Create a configuration file containing the name of the group and corresponding tasks and save it in the noreval folder.

group: norwegian_specific_and_world_knowledge_tasks_nob
task:
  - nrk_quiz_qa_nob
  - noropenbookqa_nob
aggregate_metric_list:
  - metric: acc
    weight_by_size: True

Step 2: Run the evaluation

Here, we are specifying the name of our created group as tasks:

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norwegian_specific_and_world_knowledge_tasks_nob \
  --include_path ./noreval/ \
  --output results/norwegian_specific_and_world_knowledge_tasks_nob/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0
Example 5: Zero-shot evaluation on ASK-GEC, which requires computation of the performance metric using a separate script.

Here, we use the --predict_only argument and compute the performance metrics as described below.

Step 1: Generate the predictions

lm_eval \
  --model hf \
  --model_args pretrained=AI-Sweden-Models/Llama-3-8B \
  --tasks ask_gec \
  --include_path ./noreval/ \
  --output results/ask_gec/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --predict_only \
  --batch_size auto \
  --num_fewshot 0

Step 2: Evaluate the predictions with ERRANT

  • Please refer to the installation instructions here.
  • Run the following:
    python3 ask_gec/errant.py --fpath results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441.jsonl --out_fdir results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/
  • The results will be saved as results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441_errant.json
Comment: Running BERTScore.

The optimal support of BERTScore in LM Evaluation Harness remains an open issue. We follow the proposed workaround for NorSumm but compute the BERTScore for the other sequence-to-sequence generation tasks offline after running the evaluation with the --predict_only argument.

📝 Cite Us

@article{mikhailov2025noreval,
  title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
  author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
  journal={arXiv preprint arXiv:2504.07749},
  year={2025}
}

About

A Norwegian Language Understanding and Generation Evaluation Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages