This is the code used by the MaiNLP team in the VarDial 2025 shared task on dialectal Norwegian slot and intent detection. It is described in the paper Add Noise, Tasks, or Layers? A Comparison of Methods to Improve Dialectal Norwegian Slot and Intent Detection (Verena Blaschke*, Felicia Körner*, Barbara Plank), to be published at VarDial @ COLING 2025.
Clone with "recursive" flag to download the submodule contents:
git clone [email protected]:mainlp/NorSID.git --recursive
Our experiments are described more closely in the respective folders:
experiments_baselines
: Compare mDeBERTa-v3, ScandiBERT and NorBERT-v3 when trained on the English training data, its machine-translated counterpart, or 90% of the (predominantly dialectal) development set. This folder also contains some general scripts/notes (making MaChAmp compatible with NorBERT, preparing the 90:10 split of the dev data).experiments_noise
: Inject character-level noise into the MT'ed Norwegian training data and fine-tune the language models on the noised data.experiments_auxtasks
: Train ScandiBERT on another NLP task in (standard or dialectal) Norwegian (POS tagging, dependency parsing, NER, dialect identification), either before training on the English SID training set or simultaneosuly.experiments_layer_swap
: Train models on different datasets and swap out some of their layers (or reset some of their layers).
Data in submodule folder:
- xSID (regular training/dev/test data for English + other languages)
- NoMusic (shared task dev/test data; MT'ed Norwegian training data)
- UD Nynorsk LIA (treebank with dialectal transcriptions)
- NorNE (named entity annotations for Norwegian)
Data to be additionally downloaded:
- Nordic Dialect Corpus (CC BY-SA 4.0): Download the phonetic & orthographic transcriptions with informant codes, unzip them, and add the folders
ndc_phon_with_informant_codes
andndc_with_informant_codes
to thedata
directory.