Skip to content

Latest commit

 

History

History
88 lines (61 loc) · 4.92 KB

README.md

File metadata and controls

88 lines (61 loc) · 4.92 KB

DOI

Enriching Social Science Research via Survey Item Linking (SIL)

This repository is the official implementation of Enriching Social Science Research via Survey Item Linking (2024).

A figure showing the pipeline for Survey Item Linking

Requirements

To install requirements, use either poetry or pip:

poetry install
poetry install --only data_s44k # if you want to reproduce the S44k dataset
poetry install --only data_gsim # if you want to reproduce the GSIM dataset
pip install -r requirements/requirements.txt
pip instlal -r requirements/data_s44k.txt # if you want to reproduce the S44k dataset
pip install -r requirements/data_gsim.txt # if you want to reproduce the GSIM dataset

Important

After installing the requirements, supplemenraty datasets can be re-created (except for SILD, which should be downloaded from here and placed into the /data/sild/ directory) by following the instructions for each (GSIM, LLM-Gen, or S44k). SILD is archived on Zenodo.

Experiments

To run the experiments in the paper, run the following commands:

bash ./experiments/md/pretrain.slurm  # continue pretraining PLMs on S44k
bash ./experiments/md/train_linear.sh  # train linear classifiers on SILD
bash ./experiments/md/train_linear_da.sh  # train linear classifiers using data augmentation
bash ./experiments/md/train_plms.sh  # fine-tune PLMs on SILD
bash ./experiments/md/train_plms_da.sh  # fine-tune PLMS using data augmentation
bash ./experiments/md/train_knn.sh  # train kNN on SILD
bash ./experiments/md/eval_rac.sh  # combine the best PLM w/ the best kNN
bash ./experiments/md/eval_icl.sh  # evaluate In-Context Learning
bash ./experiments/ed/eval_bm25.sh  # evaluate BM25
bash ./experiments/ed/eval_plms.sh  # evaluate PLMs (including sentence transformers)
bash ./experiments/ed/train_sosse.sh  # train SoSSE models by fine-tuning sentence-transformers on GSIM and LLM-Gen
bash ./experiments/ed/eval_sosse.sh  # evaluate SoSSE models

Models

Important

The models will be uploaded to HuggingFace Hub soon!

You can download multilingual pretrained models for the social science domain:

  • SSOAR-XLM-R-base is pre-trained on S44k using masked language modeling (MLM), a batch size of 8, and a sequence length of 512 tokens.

You can download multilingual fine-tuned models for MD on SILD, which used a batch size of 32 and a sequence length of 64, here:

You can download multilingual fine-tuned models for ED on LLM-Gen, which used a batch size of 1024 and a sequence length of 512, here:

Results

Our model achieves the following performance on:

Model name F1-binary (English) F1-binary (German) F1-binary (Total)
XLM-R-base-SILD 58.5% 53.9% 57.1%
SSOAR-XLM-R-base-SILD 60.7% 61.8% 61.0%
XLM-R-large-SILD 61.4% 65.1% 62.6%
Model name MAP@10 (English) MAP@10 (German)
mE5-base (baseline) 57.9% 65.6%
SoSSE-mE5-base 63.2% 68.1%

Licensing Information

Dataset licensing can be found under the respective directories (SILD, GSIM, LLM-Gen, or S44k). This work (including the models and the annotations) is licensed under CC BY 4.0.