This repository is the official implementation of Enriching Social Science Research via Survey Item Linking (2024).
To install requirements, use either poetry or pip:
poetry install
poetry install --only data_s44k # if you want to reproduce the S44k dataset
poetry install --only data_gsim # if you want to reproduce the GSIM dataset
pip install -r requirements/requirements.txt
pip instlal -r requirements/data_s44k.txt # if you want to reproduce the S44k dataset
pip install -r requirements/data_gsim.txt # if you want to reproduce the GSIM dataset
Important
After installing the requirements, supplemenraty datasets can be re-created (except for SILD, which should be downloaded from here and placed into the /data/sild/
directory) by following the instructions for each (GSIM, LLM-Gen, or S44k).
SILD is archived on Zenodo.
To run the experiments in the paper, run the following commands:
bash ./experiments/md/pretrain.slurm # continue pretraining PLMs on S44k
bash ./experiments/md/train_linear.sh # train linear classifiers on SILD
bash ./experiments/md/train_linear_da.sh # train linear classifiers using data augmentation
bash ./experiments/md/train_plms.sh # fine-tune PLMs on SILD
bash ./experiments/md/train_plms_da.sh # fine-tune PLMS using data augmentation
bash ./experiments/md/train_knn.sh # train kNN on SILD
bash ./experiments/md/eval_rac.sh # combine the best PLM w/ the best kNN
bash ./experiments/md/eval_icl.sh # evaluate In-Context Learning
bash ./experiments/ed/eval_bm25.sh # evaluate BM25
bash ./experiments/ed/eval_plms.sh # evaluate PLMs (including sentence transformers)
bash ./experiments/ed/train_sosse.sh # train SoSSE models by fine-tuning sentence-transformers on GSIM and LLM-Gen
bash ./experiments/ed/eval_sosse.sh # evaluate SoSSE models
Important
The models will be uploaded to HuggingFace Hub soon!
You can download multilingual pretrained models for the social science domain:
- SSOAR-XLM-R-base is pre-trained on S44k using masked language modeling (MLM), a batch size of 8, and a sequence length of 512 tokens.
You can download multilingual fine-tuned models for MD on SILD, which used a batch size of 32 and a sequence length of 64, here:
- XLM-R-base-SILD is fine-tuned on SILD using ... .
- XLM-R-large-SILD is fine-tuned on SILD using ... .
- SSOAR-XLM-R-base-SILD is pre-trained on S44k and then fine-tuned on SILD using ... .
You can download multilingual fine-tuned models for ED on LLM-Gen, which used a batch size of 1024 and a sequence length of 512, here:
- SoSSE-mE5-base is fine-tuned on LLM-Gen using ... .
Our model achieves the following performance on:
Model name | F1-binary (English) | F1-binary (German) | F1-binary (Total) |
---|---|---|---|
XLM-R-base-SILD | 58.5% | 53.9% | 57.1% |
SSOAR-XLM-R-base-SILD | 60.7% | 61.8% | 61.0% |
XLM-R-large-SILD | 61.4% | 65.1% | 62.6% |
Model name | MAP@10 (English) | MAP@10 (German) |
---|---|---|
mE5-base (baseline) | 57.9% | 65.6% |
SoSSE-mE5-base | 63.2% | 68.1% |
Dataset licensing can be found under the respective directories (SILD, GSIM, LLM-Gen, or S44k). This work (including the models and the annotations) is licensed under CC BY 4.0.