Skip to content
/ SIL Public

Enriching Social Science Research via Survey Item Linking

Notifications You must be signed in to change notification settings

e-tornike/SIL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Enriching Social Science Research via Survey Item Linking (SIL)

This repository is the official implementation of Enriching Social Science Research via Survey Item Linking (2024).

A figure showing the pipeline for Survey Item Linking

Requirements

To install requirements, use either poetry or pip:

poetry install
poetry install --only data_s44k # if you want to reproduce the S44k dataset
poetry install --only data_gsim # if you want to reproduce the GSIM dataset
pip install -r requirements/requirements.txt
pip instlal -r requirements/data_s44k.txt # if you want to reproduce the S44k dataset
pip install -r requirements/data_gsim.txt # if you want to reproduce the GSIM dataset

Important

After installing the requirements, supplemenraty datasets can be re-created (except for SILD, which should be downloaded from here and placed into the /data/sild/ directory) by following the instructions for each (GSIM, LLM-Gen, or S44k). SILD is archived on Zenodo.

Experiments

To run the experiments in the paper, run the following commands:

bash ./experiments/md/pretrain.slurm  # continue pretraining PLMs on S44k
bash ./experiments/md/train_linear.sh  # train linear classifiers on SILD
bash ./experiments/md/train_linear_da.sh  # train linear classifiers using data augmentation
bash ./experiments/md/train_plms.sh  # fine-tune PLMs on SILD
bash ./experiments/md/train_plms_da.sh  # fine-tune PLMS using data augmentation
bash ./experiments/md/train_knn.sh  # train kNN on SILD
bash ./experiments/md/eval_rac.sh  # combine the best PLM w/ the best kNN
bash ./experiments/md/eval_icl.sh  # evaluate In-Context Learning
bash ./experiments/ed/eval_bm25.sh  # evaluate BM25
bash ./experiments/ed/eval_plms.sh  # evaluate PLMs (including sentence transformers)
bash ./experiments/ed/train_sosse.sh  # train SoSSE models by fine-tuning sentence-transformers on GSIM and LLM-Gen
bash ./experiments/ed/eval_sosse.sh  # evaluate SoSSE models

Models

Important

The models will be uploaded to HuggingFace Hub soon!

You can download multilingual pretrained models for the social science domain:

  • SSOAR-XLM-R-base is pre-trained on S44k using masked language modeling (MLM), a batch size of 8, and a sequence length of 512 tokens.

You can download multilingual fine-tuned models for MD on SILD, which used a batch size of 32 and a sequence length of 64, here:

You can download multilingual fine-tuned models for ED on LLM-Gen, which used a batch size of 1024 and a sequence length of 512, here:

Results

Our model achieves the following performance on:

Model name F1-binary (English) F1-binary (German) F1-binary (Total)
XLM-R-base-SILD 58.5% 53.9% 57.1%
SSOAR-XLM-R-base-SILD 60.7% 61.8% 61.0%
XLM-R-large-SILD 61.4% 65.1% 62.6%
Model name MAP@10 (English) MAP@10 (German)
mE5-base (baseline) 57.9% 65.6%
SoSSE-mE5-base 63.2% 68.1%

Licensing Information

Dataset licensing can be found under the respective directories (SILD, GSIM, LLM-Gen, or S44k). This work (including the models and the annotations) is licensed under CC BY 4.0.

About

Enriching Social Science Research via Survey Item Linking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published