Enriching Social Science Research via Survey Item Linking (SIL)

This repository is the official implementation of Enriching Social Science Research via Survey Item Linking (2024).

Requirements

To install requirements, use either poetry or pip:

poetry install
poetry install --only data_s44k # if you want to reproduce the S44k dataset
poetry install --only data_gsim # if you want to reproduce the GSIM dataset

pip install -r requirements/requirements.txt
pip instlal -r requirements/data_s44k.txt # if you want to reproduce the S44k dataset
pip install -r requirements/data_gsim.txt # if you want to reproduce the GSIM dataset

Important

After installing the requirements, supplemenraty datasets can be re-created (except for SILD, which should be downloaded from here and placed into the /data/sild/ directory) by following the instructions for each (GSIM, LLM-Gen, or S44k). SILD is archived on Zenodo.

Experiments

To run the experiments in the paper, run the following commands:

bash ./experiments/md/pretrain.slurm  # continue pretraining PLMs on S44k
bash ./experiments/md/train_linear.sh  # train linear classifiers on SILD
bash ./experiments/md/train_linear_da.sh  # train linear classifiers using data augmentation
bash ./experiments/md/train_plms.sh  # fine-tune PLMs on SILD
bash ./experiments/md/train_plms_da.sh  # fine-tune PLMS using data augmentation
bash ./experiments/md/train_knn.sh  # train kNN on SILD
bash ./experiments/md/eval_rac.sh  # combine the best PLM w/ the best kNN
bash ./experiments/md/eval_icl.sh  # evaluate In-Context Learning
bash ./experiments/ed/eval_bm25.sh  # evaluate BM25
bash ./experiments/ed/eval_plms.sh  # evaluate PLMs (including sentence transformers)
bash ./experiments/ed/train_sosse.sh  # train SoSSE models by fine-tuning sentence-transformers on GSIM and LLM-Gen
bash ./experiments/ed/eval_sosse.sh  # evaluate SoSSE models

Models

Important

The models will be uploaded to HuggingFace Hub soon!

You can download multilingual pretrained models for the social science domain:

SSOAR-XLM-R-base is pre-trained on S44k using masked language modeling (MLM), a batch size of 8, and a sequence length of 512 tokens.

You can download multilingual fine-tuned models for MD on SILD, which used a batch size of 32 and a sequence length of 64, here:

XLM-R-base-SILD is fine-tuned on SILD using ... .
XLM-R-large-SILD is fine-tuned on SILD using ... .
SSOAR-XLM-R-base-SILD is pre-trained on S44k and then fine-tuned on SILD using ... .

You can download multilingual fine-tuned models for ED on LLM-Gen, which used a batch size of 1024 and a sequence length of 512, here:

SoSSE-mE5-base is fine-tuned on LLM-Gen using ... .

Results

Our model achieves the following performance on:

Mention Detection (MD) on SILD

Model name	F1-binary (English)	F1-binary (German)	F1-binary (Total)
XLM-R-base-SILD	58.5%	53.9%	57.1%
SSOAR-XLM-R-base-SILD	60.7%	61.8%	61.0%
XLM-R-large-SILD	61.4%	65.1%	62.6%

Entity Disambiguation (ED) on SILD

Model name	MAP@10 (English)	MAP@10 (German)
mE5-base (baseline)	57.9%	65.6%
SoSSE-mE5-base	63.2%	68.1%

Licensing Information

Dataset licensing can be found under the respective directories (SILD, GSIM, LLM-Gen, or S44k). This work (including the models and the annotations) is licensed under CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
experiments		experiments
images		images
notebooks		notebooks
sil		sil
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enriching Social Science Research via Survey Item Linking (SIL)

Requirements

Experiments

Models

Results

Mention Detection (MD) on SILD

Entity Disambiguation (ED) on SILD

Licensing Information

About

Releases

Packages

Languages

e-tornike/SIL

Folders and files

Latest commit

History

Repository files navigation

Enriching Social Science Research via Survey Item Linking (SIL)

Requirements

Experiments

Models

Results

Mention Detection (MD) on SILD

Entity Disambiguation (ED) on SILD

Licensing Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages