Skip to content

Trains RNN and BERT for L2 English writing proficiency prediction on the CEFR scale, adjusted for class & topic bias in the EFCamDat dataset

License

Notifications You must be signed in to change notification settings

nginyc/language-proficiency-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Proficiency Classification

Outline

Introduction

The accurate assessment of second language (L2) proficiency has always been important in education and hiring. Traditionally, this has been done through standardized tests, such as TOEFL and IELTS, and human evaluations. However, these are often costly, time-consuming and highly subjective.

While the standard EFCamDat benchmark show high performance for BERT-based models on L2 English writing proficiency assessment, this project identifies critical dataset flaws, namely topic bias and class imbalance, that cause models to learn topic classification as a shortcut rather than assessing true linguistic competence.

We evaluate models on a cleaner, class-balanced EFCamDat test set with unseen topics held-out from the train set. On the fairer benchmark, we discovered that a simple RNN trained on POS tags outperforms a fine-tuned BERT. Additionally, this project explores ways to shore up BERT’s weakness in the language proficiency classification task, including training on balanced classes and complementing it with a set of classical linguistic features.

Existing Work & Limitations

Most existing research on L2 English writing proficiency base their proficiency levels on the Common European Framework of Reference for Languages (CEFR), a widely adopted standard for describing language proficiency on a six-level scale of A1 to C2, as well as the EFCamDat dataset.

The EFCamDat dataset is one of the largest, frequently cited publicly available dataset for L2 English research. It contains over 1.18 million texts submitted to Education First (EF)’s online English school by 174,000+ L2 English learners of various nationalities. These texts are writing tasks at the end of each course, covering topics like “writing a resume” and “giving budgeting advice”. Based on their initial placement test results or through successful course progression, these learners were assigned to proficiency levels on a 16-level scale, which can be mapped to CEFR levels.

State-of-the-art (SOTA) models studied the performance of transformer-based architectures like BERT, achieving remarkable F1 scores of above 0.9 on the EFCamDat dataset, such as research by Sánchez et al. (2024).

However, this project identifies major challenges with the EFCamDat dataset that questions the usefulness of BERT-based models on advanced proficiency levels and out-of-domain writing topics. In our investigation, vanilla BERT-based models perform extremely well on EFCamDat levels because of the huge bias towards lower proficiency levels and the one-to-one correlation between proficiency level and writing topic. After adjusting for the huge class imbalance and topic bias, our baseline BERT model performs poorly on the task, attaining a macro F1 score of only about 0.4.

Addressing Limitations

To address the problem of class imbalance, we carved out a test set (BalSeen) that is evenly distributed by class, contrasting with the original, stratified test set that preserves the latent class imbalance across CEFR levels (StratSeen).

To address the concern that BERT-based models are learning topic classification rather than language proficiency classification, we intentionally held out 50 writing topics to create a third test set (BalUnseen) with equally balanced classes and unseen topics. All texts from these topics were excluded from the train set, serving as the most robust measure of the model's ability to generalize across proficiency levels and topics.

We used Shatz I.'s cleaner, refined version of the EFCamDat dataset, which has 406,000+ texts which exclude problematic texts (duplicates, ultra-short, non-English, outliers based on word count) and narrows down to texts from the top 11 nationalities with accurate topic data.

The following table summarizes the number of texts for each CEFR level and unique topics for each test set. Note that C2 is omitted due to lack of representation in the EFCamDat dataset.

Level StratSeen BalSeen BalUnseen
A1 4,807 2,000 2,000
A2 3,442 2,000 2,000
B1 1,209 2,000 2,000
B2 434 2,000 2,000
C1 108 2,000 2,000
Total 10,000 10,000 10,000
Topics 69 (Seen) 70 (Seen) 50 (Unseen)

Our Experiments & Results

We experimented with:

  • Fine-tuning BERT on both the original class distribution and balanced classes
  • Varying the training data size (1K vs 4K samples)
  • Training a GRU-based RNN on POS tags (POS-RNN)
  • Training a fused model of BERT and POS-RNN (BERT-POS-RNN)
  • Adding linguistic features to BERT and POS-RNN

The 18 linguistic features were explored:

Category Feature
Syntactic complexity Average sentence length
Subordination ratio
Average dependency distance
Clause complexity ratio
Verb phase complexity
Model verb ratio
Noun ratio
Verb ratio
Lexical diversity Type token ratio
POS entropy
POS diversity
Rare word ratio
Average word length
Lexical sophistication index
Information density
Discourse coherence Connective density
Coherence device density
Accuracy Number of spelling errors

The following table summarizes the results of our experiments on the three test sets. The best scores for each test set are highlighted in bold.

Model Training Data StratSeen (Macro F1) BalSeen (Macro F1) BalUnseen (Macro F1)
BERT 1K, original distribution 0.730 0.660 0.404
BERT 4K, original distribution 0.941 0.914 0.400
BERT 1K, balanced classes 0.959 0.972 0.514
BERT 4K, balanced classes 0.979 0.987 0.475
BERT with linguistic features 1K, balanced classes 0.924 0.957 0.580
POS-RNN 1K, balanced classes 0.686 0.769 0.584
POS-RNN 4K, balanced classes 0.794 0.860 0.624
POS-RNN with linguistic features 4K, balanced classes 0.874 0.860 0.604
BERT-POS-RNN 1K, balanced classes 0.933 0.968 0.568

Key Findings

BERT learns topic classification, not language proficiency

Vanilla BERT significantly underperforms on unseen topics, achieving only 0.4-0.5 macro F1 compared to 0.9+ on seen topics.

Balanced classes improve BERT generalization

Training on balanced classes consistently improves BERT's performance across all test sets.

Linguistic features help BERT focus on language proficiency

Adding linguistic features improved BERT's performance on unseen topics from 0.514 to 0.580 macro F1.

More training data hurts BERT's generalization

Increasing training size from 1K to 4K samples worsened BERT's performance on unseen topics (0.514 → 0.475 macro F1).

Simple POS-RNN outperforms BERT

A basic RNN trained on POS tags achieves 0.624 macro F1 on unseen topics versus BERT's best of 0.580.

Fusion shows mixed results

The BERT-POS-RNN combination underperforms standalone POS-RNN, with linguistic signals being "drowned out" by BERT's topic bias.

Replicating the Results

Setting Up the Environment

Install uv.

Then, run the following commands:

uv sync
source .venv/bin/activate

Either run notebooks directly on Visual Studio Code or use Jupyter Lab:

uv run jupyter lab 

Accessing the Dataset

You will first need to request for official access to the EFCamDat dataset from the EFCamDat website. Note that an academic affiliation and access to Google Drive are necessary to use the corpus.

Once you have access, locate the specific dataset Final database (main prompts).xlsx (under Cleaned_Subcorpus (Shatz, 2020)) and place it in the data/ directory of this project.

Doing Exploratory Data Analysis (EDA)

Run the notebook explore_efcamdat.ipynb. Additionally, analyze_ling_feats.ipynb does a correlation analysis of the linguistic features on the train dataset to identify the redundant ones.

Preprocessing the Dataset

Run the notebooks in the following order:

  1. clean_dataset.ipynb cleans the original dataset for and exports the clean data to data/texts_clean.csv
  2. split_dataset.ipynb reads from data/texts_clean.csv and splits it into train and test datasets:
  • data/texts_train.csv as the train data
  • data/texts_test.csv as the test data that is class-balanced with unseen topics (BalUnseen)
  • data/texts_test_seen.csv as the test data that is class-balanced with seen topics (BalSeen)
  • data/texts_test_stratified.csv as the test data that is stratified with seen topics (StratSeen)

Training & Evaluating the Models

Run either of the following notebooks:

  • train_bert.ipynb
  • train_bert-4k.ipynb
  • train_bert-even.ipynb
  • train_bert-even-4k.ipynb
  • train_rnn-even.ipynb
  • train_rnn-even-4k.ipynb
  • train_bert-even-rnn.ipynb
  • train_rnn-even-4k-ling_feat.ipynb
  • train_bert-even-ling_feat.ipynb

Credits & Disclaimers

Dataset Attribution

This project uses the EF-Cambridge Open Language Database (EFCamDat), specifically the cleaned subcorpus created by Shatz (2020). We acknowledge and thank the EF Research Lab, Cambridge TAL and the EF Education First Group of Companies.

The EFCamDat dataset is subject to separate licensing terms and usage restrictions. This repository's MIT license applies only to the source code and analysis scripts.

Neither Cambridge TAL nor EF Education First Group of Companies bear any responsibility for the analysis or interpretation presented in this repository.

References

Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R.T. Millar, K.I. Martin, C.M. Eddington, A. Henery, N.M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.

Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html

Sánchez, R. M., Alfter, D., Dobnik, S., Szawerna, M., & Volodina, E. (2024, October). Jingle BERT, Jingle BERT, Frozen All the Way: Freezing Layers to Identify CEFR Levels of Second Language Learners Using BERT. In Swedish Language Technology Conference and NLP4CALL (pp. 137-152).

Shatz, I. (2020). Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database. International Journal of Learner Corpus Research, 6(2), 220-236. doi:10.1075/ijlcr.20009.sha

About

Trains RNN and BERT for L2 English writing proficiency prediction on the CEFR scale, adjusted for class & topic bias in the EFCamDat dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published