- Introduction
- Existing Work & Limitations
- Addressing Limitations
- Our Experiments & Results
- Key Findings
- Replicating the Results
- Credits & Disclaimers
The accurate assessment of second language (L2) proficiency has always been important in education and hiring. Traditionally, this has been done through standardized tests, such as TOEFL and IELTS, and human evaluations. However, these are often costly, time-consuming and highly subjective.
While the standard EFCamDat benchmark show high performance for BERT-based models on L2 English writing proficiency assessment, this project identifies critical dataset flaws, namely topic bias and class imbalance, that cause models to learn topic classification as a shortcut rather than assessing true linguistic competence.
We evaluate models on a cleaner, class-balanced EFCamDat test set with unseen topics held-out from the train set. On the fairer benchmark, we discovered that a simple RNN trained on POS tags outperforms a fine-tuned BERT. Additionally, this project explores ways to shore up BERT’s weakness in the language proficiency classification task, including training on balanced classes and complementing it with a set of classical linguistic features.
Most existing research on L2 English writing proficiency base their proficiency levels on the Common European Framework of Reference for Languages (CEFR), a widely adopted standard for describing language proficiency on a six-level scale of A1 to C2, as well as the EFCamDat dataset.
The EFCamDat dataset is one of the largest, frequently cited publicly available dataset for L2 English research. It contains over 1.18 million texts submitted to Education First (EF)’s online English school by 174,000+ L2 English learners of various nationalities. These texts are writing tasks at the end of each course, covering topics like “writing a resume” and “giving budgeting advice”. Based on their initial placement test results or through successful course progression, these learners were assigned to proficiency levels on a 16-level scale, which can be mapped to CEFR levels.
State-of-the-art (SOTA) models studied the performance of transformer-based architectures like BERT, achieving remarkable F1 scores of above 0.9 on the EFCamDat dataset, such as research by Sánchez et al. (2024).
However, this project identifies major challenges with the EFCamDat dataset that questions the usefulness of BERT-based models on advanced proficiency levels and out-of-domain writing topics. In our investigation, vanilla BERT-based models perform extremely well on EFCamDat levels because of the huge bias towards lower proficiency levels and the one-to-one correlation between proficiency level and writing topic. After adjusting for the huge class imbalance and topic bias, our baseline BERT model performs poorly on the task, attaining a macro F1 score of only about 0.4.
To address the problem of class imbalance, we carved out a test set (BalSeen) that is evenly distributed by class, contrasting with the original, stratified test set that preserves the latent class imbalance across CEFR levels (StratSeen).
To address the concern that BERT-based models are learning topic classification rather than language proficiency classification, we intentionally held out 50 writing topics to create a third test set (BalUnseen) with equally balanced classes and unseen topics. All texts from these topics were excluded from the train set, serving as the most robust measure of the model's ability to generalize across proficiency levels and topics.
We used Shatz I.'s cleaner, refined version of the EFCamDat dataset, which has 406,000+ texts which exclude problematic texts (duplicates, ultra-short, non-English, outliers based on word count) and narrows down to texts from the top 11 nationalities with accurate topic data.
The following table summarizes the number of texts for each CEFR level and unique topics for each test set. Note that C2 is omitted due to lack of representation in the EFCamDat dataset.
| Level | StratSeen | BalSeen | BalUnseen |
|---|---|---|---|
| A1 | 4,807 | 2,000 | 2,000 |
| A2 | 3,442 | 2,000 | 2,000 |
| B1 | 1,209 | 2,000 | 2,000 |
| B2 | 434 | 2,000 | 2,000 |
| C1 | 108 | 2,000 | 2,000 |
| Total | 10,000 | 10,000 | 10,000 |
| Topics | 69 (Seen) | 70 (Seen) | 50 (Unseen) |
We experimented with:
- Fine-tuning BERT on both the original class distribution and balanced classes
- Varying the training data size (1K vs 4K samples)
- Training a GRU-based RNN on POS tags (POS-RNN)
- Training a fused model of BERT and POS-RNN (BERT-POS-RNN)
- Adding linguistic features to BERT and POS-RNN
The 18 linguistic features were explored:
| Category | Feature |
|---|---|
| Syntactic complexity | Average sentence length |
| Subordination ratio | |
| Average dependency distance | |
| Clause complexity ratio | |
| Verb phase complexity | |
| Model verb ratio | |
| Noun ratio | |
| Verb ratio | |
| Lexical diversity | Type token ratio |
| POS entropy | |
| POS diversity | |
| Rare word ratio | |
| Average word length | |
| Lexical sophistication index | |
| Information density | |
| Discourse coherence | Connective density |
| Coherence device density | |
| Accuracy | Number of spelling errors |
The following table summarizes the results of our experiments on the three test sets. The best scores for each test set are highlighted in bold.
| Model | Training Data | StratSeen (Macro F1) | BalSeen (Macro F1) | BalUnseen (Macro F1) |
|---|---|---|---|---|
| BERT | 1K, original distribution | 0.730 | 0.660 | 0.404 |
| BERT | 4K, original distribution | 0.941 | 0.914 | 0.400 |
| BERT | 1K, balanced classes | 0.959 | 0.972 | 0.514 |
| BERT | 4K, balanced classes | 0.979 | 0.987 | 0.475 |
| BERT with linguistic features | 1K, balanced classes | 0.924 | 0.957 | 0.580 |
| POS-RNN | 1K, balanced classes | 0.686 | 0.769 | 0.584 |
| POS-RNN | 4K, balanced classes | 0.794 | 0.860 | 0.624 |
| POS-RNN with linguistic features | 4K, balanced classes | 0.874 | 0.860 | 0.604 |
| BERT-POS-RNN | 1K, balanced classes | 0.933 | 0.968 | 0.568 |
BERT learns topic classification, not language proficiency
Vanilla BERT significantly underperforms on unseen topics, achieving only 0.4-0.5 macro F1 compared to 0.9+ on seen topics.
Balanced classes improve BERT generalization
Training on balanced classes consistently improves BERT's performance across all test sets.
Linguistic features help BERT focus on language proficiency
Adding linguistic features improved BERT's performance on unseen topics from 0.514 to 0.580 macro F1.
More training data hurts BERT's generalization
Increasing training size from 1K to 4K samples worsened BERT's performance on unseen topics (0.514 → 0.475 macro F1).
Simple POS-RNN outperforms BERT
A basic RNN trained on POS tags achieves 0.624 macro F1 on unseen topics versus BERT's best of 0.580.
Fusion shows mixed results
The BERT-POS-RNN combination underperforms standalone POS-RNN, with linguistic signals being "drowned out" by BERT's topic bias.
Install uv.
Then, run the following commands:
uv sync
source .venv/bin/activateEither run notebooks directly on Visual Studio Code or use Jupyter Lab:
uv run jupyter lab You will first need to request for official access to the EFCamDat dataset from the EFCamDat website. Note that an academic affiliation and access to Google Drive are necessary to use the corpus.
Once you have access, locate the specific dataset Final database (main prompts).xlsx (under Cleaned_Subcorpus (Shatz, 2020)) and place it in the data/ directory of this project.
Run the notebook explore_efcamdat.ipynb. Additionally, analyze_ling_feats.ipynb does a correlation analysis of the linguistic features on the train dataset to identify the redundant ones.
Run the notebooks in the following order:
clean_dataset.ipynbcleans the original dataset for and exports the clean data todata/texts_clean.csvsplit_dataset.ipynbreads fromdata/texts_clean.csvand splits it into train and test datasets:
data/texts_train.csvas the train datadata/texts_test.csvas the test data that is class-balanced with unseen topics (BalUnseen)data/texts_test_seen.csvas the test data that is class-balanced with seen topics (BalSeen)data/texts_test_stratified.csvas the test data that is stratified with seen topics (StratSeen)
Run either of the following notebooks:
train_bert.ipynbtrain_bert-4k.ipynbtrain_bert-even.ipynbtrain_bert-even-4k.ipynbtrain_rnn-even.ipynbtrain_rnn-even-4k.ipynbtrain_bert-even-rnn.ipynbtrain_rnn-even-4k-ling_feat.ipynbtrain_bert-even-ling_feat.ipynb
This project uses the EF-Cambridge Open Language Database (EFCamDat), specifically the cleaned subcorpus created by Shatz (2020). We acknowledge and thank the EF Research Lab, Cambridge TAL and the EF Education First Group of Companies.
The EFCamDat dataset is subject to separate licensing terms and usage restrictions. This repository's MIT license applies only to the source code and analysis scripts.
Neither Cambridge TAL nor EF Education First Group of Companies bear any responsibility for the analysis or interpretation presented in this repository.
Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R.T. Millar, K.I. Martin, C.M. Eddington, A. Henery, N.M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.
Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html
Sánchez, R. M., Alfter, D., Dobnik, S., Szawerna, M., & Volodina, E. (2024, October). Jingle BERT, Jingle BERT, Frozen All the Way: Freezing Layers to Identify CEFR Levels of Second Language Learners Using BERT. In Swedish Language Technology Conference and NLP4CALL (pp. 137-152).
Shatz, I. (2020). Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database. International Journal of Learner Corpus Research, 6(2), 220-236. doi:10.1075/ijlcr.20009.sha