Language Proficiency Classification

Outline

Introduction
Existing Work & Limitations
Addressing Limitations
Our Experiments & Results
Key Findings
Replicating the Results
Credits & Disclaimers

Introduction

The accurate assessment of second language (L2) proficiency has always been important in education and hiring. Traditionally, this has been done through standardized tests, such as TOEFL and IELTS, and human evaluations. However, these are often costly, time-consuming and highly subjective.

While the standard EFCamDat benchmark show high performance for BERT-based models on L2 English writing proficiency assessment, this project identifies critical dataset flaws, namely topic bias and class imbalance, that cause models to learn topic classification as a shortcut rather than assessing true linguistic competence.

We evaluate models on a cleaner, class-balanced EFCamDat test set with unseen topics held-out from the train set. On the fairer benchmark, we discovered that a simple RNN trained on POS tags outperforms a fine-tuned BERT. Additionally, this project explores ways to shore up BERT’s weakness in the language proficiency classification task, including training on balanced classes and complementing it with a set of classical linguistic features.

Existing Work & Limitations

Most existing research on L2 English writing proficiency base their proficiency levels on the Common European Framework of Reference for Languages (CEFR), a widely adopted standard for describing language proficiency on a six-level scale of A1 to C2, as well as the EFCamDat dataset.

The EFCamDat dataset is one of the largest, frequently cited publicly available dataset for L2 English research. It contains over 1.18 million texts submitted to Education First (EF)’s online English school by 174,000+ L2 English learners of various nationalities. These texts are writing tasks at the end of each course, covering topics like “writing a resume” and “giving budgeting advice”. Based on their initial placement test results or through successful course progression, these learners were assigned to proficiency levels on a 16-level scale, which can be mapped to CEFR levels.

State-of-the-art (SOTA) models studied the performance of transformer-based architectures like BERT, achieving remarkable F1 scores of above 0.9 on the EFCamDat dataset, such as research by Sánchez et al. (2024).

However, this project identifies major challenges with the EFCamDat dataset that questions the usefulness of BERT-based models on advanced proficiency levels and out-of-domain writing topics. In our investigation, vanilla BERT-based models perform extremely well on EFCamDat levels because of the huge bias towards lower proficiency levels and the one-to-one correlation between proficiency level and writing topic. After adjusting for the huge class imbalance and topic bias, our baseline BERT model performs poorly on the task, attaining a macro F1 score of only about 0.4.

Addressing Limitations

To address the problem of class imbalance, we carved out a test set (BalSeen) that is evenly distributed by class, contrasting with the original, stratified test set that preserves the latent class imbalance across CEFR levels (StratSeen).

To address the concern that BERT-based models are learning topic classification rather than language proficiency classification, we intentionally held out 50 writing topics to create a third test set (BalUnseen) with equally balanced classes and unseen topics. All texts from these topics were excluded from the train set, serving as the most robust measure of the model's ability to generalize across proficiency levels and topics.

We used Shatz I.'s cleaner, refined version of the EFCamDat dataset, which has 406,000+ texts which exclude problematic texts (duplicates, ultra-short, non-English, outliers based on word count) and narrows down to texts from the top 11 nationalities with accurate topic data.

The following table summarizes the number of texts for each CEFR level and unique topics for each test set. Note that C2 is omitted due to lack of representation in the EFCamDat dataset.

Level	StratSeen	BalSeen	BalUnseen
A1	4,807	2,000	2,000
A2	3,442	2,000	2,000
B1	1,209	2,000	2,000
B2	434	2,000	2,000
C1	108	2,000	2,000
Total	10,000	10,000	10,000
Topics	69 (Seen)	70 (Seen)	50 (Unseen)

Our Experiments & Results

We experimented with:

Fine-tuning BERT on both the original class distribution and balanced classes
Varying the training data size (1K vs 4K samples)
Training a GRU-based RNN on POS tags (POS-RNN)
Training a fused model of BERT and POS-RNN (BERT-POS-RNN)
Adding linguistic features to BERT and POS-RNN

The 18 linguistic features were explored:

Category	Feature
Syntactic complexity	Average sentence length
	Subordination ratio
	Average dependency distance
	Clause complexity ratio
	Verb phase complexity
	Model verb ratio
	Noun ratio
	Verb ratio
Lexical diversity	Type token ratio
	POS entropy
	POS diversity
	Rare word ratio
	Average word length
	Lexical sophistication index
	Information density
Discourse coherence	Connective density
	Coherence device density
Accuracy	Number of spelling errors

The following table summarizes the results of our experiments on the three test sets. The best scores for each test set are highlighted in bold.

Model	Training Data	StratSeen (Macro F1)	BalSeen (Macro F1)	BalUnseen (Macro F1)
BERT	1K, original distribution	0.730	0.660	0.404
BERT	4K, original distribution	0.941	0.914	0.400
BERT	1K, balanced classes	0.959	0.972	0.514
BERT	4K, balanced classes	0.979	0.987	0.475
BERT with linguistic features	1K, balanced classes	0.924	0.957	0.580
POS-RNN	1K, balanced classes	0.686	0.769	0.584
POS-RNN	4K, balanced classes	0.794	0.860	0.624
POS-RNN with linguistic features	4K, balanced classes	0.874	0.860	0.604
BERT-POS-RNN	1K, balanced classes	0.933	0.968	0.568

Key Findings

BERT learns topic classification, not language proficiency

Vanilla BERT significantly underperforms on unseen topics, achieving only 0.4-0.5 macro F1 compared to 0.9+ on seen topics.

Balanced classes improve BERT generalization

Training on balanced classes consistently improves BERT's performance across all test sets.

Linguistic features help BERT focus on language proficiency

Adding linguistic features improved BERT's performance on unseen topics from 0.514 to 0.580 macro F1.

More training data hurts BERT's generalization

Increasing training size from 1K to 4K samples worsened BERT's performance on unseen topics (0.514 → 0.475 macro F1).

Simple POS-RNN outperforms BERT

A basic RNN trained on POS tags achieves 0.624 macro F1 on unseen topics versus BERT's best of 0.580.

Fusion shows mixed results

The BERT-POS-RNN combination underperforms standalone POS-RNN, with linguistic signals being "drowned out" by BERT's topic bias.

Replicating the Results

Setting Up the Environment

Install uv.

Then, run the following commands:

uv sync
source .venv/bin/activate

Either run notebooks directly on Visual Studio Code or use Jupyter Lab:

uv run jupyter lab

Accessing the Dataset

You will first need to request for official access to the EFCamDat dataset from the EFCamDat website. Note that an academic affiliation and access to Google Drive are necessary to use the corpus.

Once you have access, locate the specific dataset Final database (main prompts).xlsx (under Cleaned_Subcorpus (Shatz, 2020)) and place it in the data/ directory of this project.

Doing Exploratory Data Analysis (EDA)

Run the notebook explore_efcamdat.ipynb. Additionally, analyze_ling_feats.ipynb does a correlation analysis of the linguistic features on the train dataset to identify the redundant ones.

Preprocessing the Dataset

Run the notebooks in the following order:

clean_dataset.ipynb cleans the original dataset for and exports the clean data to data/texts_clean.csv
split_dataset.ipynb reads from data/texts_clean.csv and splits it into train and test datasets:

data/texts_train.csv as the train data
data/texts_test.csv as the test data that is class-balanced with unseen topics (BalUnseen)
data/texts_test_seen.csv as the test data that is class-balanced with seen topics (BalSeen)
data/texts_test_stratified.csv as the test data that is stratified with seen topics (StratSeen)

Training & Evaluating the Models

Run either of the following notebooks:

train_bert.ipynb
train_bert-4k.ipynb
train_bert-even.ipynb
train_bert-even-4k.ipynb
train_rnn-even.ipynb
train_rnn-even-4k.ipynb
train_bert-even-rnn.ipynb
train_rnn-even-4k-ling_feat.ipynb
train_bert-even-ling_feat.ipynb

Credits & Disclaimers

Dataset Attribution

This project uses the EF-Cambridge Open Language Database (EFCamDat), specifically the cleaned subcorpus created by Shatz (2020). We acknowledge and thank the EF Research Lab, Cambridge TAL and the EF Education First Group of Companies.

The EFCamDat dataset is subject to separate licensing terms and usage restrictions. This repository's MIT license applies only to the source code and analysis scripts.

Neither Cambridge TAL nor EF Education First Group of Companies bear any responsibility for the analysis or interpretation presented in this repository.

References

Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R.T. Millar, K.I. Martin, C.M. Eddington, A. Henery, N.M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.

Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html

Sánchez, R. M., Alfter, D., Dobnik, S., Szawerna, M., & Volodina, E. (2024, October). Jingle BERT, Jingle BERT, Frozen All the Way: Freezing Layers to Identify CEFR Levels of Second Language Learners Using BERT. In Swedish Language Technology Conference and NLP4CALL (pp. 137-152).

Shatz, I. (2020). Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database. International Journal of Learner Corpus Research, 6(2), 220-236. doi:10.1075/ijlcr.20009.sha

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Proficiency Classification

Outline

Introduction

Existing Work & Limitations

Addressing Limitations

Our Experiments & Results

Key Findings

Replicating the Results

Setting Up the Environment

Accessing the Dataset

Doing Exploratory Data Analysis (EDA)

Preprocessing the Dataset

Training & Evaluating the Models

Credits & Disclaimers

Dataset Attribution

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_ling_feats.ipynb		analyze_ling_feats.ipynb
clean_dataset.ipynb		clean_dataset.ipynb
explore_efcamdat.ipynb		explore_efcamdat.ipynb
pyproject.toml		pyproject.toml
split_dataset.ipynb		split_dataset.ipynb
train_bert-4k.ipynb		train_bert-4k.ipynb
train_bert-even-4k.ipynb		train_bert-even-4k.ipynb
train_bert-even-ling_feat.ipynb		train_bert-even-ling_feat.ipynb
train_bert-even-rnn.ipynb		train_bert-even-rnn.ipynb
train_bert-even.ipynb		train_bert-even.ipynb
train_bert.ipynb		train_bert.ipynb
train_rnn-even-4k-ling_feat.ipynb		train_rnn-even-4k-ling_feat.ipynb
train_rnn-even-4k.ipynb		train_rnn-even-4k.ipynb
train_rnn-even.ipynb		train_rnn-even.ipynb
uv.lock		uv.lock

License

nginyc/language-proficiency-classification

Folders and files

Latest commit

History

Repository files navigation

Language Proficiency Classification

Outline

Introduction

Existing Work & Limitations

Addressing Limitations

Our Experiments & Results

Key Findings

Replicating the Results

Setting Up the Environment

Accessing the Dataset

Doing Exploratory Data Analysis (EDA)

Preprocessing the Dataset

Training & Evaluating the Models

Credits & Disclaimers

Dataset Attribution

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages