Multilabel classification using BERT transformers returns low accuracy #13792

CodeCheetah · 2025-04-07T15:46:15Z

CodeCheetah
Apr 7, 2025

I'm currently training a transformer-based models (bert-base-uncased, xlm-roberta-base, and roberta-base) on synthetic data of around 2000 records generated by GPT3.5 with 15 labels. The training works fine with out of the box config and several changes to hyperparameters like learning rate, dropout, batch size.

I'm able to evaluate the model on 30% of synthetic data which works, but whenever I bring real data for classification I'm getting really low scores, below 25% accuracy overall.

Any ideas on how/what to improve?

Should I try different textcat architecture: CNN/TextCatBOW, anything else?

My config:

[paths]
train = "./data/train.spacy"
dev = "./data/dev.spacy"
vectors = null
init_tok2vec = null
raw_text = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","textcat_multilabel"]
batch_size = 64
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
threshold = 0.8

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v3"
exclusive_classes = false
length = 262144
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "../HF/models/xlm-roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 500
gold_preproc = false
limit = 0
augmenter = null

[corpora.pretrain]
@readers = "spacy.JsonlCorpus.v1"
path = ${paths.raw_text}
min_length = 5
max_length = 500
limit = 0

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 300
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 0
max_epochs = 0
max_steps = 1000
eval_frequency = 100
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 1024
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.00005

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = 0.0
cats_micro_r = 0.0
cats_micro_f = 0.0
cats_macro_p = 0.0
cats_macro_r = 0.0
cats_macro_f = 0.0
cats_macro_auc = 0.0
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.components.textcat_multilabel]

[initialize.components.textcat_multilabel.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/textcat_multilabel.json"
require = false

[initialize.tokenizer]

DuyguA · 2025-05-02T08:32:54Z

DuyguA
May 2, 2025

Well, my guess is a corpus of size 2000 is too small for 15 labels. For a rule of thumb, one needs at least between 500-1000 instances per label for a healthy classification. In your case, per label there're only 2000/15 =133,3 instances, which is too small. Another issue is that your dataset is synthetic, the underlying distribution may not be smiler to real world data.

For this project I'd:

For a start I'd merge the labels and drop down to 4/5 labels.
For those labels in point1, I'd add 100 instances of real data per class and see how it improves the classification quality.

Best of luck!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multilabel classification using BERT transformers returns low accuracy #13792

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Multilabel classification using BERT transformers returns low accuracy #13792

Uh oh!

CodeCheetah Apr 7, 2025

Replies: 1 comment

Uh oh!

DuyguA May 2, 2025

CodeCheetah
Apr 7, 2025

DuyguA
May 2, 2025