Current protein representations from the SpeedRun are subpar. #7

lhallee · 2025-01-01T19:07:40Z

lhallee
Jan 1, 2025
Maintainer

A common metric of success for pLMs is how well vector embeddings (usually mean pooling if looking at just pretrained models) correlate with valuable downstream tasks. Currently, while exhibiting excellent language modeling loss and sequence reconstruction (after very few steps), the current models underperform when looking at the pooled vector embeddings.

For example:

132M speedrun	model	loss	perplexity	precision	recall	f1	accuracy	mcc
Valid	20000 steps	2.3137	10.112	0.3545	0.3285	0.3243	0.3285	0.281
Test	20000 steps	2.2896	9.8712	0.3787	0.3489	0.3459	0.3489	0.3045
Valid	100000 steps	2.2334	9.332	0.3821	0.3579	0.3547	0.3579	0.313
Test	100000 steps	2.2058	9.0773	0.4073	0.3803	0.3781	0.3803	0.3384

Where we've even observed losses for validation and test around 2.1 during different training runs (this would suggest on par with ESM2-650M!).

Yet, when probing:

Here, we zero out performance on various datasets based on embeddings that are randomly generated vectors.

Our sped run model is worse on average than a regular transformer with random weights. This may seem surprising, but this is a common phenomena with protein language models, as sequence similarity is a strong feature for related function and even random transformers embed similar inputs similarly. Regardless, this is a big problem for downstream applications.

Open to any and all suggestions to address this during our speedrun training process.

lhallee · 2025-01-01T19:08:03Z

lhallee
Jan 1, 2025
Maintainer Author

Very interested in your thoughts @lapp0

0 replies

lapp0 · 2025-01-02T22:41:39Z

lapp0
Jan 2, 2025
Maintainer

Good visualization, glad you're checking against these metrics!

My immediate hunch was the validation set contamination fixed in #5, however there wasn't any test contamination and the test set performs well!

A few thoughts:

Have you verified that inference works properly by checking if the test sets loss during training == test sets loss during inference?
Are these metrics measured correctly? I'd expect a randomly initialized ESM_35M to perform equally to random vectors.
I wonder if their UR50D dataset is better suited for some of these tasks? Perhaps use ESM2's validation dataset as our validation dataset? And even try their training set?
How is fold prediction measured without a folding head?

Can I get some more details on how these metrics are produced generally? It's not clear to me how any of these are measured.

1 reply

lhallee Jan 2, 2025
Maintainer Author

Here, we zero out performance on various datasets based on embeddings that are randomly generated vectors.

Just to be clear, none of these perform perfectly, but we scale random vectors to 0 and the best to 1.0.

Have you verified that inference works properly by checking if the test sets loss during training == test sets loss during inference?

The reported loss and metrics above are from here, which is worse than the loss during training for some reason. They are still pretty good though, it is concerning they are not the same, however. I currently don't have any good idea why.

Are these metrics measured correctly? I'd expect a randomly initialized ESM_35M to perform equally to random vectors.

They are measured correctly. We also used to expect random transformers to be closer to random vectors, but it turns out just switching to transformer architecture gives you gains for protein tasks. Again, the hypothesis here is that proteins with similar sequences are embedded somewhat similarly even with random weights. We've observed this many times, it's pretty neat. Here's a paper that analyzed some of these properties in more depth.

I wonder if their UR50D dataset is better suited for some of these tasks? Perhaps use ESM2's validation dataset as our validation dataset? And even try their training set?

It may be, I'm happy to switch if we want to get closer to the original data. I don't think there will be a large different because they are clustered in the same way. UR50D is a bit smaller I think because it doesn't include metagenomic data, which is totally fine.

How is fold prediction measured without a folding head?

This fold task is a classification task of how many folds are in that protein, not any precise structure prediction.

lapp0 · 2025-01-02T23:09:48Z

lapp0
Jan 2, 2025
Maintainer

The reported loss and metrics above are from here, which is worse than the loss during training for some reason. They are still pretty good though, it is concerning they are not the same, however. I currently don't have any good idea why.

One thing that stands out is you use a sequence length of 1024 instead of 2048.

Another thing is that before #5, we didn't actually ignore pad tokens, which inference.py seems to produce in its tokenization process. These were untrained, randomly initialized embeddings which injected noise into the forward pass.

2 replies

lapp0 Jan 2, 2025
Maintainer

Could you do a brief training run with the fixes from #5 and #8 incorporated and let me know if inference still diverges from the "in training" test metrics?

Also do you have the script to generate the chart you posted handy?

lhallee Jan 3, 2025
Maintainer Author

One thing that stands out is you use a sequence length of 1024 instead of 2048.

The newest data is all max length 1024. Not sure if we want to keep it like this, but should prevent any issue like this in the future. Vast majority of protein sequences are less than 1024, so not a big deal either way. Plus the ESM2 trained up to this, so a more fair comparison. The sliding window has also been adjusted from 512 to 1024.

Could you do a brief training run with the fixes from #5 and #8 incorporated and let me know if inference still diverges from the "in training" test metrics?

Yep, will do a more up to date run and probe.

Also do you have the script to generate the chart you posted handy?

Yes, need to clean up a bit before sending. We may be officially releasing a package for probing PLMs soon, which will have this as well.

lapp0 · 2025-01-21T21:16:58Z

lapp0
Jan 21, 2025
Maintainer

I think the issue with performance on the tasks referenced in your table is the lack of finetuning. Your tasks expect a separate classifier head, not the lm_head.

My work in kbert is analogous in that I'm finetuning on sequence classification. I've converted kbert's trainer to a class, and have a separate subclass and model for finetuning on MNLI.

I achieved 85.64% accuracy on MNLI, training for ~24 hours and finetuning for ~6 minutes on a 4x4090. Result is better than BERT, but beating ModernBERT is still a WIP.

Most of your protein understanding tasks should be tunable through the referenced sequence classification model. Let me know if you have any questions.

7 replies

lapp0 Jan 22, 2025
Maintainer

being able to fine tune for 6 minutes and get a production-like model is awesome

It's actually only 6 minutes because I couldn't train for any longer without over-fitting. MNLI train set only has about 15 million tokens. But indeed, this is dependent primarily on the PLM, as I tried with a weaker base model and only achieved ~80% on MNLI.

Perhaps the PLM community should start to value a propensity for a model to morph to new tasks via transfer learning, not just where the weights are after pretraining type of "transfer learning".

One thing worth mentioning - the DeBERTaV3 paper states that downstream tasks are benefitted by tied embeddings even though the MLM loss is worse during pretraining. Makes some sense considering the lm_head is discarded entirely.

That's a fair assumption. We typically freeze the models and train small MLPs on their embeddings to evaluate the, sort of, intrinsic correlation between embeddings and valuable properties. Fine tuning is kind of cheating for that type of evaluation.

Are you saying you froze the models weights and trained an MLP for each task for each model to generate the table? Or does ESM2 already have task-specific MLPs created? Curious about the specifics / training script.

lhallee Jan 22, 2025
Maintainer Author

Yep, the graph you saw is with frozen model weights. We mean pool over the embeddings (L, d) -> (d,) to create a whole supervised datasets of vector embeddings and labels, then train a small MLP to predict labels.

lapp0 Jan 23, 2025
Maintainer

Makes sense. If you share the dataset / workflow / code for training these MLPs I'll see if I can gauge the cause of downstream performance issues.

My best guess right now is that ESM2 uses tied weights whereas SpeedRunningESM2 does not. In the DeBERTaV3 paper, the authors demonstrate that NES (No Embedding Sharing) performs much better on the MLM objective than ES. "However, the embeddings learned by NES do not lead to any significant improvement on two representative downstream NLU tasks".

In other words, without tied embeddings, SpeedRunningESM2's achieving equivalent MLM loss to ESM2 without tied embeddings likely won't result in equivalent downstream performance.

lapp0 Jan 23, 2025
Maintainer

We probably should have a separate "downstream validation" step which trains an MLP for one representative task with the current model every N steps to get a more accurate perspective.

lhallee Jan 23, 2025
Maintainer Author

That's super interesting. For our purposes, and fine tuning, we may want to tie embeddings then. I will definitely share the PLM probing package with you once we put out our first release, just do not have permission to share it yet. Should be out within the next couple weeks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current protein representations from the SpeedRun are subpar. #7

{{title}}

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Current protein representations from the SpeedRun are subpar. #7

lhallee Jan 1, 2025 Maintainer

Replies: 4 comments · 10 replies

lhallee Jan 1, 2025 Maintainer Author

lapp0 Jan 2, 2025 Maintainer

lhallee Jan 2, 2025 Maintainer Author

lapp0 Jan 2, 2025 Maintainer

lapp0 Jan 2, 2025 Maintainer

lhallee Jan 3, 2025 Maintainer Author

lapp0 Jan 21, 2025 Maintainer

lapp0 Jan 22, 2025 Maintainer

lhallee Jan 22, 2025 Maintainer Author

lapp0 Jan 23, 2025 Maintainer

lapp0 Jan 23, 2025 Maintainer

lhallee Jan 23, 2025 Maintainer Author

lhallee
Jan 1, 2025
Maintainer

Replies: 4 comments 10 replies

lhallee
Jan 1, 2025
Maintainer Author

lapp0
Jan 2, 2025
Maintainer

lhallee Jan 2, 2025
Maintainer Author

lapp0
Jan 2, 2025
Maintainer

lapp0 Jan 2, 2025
Maintainer

lhallee Jan 3, 2025
Maintainer Author

lapp0
Jan 21, 2025
Maintainer

lapp0 Jan 22, 2025
Maintainer

lhallee Jan 22, 2025
Maintainer Author

lapp0 Jan 23, 2025
Maintainer

lapp0 Jan 23, 2025
Maintainer

lhallee Jan 23, 2025
Maintainer Author