Replies: 4 comments 10 replies
-
Very interested in your thoughts @lapp0 |
Beta Was this translation helpful? Give feedback.
-
Good visualization, glad you're checking against these metrics! My immediate hunch was the validation set contamination fixed in #5, however there wasn't any test contamination and the test set performs well! A few thoughts:
Can I get some more details on how these metrics are produced generally? It's not clear to me how any of these are measured. |
Beta Was this translation helpful? Give feedback.
-
One thing that stands out is you use a sequence length of 1024 instead of 2048. Another thing is that before #5, we didn't actually ignore pad tokens, which |
Beta Was this translation helpful? Give feedback.
-
I think the issue with performance on the tasks referenced in your table is the lack of finetuning. Your tasks expect a separate classifier head, not the My work in kbert is analogous in that I'm finetuning on sequence classification. I've converted kbert's trainer to a class, and have a separate subclass and model for finetuning on MNLI. I achieved 85.64% accuracy on MNLI, training for ~24 hours and finetuning for ~6 minutes on a 4x4090. Result is better than BERT, but beating ModernBERT is still a WIP. Most of your protein understanding tasks should be tunable through the referenced sequence classification model. Let me know if you have any questions. |
Beta Was this translation helpful? Give feedback.
-
A common metric of success for pLMs is how well vector embeddings (usually mean pooling if looking at just pretrained models) correlate with valuable downstream tasks. Currently, while exhibiting excellent language modeling loss and sequence reconstruction (after very few steps), the current models underperform when looking at the pooled vector embeddings.
For example:
Where we've even observed losses for validation and test around 2.1 during different training runs (this would suggest on par with ESM2-650M!).
Yet, when probing:

Here, we zero out performance on various datasets based on embeddings that are randomly generated vectors.
Our sped run model is worse on average than a regular transformer with random weights. This may seem surprising, but this is a common phenomena with protein language models, as sequence similarity is a strong feature for related function and even random transformers embed similar inputs similarly. Regardless, this is a big problem for downstream applications.
Open to any and all suggestions to address this during our speedrun training process.
Beta Was this translation helpful? Give feedback.
All reactions