Skip to content

Fix loading of legacy HuggingFace BERT checkpoints.#10631

Open
drivanov wants to merge 12 commits intopyg-team:masterfrom
drivanov:tokenizer
Open

Fix loading of legacy HuggingFace BERT checkpoints.#10631
drivanov wants to merge 12 commits intopyg-team:masterfrom
drivanov:tokenizer

Conversation

@drivanov
Copy link
Contributor

@drivanov drivanov commented Mar 6, 2026

Some legacy HuggingFace checkpoints such as prajjwal1/bert-tiny do not contain a config.json with the model_type field required by recent versions of Transformers.
As a result, AutoModelForSequenceClassification.from_pretrained() fails with:

  File "/workspace/examples/llm/glem.py", line 461, in <module>
    main(args)
  File "/workspace/examples/llm/glem.py", line 88, in main
    tag_dataset = TAGDataset(root, dataset, hf_model,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/tag_dataset.py", line 89, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 773, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_tokenizers.py", line 341, in __init__
    raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

This PR adds a small fallback for such models by directly using BertForSequenceClassification, while keeping the default AutoModelForSequenceClassification path for all other models.

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.53%. Comparing base (c211214) to head (f3e8f94).
⚠️ Report is 173 commits behind head on master.

Files with missing lines Patch % Lines
torch_geometric/llm/models/glem.py 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10631      +/-   ##
==========================================
- Coverage   86.11%   85.53%   -0.58%     
==========================================
  Files         496      510      +14     
  Lines       33655    35983    +2328     
==========================================
+ Hits        28981    30779    +1798     
- Misses       4674     5204     +530     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant