Fix loading of legacy HuggingFace BERT checkpoints. by drivanov · Pull Request #10631 · pyg-team/pytorch_geometric

drivanov · 2026-03-06T17:07:50Z

Some legacy HuggingFace checkpoints such as prajjwal1/bert-tiny do not contain a config.json with the model_type field required by recent versions of Transformers.
As a result, AutoModelForSequenceClassification.from_pretrained() fails with:

  File "/workspace/examples/llm/glem.py", line 461, in <module>
    main(args)
  File "/workspace/examples/llm/glem.py", line 88, in main
    tag_dataset = TAGDataset(root, dataset, hf_model,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/tag_dataset.py", line 89, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 773, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_tokenizers.py", line 341, in __init__
    raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

This PR adds a small fallback for such models by directly using BertForSequenceClassification, while keeping the default AutoModelForSequenceClassification path for all other models.

for more information, see https://pre-commit.ci

…tric into tokenizer # Conflicts: # torch_geometric/datasets/tag_dataset.py

codecov · 2026-03-06T17:55:56Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.53%. Comparing base (c211214) to head (f3e8f94).
⚠️ Report is 173 commits behind head on master.

Files with missing lines	Patch %	Lines
torch_geometric/llm/models/glem.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10631      +/-   ##
==========================================
- Coverage   86.11%   85.53%   -0.58%     
==========================================
  Files         496      510      +14     
  Lines       33655    35983    +2328     
==========================================
+ Hits        28981    30779    +1798     
- Misses       4674     5204     +530

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

…tric into tokenizer

for more information, see https://pre-commit.ci

Fix loading of legacy HuggingFace BERT checkpoints.

8073cb8

drivanov requested review from akihironitta, puririshi98, rusty1s and wsad1 as code owners March 6, 2026 17:07

pre-commit-ci bot and others added 3 commits March 6, 2026 17:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad7ce0e

for more information, see https://pre-commit.ci

Updating test_model_summary

6625639

Merge branch 'tokenizer' of https://github.com/drivanov/pytorch_geome…

f3e8f94

…tric into tokenizer # Conflicts: # torch_geometric/datasets/tag_dataset.py

drivanov and others added 7 commits March 6, 2026 09:59

Fixing lint problems

a71317c

Fixing code coverage for glem.py

c27a8f3

[pre-commit.ci] auto fixes from pre-commit.com hooks

142500e

for more information, see https://pre-commit.ci

Fixing minor bug

9b00931

Merge branch 'tokenizer' of https://github.com/drivanov/pytorch_geome…

f1399ce

…tric into tokenizer

[pre-commit.ci] auto fixes from pre-commit.com hooks

b54a95a

for more information, see https://pre-commit.ci

Fixing PreTrainedTokenizerBase import error

bf4bbb4

AJamal27891 mentioned this pull request Mar 9, 2026

Fix brittle test_summary_with_to_hetero_model assertion broken by tabulate 0.10.0 #10633

Closed

Merge branch 'master' into tokenizer

d0bb4de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loading of legacy HuggingFace BERT checkpoints.#10631

Fix loading of legacy HuggingFace BERT checkpoints.#10631
drivanov wants to merge 12 commits intopyg-team:masterfrom
drivanov:tokenizer

drivanov commented Mar 6, 2026

Uh oh!

codecov bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drivanov commented Mar 6, 2026

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 6, 2026 •

edited

Loading