Skip to content

Add TypedDicts for complex return dicts: RougeScore, TokenizationStat family, CorefResult#1368

Merged
bact merged 5 commits intodevfrom
copilot/add-typed-dicts-for-complex-dictionaries
Mar 25, 2026
Merged

Add TypedDicts for complex return dicts: RougeScore, TokenizationStat family, CorefResult#1368
bact merged 5 commits intodevfrom
copilot/add-typed-dicts-for-complex-dictionaries

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 24, 2026

What do these changes do

Adds TypedDicts to several public APIs that previously returned opaque dict[str, Any] or dict[str, tuple] types, making keys and value types explicit, IDE-completable, and statically checkable.

New TypedDicts in pythainlp.benchmarks:

  • RougeScore — per-ROUGE-type result with precision, recall, fmeasure fields. Breaking change: rouge_score() return type changes from dict[str, tuple[float, float, float]] to dict[str, RougeScore].
  • CharLevelStat, WordLevelStat, GlobalStat, TokenizationStat — typed wrapper for the nested dict from compute_stats(). The global-level key is "global_" (trailing underscore avoids the Python reserved word). Class-based TypedDict syntax throughout.

New TypedDicts in pythainlp.coref:

  • CorefResult (renamed from internal CorefResultDict) — now exported; coreference_resolution() return type tightened from list[dict[str, Any]] to list[CorefResult].

Naming conventions:

  • Singular form throughout (CharLevelStat, not CharLevelStats) — these are single result records, consistent with existing BleuScore/RougeScore and NLTK's ChunkScore convention.
  • US spelling used consistently: correctly_tokenized_words, tokenization_indicators, _find_words_correctly_tokenized.

What was wrong

Several public functions returned opaque dict types whose structure was only discoverable by reading source code. Additionally, the TokenizationStats TypedDict required the functional TypedDict(...) form because "global" is a Python reserved keyword, making it awkward to use directly. Field names used British spelling (tokenised, tokenisation) inconsistently with the rest of the codebase.

How this fixes it

  • Introduces four TypedDicts (CharLevelStat, WordLevelStat, GlobalStat, TokenizationStat) with the global-level key renamed to "global_", allowing the standard class-based TypedDict syntax.
  • Renames all fields and the internal helper function to use US spelling: correctly_tokenized_words, tokenization_indicators, _find_words_correctly_tokenized.
  • All usages updated across word_tokenization.py, cli/benchmark.py, tests, and CHANGELOG migration notes.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test

📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

…port from public APIs

- metrics.py: add RougeScore TypedDict (precision/recall/fmeasure);
  rouge_score() now returns dict[str, RougeScore] instead of
  dict[str, tuple[float, float, float]] (breaking change)
- word_tokenization.py: add CharLevelStats, WordLevelStats, GlobalStats,
  TokenizationStats TypedDicts; update compute_stats() return type;
  _flatten_result() uses @overload to accept both TokenizationStats and
  generic Mapping without a cast
- benchmarks/__init__.py: export RougeScore, CharLevelStats, WordLevelStats,
  GlobalStats, TokenizationStats
- _fastcoref.py: rename CorefResultDict -> CorefResult; alias the
  fastcoref.modeling.CorefResult import as FastCorefResult to avoid clash
- coref/__init__.py: export CorefResult
- coref/core.py: update return type and fallback return to use CorefResult
- tag/named_entity.py: improve EntitySpan docstring
- tests/extra/testx_benchmarks.py: update rouge tests for named-field access;
  add test_rouge_score_return_type and test_compute_stats_return_type
- CHANGELOG.md: add migration notes for all TypedDicts

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/de0a9df1-25ba-4e89-a047-7ee0664753c1
Copilot AI changed the title [WIP] Add TypedDicts for complex dictionaries with docstrings Add TypedDicts for complex return dicts: RougeScore, TokenizationStats family, CorefResult Mar 24, 2026
Copilot AI requested a review from bact March 24, 2026 12:51
Copilot finished work on behalf of bact March 24, 2026 12:51
- CharLevelStats → CharLevelStat (singular, matches BleuScore/RougeScore)
- WordLevelStats → WordLevelStat
- GlobalStats → GlobalStat
- TokenizationStats → TokenizationStat; converted from functional TypedDict
  form to class form now that global_ is a valid identifier
- "global" key → "global_" in compute_stats() return value
- Update __init__.py __all__ and imports
- Update tests: import names, assertIn("global_"), typed annotations
- Update CHANGELOG migration notes

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/d0fa4ca1-06a6-48cc-b353-4df24554e700
Copilot AI force-pushed the copilot/add-typed-dicts-for-complex-dictionaries branch from 61f0c8f to a1863d4 Compare March 24, 2026 12:57
Copilot finished work on behalf of bact March 24, 2026 12:58
Copy link
Copy Markdown
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Use US-spelling for names

- WordLevelStat field: correctly_tokenised_words → correctly_tokenized_words
- GlobalStat field: tokenisation_indicators → tokenization_indicators
- Private helper: _find_words_correctly_tokenised → _find_words_correctly_tokenized
- All call sites in compute_stats(), cli/benchmark.py, and tests updated
- CHANGELOG migration example updated

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/e8eff276-0f77-4d6f-9d4e-d9787e6f414b
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 24, 2026

@copilot Use US-spelling for names

Done in 8b8e174. Changed tokenised→tokenized, tokenisation→tokenization, and _find_words_correctly_tokenised→_find_words_correctly_tokenized across word_tokenization.py, cli/benchmark.py, tests, and the CHANGELOG example.

Copilot AI changed the title Add TypedDicts for complex return dicts: RougeScore, TokenizationStats family, CorefResult Add TypedDicts for complex return dicts: RougeScore, TokenizationStat family, CorefResult Mar 24, 2026
Copilot AI requested a review from bact March 24, 2026 13:12
Copilot finished work on behalf of bact March 24, 2026 13:13
@bact bact marked this pull request as ready for review March 24, 2026 13:15
@bact bact added benchmark benchmarking tools in the libarary refactoring a technical improvement which does not add any new features or change existing features. labels Mar 24, 2026
@bact bact added this to PyThaiNLP Mar 24, 2026
@bact bact added this to the 5.3.3 milestone Mar 24, 2026
Updated CHANGELOG to reflect breaking changes and new TypedDicts in pythainlp.benchmarks and pythainlp.coref. Added migration notes for users to transition to the new TypedDict structures.
@bact bact merged commit a4296cb into dev Mar 25, 2026
24 checks passed
@bact bact deleted the copilot/add-typed-dicts-for-complex-dictionaries branch March 25, 2026 16:03
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark benchmarking tools in the libarary refactoring a technical improvement which does not add any new features or change existing features.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants