Skip to content

feat(pelagios): tune Pleiades matcher against the live corpus#75

Merged
Eddy1919 merged 1 commit into
mainfrom
feat/pleiades-matcher-tuning
Jun 20, 2026
Merged

feat(pelagios): tune Pleiades matcher against the live corpus#75
Eddy1919 merged 1 commit into
mainfrom
feat/pleiades-matcher-tuning

Conversation

@Eddy1919

Copy link
Copy Markdown
Owner

Follow-up to the Pelagios place axis: ran propose_pleiades_links over all 838 distinct findspots / 5,932 inscriptions pulled from the live API and tuned the matcher.

Changes

  • Stem-prefix index in propose_links (prefix_len, default 3). A full difflib pass over the ~11k-place gazetteer doesn't finish; indexed runs in ~2s. prefix_len=0 = old full scan.
  • Stopwords: cum/et + museum/collection scaffolding. Recovers ~70 inscriptions scoring just under threshold (Clusium cum agro, Clusii in museo publico → Clusium 1.0).
  • Default threshold 0.84 → 0.90. Sub-0.90 recall is mostly wrong (Clusino GA.→lake Clusinus, ParisiisParsiana, in fronte DA.).

Coverage sweep

threshold findspots inscriptions
1.00 exact 57 1,174 (53%)
0.90 default 75 1,280 (58%)
0.84 old 90 1,328 (60%)

Open precision gaps (documented, not threshold-fixable)

Place-type disambiguation (prefer settlements over lakes/rivers) and non-findspot string filtering (catalogue sigla, pure museum provenance).

31 gazetteer tests pass (+4). docs/PELAGIOS.md updated with the sweep.

🤖 Generated with Claude Code

Ran propose_pleiades_links over all 838 distinct findspots (5,932
inscriptions) pulled from the live API. Findings, now folded in:

- Stem-prefix index in propose_links (prefix_len, default 3). A full
  difflib pass over the ~11k-place gazetteer does not finish; the indexed
  path runs in ~2s. prefix_len=0 forces the old full comparison.
- Stopwords: add `cum`/`et` and museum/collection scaffolding (museo,
  publico, collezione, …). Recovers ~70 inscriptions that scored just under
  threshold ("Clusium cum agro", "Clusii in museo publico" → Clusium 1.0).
- Default threshold 0.84 → 0.90. Sweep showed sub-0.90 recall is mostly
  wrong (Clusino GA.→lake Clusinus, Parisiis→Parsiana, "in fronte DA.").
  At 0.90: 75 findspots / 1,280 inscriptions (58%); exact-stem alone covers
  53%. docs/PELAGIOS.md records the sweep + the open precision gaps
  (place-type disambiguation, non-findspot string filtering).
- Tests: cum phrase, museum scaffolding, prefix-index vs full-scan agreement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Eddy1919 Eddy1919 merged commit 1ef18d9 into main Jun 20, 2026
3 of 4 checks passed
@Eddy1919 Eddy1919 deleted the feat/pleiades-matcher-tuning branch June 20, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant