Skip to content

SMPTE journal-article backfill — ~18k NLM articles into the per-doc registry#1172

Open
SteveLLamb wants to merge 14 commits into
mainfrom
feature/smpte-backfill-journals
Open

SMPTE journal-article backfill — ~18k NLM articles into the per-doc registry#1172
SteveLLamb wants to merge 14 commits into
mainfrom
feature/smpte-backfill-journals

Conversation

@SteveLLamb

@SteveLLamb SteveLLamb commented May 22, 2026

Copy link
Copy Markdown
Member

Summary

Backfills the SMPTE journal-article corpus — ~18k articles from the HIGHWIRE
vendor XML in _source/SMPTE/ — into the per-doc registry (#1108). These are
the "Gap" bucket from sourceInventory.smpte.md: SMPTE journal papers the
registry has never carried.

References are not extracted in this pass — the <back> reference list is
a deferred follow-up.

Tooling (commit 1)

  • inventorySource.smpte.js — refactored to read the per-doc registry via
    loadAllDocs() instead of the retired monolithic documents.json.
  • readNlmArticleXml (added to extractSourceMetadata.js) — parser for the
    NLM Journal Archiving & Interchange DTD. The HIGHWIRE corpus turned out to be
    NLM per-article XML, not the Allen Press journal_metadata "issue XML" the
    schemaMap anticipated — hence the parser/provenance tag smpte-journal-article-nlm@v1.
  • extractSmpteJournalArticles.js — one-time runner: walks _source/SMPTE/HIGHWIRE,
    builds one per-doc JSON per article, dedups against the registry, AJV-validates
    each doc, writes into src/main/data/docs/smpte/journal-article/{year}/.
    Dry-run by default; --apply to write; --limit N for resumable chunked runs
    (no --offset — targets recompute each run).

Data (subsequent commits)

~18k new per-doc JSONs under src/main/data/docs/smpte/journal-article/{year}/,
landed in chunked --apply runs. Each doc carries $meta provenance per field
(smpte-journal-article-nlm@v1).

Full-corpus dry-run

Unique articles in source 18,166 (1916–2010)
Already in registry 81
New articles imported 18,085
Schema-invalid 0
Parse errors 0

Verification

  • node src/main/scripts/extras/extractSmpteJournalArticles.js (dry-run) — clean.
  • npm run canonicalize && npm run validate after each --apply chunk.
  • Doc shape matches existing rich journal docs (e.g. 10.5594-J08011.json).

Out of scope (follow-ups)

Closes #1173

Refactor inventorySource.smpte.js to read the per-doc registry (#1108)
via loadAllDocs() instead of the retired monolithic documents.json.

Add readNlmArticleXml to extractSourceMetadata.js — a parser for the
NLM Journal Archiving DTD, the format of the ~18k HIGHWIRE SMPTE
journal-article XML deliveries.

Add extractSmpteJournalArticles.js: a one-time runner that parses the
HIGHWIRE NLM corpus and writes one per-doc JSON per article into
src/main/data/docs/smpte/journal-article/{year}/. Dry-run by default,
--apply to write, --limit N for resumable chunked runs. References are
not extracted (deferred pass).

Full-corpus dry-run: 18,085 new articles (1916-2010), all schema-valid,
0 parse errors.
@prz3-unit

prz3-unit Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Review link

MSRBot.io Build Preview
This link updates on new commits to this PR.

prz3-unit Bot added a commit that referenced this pull request May 22, 2026
prz3-unit Bot added a commit that referenced this pull request May 22, 2026
Docs whose articleType is listed in site.json noPageArticleTypes
(obituary, other, news, calendar, announcement, correction, addendum,
reprint) get no detail page, reference-tree page, sitemap entry, or
search-index row. They remain fully present in the API JSON and
registry — the gate suppresses generated pages, not data.

- New shared lib/pageGate.js drives the decision from site config.
- build.js skips gated docs at the per-doc, reftree, and sitemap emit
  loops, and removes stale page directories left by pre-gate builds.
- build.search-index.js drops gated docs from search-index/facets.
- docId.hbs surfaces articleType on the doc page, below Doc Type.
extractSmpteJournalArticles.js now imports both NLM corpora, selected
by --corpus journal|conference|both (default both):
  smptej/ 10.5594-J*  -> docType "Journal Article"
  smptem/ 10.5594-M*  -> docType "Conference Paper"  (~1,502 docs)
Same readNlmArticleXml parser and pipeline; per-corpus $meta version
(smpte-conference-paper-nlm@v1 / smpte-journal-article-nlm@v1).

Correct 10.5594-M00395 — an existing conference paper mistyped as
"Journal Article" — to "Conference Paper". canonicalize re-homes it
to the conference-paper shard; its curated references/resolvedHref
are preserved.

The "Conference Paper" docType enum + titleLabelDocTypes /
nonLineageDocTypes additions landed in earlier commits.
prz3-unit Bot added a commit that referenced this pull request May 22, 2026
….1.0)

buildStats now emits documents.journalArticles { total, articleTypes,
byArticleType } — a count of Journal Article docs grouped by articleType,
sorted descending. Backward-compatible new field, so stats apiVersion
bumps 1.0.0 -> 1.1.0.
prz3-unit Bot added a commit that referenced this pull request May 22, 2026
The full-bundle shape grew past GitHub's 100 MB per-file limit on the
gh-pages branch (119.67 MB at PR #1172). /api/documents.json now emits
a lightweight index of all docs — { docId, publisher, docType, docLabel,
docTitle, articleType?, path } — and links each row to /api/doc/{docId}.json
for the full record including $meta provenance. Drops the bundle from
~120 MB to ~7 MB and stays small as the corpus grows.

- build.js: replace full-bundle write with index payload; bump
  documents apiVersion 1.0.0 -> 2.0.0. Per-doc shards untouched.
- api.hbs: update the endpoint table to describe the new shape.
@SteveLLamb

Copy link
Copy Markdown
Member Author

Tracks and closes #1173 — the /api/documents.json file crossed GitHub's 100 MB per-file limit on the gh-pages deploy. Fixed here by converting that endpoint to a lightweight index (apiVersion 2.0.0); full records with $meta remain at /api/doc/{docId}.json.

prz3-unit Bot added a commit that referenced this pull request May 22, 2026
prz3-unit Bot added a commit that referenced this pull request May 22, 2026
extractSmpteJournalIssues.js imports SMPTE journal docs from the Allen
Press / APTARA journal_metadata XML format (IEEE content_delivery 1.6
schema) via the existing readIssueMetadataXml parser. Different DTD
from the HIGHWIRE NLM per-article XML the sibling extractor handles —
this format groups multiple articles under one issue header, so each
article inherits journalSuite + issue fields from its parent block.

Two modes:
  --coverage-only   create new docs not already in the registry
                    (~4,691 docs against the current corpus)
  --crossfill-only  enrich overlap docs by adding only missing fields
                    (universal journalAcronym + publisherLocation.country
                    that NLM lacked, plus ~1,430 abstracts NLM missed)
Default does both. Cross-fill rule is universal: only-add-missing,
never overwrite — protects hand-curated and prior-parsed values.
Cross-filled docs are re-validated against the schema after mutation.

Provenance: smpte-journal-issue-xml@v1 / smpte-conference-issue-xml@v1.
References and keywords explicitly skipped — APTARA major_topic/
minor_topic are section labels (already covered by articleType).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/api/documents.json exceeds GitHub's 100 MB file limit on gh-pages deploy

1 participant