SMPTE journal-article backfill — ~18k NLM articles into the per-doc registry by SteveLLamb · Pull Request #1172 · PrZ3r/MSRBot.io

SteveLLamb · 2026-05-22T19:06:39Z

Summary

Backfills the SMPTE journal-article corpus — ~18k articles from the HIGHWIRE
vendor XML in _source/SMPTE/ — into the per-doc registry (#1108). These are
the "Gap" bucket from sourceInventory.smpte.md: SMPTE journal papers the
registry has never carried.

References are not extracted in this pass — the <back> reference list is
a deferred follow-up.

Tooling (commit 1)

inventorySource.smpte.js — refactored to read the per-doc registry via
loadAllDocs() instead of the retired monolithic documents.json.
readNlmArticleXml (added to extractSourceMetadata.js) — parser for the
NLM Journal Archiving & Interchange DTD. The HIGHWIRE corpus turned out to be
NLM per-article XML, not the Allen Press journal_metadata "issue XML" the
schemaMap anticipated — hence the parser/provenance tag smpte-journal-article-nlm@v1.
extractSmpteJournalArticles.js — one-time runner: walks _source/SMPTE/HIGHWIRE,
builds one per-doc JSON per article, dedups against the registry, AJV-validates
each doc, writes into src/main/data/docs/smpte/journal-article/{year}/.
Dry-run by default; --apply to write; --limit N for resumable chunked runs
(no --offset — targets recompute each run).

Data (subsequent commits)

~18k new per-doc JSONs under src/main/data/docs/smpte/journal-article/{year}/,
landed in chunked --apply runs. Each doc carries $meta provenance per field
(smpte-journal-article-nlm@v1).

Full-corpus dry-run


Unique articles in source	18,166 (1916–2010)
Already in registry	81
New articles imported	18,085
Schema-invalid	0
Parse errors	0

Verification

node src/main/scripts/extras/extractSmpteJournalArticles.js (dry-run) — clean.
npm run canonicalize && npm run validate after each --apply chunk.
Doc shape matches existing rich journal docs (e.g. 10.5594-J08011.json).

Out of scope (follow-ups)

Journal-article references extraction.
~221 delta-docs reconciliation.
refMap.json J. SMPE → 10.5594-J* resolution patterns.
[FEATURE] Supplemental archive repo for scrubbed _source/; deterministic source path for re-extraction #1171 rawSource envelope.

Closes #1173

Refactor inventorySource.smpte.js to read the per-doc registry (#1108) via loadAllDocs() instead of the retired monolithic documents.json. Add readNlmArticleXml to extractSourceMetadata.js — a parser for the NLM Journal Archiving DTD, the format of the ~18k HIGHWIRE SMPTE journal-article XML deliveries. Add extractSmpteJournalArticles.js: a one-time runner that parses the HIGHWIRE NLM corpus and writes one per-doc JSON per article into src/main/data/docs/smpte/journal-article/{year}/. Dry-run by default, --apply to write, --limit N for resumable chunked runs. References are not extracted (deferred pass). Full-corpus dry-run: 18,085 new articles (1916-2010), all schema-valid, 0 parse errors.

prz3-unit · 2026-05-22T19:08:50Z

Review link

MSRBot.io Build Preview
This link updates on new commits to this PR.

Docs whose articleType is listed in site.json noPageArticleTypes (obituary, other, news, calendar, announcement, correction, addendum, reprint) get no detail page, reference-tree page, sitemap entry, or search-index row. They remain fully present in the API JSON and registry — the gate suppresses generated pages, not data. - New shared lib/pageGate.js drives the decision from site config. - build.js skips gated docs at the per-doc, reftree, and sitemap emit loops, and removes stale page directories left by pre-gate builds. - build.search-index.js drops gated docs from search-index/facets. - docId.hbs surfaces articleType on the doc page, below Doc Type.

extractSmpteJournalArticles.js now imports both NLM corpora, selected by --corpus journal|conference|both (default both): smptej/ 10.5594-J* -> docType "Journal Article" smptem/ 10.5594-M* -> docType "Conference Paper" (~1,502 docs) Same readNlmArticleXml parser and pipeline; per-corpus $meta version (smpte-conference-paper-nlm@v1 / smpte-journal-article-nlm@v1). Correct 10.5594-M00395 — an existing conference paper mistyped as "Journal Article" — to "Conference Paper". canonicalize re-homes it to the conference-paper shard; its curated references/resolvedHref are preserved. The "Conference Paper" docType enum + titleLabelDocTypes / nonLineageDocTypes additions landed in earlier commits.

….1.0) buildStats now emits documents.journalArticles { total, articleTypes, byArticleType } — a count of Journal Article docs grouped by articleType, sorted descending. Backward-compatible new field, so stats apiVersion bumps 1.0.0 -> 1.1.0.

The full-bundle shape grew past GitHub's 100 MB per-file limit on the gh-pages branch (119.67 MB at PR #1172). /api/documents.json now emits a lightweight index of all docs — { docId, publisher, docType, docLabel, docTitle, articleType?, path } — and links each row to /api/doc/{docId}.json for the full record including $meta provenance. Drops the bundle from ~120 MB to ~7 MB and stays small as the corpus grows. - build.js: replace full-bundle write with index payload; bump documents apiVersion 1.0.0 -> 2.0.0. Per-doc shards untouched. - api.hbs: update the endpoint table to describe the new shape.

SteveLLamb · 2026-05-22T22:48:04Z

Tracks and closes #1173 — the /api/documents.json file crossed GitHub's 100 MB per-file limit on the gh-pages deploy. Fixed here by converting that endpoint to a lightweight index (apiVersion 2.0.0); full records with $meta remain at /api/doc/{docId}.json.

extractSmpteJournalIssues.js imports SMPTE journal docs from the Allen Press / APTARA journal_metadata XML format (IEEE content_delivery 1.6 schema) via the existing readIssueMetadataXml parser. Different DTD from the HIGHWIRE NLM per-article XML the sibling extractor handles — this format groups multiple articles under one issue header, so each article inherits journalSuite + issue fields from its parent block. Two modes: --coverage-only create new docs not already in the registry (~4,691 docs against the current corpus) --crossfill-only enrich overlap docs by adding only missing fields (universal journalAcronym + publisherLocation.country that NLM lacked, plus ~1,430 abstracts NLM missed) Default does both. Cross-fill rule is universal: only-add-missing, never overwrite — protects hand-curated and prior-parsed values. Cross-filled docs are re-validated against the schema after mutation. Provenance: smpte-journal-issue-xml@v1 / smpte-conference-issue-xml@v1. References and keywords explicitly skipped — APTARA major_topic/ minor_topic are section labels (already covered by articleType).

prz3-unit Bot added a commit that referenced this pull request May 22, 2026

deploy preview for PR #1172 (PrZ3 Unit)

5ebc06e

SteveLLamb added 2 commits May 22, 2026 12:13

batch 1

b6e8f87

Fix build: load documents sub-registry from per-doc files (#1108)

757e36f

prz3-unit Bot added a commit that referenced this pull request May 22, 2026

deploy preview for PR #1172 (PrZ3 Unit)

6a6a5c1

SteveLLamb added 3 commits May 22, 2026 13:01

batch 2

4720f7c

prz3-unit Bot added a commit that referenced this pull request May 22, 2026

deploy preview for PR #1172 (PrZ3 Unit)

dec7fe0

prz3-unit Bot added a commit that referenced this pull request May 22, 2026

deploy preview for PR #1172 (PrZ3 Unit)

b510b0f

SteveLLamb added 3 commits May 22, 2026 14:40

batch 3

ebfbd01

batch 3 continued

7cc8327

SteveLLamb added 2 commits May 22, 2026 15:48

Update CHANGELOG for articleType gate, stats breakdown, and API index

3fa2a3b

batch 4

8ab9c2f

prz3-unit Bot added a commit that referenced this pull request May 22, 2026

deploy preview for PR #1172 (PrZ3 Unit)

2198573

prz3-unit Bot added a commit that referenced this pull request May 22, 2026

deploy preview for PR #1172 (PrZ3 Unit)

c5c5da4

SteveLLamb added 2 commits May 22, 2026 16:55

batch 5

2dc0094

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMPTE journal-article backfill — ~18k NLM articles into the per-doc registry#1172

SMPTE journal-article backfill — ~18k NLM articles into the per-doc registry#1172
SteveLLamb wants to merge 14 commits into
mainfrom
feature/smpte-backfill-journals

SteveLLamb commented May 22, 2026 •

edited

Loading

Uh oh!

prz3-unit Bot commented May 22, 2026

Uh oh!

SteveLLamb commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SteveLLamb commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tooling (commit 1)

Data (subsequent commits)

Full-corpus dry-run

Verification

Out of scope (follow-ups)

Uh oh!

prz3-unit Bot commented May 22, 2026

Review link

Uh oh!

SteveLLamb commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SteveLLamb commented May 22, 2026 •

edited

Loading