SMPTE journal-article backfill — ~18k NLM articles into the per-doc registry#1172
Open
SteveLLamb wants to merge 14 commits into
Open
SMPTE journal-article backfill — ~18k NLM articles into the per-doc registry#1172SteveLLamb wants to merge 14 commits into
SteveLLamb wants to merge 14 commits into
Conversation
Refactor inventorySource.smpte.js to read the per-doc registry (#1108) via loadAllDocs() instead of the retired monolithic documents.json. Add readNlmArticleXml to extractSourceMetadata.js — a parser for the NLM Journal Archiving DTD, the format of the ~18k HIGHWIRE SMPTE journal-article XML deliveries. Add extractSmpteJournalArticles.js: a one-time runner that parses the HIGHWIRE NLM corpus and writes one per-doc JSON per article into src/main/data/docs/smpte/journal-article/{year}/. Dry-run by default, --apply to write, --limit N for resumable chunked runs. References are not extracted (deferred pass). Full-corpus dry-run: 18,085 new articles (1916-2010), all schema-valid, 0 parse errors.
Contributor
Review linkMSRBot.io Build Preview |
Docs whose articleType is listed in site.json noPageArticleTypes (obituary, other, news, calendar, announcement, correction, addendum, reprint) get no detail page, reference-tree page, sitemap entry, or search-index row. They remain fully present in the API JSON and registry — the gate suppresses generated pages, not data. - New shared lib/pageGate.js drives the decision from site config. - build.js skips gated docs at the per-doc, reftree, and sitemap emit loops, and removes stale page directories left by pre-gate builds. - build.search-index.js drops gated docs from search-index/facets. - docId.hbs surfaces articleType on the doc page, below Doc Type.
extractSmpteJournalArticles.js now imports both NLM corpora, selected by --corpus journal|conference|both (default both): smptej/ 10.5594-J* -> docType "Journal Article" smptem/ 10.5594-M* -> docType "Conference Paper" (~1,502 docs) Same readNlmArticleXml parser and pipeline; per-corpus $meta version (smpte-conference-paper-nlm@v1 / smpte-journal-article-nlm@v1). Correct 10.5594-M00395 — an existing conference paper mistyped as "Journal Article" — to "Conference Paper". canonicalize re-homes it to the conference-paper shard; its curated references/resolvedHref are preserved. The "Conference Paper" docType enum + titleLabelDocTypes / nonLineageDocTypes additions landed in earlier commits.
….1.0)
buildStats now emits documents.journalArticles { total, articleTypes,
byArticleType } — a count of Journal Article docs grouped by articleType,
sorted descending. Backward-compatible new field, so stats apiVersion
bumps 1.0.0 -> 1.1.0.
The full-bundle shape grew past GitHub's 100 MB per-file limit on the gh-pages branch (119.67 MB at PR #1172). /api/documents.json now emits a lightweight index of all docs — { docId, publisher, docType, docLabel, docTitle, articleType?, path } — and links each row to /api/doc/{docId}.json for the full record including $meta provenance. Drops the bundle from ~120 MB to ~7 MB and stays small as the corpus grows. - build.js: replace full-bundle write with index payload; bump documents apiVersion 1.0.0 -> 2.0.0. Per-doc shards untouched. - api.hbs: update the endpoint table to describe the new shape.
Member
Author
|
Tracks and closes #1173 — the |
extractSmpteJournalIssues.js imports SMPTE journal docs from the Allen
Press / APTARA journal_metadata XML format (IEEE content_delivery 1.6
schema) via the existing readIssueMetadataXml parser. Different DTD
from the HIGHWIRE NLM per-article XML the sibling extractor handles —
this format groups multiple articles under one issue header, so each
article inherits journalSuite + issue fields from its parent block.
Two modes:
--coverage-only create new docs not already in the registry
(~4,691 docs against the current corpus)
--crossfill-only enrich overlap docs by adding only missing fields
(universal journalAcronym + publisherLocation.country
that NLM lacked, plus ~1,430 abstracts NLM missed)
Default does both. Cross-fill rule is universal: only-add-missing,
never overwrite — protects hand-curated and prior-parsed values.
Cross-filled docs are re-validated against the schema after mutation.
Provenance: smpte-journal-issue-xml@v1 / smpte-conference-issue-xml@v1.
References and keywords explicitly skipped — APTARA major_topic/
minor_topic are section labels (already covered by articleType).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backfills the SMPTE journal-article corpus — ~18k articles from the HIGHWIRE
vendor XML in
_source/SMPTE/— into the per-doc registry (#1108). These arethe "Gap" bucket from
sourceInventory.smpte.md: SMPTE journal papers theregistry has never carried.
References are not extracted in this pass — the
<back>reference list isa deferred follow-up.
Tooling (commit 1)
inventorySource.smpte.js— refactored to read the per-doc registry vialoadAllDocs()instead of the retired monolithicdocuments.json.readNlmArticleXml(added toextractSourceMetadata.js) — parser for theNLM Journal Archiving & Interchange DTD. The HIGHWIRE corpus turned out to be
NLM per-article XML, not the Allen Press
journal_metadata"issue XML" theschemaMap anticipated — hence the parser/provenance tag
smpte-journal-article-nlm@v1.extractSmpteJournalArticles.js— one-time runner: walks_source/SMPTE/HIGHWIRE,builds one per-doc JSON per article, dedups against the registry, AJV-validates
each doc, writes into
src/main/data/docs/smpte/journal-article/{year}/.Dry-run by default;
--applyto write;--limit Nfor resumable chunked runs(no
--offset— targets recompute each run).Data (subsequent commits)
~18k new per-doc JSONs under
src/main/data/docs/smpte/journal-article/{year}/,landed in chunked
--applyruns. Each doc carries$metaprovenance per field(
smpte-journal-article-nlm@v1).Full-corpus dry-run
Verification
node src/main/scripts/extras/extractSmpteJournalArticles.js(dry-run) — clean.npm run canonicalize && npm run validateafter each--applychunk.10.5594-J08011.json).Out of scope (follow-ups)
refMap.jsonJ. SMPE → 10.5594-J*resolution patterns._source/; deterministic source path for re-extraction #1171rawSourceenvelope.Closes #1173