Skip to content

fix: reject partial-zero classification dates (prof-5qy)#42

Merged
ckrough merged 1 commit into
mainfrom
issue/prof-5qy
May 29, 2026
Merged

fix: reject partial-zero classification dates (prof-5qy)#42
ckrough merged 1 commit into
mainfrom
issue/prof-5qy

Conversation

@ckrough
Copy link
Copy Markdown
Owner

@ckrough ckrough commented May 29, 2026

Classification dates are either the "00000000" no-date sentinel or a real YYYYMMDD calendar date with ASCII digits, year >= 1, month 01-12, and a day valid for that month with leap years honored. Partial-zero components ("20240900" day 00, "20240015" month 00, "00000901" year 0000), impossible dates (e.g. "20240230"), and non-ASCII digit characters collapse to the sentinel.

drover.dates exposes is_valid_classification_date() and normalize_classification_date(). The model boundary (RawClassification.date, ClassificationResult.date) carries a non-raising mode="before" field_validator that normalizes the LLM-supplied date before any downstream consumer (naming policy, tag actions, eval comparison, JSON export, on-disk reloads) reads it. NARA naming delegates to the shared normalizer. GroundTruthEntry.date raises on bad values; _load_ground_truth catches the resulting ValidationError and logs the offending line. The synthetic-sample generator validates against the same rule. The classification prompt instructs the model to emit the sentinel rather than zero-fill a single component.

Classification dates are either the "00000000" no-date sentinel or a real YYYYMMDD calendar date with ASCII digits, year >= 1, month 01-12, and a day valid for that month with leap years honored. Partial-zero components ("20240900" day 00, "20240015" month 00, "00000901" year 0000), impossible dates (e.g. "20240230"), and non-ASCII digit characters collapse to the sentinel.

drover.dates exposes is_valid_classification_date() and normalize_classification_date(). The model boundary (RawClassification.date, ClassificationResult.date) carries a non-raising mode="before" field_validator that normalizes the LLM-supplied date before any downstream consumer (naming policy, tag actions, eval comparison, JSON export, on-disk reloads) reads it. NARA naming delegates to the shared normalizer. GroundTruthEntry.date raises on bad values; _load_ground_truth catches the resulting ValidationError and logs the offending line. The synthetic-sample generator validates against the same rule. The classification prompt instructs the model to emit the sentinel rather than zero-fill a single component.
@ckrough ckrough merged commit bce6498 into main May 29, 2026
5 checks passed
@ckrough ckrough deleted the issue/prof-5qy branch May 29, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant