Skip to content

ci: scheduled taxonomy-drift tripwire — alert when any one Area exceeds ~30% of open issues #1067

@isaacschepp

Description

@isaacschepp

Context / Problem

The Area taxonomy is a controlled vocabulary of 6 buckets, single-sourced in .github/issue-areas.yml:

cli, spec, go-glx, import-export, ui, tooling

We have two CI gates that keep this vocabulary internally consistent:

Neither watches the one thing that actually tells us the taxonomy has stopped being useful: distribution skew. A controlled vocabulary that is never re-audited rots, and tooling already proves it — it has reached 133/247 = ~53% of open issues with no rename/split/addition since the taxonomy was born (#240, 2026-03-28). When a single bucket holds half the tracker it stops discriminating, and filtering by it is no better than not filtering at all.

The recent consolidation (#915 single-sourced the Area list but deliberately kept all 6 buckets) did not address this; it makes the list easier to edit, not self-monitoring. Even after a future split of tooling, the largest successor bucket (likely infra) will still be the biggest and will silently regrow without a tripwire. This is a textbook concept-drift guard: cheap, periodic, converts taxonomy maintenance from reactive to scheduled.

Proposal / Recommendation

Add a lightweight scheduled check that, for each Area label, computes its share of open issues and opens/updates a single tracking issue when any Area exceeds a threshold (~30%), prompting a re-split review.

Two acceptable implementations — pick the lighter one the maintainers prefer:

  1. Scheduled workflow (.github/workflows/taxonomy-drift.yml), on: schedule (e.g. weekly cron) plus workflow_dispatch. For each Area in .github/issue-areas.yml, run the per-label totalCount GraphQL query (validated during this review):

    query($owner:String!, $repo:String!, $q:String!) {
      search(type: ISSUE, query: $q) { issueCount }
    }

    with $q = "repo:genealogix/glx is:issue is:open label:<area>", plus one unfiltered is:open count for the denominator. If count/total > 0.30 for any Area, open (or update, to avoid dupes) an issue tagged tooling listing the offending buckets and their percentages.

  2. Checklist item in the existing triage cadence (lighter, no new workflow surface): a documented step that runs the same query manually each triage cycle and files a re-split issue when the threshold trips. Pair this with the documented audit rule.

Implementation notes:

  • Read the Area list from .github/issue-areas.yml so the check stays in lock-step with the canonical source (same pattern the labeler and templates-drift check already use — Python + stdlib + PyYAML, no yq).
  • SHA-pin / patch-pin all uses: actions per .github/CLAUDE.md (@vX.Y.Z, never @vN); permissions: minimal (issues: write only if the workflow opens the tracking issue, else contents: read).
  • If creating the tracking issue from CI, never interpolate untrusted issue fields into run: — stage values via env: (per .github/CLAUDE.md).
  • Make the threshold a single named constant so it's easy to tune.
  • De-dupe: search for an existing open tracking issue (by a fixed title prefix or a dedicated marker) and update it rather than opening a new one every run.

Acceptance criteria

  • A scheduled mechanism (workflow on: schedule + workflow_dispatch, OR a documented triage-checklist step) computes each Area's share of open issues using the per-label totalCount/issueCount GraphQL query.
  • Area list is read from .github/issue-areas.yml, not hard-coded.
  • When any Area exceeds ~30% of open issues, the mechanism opens or updates a single tracking issue (typed Infrastructure, label tooling) naming the offending bucket(s) and their percentages, prompting a re-split review.
  • Threshold is a single tunable constant.
  • If implemented as a workflow: all uses: actions are patch-/SHA-pinned per .github/CLAUDE.md; permissions: are least-privilege; no untrusted input is interpolated into run:.
  • The audit rule (re-audit the taxonomy when a bucket dominates) is documented alongside the check so the intent survives.

Notes / scope

  • This is preventive tooling, not a bug — hence Infrastructure and P3. It does not block the tooling re-split itself; it ensures the next catch-all can't silently regrow.
  • If the team decides a dedicated taxonomy/meta label is warranted for these tracking issues, that's a separate request — only the existing tooling label is applied here, since inventing labels is out of scope.

Relates to


Part of a focused review of .github/issue-areas.yml and its labeling machinery; taxonomy hub: #1062.

Metadata

Metadata

Assignees

No one assigned

    Labels

    github_actionsPull requests that update GitHub Actions codetoolingInfrastructure, workflow, and developer tools
    No fields configured for Infrastructure.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions