Skip to content

FAC 143 feat: surface per-stage submission exclusions (counts + reasons + persisted IDs) in analysis pipeline #375

Description

@y4nder

Problem

Submissions are filtered out at multiple stages of the analysis pipeline, but only the sentiment gate currently reports counts. The preprocessing drop (submissions where cleanedComment IS NULL) is silent, making it impossible to tell why a scope's sentiment stage staged fewer items than its commentCount.

Example — Department CCS pipeline 67ce11d3-ac52-4df6-aebb-e93347eb3609:

Stage Count Delta Currently surfaced?
Submissions in scope 859 yes (coverage.submissionCount)
With raw comment 852 −7 yes (coverage.commentCount)
Staged for sentiment 788 −64 no — silently dropped
Passed sentiment gate 741 −47 yes (count only, no reasons/IDs)

71 submissions disappear between commentCount and the sentiment dispatch with no diagnostic. 47 more are excluded at the gate with only a count. Operators and reviewers cannot audit what was filtered or why.

Goal

Make every submission exclusion in the pipeline auditable: count + reason + persisted submission ID, surfaced both in the pipeline status response and via a dedicated endpoint.

Proposed scope

Data model

New entity PipelineSubmissionExclusion (src/entities/):

  • pipeline: ManyToOne<AnalysisPipeline>
  • submission: ManyToOne<QuestionnaireSubmission>
  • stage: enumPREPROCESSING | SENTIMENT_GATE (extensible for future stages)
  • reason: enum — see reason codes below
  • metadata: jsonb — stage-specific context (e.g. { wordCount, label })
  • Indexed on (pipeline, stage) for status-payload aggregation

Reason codes

Preprocessing (derived by re-running cleanText on in-scope submissions whose cleanedComment IS NULL):

  • RAW_COMMENT_NULLqualitativeComment was null/undefined
  • EMPTY_INPUT — whitespace-only after trim
  • EXCEL_ARTIFACT — matched #NAME?, #VALUE!, etc.
  • EMPTY_AFTER_STRIP — empty after URL/emoji/laughter stripping
  • KEYBOARD_MASH — failed vowel-ratio gibberish check
  • UNDER_3_WORDS — fewer than 3 tokens after cleaning

Sentiment gate:

  • POSITIVE_SHORT_COMMENT — positive sentiment with < POSITIVE_MIN_WORD_COUNT words (metadata: { wordCount, label })

Code changes

  1. cleanText refactor (src/modules/questionnaires/utils/clean-text.ts) — return { text: string | null, reason?: PreprocessingRejectionReason } instead of string | null. Update single call site in questionnaire.service.ts.
  2. dispatchSentiment (pipeline-orchestrator.service.ts:~1669) — before the cleanedComment IS NOT NULL filter, persist one PipelineSubmissionExclusion row per in-scope submission that would be dropped, classified by re-running cleanText on qualitativeComment.
  3. OnSentimentComplete gate (pipeline-orchestrator.service.ts:403-483) — persist one exclusion row per gated submission instead of only incrementing counters. Keep sentimentGateIncluded/sentimentGateExcluded columns as-is (aggregates).
  4. Status DTO (pipeline-status.dto.ts) — add a filtering block:
    filtering: {
      preprocessing: {
        total: 71,
        reasons: { UNDER_3_WORDS: 52, EMPTY_AFTER_STRIP: 12, RAW_COMMENT_NULL: 7 }
      },
      sentimentGate: {
        total: 47,
        reasons: { POSITIVE_SHORT_COMMENT: 47 }
      }
    }
    
  5. New endpointGET /analysis/pipelines/:id/exclusions?stage=&reason=&page=&pageSize= returning paginated { submissionId, stage, reason, metadata, createdAt }. Guard matches existing pipeline-status guard.
  6. Migration — create pipeline_submission_exclusion table with FK to analysis_pipeline and questionnaire_submission, indexed on (pipeline_id, stage).

Open design decisions (for refinement)

  1. Re-run cleanText at pipeline time vs. persist rejection reason on QuestionnaireSubmission at submission time. Default: pipeline-time re-run (no submission schema change, historicity preserved across rule changes). Alternative: add cleanedCommentRejectionReason column on QuestionnaireSubmission (cleaner long-term, avoids duplicated work).
  2. Endpoint payload shape — IDs + reasons + metadata only, or also include qualitativeComment / cleanedComment text inline for debugging? Text inline is heavier but removes a second round-trip when auditing.
  3. Cascading on pipeline soft-delete — should exclusion rows soft-delete with their parent pipeline, or persist independently for audit retention?
  4. Backfill strategy — apply only to new pipelines (simple) vs. one-off backfill job that re-evaluates historical pipelines (more complete audit, more work).
  5. Frontend integration — do we want the filtering block visible in the pipeline detail UI, an expandable drawer, or only via the dedicated endpoint for super-admins?
  6. Reason taxonomy extensibility — how do we evolve reason codes without breaking consumers (e.g. when adding language-detection filtering later)?

Out of scope

  • Tuning the preprocessing rules themselves (e.g. lowering the 3-word threshold). This ticket only makes filtering visible; rule changes would be a separate decision informed by the data this exposes.
  • Embeddings-stage or topic-modeling-stage exclusions. The current pipeline doesn't filter at those stages; if future stages add filters, the schema extends naturally via the stage enum.

References

  • src/modules/analysis/services/pipeline-orchestrator.service.ts:1669-1672 — current sentiment dispatch filter
  • src/modules/analysis/services/pipeline-orchestrator.service.ts:403-483 — current sentiment gate logic
  • src/modules/analysis/constants.ts:1-6SENTIMENT_GATE config
  • src/modules/questionnaires/utils/clean-text.ts — preprocessing rules
  • src/entities/analysis-pipeline.entity.ts:70-74 — existing gate aggregate columns
  • src/modules/analysis/dto/pipeline-status.dto.ts — status payload schema

Investigated in session: https://claude.ai/code/session_01PJUfc4RqZ4mFazhP8fSULt

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions