Problem
Submissions are filtered out at multiple stages of the analysis pipeline, but only the sentiment gate currently reports counts. The preprocessing drop (submissions where cleanedComment IS NULL) is silent, making it impossible to tell why a scope's sentiment stage staged fewer items than its commentCount.
Example — Department CCS pipeline 67ce11d3-ac52-4df6-aebb-e93347eb3609:
| Stage |
Count |
Delta |
Currently surfaced? |
| Submissions in scope |
859 |
— |
yes (coverage.submissionCount) |
| With raw comment |
852 |
−7 |
yes (coverage.commentCount) |
| Staged for sentiment |
788 |
−64 |
no — silently dropped |
| Passed sentiment gate |
741 |
−47 |
yes (count only, no reasons/IDs) |
71 submissions disappear between commentCount and the sentiment dispatch with no diagnostic. 47 more are excluded at the gate with only a count. Operators and reviewers cannot audit what was filtered or why.
Goal
Make every submission exclusion in the pipeline auditable: count + reason + persisted submission ID, surfaced both in the pipeline status response and via a dedicated endpoint.
Proposed scope
Data model
New entity PipelineSubmissionExclusion (src/entities/):
pipeline: ManyToOne<AnalysisPipeline>
submission: ManyToOne<QuestionnaireSubmission>
stage: enum — PREPROCESSING | SENTIMENT_GATE (extensible for future stages)
reason: enum — see reason codes below
metadata: jsonb — stage-specific context (e.g. { wordCount, label })
- Indexed on
(pipeline, stage) for status-payload aggregation
Reason codes
Preprocessing (derived by re-running cleanText on in-scope submissions whose cleanedComment IS NULL):
RAW_COMMENT_NULL — qualitativeComment was null/undefined
EMPTY_INPUT — whitespace-only after trim
EXCEL_ARTIFACT — matched #NAME?, #VALUE!, etc.
EMPTY_AFTER_STRIP — empty after URL/emoji/laughter stripping
KEYBOARD_MASH — failed vowel-ratio gibberish check
UNDER_3_WORDS — fewer than 3 tokens after cleaning
Sentiment gate:
POSITIVE_SHORT_COMMENT — positive sentiment with < POSITIVE_MIN_WORD_COUNT words (metadata: { wordCount, label })
Code changes
cleanText refactor (src/modules/questionnaires/utils/clean-text.ts) — return { text: string | null, reason?: PreprocessingRejectionReason } instead of string | null. Update single call site in questionnaire.service.ts.
dispatchSentiment (pipeline-orchestrator.service.ts:~1669) — before the cleanedComment IS NOT NULL filter, persist one PipelineSubmissionExclusion row per in-scope submission that would be dropped, classified by re-running cleanText on qualitativeComment.
OnSentimentComplete gate (pipeline-orchestrator.service.ts:403-483) — persist one exclusion row per gated submission instead of only incrementing counters. Keep sentimentGateIncluded/sentimentGateExcluded columns as-is (aggregates).
- Status DTO (
pipeline-status.dto.ts) — add a filtering block:
filtering: {
preprocessing: {
total: 71,
reasons: { UNDER_3_WORDS: 52, EMPTY_AFTER_STRIP: 12, RAW_COMMENT_NULL: 7 }
},
sentimentGate: {
total: 47,
reasons: { POSITIVE_SHORT_COMMENT: 47 }
}
}
- New endpoint —
GET /analysis/pipelines/:id/exclusions?stage=&reason=&page=&pageSize= returning paginated { submissionId, stage, reason, metadata, createdAt }. Guard matches existing pipeline-status guard.
- Migration — create
pipeline_submission_exclusion table with FK to analysis_pipeline and questionnaire_submission, indexed on (pipeline_id, stage).
Open design decisions (for refinement)
- Re-run
cleanText at pipeline time vs. persist rejection reason on QuestionnaireSubmission at submission time. Default: pipeline-time re-run (no submission schema change, historicity preserved across rule changes). Alternative: add cleanedCommentRejectionReason column on QuestionnaireSubmission (cleaner long-term, avoids duplicated work).
- Endpoint payload shape — IDs + reasons + metadata only, or also include
qualitativeComment / cleanedComment text inline for debugging? Text inline is heavier but removes a second round-trip when auditing.
- Cascading on pipeline soft-delete — should exclusion rows soft-delete with their parent pipeline, or persist independently for audit retention?
- Backfill strategy — apply only to new pipelines (simple) vs. one-off backfill job that re-evaluates historical pipelines (more complete audit, more work).
- Frontend integration — do we want the
filtering block visible in the pipeline detail UI, an expandable drawer, or only via the dedicated endpoint for super-admins?
- Reason taxonomy extensibility — how do we evolve reason codes without breaking consumers (e.g. when adding language-detection filtering later)?
Out of scope
- Tuning the preprocessing rules themselves (e.g. lowering the 3-word threshold). This ticket only makes filtering visible; rule changes would be a separate decision informed by the data this exposes.
- Embeddings-stage or topic-modeling-stage exclusions. The current pipeline doesn't filter at those stages; if future stages add filters, the schema extends naturally via the
stage enum.
References
src/modules/analysis/services/pipeline-orchestrator.service.ts:1669-1672 — current sentiment dispatch filter
src/modules/analysis/services/pipeline-orchestrator.service.ts:403-483 — current sentiment gate logic
src/modules/analysis/constants.ts:1-6 — SENTIMENT_GATE config
src/modules/questionnaires/utils/clean-text.ts — preprocessing rules
src/entities/analysis-pipeline.entity.ts:70-74 — existing gate aggregate columns
src/modules/analysis/dto/pipeline-status.dto.ts — status payload schema
Investigated in session: https://claude.ai/code/session_01PJUfc4RqZ4mFazhP8fSULt
Problem
Submissions are filtered out at multiple stages of the analysis pipeline, but only the sentiment gate currently reports counts. The preprocessing drop (submissions where
cleanedComment IS NULL) is silent, making it impossible to tell why a scope's sentiment stage staged fewer items than itscommentCount.Example — Department CCS pipeline
67ce11d3-ac52-4df6-aebb-e93347eb3609:coverage.submissionCount)coverage.commentCount)71 submissions disappear between
commentCountand the sentiment dispatch with no diagnostic. 47 more are excluded at the gate with only a count. Operators and reviewers cannot audit what was filtered or why.Goal
Make every submission exclusion in the pipeline auditable: count + reason + persisted submission ID, surfaced both in the pipeline status response and via a dedicated endpoint.
Proposed scope
Data model
New entity
PipelineSubmissionExclusion(src/entities/):pipeline: ManyToOne<AnalysisPipeline>submission: ManyToOne<QuestionnaireSubmission>stage: enum—PREPROCESSING|SENTIMENT_GATE(extensible for future stages)reason: enum— see reason codes belowmetadata: jsonb— stage-specific context (e.g.{ wordCount, label })(pipeline, stage)for status-payload aggregationReason codes
Preprocessing (derived by re-running
cleanTexton in-scope submissions whosecleanedComment IS NULL):RAW_COMMENT_NULL—qualitativeCommentwas null/undefinedEMPTY_INPUT— whitespace-only after trimEXCEL_ARTIFACT— matched#NAME?,#VALUE!, etc.EMPTY_AFTER_STRIP— empty after URL/emoji/laughter strippingKEYBOARD_MASH— failed vowel-ratio gibberish checkUNDER_3_WORDS— fewer than 3 tokens after cleaningSentiment gate:
POSITIVE_SHORT_COMMENT— positive sentiment with< POSITIVE_MIN_WORD_COUNTwords (metadata:{ wordCount, label })Code changes
cleanTextrefactor (src/modules/questionnaires/utils/clean-text.ts) — return{ text: string | null, reason?: PreprocessingRejectionReason }instead ofstring | null. Update single call site inquestionnaire.service.ts.dispatchSentiment(pipeline-orchestrator.service.ts:~1669) — before thecleanedComment IS NOT NULLfilter, persist onePipelineSubmissionExclusionrow per in-scope submission that would be dropped, classified by re-runningcleanTextonqualitativeComment.OnSentimentCompletegate (pipeline-orchestrator.service.ts:403-483) — persist one exclusion row per gated submission instead of only incrementing counters. KeepsentimentGateIncluded/sentimentGateExcludedcolumns as-is (aggregates).pipeline-status.dto.ts) — add afilteringblock:GET /analysis/pipelines/:id/exclusions?stage=&reason=&page=&pageSize=returning paginated{ submissionId, stage, reason, metadata, createdAt }. Guard matches existing pipeline-status guard.pipeline_submission_exclusiontable with FK toanalysis_pipelineandquestionnaire_submission, indexed on(pipeline_id, stage).Open design decisions (for refinement)
cleanTextat pipeline time vs. persist rejection reason onQuestionnaireSubmissionat submission time. Default: pipeline-time re-run (no submission schema change, historicity preserved across rule changes). Alternative: addcleanedCommentRejectionReasoncolumn onQuestionnaireSubmission(cleaner long-term, avoids duplicated work).qualitativeComment/cleanedCommenttext inline for debugging? Text inline is heavier but removes a second round-trip when auditing.filteringblock visible in the pipeline detail UI, an expandable drawer, or only via the dedicated endpoint for super-admins?Out of scope
stageenum.References
src/modules/analysis/services/pipeline-orchestrator.service.ts:1669-1672— current sentiment dispatch filtersrc/modules/analysis/services/pipeline-orchestrator.service.ts:403-483— current sentiment gate logicsrc/modules/analysis/constants.ts:1-6—SENTIMENT_GATEconfigsrc/modules/questionnaires/utils/clean-text.ts— preprocessing rulessrc/entities/analysis-pipeline.entity.ts:70-74— existing gate aggregate columnssrc/modules/analysis/dto/pipeline-status.dto.ts— status payload schemaInvestigated in session: https://claude.ai/code/session_01PJUfc4RqZ4mFazhP8fSULt