FAC 143 feat: surface per-stage submission exclusions (counts + reasons + persisted IDs) in analysis pipeline

## Problem

Submissions are filtered out at multiple stages of the analysis pipeline, but only the sentiment gate currently reports counts. The preprocessing drop (submissions where `cleanedComment IS NULL`) is silent, making it impossible to tell why a scope's sentiment stage staged fewer items than its `commentCount`.

**Example** — Department CCS pipeline `67ce11d3-ac52-4df6-aebb-e93347eb3609`:

| Stage | Count | Delta | Currently surfaced? |
|---|---|---|---|
| Submissions in scope | 859 | — | yes (`coverage.submissionCount`) |
| With raw comment | 852 | −7 | yes (`coverage.commentCount`) |
| Staged for sentiment | 788 | −64 | **no — silently dropped** |
| Passed sentiment gate | 741 | −47 | yes (count only, no reasons/IDs) |

71 submissions disappear between `commentCount` and the sentiment dispatch with no diagnostic. 47 more are excluded at the gate with only a count. Operators and reviewers cannot audit what was filtered or why.

## Goal

Make every submission exclusion in the pipeline auditable: **count + reason + persisted submission ID**, surfaced both in the pipeline status response and via a dedicated endpoint.

## Proposed scope

### Data model

New entity `PipelineSubmissionExclusion` (`src/entities/`):

- `pipeline: ManyToOne<AnalysisPipeline>`
- `submission: ManyToOne<QuestionnaireSubmission>`
- `stage: enum` — `PREPROCESSING` | `SENTIMENT_GATE` (extensible for future stages)
- `reason: enum` — see reason codes below
- `metadata: jsonb` — stage-specific context (e.g. `{ wordCount, label }`)
- Indexed on `(pipeline, stage)` for status-payload aggregation

### Reason codes

**Preprocessing** (derived by re-running `cleanText` on in-scope submissions whose `cleanedComment IS NULL`):
- `RAW_COMMENT_NULL` — `qualitativeComment` was null/undefined
- `EMPTY_INPUT` — whitespace-only after trim
- `EXCEL_ARTIFACT` — matched `#NAME?`, `#VALUE!`, etc.
- `EMPTY_AFTER_STRIP` — empty after URL/emoji/laughter stripping
- `KEYBOARD_MASH` — failed vowel-ratio gibberish check
- `UNDER_3_WORDS` — fewer than 3 tokens after cleaning

**Sentiment gate**:
- `POSITIVE_SHORT_COMMENT` — positive sentiment with `< POSITIVE_MIN_WORD_COUNT` words (metadata: `{ wordCount, label }`)

### Code changes

1. **`cleanText` refactor** (`src/modules/questionnaires/utils/clean-text.ts`) — return `{ text: string | null, reason?: PreprocessingRejectionReason }` instead of `string | null`. Update single call site in `questionnaire.service.ts`.
2. **`dispatchSentiment`** (`pipeline-orchestrator.service.ts:~1669`) — before the `cleanedComment IS NOT NULL` filter, persist one `PipelineSubmissionExclusion` row per in-scope submission that would be dropped, classified by re-running `cleanText` on `qualitativeComment`.
3. **`OnSentimentComplete` gate** (`pipeline-orchestrator.service.ts:403-483`) — persist one exclusion row per gated submission instead of only incrementing counters. Keep `sentimentGateIncluded`/`sentimentGateExcluded` columns as-is (aggregates).
4. **Status DTO** (`pipeline-status.dto.ts`) — add a `filtering` block:
   ```
   filtering: {
     preprocessing: {
       total: 71,
       reasons: { UNDER_3_WORDS: 52, EMPTY_AFTER_STRIP: 12, RAW_COMMENT_NULL: 7 }
     },
     sentimentGate: {
       total: 47,
       reasons: { POSITIVE_SHORT_COMMENT: 47 }
     }
   }
   ```
5. **New endpoint** — `GET /analysis/pipelines/:id/exclusions?stage=&reason=&page=&pageSize=` returning paginated `{ submissionId, stage, reason, metadata, createdAt }`. Guard matches existing pipeline-status guard.
6. **Migration** — create `pipeline_submission_exclusion` table with FK to `analysis_pipeline` and `questionnaire_submission`, indexed on `(pipeline_id, stage)`.

## Open design decisions (for refinement)

1. **Re-run `cleanText` at pipeline time** vs. **persist rejection reason on `QuestionnaireSubmission` at submission time**. Default: pipeline-time re-run (no submission schema change, historicity preserved across rule changes). Alternative: add `cleanedCommentRejectionReason` column on `QuestionnaireSubmission` (cleaner long-term, avoids duplicated work).
2. **Endpoint payload shape** — IDs + reasons + metadata only, or also include `qualitativeComment` / `cleanedComment` text inline for debugging? Text inline is heavier but removes a second round-trip when auditing.
3. **Cascading on pipeline soft-delete** — should exclusion rows soft-delete with their parent pipeline, or persist independently for audit retention?
4. **Backfill strategy** — apply only to new pipelines (simple) vs. one-off backfill job that re-evaluates historical pipelines (more complete audit, more work).
5. **Frontend integration** — do we want the `filtering` block visible in the pipeline detail UI, an expandable drawer, or only via the dedicated endpoint for super-admins?
6. **Reason taxonomy extensibility** — how do we evolve reason codes without breaking consumers (e.g. when adding language-detection filtering later)?

## Out of scope

- Tuning the preprocessing rules themselves (e.g. lowering the 3-word threshold). This ticket only makes filtering visible; rule changes would be a separate decision informed by the data this exposes.
- Embeddings-stage or topic-modeling-stage exclusions. The current pipeline doesn't filter at those stages; if future stages add filters, the schema extends naturally via the `stage` enum.

## References

- `src/modules/analysis/services/pipeline-orchestrator.service.ts:1669-1672` — current sentiment dispatch filter
- `src/modules/analysis/services/pipeline-orchestrator.service.ts:403-483` — current sentiment gate logic
- `src/modules/analysis/constants.ts:1-6` — `SENTIMENT_GATE` config
- `src/modules/questionnaires/utils/clean-text.ts` — preprocessing rules
- `src/entities/analysis-pipeline.entity.ts:70-74` — existing gate aggregate columns
- `src/modules/analysis/dto/pipeline-status.dto.ts` — status payload schema

Investigated in session: https://claude.ai/code/session_01PJUfc4RqZ4mFazhP8fSULt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FAC 143 feat: surface per-stage submission exclusions (counts + reasons + persisted IDs) in analysis pipeline #375

Problem

Goal

Proposed scope

Data model

Reason codes

Code changes

Open design decisions (for refinement)

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage	Count	Delta	Currently surfaced?
Submissions in scope	859	—	yes (`coverage.submissionCount`)
With raw comment	852	−7	yes (`coverage.commentCount`)
Staged for sentiment	788	−64	no — silently dropped
Passed sentiment gate	741	−47	yes (count only, no reasons/IDs)

Uh oh!

FAC 143 feat: surface per-stage submission exclusions (counts + reasons + persisted IDs) in analysis pipeline #375

Description

Problem

Goal

Proposed scope

Data model

Reason codes

Code changes

Open design decisions (for refinement)

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions