Proposal: DataQualityAgent for automated data health checks

## Summary

While testing Lumen AI agents and reviewing the current agent architecture, I noticed there is no agent that handles automated data quality analysis. Users asking questions like "Are there missing values?", "Show me outliers", or "What's the data quality?" would currently get routed to SQLAgent, which relies on the LLM to write the correct SQL. This can be unreliable and inconsistent.

I'd like to propose a dedicated `DataQualityAgent` that provides structured, reliable data health checks.

## Current Gap

Lumen AI currently has 11 agents:

| Agent | Purpose |
|---|---|
| ChatAgent | General conversation |
| SQLAgent | SQL query generation |
| hvPlotAgent | hvPlot visualizations |
| VegaLiteAgent | Vega-Lite visualizations |
| DeckGLAgent | Map/geographic visualizations |
| AnalysisAgent | Custom domain-specific analyses |
| ValidationAgent | Validates if user query was answered |
| TableListAgent | Lists available tables |
| DocumentListAgent | Lists available documents |
| DocumentSummarizerAgent | Summarizes documents |
| DbtslAgent | dbt Semantic Layer queries |

**None of these handle structured data quality analysis.**

## Evidence: Gap Analysis Script

I wrote a script that scans all agent `purpose` and `conditions` fields for data quality keywords (`missing`, `null`, `outlier`, `duplicate`, `quality`, `health`, `profiling`):

```
1. AGENT PURPOSE/CONDITIONS SCAN
----------------------------------------
  SQLAgent                       no data quality keywords
  AnalysisAgent                  no data quality keywords
  ChatAgent                      no data quality keywords
  TableListAgent                 no data quality keywords
  ValidationAgent                MATCHES ['missing', 'quality']
  hvPlotAgent                    no data quality keywords
  VegaLiteAgent                  no data quality keywords
  DeckGLAgent                    no data quality keywords
  DocumentListAgent              no data quality keywords
  DocumentSummarizerAgent        no data quality keywords

  Note: ValidationAgent matches "missing" and "quality" but
  its purpose is validating query completeness, NOT data quality.
```

ValidationAgent's actual purpose is: *"Validates whether executed plans fully answered the user's original query. Identifies missing elements, assesses completeness."* - this is about query completeness, not data health.

## Evidence: SQLAgent Prompt Actively Excludes Nulls

The SQLAgent's `main.jinja2` prompt contains:

```
NULL handling: Exclude unless specified
```

This instructs the LLM to **filter OUT nulls** from results rather than report them. So asking "are there missing values?" would fight against the prompt's own guidance.

## Evidence: No Statistical Methods in Any Agent Prompt

Searched all agent prompts for statistical keywords (IQR, z-score, percentile, outlier, standard deviation). No agent prompt contains guidance for statistical analysis methods. Outlier detection queries would rely entirely on LLM improvisation.

## Query Routing Analysis

| Query | Likely Routed To | Problem |
|---|---|---|
| "Are there missing values in my data?" | SQLAgent (best guess) | No explicit match, LLM-dependent SQL |
| "Show me outliers in the revenue column" | SQLAgent or ChatAgent (ambiguous) | No statistical method guidance |
| "Check data quality of this dataset" | No agent matches | Falls through to ChatAgent |
| "How many duplicate rows are there?" | SQLAgent (best guess) | No explicit match |
| "Give me a health report" | No agent matches | Falls through to ChatAgent |
| "What percentage of values are null?" | SQLAgent (best guess) | Fights against "Exclude nulls" prompt |

## Proposed DataQualityAgent

### What it would cover
- **Missing values**: Count and percentage of nulls per column
- **Duplicates**: Duplicate row detection and count
- **Data types**: Column types, mixed-type detection
- **Cardinality**: Unique value counts per column (high/low cardinality flags)
- **Basic statistics**: min, max, mean, std for numeric columns
- **Outliers**: IQR-based outlier detection for numeric columns
- **Overall health score**: A simple summary metric

### Where it fits in the architecture
- Extends `BaseLumenAgent` (needs pipeline access, no view generation)
- Uses the existing `pipeline.data` DataFrame directly, no LLM-generated code needed
- Returns a structured Pydantic model that can be rendered as a Panel component
- Conditions: triggers on "data quality", "health check", "missing values", "outliers", "duplicates"

### Example interaction
```
User: "Check the quality of this data"
DataQualityAgent: 
  Data Health Report for "sales_data"
  - 5,000 rows x 12 columns
  - Missing values: 3 columns affected (revenue: 2.1%, region: 0.5%, date: 0.0%)
  - Duplicates: 15 duplicate rows found
  - Outliers: 23 outliers in "revenue" column (IQR method)
  - Health Score: 87/100
```

## Questions for Maintainers

1. Would this be a welcome addition to the agent ecosystem?
2. Should it extend `BaseLumenAgent` or would a custom `Analysis` subclass be preferred?
3. Any preferences on the output format (structured Panel component vs. markdown summary)?

Happy to implement this if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: DataQualityAgent for automated data health checks #1731

Summary

Current Gap

Evidence: Gap Analysis Script

Evidence: SQLAgent Prompt Actively Excludes Nulls

Evidence: No Statistical Methods in Any Agent Prompt

Query Routing Analysis

Proposed DataQualityAgent

What it would cover

Where it fits in the architecture

Example interaction

Questions for Maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent	Purpose
ChatAgent	General conversation
SQLAgent	SQL query generation
hvPlotAgent	hvPlot visualizations
VegaLiteAgent	Vega-Lite visualizations
DeckGLAgent	Map/geographic visualizations
AnalysisAgent	Custom domain-specific analyses
ValidationAgent	Validates if user query was answered
TableListAgent	Lists available tables
DocumentListAgent	Lists available documents
DocumentSummarizerAgent	Summarizes documents
DbtslAgent	dbt Semantic Layer queries

Query	Likely Routed To	Problem
"Are there missing values in my data?"	SQLAgent (best guess)	No explicit match, LLM-dependent SQL
"Show me outliers in the revenue column"	SQLAgent or ChatAgent (ambiguous)	No statistical method guidance
"Check data quality of this dataset"	No agent matches	Falls through to ChatAgent
"How many duplicate rows are there?"	SQLAgent (best guess)	No explicit match
"Give me a health report"	No agent matches	Falls through to ChatAgent
"What percentage of values are null?"	SQLAgent (best guess)	Fights against "Exclude nulls" prompt

Uh oh!

Proposal: DataQualityAgent for automated data health checks #1731

Description

Summary

Current Gap

Evidence: Gap Analysis Script

Evidence: SQLAgent Prompt Actively Excludes Nulls

Evidence: No Statistical Methods in Any Agent Prompt

Query Routing Analysis

Proposed DataQualityAgent

What it would cover

Where it fits in the architecture

Example interaction

Questions for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions