-
-
Notifications
You must be signed in to change notification settings - Fork 27
Proposal: DataQualityAgent for automated data health checks #1731
Description
Summary
While testing Lumen AI agents and reviewing the current agent architecture, I noticed there is no agent that handles automated data quality analysis. Users asking questions like "Are there missing values?", "Show me outliers", or "What's the data quality?" would currently get routed to SQLAgent, which relies on the LLM to write the correct SQL. This can be unreliable and inconsistent.
I'd like to propose a dedicated DataQualityAgent that provides structured, reliable data health checks.
Current Gap
Lumen AI currently has 11 agents:
| Agent | Purpose |
|---|---|
| ChatAgent | General conversation |
| SQLAgent | SQL query generation |
| hvPlotAgent | hvPlot visualizations |
| VegaLiteAgent | Vega-Lite visualizations |
| DeckGLAgent | Map/geographic visualizations |
| AnalysisAgent | Custom domain-specific analyses |
| ValidationAgent | Validates if user query was answered |
| TableListAgent | Lists available tables |
| DocumentListAgent | Lists available documents |
| DocumentSummarizerAgent | Summarizes documents |
| DbtslAgent | dbt Semantic Layer queries |
None of these handle structured data quality analysis.
Evidence: Gap Analysis Script
I wrote a script that scans all agent purpose and conditions fields for data quality keywords (missing, null, outlier, duplicate, quality, health, profiling):
1. AGENT PURPOSE/CONDITIONS SCAN
----------------------------------------
SQLAgent no data quality keywords
AnalysisAgent no data quality keywords
ChatAgent no data quality keywords
TableListAgent no data quality keywords
ValidationAgent MATCHES ['missing', 'quality']
hvPlotAgent no data quality keywords
VegaLiteAgent no data quality keywords
DeckGLAgent no data quality keywords
DocumentListAgent no data quality keywords
DocumentSummarizerAgent no data quality keywords
Note: ValidationAgent matches "missing" and "quality" but
its purpose is validating query completeness, NOT data quality.
ValidationAgent's actual purpose is: "Validates whether executed plans fully answered the user's original query. Identifies missing elements, assesses completeness." - this is about query completeness, not data health.
Evidence: SQLAgent Prompt Actively Excludes Nulls
The SQLAgent's main.jinja2 prompt contains:
NULL handling: Exclude unless specified
This instructs the LLM to filter OUT nulls from results rather than report them. So asking "are there missing values?" would fight against the prompt's own guidance.
Evidence: No Statistical Methods in Any Agent Prompt
Searched all agent prompts for statistical keywords (IQR, z-score, percentile, outlier, standard deviation). No agent prompt contains guidance for statistical analysis methods. Outlier detection queries would rely entirely on LLM improvisation.
Query Routing Analysis
| Query | Likely Routed To | Problem |
|---|---|---|
| "Are there missing values in my data?" | SQLAgent (best guess) | No explicit match, LLM-dependent SQL |
| "Show me outliers in the revenue column" | SQLAgent or ChatAgent (ambiguous) | No statistical method guidance |
| "Check data quality of this dataset" | No agent matches | Falls through to ChatAgent |
| "How many duplicate rows are there?" | SQLAgent (best guess) | No explicit match |
| "Give me a health report" | No agent matches | Falls through to ChatAgent |
| "What percentage of values are null?" | SQLAgent (best guess) | Fights against "Exclude nulls" prompt |
Proposed DataQualityAgent
What it would cover
- Missing values: Count and percentage of nulls per column
- Duplicates: Duplicate row detection and count
- Data types: Column types, mixed-type detection
- Cardinality: Unique value counts per column (high/low cardinality flags)
- Basic statistics: min, max, mean, std for numeric columns
- Outliers: IQR-based outlier detection for numeric columns
- Overall health score: A simple summary metric
Where it fits in the architecture
- Extends
BaseLumenAgent(needs pipeline access, no view generation) - Uses the existing
pipeline.dataDataFrame directly, no LLM-generated code needed - Returns a structured Pydantic model that can be rendered as a Panel component
- Conditions: triggers on "data quality", "health check", "missing values", "outliers", "duplicates"
Example interaction
User: "Check the quality of this data"
DataQualityAgent:
Data Health Report for "sales_data"
- 5,000 rows x 12 columns
- Missing values: 3 columns affected (revenue: 2.1%, region: 0.5%, date: 0.0%)
- Duplicates: 15 duplicate rows found
- Outliers: 23 outliers in "revenue" column (IQR method)
- Health Score: 87/100
Questions for Maintainers
- Would this be a welcome addition to the agent ecosystem?
- Should it extend
BaseLumenAgentor would a customAnalysissubclass be preferred? - Any preferences on the output format (structured Panel component vs. markdown summary)?
Happy to implement this if there's interest.