Skip to content

Proposal: DataQualityAgent for automated data health checks #1731

@ghostiee-11

Description

@ghostiee-11

Summary

While testing Lumen AI agents and reviewing the current agent architecture, I noticed there is no agent that handles automated data quality analysis. Users asking questions like "Are there missing values?", "Show me outliers", or "What's the data quality?" would currently get routed to SQLAgent, which relies on the LLM to write the correct SQL. This can be unreliable and inconsistent.

I'd like to propose a dedicated DataQualityAgent that provides structured, reliable data health checks.

Current Gap

Lumen AI currently has 11 agents:

Agent Purpose
ChatAgent General conversation
SQLAgent SQL query generation
hvPlotAgent hvPlot visualizations
VegaLiteAgent Vega-Lite visualizations
DeckGLAgent Map/geographic visualizations
AnalysisAgent Custom domain-specific analyses
ValidationAgent Validates if user query was answered
TableListAgent Lists available tables
DocumentListAgent Lists available documents
DocumentSummarizerAgent Summarizes documents
DbtslAgent dbt Semantic Layer queries

None of these handle structured data quality analysis.

Evidence: Gap Analysis Script

I wrote a script that scans all agent purpose and conditions fields for data quality keywords (missing, null, outlier, duplicate, quality, health, profiling):

1. AGENT PURPOSE/CONDITIONS SCAN
----------------------------------------
  SQLAgent                       no data quality keywords
  AnalysisAgent                  no data quality keywords
  ChatAgent                      no data quality keywords
  TableListAgent                 no data quality keywords
  ValidationAgent                MATCHES ['missing', 'quality']
  hvPlotAgent                    no data quality keywords
  VegaLiteAgent                  no data quality keywords
  DeckGLAgent                    no data quality keywords
  DocumentListAgent              no data quality keywords
  DocumentSummarizerAgent        no data quality keywords

  Note: ValidationAgent matches "missing" and "quality" but
  its purpose is validating query completeness, NOT data quality.

ValidationAgent's actual purpose is: "Validates whether executed plans fully answered the user's original query. Identifies missing elements, assesses completeness." - this is about query completeness, not data health.

Evidence: SQLAgent Prompt Actively Excludes Nulls

The SQLAgent's main.jinja2 prompt contains:

NULL handling: Exclude unless specified

This instructs the LLM to filter OUT nulls from results rather than report them. So asking "are there missing values?" would fight against the prompt's own guidance.

Evidence: No Statistical Methods in Any Agent Prompt

Searched all agent prompts for statistical keywords (IQR, z-score, percentile, outlier, standard deviation). No agent prompt contains guidance for statistical analysis methods. Outlier detection queries would rely entirely on LLM improvisation.

Query Routing Analysis

Query Likely Routed To Problem
"Are there missing values in my data?" SQLAgent (best guess) No explicit match, LLM-dependent SQL
"Show me outliers in the revenue column" SQLAgent or ChatAgent (ambiguous) No statistical method guidance
"Check data quality of this dataset" No agent matches Falls through to ChatAgent
"How many duplicate rows are there?" SQLAgent (best guess) No explicit match
"Give me a health report" No agent matches Falls through to ChatAgent
"What percentage of values are null?" SQLAgent (best guess) Fights against "Exclude nulls" prompt

Proposed DataQualityAgent

What it would cover

  • Missing values: Count and percentage of nulls per column
  • Duplicates: Duplicate row detection and count
  • Data types: Column types, mixed-type detection
  • Cardinality: Unique value counts per column (high/low cardinality flags)
  • Basic statistics: min, max, mean, std for numeric columns
  • Outliers: IQR-based outlier detection for numeric columns
  • Overall health score: A simple summary metric

Where it fits in the architecture

  • Extends BaseLumenAgent (needs pipeline access, no view generation)
  • Uses the existing pipeline.data DataFrame directly, no LLM-generated code needed
  • Returns a structured Pydantic model that can be rendered as a Panel component
  • Conditions: triggers on "data quality", "health check", "missing values", "outliers", "duplicates"

Example interaction

User: "Check the quality of this data"
DataQualityAgent: 
  Data Health Report for "sales_data"
  - 5,000 rows x 12 columns
  - Missing values: 3 columns affected (revenue: 2.1%, region: 0.5%, date: 0.0%)
  - Duplicates: 15 duplicate rows found
  - Outliers: 23 outliers in "revenue" column (IQR method)
  - Health Score: 87/100

Questions for Maintainers

  1. Would this be a welcome addition to the agent ecosystem?
  2. Should it extend BaseLumenAgent or would a custom Analysis subclass be preferred?
  3. Any preferences on the output format (structured Panel component vs. markdown summary)?

Happy to implement this if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions