MLSanity

Sanity-check your dataset before training your model.

MLSanity is an open-source dataset sanity-checking toolkit for image classification folders and tabular CSV files. Run one command, get a colorized terminal summary, optional JSON and HTML reports, and a simple health score so you catch data issues before you waste GPU time.

Current release: v0.2.0 (MVP)

Why MLSanity?

Without MLSanity	With MLSanity
Spot-check a few files by hand	Scan every image row / file the loaders see
Duplicates and leakage hide in large folders	Exact and near-duplicate checks, cross-split leakage
“Looks fine” in a notebook	Structured checks + scores + exportable reports
Hard to share findings with a team	JSON / HTML artifacts you can attach to PRs or tickets

Features at a glance

Capability	v0.2
Image classification folders (split-based or flat classes)	Yes
Tabular CSV with `--target` (optional `--split-column`)	Yes
Corruption, duplicates, near-duplicates, imbalance	Yes
Schema + tabular duplicates + leakage	Yes
Terminal (Rich), JSON, HTML reports	Yes
Train/val leakage when splits exist	Yes
Web dashboard, plugins, auto-fix	Not in v0.2

How it works

flowchart LR
    subgraph inputs["Your data"]
        IMG["Image folders"]
        CSV["CSV + target"]
    end

    subgraph pipeline["MLSanity"]
        L["Loaders"]
        C["Checks"]
        S["Health score"]
        R["Reports"]
    end

    IMG --> L
    CSV --> L
    L --> C
    C --> S
    S --> R

    R --> T["Terminal"]
    R --> J["JSON"]
    R --> H["HTML"]

Pipeline in words: loaders turn paths into Sample objects → checks produce CheckResults → scoring derives a 0–100 health score and status band → reporting prints to the terminal and optionally writes JSON / HTML.

Requirements

Python 3.11+

Install

git clone https://github.com/<your-username>/MLSanity.git
cd MLSanity
python -m pip install -e .

Quick usage

Image classification

mlsanity doctor /path/to/dataset --type image

Tabular CSV

mlsanity doctor /path/to/data.csv --type tabular --target target_column

Optional split column (enables cross-split leakage checks):

mlsanity doctor /path/to/data.csv --type tabular --target target_column --split-column split

Export reports

mlsanity doctor /path/to/data --type image \
  --json report.json \
  --html report.html

The CLI uses Rich for tables, panels, and colored status in the terminal.

Version

mlsanity version

Supported layouts

Images — split folders

dataset/
  train/
    cat/
    dog/
  val/
    cat/
    dog/

Images — flat class folders

dataset/
  cat/
  dog/

What v0.2 checks

Area	Check	What it does
Images	`corruption`	Zero-byte files; unreadable / invalid images (Pillow)
Images	`duplicates`	Exact duplicates via SHA-256 of file bytes
Images	`near_duplicates`	pHash + Hamming distance grouping
Images	`imbalance`	Class counts, %, imbalance ratio
Images	`leakage`	Same file hash in more than one split
Images	`leakage_near`	Near-duplicate pairs across splits
Images	`label_hints`	Heuristic hints for likely label issues (not definitive)
Tabular	`schema`	Missing values, empty columns, constant columns
Tabular	`duplicates`	Exact duplicate rows; conflicting labels on same features
Tabular	`imbalance`	Same metrics on the target column
Tabular	`leakage`	Same feature row under more than one split
Tabular	`label_hints`	Heuristic hints for likely label issues (not definitive)

If there are no splits (e.g. flat image layout), cross-split leakage checks return OK with a short “skipped” explanation.

Health score & status bands

The score starts at 100 and applies penalties for warnings/errors from checks, then clamps to 0–100.

Check (when not OK)	Approx. penalty
`corruption`	−20
`leakage`	−25
`leakage_near`	−15
`duplicates` (warning / error)	−10 / −15
`near_duplicates`	−10
`imbalance`	−15
`schema`	−10 / −15
`label_hints`	−5

Score	`overall_status`
90–100	`healthy`
70–89	`acceptable`
40–69	`needs_attention`
0–39	`critical`

Report outputs compared

Output	Best for
Terminal	Fast feedback; colored summary in your shell
JSON	Scripts, CI, dashboards, custom tooling
HTML	Sharing with teammates; opening in a browser

v0.2 additions (beyond v0.1)

Suspicious-label hints: label_hints check (heuristic hints, status warning).
Dataset comparison mode: mlsanity compare OLD_PATH NEW_PATH ... with terminal + JSON + HTML compare reports.
CI-friendly quality gates: --min-score and --fail-on warning|error (non-zero exit code + pass/fail fields in JSON).
Tabular formats: tabular loader supports CSV, TSV, and Parquet (.parquet requires a Parquet engine like pyarrow).
Richer report visuals: HTML report now shows class distribution and split distribution charts, plus “what to fix first”.

Project layout

mlsanity/
  cli.py                 # Typer CLI (Rich output)
  engine.py              # Orchestrates loaders, checks, scoring
  types.py               # Sample, CheckResult, Report
  loaders/               # image_loader, tabular_loader
  checks/                # corruption, duplicates, near_duplicates, imbalance, schema, leakage
  reporting/             # terminal, json_report, html_report, scoring, templates/
examples/                # sample CSV + notes
tests/                   # pytest
logo.png                 # Logo for README (keep at repo root next to README.md)

Examples

See examples/README.md and examples/sample_tabular.csv.

Tests

python -m pip install pytest
python -m pytest tests/ -v

Roadmap (not v0.2)

Web dashboard / plugins / auto-fix UI
More formats and deeper diagnostics (tracked in project issues once published)

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLSanity

Why MLSanity?

Features at a glance

How it works

Requirements

Install

Quick usage

Image classification

Tabular CSV

Export reports

Version

Supported layouts

What v0.2 checks

Health score & status bands

Report outputs compared

v0.2 additions (beyond v0.1)

Project layout

Examples

Tests

Roadmap (not v0.2)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples		examples
mlsanity		mlsanity
report_template		report_template
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logo.png		logo.png
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

MLSanity

Why MLSanity?

Features at a glance

How it works

Requirements

Install

Quick usage

Image classification

Tabular CSV

Export reports

Version

Supported layouts

What v0.2 checks

Health score & status bands

Report outputs compared

v0.2 additions (beyond v0.1)

Project layout

Examples

Tests

Roadmap (not v0.2)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages