A Streamlit app that helps you clean CSVs, infer column semantics, merge multiple files, and generate visualizations — with optional help from an LLM. It’s designed to be practical and auditable: you always see the code an LLM proposes, you can accept/reject steps, and your actions are logged to SQLite.
TL;DR • Tab 1 – CSV Cleaner: choose built‑in steps from
table_steps.json
, get LLM suggestions, optionally apply LLM‑generated Python code (shown before execution), and download the cleaned CSV. • Tab 2 – Metadata Inspector: upload/merge up to 5 CSVs, infer column types, preview media/links, and auto‑generate plots (with or without LLM help). • Advanced: interactive decision‑tree UI for guided cleaning; optional custom LLM API; session/file/event audit logging with timestamped CSVs.
-
Guided CSV Cleaning
- Configurable steps defined in
table_steps.json
(sliders, selects, text inputs rendered dynamically) - LLM proposes 5 non‑redundant cleaning suggestions (excludes already‑selected steps)
- Any natural‑language instruction → executable Python code that mutates
df
in place (code is displayed before execution for review)
- Configurable steps defined in
-
Metadata Inference & Usability
-
Infers semantic types like: Categorical, Text, Numerical, Datetime, GPS Coordinates, Email, Phone, Currency, Percentage, Color Code, Image/Video/Document/General URL, Identifier/ID, Null‑heavy, Constant/Low Variance, etc.
-
Context‑aware visualizations:
- Categorical → top‑k bar chart
- Text → word cloud
- Numerical → box plot + correlation heatmap
- Datetime → time‑series line plot
- GPS → map from
lat,lon
- Color codes → swatches
- Email/Phone → frequency bars
- URLs → previews (images, videos) + webpage summarization via LLM for general links
-
-
Multi‑CSV Merge UI
- Upload up to 5 CSVs, configure pairwise joins (keys + type), then merge CSVs with one click
-
Interactive Cleaning Graph
- AGraph‑based decision tree per column; click leaf nodes to apply LLM‑generated cleaning for the chosen path; executed code is surfaced and actions are tracked
-
Exploration via D‑Tale
- One‑click link to open D‑Tale and explore the current DataFrame
-
Audit Logging
- Sessions, files, and events logged to SQLite via
DB/log_to_db.py
- Uploaded/merged CSVs saved into an audit folder with timestamp + session id
- Sessions, files, and events logged to SQLite via
-
Optional Custom LLM API
- Query your own model endpoint (e.g., DeepSeek Coder) via
LLM/config.py
- Query your own model endpoint (e.g., DeepSeek Coder) via
my_data_cleaning_app/
├─ app.py # Streamlit UI (tabs, LLM helpers, decision tree, logging)
├─ pipeline_logic.py # Executes selected cleaning steps on df
├─ metadata_inference.py # Column type inference + LLM‑assisted helpers
├─ cleaningDecisionTree.py # AGraph/PyVis decision tree + click‑to‑clean
├─ table_steps.json # Declarative config driving the Cleaner UI
├─ DB/
│ ├─ log_to_db.py # log_session, log_file, log_event (SQLite)
│ └─ auditCSVFiles/ # audit folder for saved CSVs (created at runtime)
├─ LLM/
│ └─ config.py # your custom LLM API endpoint config
├─ .streamlit/ # Streamlit settings
├─ .devcontainer/ # VS Code Dev Container setup
├─ requirements.txt # Python dependencies
└─ README.md # (this file)
-
Config‑Driven UI —
table_steps.json
defines sections, step names, descriptions, and typed options.app.py
renders controls automatically and builds asteps
list. -
Pipeline Execution —
pipeline_logic.run_pipeline(df, steps)
executes selected steps in order. If an LLM instruction was accepted, its generated code is appended as a step and executed safely within the pipeline wrapper. -
LLM Helpers
-
call_llm()
uses the Together API with modelmoonshotai/Kimi-K2-Instruct
to:- generate cleaning suggestions (
fetch_llm_suggestions
) - translate a natural instruction → raw Python code (
get_cleaning_code_from_llm
)
- generate cleaning suggestions (
-
A separate custom LLM API block posts to
LLM.config.API_URL
.
-
-
Metadata Inference & Visuals —
metadata_inference.analyze_dataframe(df)
infers types and suggests basic visualizations; URL columns can be summarized via an LLM. -
Decision Tree —
cleaningDecisionTree.render_agraph_tree()
builds a compact action tree (max branching/leaf count). Clicking a leaf triggerscustom_cleaning_via_llm()
with contextual instruction; code and results are shown. -
Auditability — All key actions are logged via
log_session
,log_file
, andlog_event
. CSVs are saved to an audit directory with timestamp and session id.
git clone https://github.com/iamvisheshsrivastava/my_data_cleaning_app.git
cd my_data_cleaning_app
# (recommended) Python 3.10+ virtual env
python -m venv .venv
# Windows
. .venv/Scripts/activate
# macOS/Linux
# source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Together API (used by call_llm
)
- Create an account and get an API key.
- Set the environment variable before running Streamlit:
# Windows PowerShell
$env:TOGETHER_API_KEY = "YOUR_KEY"
# macOS/Linux
export TOGETHER_API_KEY="YOUR_KEY"
Custom LLM endpoint (used by the bottom "Custom Trained LLM via API" section)
Create/adjust LLM/config.py
(already present in the repo). Example:
# LLM/config.py
API_URL = "http://localhost:9000/generate" # your FastAPI/Flask inference endpoint
# If you need headers/auth, modify app.py where requests.post is called.
The app posts
{ "prompt": "..." }
toAPI_URL
and expects{ "response": "..." }
in return. SSL verification is disabled in that call by default (verify=False
).
In app.py
, the audit directory defaults to a Windows path:
AUDIT_DIR = r"C:\\Users\\sriva\\Desktop\\AICUFLow\\my_data_cleaning_app\\DB\\auditCSVFiles"
Change this to a portable relative path if you’re on macOS/Linux:
from pathlib import Path
AUDIT_DIR = Path("DB/auditCSVFiles")
AUDIT_DIR.mkdir(parents=True, exist_ok=True)
streamlit run app.py
-
Upload CSV → preview top rows.
-
Select processing steps (forms are generated from
table_steps.json
). -
Get Smart LLM Suggestions → returns 5 new, non‑redundant suggestions.
-
(Optional) Custom instruction → enter natural language.
-
Run Cleaning Pipeline
-
If a custom/selected LLM instruction exists, the app will:
- Call the LLM to produce raw Python code for
df
- Show the code (for your review)
- Append it as a step and run the full pipeline
- Call the LLM to produce raw Python code for
-
-
Preview & Download the cleaned CSV.
-
Upload up to 5 CSVs (or a single CSV). If multiple, configure joins (left/right keys + join type) and merge.
-
Run Inference → get a table with inferred types.
-
Explore
- D‑Tale link to inspect data interactively.
- Visualizations by type (bar/word cloud/box+heatmap/line/map/swatch/etc.).
- General URL columns: choose a link, add an instruction (e.g., “summarize key points”), and the app fetches the page and asks the LLM to summarize.
- Click Show Interactive Graph → pick a column → generate action tree.
- Click a leaf node to apply that cleaning action sequence via LLM.
- The executed code is shown; the resulting DataFrame updates in place; repeated clicks on the same leaf are ignored.
- Free‑form prompt UI that posts to
LLM.config.API_URL
. - Response is shown in a text area with timing info.
A minimal example to illustrate the shape (your file may be richer):
{
"processing": {
"missing_values": [
{
"name": "Drop Nulls",
"description": "Drop rows with too many missing values",
"options": [
{ "name": "threshold", "data_type": "float", "value": 0.5 }
]
}
],
"encoding": [
{
"name": "One-Hot Encode",
"description": "Encode low-cardinality categoricals",
"options": [
{ "name": "max_unique", "data_type": "int", "value": 20 }
]
}
]
}
}
Each option
supports data_type
(int
, float
, str
, select
) and optional options
(for dropdowns). The UI will render appropriate widgets and collect parameters into the steps
list for pipeline_logic.run_pipeline
.
- Session: A unique
session_id
is created (uuid4
). - CSV Save: After upload/merge, the app writes to the audit folder:
uploaded_<UTC_YYYYMMDD-HHMMSS>_<session8>.csv
- DB Logging:
log_session(session_id)
,log_file(session_id, filename, path)
,log_event(session_id, event_type, event_detail)
write to SQLite (audit.db
).
Note: The exact SQLite schema is defined in
DB/log_to_db.py
. Typical events includefile_upload
,inference_triggered
,column_visualized
,custom_viz_success/error
,custom_cleaning_success/error
,agraph_tree_generated
,agraph_node_cleaning_success/error
, andfeedback
.
- Review before execution: LLM‑generated code is shown in the UI; execute only if you trust it.
- Network requests: General URL analysis fetches webpages; avoid unknown or untrusted domains.
- Secrets: Keep API keys in environment variables (don’t commit them). The Together client reads
TOGETHER_API_KEY
from env. - SSL: The custom LLM request uses
verify=False
by default; enable verification for production.
- Python 3.10+
- See
requirements.txt
for the full list (notably:streamlit
,pandas
,plotly
,matplotlib
,seaborn
,wordcloud
,beautifulsoup4
,dtale
,streamlit-agraph
,pyvis
,together
,requests
).
- VS Code Dev Container: Open the repo in VS Code → "Reopen in Container" to develop in a preconfigured environment (see
.devcontainer
). - Styling/UX: Streamlit components, Plotly charts, Matplotlib/Seaborn for custom visuals, AGraph for the interactive tree, and D‑Tale for data exploration.
- Windows paths: The default audit path is Windows‑specific; switch to
pathlib.Path
for portability as shown above.
- D‑Tale link not opening: Ensure your browser can reach the host/port D‑Tale binds to; check firewall and proxy; try opening the printed URL directly.
- LLM suggestions/code empty or errors: Confirm
TOGETHER_API_KEY
is set; the Together service reachable; retry with a simpler instruction. - Custom LLM API errors: Ensure your server at
LLM.config.API_URL
is running and returns{ "response": "..." }
JSON. - Large CSVs: If memory is tight, run with a smaller sample or increase system RAM; consider chunked processing in future extensions.
- Visualization errors: Some plots assume valid numeric/datetime parsing; ensure columns are cast or adjust instructions accordingly.
- More robust, non‑LLM type inference heuristics
- Built‑in CSV join diagnostics and key suggestions
- Executable cleaning playback (export steps as a Python script)
- Switch to portable audit paths by default; add env‑configurable audit dir
- Optional sandboxing for LLM‑generated code
- Multi‑page layout (Cleaner / Inspector / Recipes / Logs)
Issues and PRs are welcome! Please include a clear description, steps to reproduce, and screenshots/logs where helpful.
- Streamlit, pandas, Plotly, Matplotlib, Seaborn, WordCloud, BeautifulSoup
- D‑Tale, streamlit‑agraph, PyVis
- Together (LLM API)
Vishesh Srivastava Portfolio: visheshsrivastava.com GitHub: github.com/iamvisheshsrivastava LinkedIn: linkedin.com/in/iamvisheshsrivastava