LLM-Data-Insights

Project B: Dashboards Are Dead

Contributor: Irene He

Description:

This project explores whether a traditional analytics dashboard can be replaced by a conversational experience powered by an LLM. Instead of requiring users to navigate dashboards with limited information, the project processes Overture Map's monthly analytics data into an LLM-readable format in the form of a text file so users can ask the LLM natural language questions regarding the data. The goal is to understand what data formatting, prompt engineering, and context optimization choices make these answers reliable and useful, especially under real constraints like limited context windows and the risk of hallucinations.

SQL Tooling + Metrics QA Agent A tool-calling LLM agent that answers plain-English questions about Overture-style metrics by generating guardrailed SQL, executing it in DuckDB, and returning a grounded natural-language response with citations to the executed queries/results. What this project does Cleans raw metrics CSVs → Parquet Builds a DuckDB database with convenient views:

row_counts
row_counts_accurate
theme_column_summary_stats
theme_column_release_to_release_comparisons
changelog_stats Provides a SQL tool that only allows SELECT / CTE queries Runs an agent loop: question → SQL tool calls → final grounded answer Includes an evaluation harness for hallucination/grounding checks on a test set in sql_tooling/test/questions.txt

Repo layout

LLM-Data-Insights/ data/ cleaned/ metrics/ clean_metrics.py sql_tooling/ test/ evaluate.py agent_test.py questions.txt init_duckdb.py sql_tool.py run_agent.py sql_context.md

Setup

Configure environment variables Create a .env for export OPEN_API_KEY
Add metrics folder to /data. Run python clean_metrics.py from /data folder to produce a cleaned folder
From the root directory, run python sql.tooling/ init_duckdb.py to create the duckdb database
Run python sql.tooling/run_agent.py

Data pipeline: CSV → Parquet → DuckDB

What is cleaned We: Read metrics CSVs from release/theme/type folders Normalize filename/path metadata into stable columns (e.g., release_id, theme, type) Ensure consistent column presence/order across files Output Parquet partitions for fast querying Generate: manifest (what was ingested) ingest report (row counts, missing columns, anomalies) schema report (observed columns + types)

Running the agent (interactive) From the repo root: python3 -m sql_tooling.run_agent You’ll be prompted: Ask a metrics question: Example questions:

“What is the total number of address records?”
“What percentage of address records include postcode/ZIP values?”
“Which datasets contribute the most address records?”

Evaluation tool:

Run python3 -m sql_tooling.test.evaluate.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data-cleaning		data-cleaning
data		data
sql_tooling		sql_tooling
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Data-Insights

Contributor: Irene He

Description:

Repo layout

Setup

Data pipeline: CSV → Parquet → DuckDB

Evaluation tool:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Data-Insights

Contributor: Irene He

Description:

Repo layout

Setup

Data pipeline: CSV → Parquet → DuckDB

Evaluation tool:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages