Skip to content

project-terraforma/LLM-Data-Insights

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Data-Insights

Project B: Dashboards Are Dead

Contributor: Irene He

Description:

This project explores whether a traditional analytics dashboard can be replaced by a conversational experience powered by an LLM. Instead of requiring users to navigate dashboards with limited information, the project processes Overture Map's monthly analytics data into an LLM-readable format in the form of a text file so users can ask the LLM natural language questions regarding the data. The goal is to understand what data formatting, prompt engineering, and context optimization choices make these answers reliable and useful, especially under real constraints like limited context windows and the risk of hallucinations.

SQL Tooling + Metrics QA Agent A tool-calling LLM agent that answers plain-English questions about Overture-style metrics by generating guardrailed SQL, executing it in DuckDB, and returning a grounded natural-language response with citations to the executed queries/results. What this project does Cleans raw metrics CSVs → Parquet Builds a DuckDB database with convenient views:

  • row_counts
  • row_counts_accurate
  • theme_column_summary_stats
  • theme_column_release_to_release_comparisons
  • changelog_stats Provides a SQL tool that only allows SELECT / CTE queries Runs an agent loop: question → SQL tool calls → final grounded answer Includes an evaluation harness for hallucination/grounding checks on a test set in sql_tooling/test/questions.txt

Repo layout

LLM-Data-Insights/ data/ cleaned/ metrics/ clean_metrics.py sql_tooling/ test/ evaluate.py agent_test.py questions.txt init_duckdb.py sql_tool.py run_agent.py sql_context.md

Setup

  1. Configure environment variables Create a .env for export OPEN_API_KEY
  2. Add metrics folder to /data. Run python clean_metrics.py from /data folder to produce a cleaned folder
  3. From the root directory, run python sql.tooling/ init_duckdb.py to create the duckdb database
  4. Run python sql.tooling/run_agent.py

Data pipeline: CSV → Parquet → DuckDB

What is cleaned We: Read metrics CSVs from release/theme/type folders Normalize filename/path metadata into stable columns (e.g., release_id, theme, type) Ensure consistent column presence/order across files Output Parquet partitions for fast querying Generate: manifest (what was ingested) ingest report (row counts, missing columns, anomalies) schema report (observed columns + types)

Running the agent (interactive) From the repo root: python3 -m sql_tooling.run_agent You’ll be prompted: Ask a metrics question: Example questions:

  • “What is the total number of address records?”
  • “What percentage of address records include postcode/ZIP values?”
  • “Which datasets contribute the most address records?”

Evaluation tool:

Run python3 -m sql_tooling.test.evaluate.py

About

Project B: Dashboards Are Dead

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages