-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Just my notes so as not to clutter the main README.
There are so many ads and articles now for AI/LLM powered products that I figured were mostly bullshit, so i had a conversation with ChatGPT on the subject. I had a back and forth for maybe an hour on how to design an app that evaluated marketing claims and arrived at the basics of what you see here. This was a few months ago. I didn't follow it up with an implementation as it was just a thought experiment. Then Codex 5.3 came out so I thought I'd see if I could produce an app without writing any code myself, and figured the claims evaluator would be a good test of it. I went back to ChatGPT, told it my plan and had it write the prompts for Codex. All I did was set up a new repo on GitHub and pulled it into VSCode. I also installed the Codex extension in VSCode so I was working in the IDE rather than from the Codex app.
Here is my first request to ChatGPt - "
Can you review this whole conversation and generate a suitable input for me to pass to codex (in vscode) to carry out a full build of the app."
ChatGPT
You are Codex inside VS Code. Build a complete, working MVP web app for an “LLM Claim Evaluator” as specified below. Work in this repo. Do not ask questions; make reasonable defaults. Produce a clean, runnable project with README and minimal dependencies. Follow a “read files first, then implement” workflow and do not improvise structure once files exist. :contentReference[oaicite:0]{index=0}
# Product summary
A tool that takes a vendor/speaker claim about LLM-powered business applications and places it on a “Reality Map” (Maturity vs Impact), returning:
- category: Mature & Proven | Scaling & Strategic | Frontier & Experimental | Mirage Zone
- maturity_score (0–10), impact_score (0–10)
- red_flags, cautions, questions_to_ask
- evidence list (initially: user-provided links + local seed evidence; later: ingestion)
- plot coords (maturity, impact)
Also provide a landing page explaining the concept + a Resources page linking related tools.
# Hard requirements
1) Monorepo layout: /apps/api (FastAPI) and /apps/web (Vite + React + TS).
2) Provide Docker Compose for Postgres + pgvector; app should run locally without external services.
3) Implement a fully working end-to-end flow:
- Web UI: paste claim → choose industry → optional links → Analyze → get plotted point + verdict
- API endpoint /analyze_claim that returns the JSON contract
4) ...
etc
I gave that to Codex and off it went, building the app, in a coupe of minutes. Not quite in its current form but with a lot of the basics.
Me to ChatGPT
It's off and running.
ChatGPT
Out of curiosity:
Did it follow the architecture cleanly (monorepo split, scoring module separated, seed evidence, etc.), or did it start improvising?
Which if you think about it is pretty incredible.
Me:
It seemed pretty clean to me. Would you like to see the folder structure or the code
Then some back and forth on code and screenshots till -
ChatGPT:
This is solid and matches the spec closely.
What’s correct
Category boundaries are exactly as specified (>=7, >=5, >=3, else). ✅
Maturity rubric matches: 3*independent + 2*named + 2*production + recency - penalty, with caps and clamp. ✅
Impact rubric matches: base by industry + plausible bonus − governance penalty for healthcare/legal when “no human review” implied. ✅
CORS is correct for local dev on 5173. ✅
Startup DB init creates pgvector extension, creates tables, seeds. ✅
Endpoints present: /health, /evidence/search, /analyze_claim. ✅
Two small improvements worth making
Recency comparison assumes published_at is a date
Your collect_signals() compares published >= recency_cutoff. That’s fine if published_at is a Python date. If it’s sometimes a string (e.g., coming from API payload or seed data), this will throw or silently misbehave.
Fix: normalize in one place, e.g. accept ISO strings and parse to date (or ensure schemas guarantee date).
evidence/search response should be typed
Right now it returns dict with results. That’s fine, but you’ll get better stability if you add a response model (so the web UI and tests don’t drift).
And so on, back and forth between me, ChatGPT and Codex. Day 1 I probably spent about 4 hours on it, Day 2 around 3 hours, Day 3 a couple more. Day 4 (as I write) I haven't made any functional changes, just cleaned up the README and info pages, added the UI to GitHub pages. Wrote this page (sort of). I've probably spent more time fiddling with markdown here than on doing anything technical in the app itself.
So I'm going to leave it there for now. There's plenty more could be done with this; adjusting score weights, improving evidence handling, and so on, but as it stands I think it has potential.
As apps go, this isn't sexy or world changing, but it demonstrates to me that I dont have to grind code, or even write tests, any more. A useful move up the abstraction ladder.
Unsurprisingly, I let ChatGPt write the rest of this page. Based on our conversation though, not LLM guesswork
Core principles behind ClaimCheck:
ClaimCheck is intentionally rule-based and explicit. The goal is not to “decide truth” using another model, but to apply a transparent scoring rubric to vendor claims and surface signals that a human reviewer can interrogate.
The core principles are:
-
Determinism over model magic. The scoring logic is explicit, inspectable, and reproducible. The system does not generate verdicts from opaque prompts; it applies defined weights and heuristics.
-
Claim-level analysis. Evaluation happens at the level of individual claims and subclaims, not broad company reputation.
-
Structured scepticism. Absence of evidence is treated as a signal. Vague language is penalised. High-impact claims without production context trigger scrutiny.
-
Human-in-the-loop by design. The tool produces questions, cautions, and signals — not final judgments.
-
MVP discipline. Build the simplest working system first, then iterate with real-world examples.
-
This is not a fact-checker. It is a structured lens for examining delivery maturity and business impact claims.
Trade-offs made:
-
Rules-based scoring instead of LLM evaluation.
This sacrifices flexibility and nuance in favour of clarity, auditability, and predictable behaviour.
-
Keyword matching instead of semantic search.
Faster to implement and easier to reason about, but prone to false positives from generic terms.
-
Manual evidence seeding instead of web ingestion.
Keeps the system simple and offline-capable, but limits real-world verification depth.
-
Harsh by default.
Vague or unsupported claims are scored conservatively. This biases toward scepticism rather than optimism.
Key technical choices:
- Monorepo: FastAPI + Vite/React (TypeScript)
-
Why:
Clear separation between API logic and UI while keeping the project cohesive. FastAPI provides fast iteration and explicit typing. Vite + React offers a lightweight, modern frontend with minimal overhead.
-
Alternatives considered:
A single Flask app with server-rendered templates (simpler, but less modular). Next.js full-stack (more integrated, but heavier than necessary for MVP).
- Deterministic Scoring Module
-
Why:
The scoring logic lives in explicit Python modules (scoring.py, heuristics.py) rather than inside prompts. This makes behaviour inspectable and testable.
-
Trade-offs:
Enables transparency and debugging. Constrains nuance compared to ML-based scoring.
- Subclaim Decomposition
-
Why:
Vendor marketing copy often bundles multiple assertions. Breaking into subclaims allows the system to score the weakest element and surface the most suspicious component.
-
Trade-offs:
Simple sentence splitting is imperfect. A more sophisticated NLP pipeline would improve granularity but add complexity.
- Seed Evidence + User Links
-
Why:
Evidence is treated as structured input, not scraped dynamically. This keeps the MVP deterministic and avoids crawling complexity.
-
Trade-offs:
No automatic ingestion. Real-world evaluation requires manual curation.
- Reality Map (Maturity vs Impact)
-
Why:
A two-axis visual makes trade-offs obvious. High impact + low maturity is risky. High maturity + low impact is incremental. The quadrant framing forces clarity.
-
Alternatives considered:
Single composite score (simpler but hides trade-offs). Radar chart (more expressive but harder to interpret quickly).
Things I'm considering:
-
Structured testimonial ingestion (separate field for named customers, quotes, deployment context)
-
Evidence weighting based on source type (independent study > vendor blog > testimonial)
-
Improved keyword logic (ignore generic terms, add stopword filtering)
-
Automatic detection of quantitative claims (%, multipliers, ROI language)
-
Batch export and comparative reporting for procurement reviews
-
Optional semantic search or embedding-based evidence matching
-
Adjustable scoring weights for different industries
-
Generalise for claims in general, not necessarily LLM specific.
Things I'm explicitly NOT doing (for now):
-
❌ Full web crawling and automated verification
This quickly becomes a fact-checking engine rather than a structured rubric tool.
-
❌ Black-box LLM verdict generation
That would undermine transparency and reproducibility.
-
❌ Turning this into a “fraud detector”
The tool evaluates signal quality and maturity — it does not accuse vendors of deception.
-
❌ Over-optimising the UI
The focus remains on evaluation logic, not marketing polish.
Last updated: 14 Feb 2026