feat: add 2026 provider models + fix model-name cost-matching bugs#1
Merged
Conversation
New models (registry + provider mappings + benchmark tiers + docs), all validated against live provider APIs and benchmarked via `eval new`: - Gemini 3.5 Flash, Gemini 3.1 Flash-Lite (GA) - Claude Opus 4.7, Claude Sonnet 4.6 - GPT-5.4-mini, GPT-5.4-nano Cost/pricing bug fixes (silent-money class, with regression tests): - registry.get_cost_estimate: longest-match-first + dot/hyphen normalize. Was first-substring-match: under-estimated gpt-5.4-pro snapshots to the cheaper gpt-5.4 price, letting budget pre-flight approve expensive jobs. - gemini provider _calculate_cost: longest-match-first (Flash-Lite was billed at the Flash rate, ~5x overcharge). - grok provider: register hyphenated grok-4-20 forms in mappings + pricing. A routed registry name (grok-4-20-reasoning) went unmapped -> wrong API id + ~11x cost undercharge. - api/app.py: estimate job cost from the registry instead of a "mini in name" heuristic that mis-estimated nano/flash-lite/deep-research. - benchmark: report this-run cost instead of the merged-history total; fix dotted grok-4.3 tier-list keys that dropped results from routing. Model discovery tooling: - `deepr providers models`: diff live provider model lists against the registry, scoped to newer versions of families already in use, with paste-ready ModelCapability stubs. - discover_models.py: load .env, fix Windows cp1252 unicode crash, and canonical (dot/hyphen + date-snapshot) matching to kill false positives. - `deepr eval` preflight warns when relevant new models are missing. Tests: 4783 passed, coverage 81.96% (py3.13).
There was a problem hiding this comment.
Pull request overview
This PR expands Deepr’s model registry and tooling for 2026-era provider models, and hardens cost estimation/matching logic to prevent silent mispricing (especially around snapshot/variant model names and dot-vs-hyphen differences). It also adds model-discovery UX (CLI + script improvements) and updates benchmarking/reporting to better reflect actual run cost.
Changes:
- Add new 2026 models across OpenAI/Gemini/Anthropic and update benchmark tier lists + docs.
- Fix cost-estimate/pricing matching to prefer most-specific (longest) matches and normalize dot/hyphen variants; add regression tests.
- Add/extend model discovery tooling (
deepr providers models,discover_models.py) and add a benchmark preflight warning for newer provider models.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit/test_providers/test_registry.py |
Adds regression coverage for get_cost_estimate() specificity + normalization + tiered pricing. |
tests/unit/test_providers/test_grok_provider.py |
Adds regression tests ensuring hyphenated Grok registry names map/price correctly. |
scripts/discover_models.py |
Adds .env loading, canonical name matching (date/dot/hyphen), relevance filtering, JSON shape updates, and stub emission. |
scripts/benchmark_models.py |
Updates tier model lists, adds best-effort “newer models available” preflight warning, and corrects reported cost to “this run” only. |
ROADMAP.md |
Updates roadmap notes/checklists to reflect model discovery and May 2026 status. |
docs/MODELS.md |
Updates model guide with new models and discovery command guidance. |
deepr/providers/registry.py |
Adds new model capability entries and fixes get_cost_estimate() matching/normalization logic. |
deepr/providers/grok_provider.py |
Adds hyphenated registry forms to mappings/pricing to avoid unmapped routing + mispricing. |
deepr/providers/gemini_provider.py |
Adds new Gemini models and fixes pricing-key matching to prefer longest key first. |
deepr/cli/commands/providers.py |
Adds deepr providers models CLI command that shells out to discovery script. |
deepr/api/app.py |
Replaces a name-based cost heuristic with registry-based estimation for API job submission responses. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for canon_key, dm in discovered_by_canon.items(): | ||
| rm = registry_by_canon.get(canon_key) | ||
| if rm is not None: | ||
| key = f"{rm.provider}/{rm.model}" |
Comment on lines
+781
to
+786
| reg = dm.load_registry() | ||
| discovered = dm.discover_via_api() | ||
| if not discovered: | ||
| return | ||
| report = dm.compare_registry(reg, discovered) | ||
| relevant, _ = dm.classify_new_models(report["new_models"], reg) |
Comment on lines
+670
to
676
| # Calculate cost estimate from the registry (source of truth). A prior | ||
| # name heuristic ("mini" -> $0.5 else $5.0) wildly misestimated nano / | ||
| # flash-lite (over) and deep-research (under) models. | ||
| from deepr.providers.registry import get_cost_estimate | ||
|
|
||
| avg_cost = get_cost_estimate(model) | ||
| estimated_cost = { |
Comment on lines
+769
to
+775
| return ( | ||
| f' "{m.provider}/{m.model_id}": ModelCapability(\n' | ||
| f' provider="{m.provider}",\n' | ||
| f' model="{m.model_id}",\n' | ||
| f" cost_per_query=0.0, # TODO: estimate per-query cost\n" | ||
| f" latency_ms=2000, # TODO: measure\n" | ||
| f" context_window={cw if cw else 'TODO'},\n" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the latest 2026 provider models and fixes a recurring class of silent model-name cost-matching bugs found while wiring them in.
New models
Added to the registry (plus provider mappings, benchmark tiers, docs), each validated against the live provider API and benchmarked via
deepr eval new:Cost/pricing bug fixes (with regression tests)
registry.get_cost_estimate: was first-substring-match (order-dependent). Agpt-5.4-pro-<snapshot>resolved to the cheapergpt-5.4, an under-estimate that lets budget pre-flight approve an expensive job. Now longest-match-first plus dot/hyphen normalized (mirrorsget_token_pricing)._calculate_cost:gemini-2.5-flash-litewas billed at thegemini-2.5-flashrate (~5x overcharge). Fixed with longest-match-first.grok-4-20-*but mappings/pricing only had the dotted API form, so a routed name went unmapped (wrong API id plus ~11x undercharge). Added both forms.api/app.py: replaced a"mini" in modelcost heuristic (mis-estimated nano/flash-lite over, deep-research under) withget_cost_estimate().grok-4.3tier-list keys that dropped results from routing.Model discovery tooling
deepr providers models: diffs live provider model lists against the registry, scoped by default to newer versions of families already in use, with paste-ready registry stubs.discover_models.py: loads.env, fixes a Windows cp1252 unicode crash, and uses canonical (dot/hyphen plus date-snapshot) matching to eliminate false positives.deepr evalpreflight warns when relevant new models are missing.Testing
4783 passed, 1 skipped, coverage 81.96% (Python 3.13, the CI command)ruff check deepr/andruff format --check deepr/cleanTestCostEstimateMatching,TestGrokHyphenatedRegistryForms