Skip to content

feat: add 2026 provider models + fix model-name cost-matching bugs#1

Merged
blisspixel merged 1 commit into
mainfrom
add-2026-models-cost-fixes
May 23, 2026
Merged

feat: add 2026 provider models + fix model-name cost-matching bugs#1
blisspixel merged 1 commit into
mainfrom
add-2026-models-cost-fixes

Conversation

@blisspixel
Copy link
Copy Markdown
Owner

@blisspixel blisspixel commented May 23, 2026

Summary

Adds the latest 2026 provider models and fixes a recurring class of silent model-name cost-matching bugs found while wiring them in.

New models

Added to the registry (plus provider mappings, benchmark tiers, docs), each validated against the live provider API and benchmarked via deepr eval new:

  • Gemini 3.5 Flash, Gemini 3.1 Flash-Lite (GA)
  • Claude Opus 4.7, Claude Sonnet 4.6
  • GPT-5.4-mini, GPT-5.4-nano

Cost/pricing bug fixes (with regression tests)

  • registry.get_cost_estimate: was first-substring-match (order-dependent). A gpt-5.4-pro-<snapshot> resolved to the cheaper gpt-5.4, an under-estimate that lets budget pre-flight approve an expensive job. Now longest-match-first plus dot/hyphen normalized (mirrors get_token_pricing).
  • Gemini _calculate_cost: gemini-2.5-flash-lite was billed at the gemini-2.5-flash rate (~5x overcharge). Fixed with longest-match-first.
  • Grok provider: registry uses hyphenated grok-4-20-* but mappings/pricing only had the dotted API form, so a routed name went unmapped (wrong API id plus ~11x undercharge). Added both forms.
  • api/app.py: replaced a "mini" in model cost heuristic (mis-estimated nano/flash-lite over, deep-research under) with get_cost_estimate().
  • Benchmark: report the actual run cost instead of the merged-history total; fixed dotted grok-4.3 tier-list keys that dropped results from routing.

Model discovery tooling

  • deepr providers models: diffs live provider model lists against the registry, scoped by default to newer versions of families already in use, with paste-ready registry stubs.
  • discover_models.py: loads .env, fixes a Windows cp1252 unicode crash, and uses canonical (dot/hyphen plus date-snapshot) matching to eliminate false positives.
  • deepr eval preflight warns when relevant new models are missing.

Testing

  • 4783 passed, 1 skipped, coverage 81.96% (Python 3.13, the CI command)
  • ruff check deepr/ and ruff format --check deepr/ clean
  • New regression tests: TestCostEstimateMatching, TestGrokHyphenatedRegistryForms

New models (registry + provider mappings + benchmark tiers + docs),
all validated against live provider APIs and benchmarked via `eval new`:
- Gemini 3.5 Flash, Gemini 3.1 Flash-Lite (GA)
- Claude Opus 4.7, Claude Sonnet 4.6
- GPT-5.4-mini, GPT-5.4-nano

Cost/pricing bug fixes (silent-money class, with regression tests):
- registry.get_cost_estimate: longest-match-first + dot/hyphen normalize.
  Was first-substring-match: under-estimated gpt-5.4-pro snapshots to the
  cheaper gpt-5.4 price, letting budget pre-flight approve expensive jobs.
- gemini provider _calculate_cost: longest-match-first (Flash-Lite was
  billed at the Flash rate, ~5x overcharge).
- grok provider: register hyphenated grok-4-20 forms in mappings + pricing.
  A routed registry name (grok-4-20-reasoning) went unmapped -> wrong API
  id + ~11x cost undercharge.
- api/app.py: estimate job cost from the registry instead of a
  "mini in name" heuristic that mis-estimated nano/flash-lite/deep-research.
- benchmark: report this-run cost instead of the merged-history total;
  fix dotted grok-4.3 tier-list keys that dropped results from routing.

Model discovery tooling:
- `deepr providers models`: diff live provider model lists against the
  registry, scoped to newer versions of families already in use, with
  paste-ready ModelCapability stubs.
- discover_models.py: load .env, fix Windows cp1252 unicode crash, and
  canonical (dot/hyphen + date-snapshot) matching to kill false positives.
- `deepr eval` preflight warns when relevant new models are missing.

Tests: 4783 passed, coverage 81.96% (py3.13).
Copilot AI review requested due to automatic review settings May 23, 2026 00:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands Deepr’s model registry and tooling for 2026-era provider models, and hardens cost estimation/matching logic to prevent silent mispricing (especially around snapshot/variant model names and dot-vs-hyphen differences). It also adds model-discovery UX (CLI + script improvements) and updates benchmarking/reporting to better reflect actual run cost.

Changes:

  • Add new 2026 models across OpenAI/Gemini/Anthropic and update benchmark tier lists + docs.
  • Fix cost-estimate/pricing matching to prefer most-specific (longest) matches and normalize dot/hyphen variants; add regression tests.
  • Add/extend model discovery tooling (deepr providers models, discover_models.py) and add a benchmark preflight warning for newer provider models.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/test_providers/test_registry.py Adds regression coverage for get_cost_estimate() specificity + normalization + tiered pricing.
tests/unit/test_providers/test_grok_provider.py Adds regression tests ensuring hyphenated Grok registry names map/price correctly.
scripts/discover_models.py Adds .env loading, canonical name matching (date/dot/hyphen), relevance filtering, JSON shape updates, and stub emission.
scripts/benchmark_models.py Updates tier model lists, adds best-effort “newer models available” preflight warning, and corrects reported cost to “this run” only.
ROADMAP.md Updates roadmap notes/checklists to reflect model discovery and May 2026 status.
docs/MODELS.md Updates model guide with new models and discovery command guidance.
deepr/providers/registry.py Adds new model capability entries and fixes get_cost_estimate() matching/normalization logic.
deepr/providers/grok_provider.py Adds hyphenated registry forms to mappings/pricing to avoid unmapped routing + mispricing.
deepr/providers/gemini_provider.py Adds new Gemini models and fixes pricing-key matching to prefer longest key first.
deepr/cli/commands/providers.py Adds deepr providers models CLI command that shells out to discovery script.
deepr/api/app.py Replaces a name-based cost heuristic with registry-based estimation for API job submission responses.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for canon_key, dm in discovered_by_canon.items():
rm = registry_by_canon.get(canon_key)
if rm is not None:
key = f"{rm.provider}/{rm.model}"
Comment on lines +781 to +786
reg = dm.load_registry()
discovered = dm.discover_via_api()
if not discovered:
return
report = dm.compare_registry(reg, discovered)
relevant, _ = dm.classify_new_models(report["new_models"], reg)
Comment thread deepr/api/app.py
Comment on lines +670 to 676
# Calculate cost estimate from the registry (source of truth). A prior
# name heuristic ("mini" -> $0.5 else $5.0) wildly misestimated nano /
# flash-lite (over) and deep-research (under) models.
from deepr.providers.registry import get_cost_estimate

avg_cost = get_cost_estimate(model)
estimated_cost = {
Comment on lines +769 to +775
return (
f' "{m.provider}/{m.model_id}": ModelCapability(\n'
f' provider="{m.provider}",\n'
f' model="{m.model_id}",\n'
f" cost_per_query=0.0, # TODO: estimate per-query cost\n"
f" latency_ms=2000, # TODO: measure\n"
f" context_window={cw if cw else 'TODO'},\n"
@blisspixel blisspixel merged commit defb8a4 into main May 23, 2026
4 checks passed
@blisspixel blisspixel deleted the add-2026-models-cost-fixes branch May 23, 2026 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants