-
Notifications
You must be signed in to change notification settings - Fork 1
feat(llm): add shared model and run registry #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
houtanb
wants to merge
1
commit into
main
Choose a base branch
from
llm-model-runs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| # Repository Instructions | ||
|
|
||
| ## Shared LLM Registry | ||
|
|
||
| This package targets Python 3.14. Black is configured with | ||
| `target-version = ["py314"]`; do not broaden `requires-python` without first | ||
| checking that formatted code remains valid for the older target. | ||
|
|
||
| ## Local Development Setup | ||
|
|
||
| Use Python 3.14 for local development: | ||
|
|
||
| ```bash | ||
| python3.14 -m venv .venv | ||
| source .venv/bin/activate | ||
| python -m pip install --upgrade pip | ||
| python -m pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| `requirements.txt` delegates to `.[dev]`; it installs this package and the dev | ||
| tools from `pyproject.toml` without editable mode. | ||
|
|
||
| When another repo needs local utils changes during development, use that repo's | ||
| virtual environment and install utils explicitly in editable mode, for example: | ||
|
|
||
| ```bash | ||
| python -m pip install -e ../utils | ||
| ``` | ||
|
|
||
| Do not add local relative paths to another repo's requirements files. Those | ||
| files should use the deployed git pin when ready to deploy. | ||
|
|
||
| The shared LLM registry has two layers: | ||
|
|
||
| - `utils.llm.model_registry.MODELS` contains canonical provider-callable base models. | ||
| - `utils.llm.model_runs.MODEL_RUNS` contains exact benchmarkable model-plus-options runs. | ||
|
|
||
| Benchmarks should choose from `MODEL_RUNS` by `model_run_key`; forecast files should store that exact key. | ||
|
|
||
| When adding a base model: | ||
|
|
||
| - Add provider/lab registry entries first only if the provider or lab is missing. | ||
| - Look up the model in Models.dev. Prefer a `ModelsDevReference` when Models.dev | ||
| has the provider/model entry. | ||
| - In Models.dev source paths, `provider_id` is the folder under `providers/`, | ||
| and `model_id` is the TOML filename stem under `models/`, for example | ||
| `providers/anthropic/models/claude-opus-4-8.toml` maps to `anthropic` / | ||
| `claude-opus-4-8`. | ||
| - The checked-in Models.dev snapshot is not a catalog; it contains only | ||
| registry-referenced models and only `id`, `name`, and `release_date`. | ||
| - Use exact Models.dev `provider_id`/`model_id` values. If a reference is wrong, | ||
| refreshing the snapshot should fail and suggest nearby Models.dev entries. | ||
| - Use `manual_release_date` when the model is missing from Models.dev, when the | ||
| Models.dev entry lacks a usable full release date, or for deliberate | ||
| historical/manual entries. | ||
| - Put the model in the provider-specific list in `utils/llm/model_registry.py` (`OPENAI_MODELS`, `TOGETHER_MODELS`, `ANTHROPIC_MODELS`, `XAI_MODELS`, or `GOOGLE_MODELS`). | ||
| - Insert the model where `(release_date, model_key)` stays ascending within its | ||
| provider-specific list. | ||
| - Use `provider_model_id` for the exact string sent to the provider API. It may differ from `model_key`, especially for routed providers like Together. | ||
| - Set `active=False` only when a provider route should remain in registry history | ||
| but should be excluded from current live-callable benchmark runs. | ||
| - Do not add duplicate `model_key`s. `MODELS = create_models_list(...)` validates uniqueness. | ||
|
|
||
| After changing `ModelsDevReference` values, refresh the Models.dev snapshot from the utils repo: | ||
| ```bash | ||
| python - <<'PY' | ||
| from scripts.refresh_models_dev_metadata import write_models_dev_snapshot | ||
|
|
||
| write_models_dev_snapshot() | ||
| PY | ||
| ``` | ||
|
|
||
| When adding a model run: | ||
|
|
||
| - Add it to `utils/llm/model_runs.py` with | ||
| `_model_run(model_run_key=..., model_key=..., options=...)`. | ||
| - Write `model_run_key` explicitly as the stable benchmark identifier. Do not | ||
| rely on implicit generation from model/options. | ||
| - Put every runtime call option in the `ModelRun` declaration; do not add hidden defaults elsewhere. | ||
| - Use exact provider option names and values as they are passed to `get_response`. | ||
| - If an option affects performance and should appear in filenames/forecast keys, add or update a naming rule in `NAME_COMPONENT_RULES`. | ||
| - If an option is intentionally name-neutral, add it to `NAME_NEUTRAL_OPTION_PATHS`. | ||
| - Unknown option paths should fail loudly rather than silently producing ambiguous model-run keys. | ||
| - `build_model_run_key(...)` is a suggested-key helper for consistency checks and | ||
| new naming rules; the declared `model_run_key` remains the durable identity. | ||
| - Do not add duplicate `model_run_key`s. `MODEL_RUNS = create_model_runs_list(...)` validates uniqueness. | ||
| - `MODEL_RUNS` is the historical registry. `ACTIVE_MODEL_RUNS` is derived from | ||
| it by dropping runs whose base `Model` has `active=False`. | ||
| - Add unit tests for new naming behavior, registry inclusion, and routed provider options when relevant. | ||
|
|
||
| ## Artificial Analysis Model Runs | ||
|
|
||
| When adding an Artificial Analysis-backed model run: | ||
|
|
||
| - Use the checked-in Artificial Analysis snapshot as the source for the stable AA model ID and displayed AA name. | ||
| - Refresh the snapshot from the AA endpoint; do not hand-edit individual AA models into the JSON file. | ||
| - The official AA API key is `API_KEY_ARTIFICIAL_ANALYSIS` in GCP Secret Manager. | ||
| - Do not hard-code an AA display name in a `ModelRun`; set `artificial_analysis_id` and let the run read the display name from the snapshot. | ||
| - Do not add an `artificial_analysis_model` flag. A non-null `artificial_analysis_id` is the marker that a run is AA-backed. | ||
| - Add or update the canonical base `Model` only if the provider-callable model is missing from `utils.llm.model_registry`. | ||
| - Add the callable model-plus-options declaration to | ||
| `ARTIFICIAL_ANALYSIS_MODEL_RUN_DECLARATIONS` in | ||
| `utils/llm/artificial_analysis_model_runs.py`. Every declaration there is | ||
| automatically included in `utils.llm.model_runs.MODEL_RUNS`; do not add the | ||
| same AA run manually to `MODEL_RUNS`. | ||
| - Use the exact provider option names that are passed at runtime. Token suffixes in model-run keys must reflect the actual token cap option used for the call. | ||
|
|
||
| Artificial Analysis token caps should be encoded in the run options this way: | ||
|
|
||
| - Non-reasoning models: use `16_384` output tokens, adjusted downward if the model has a smaller context window or a lower maximum output-token cap. | ||
| - Reasoning models: use the maximum output tokens allowed by the model creator for that reasoning configuration. | ||
| - If the correct cap is not clear from provider/model documentation or the AA metadata, stop and confirm rather than guessing. | ||
|
|
||
| After adding an AA model run: | ||
|
|
||
| - Add or update unit tests that prove the AA ID resolves from the snapshot and that `display_name` matches the AA leaderboard name. | ||
| - Add or update shared registry coverage tests for the new selectable model-run key. | ||
| - Run the focused model-run and AA metadata tests, then run the full lint/test suite before committing. | ||
|
|
||
| ## Validation | ||
|
|
||
| - Run `make lint` before committing. It runs `isort .`, `black .`, `flake8 .`, | ||
| and `pydocstyle .`. | ||
| - Run `make test` before committing code changes. Use `PYTEST_ARGS=...` for a | ||
| focused test pass while iterating. | ||
| - Run `make test-integration` or `make test-integration-parallel` only when the | ||
| relevant provider/GCP credentials are available. | ||
|
|
||
| ## Live Model-Run Smoke Tests | ||
|
|
||
| Integration tests that hit real LLM APIs require provider API keys. | ||
|
|
||
| - `tests/conftest.py` loads `.env`, then `configure_api_keys(from_gcp=True)` when pytest is run with `--integration`. | ||
| - `configure_api_keys(from_gcp=True)` reads provider keys from GCP Secret Manager using the secret names in `utils/helpers/constants.py`. | ||
| - The standard LLM secret names are `API_KEY_OPENAI`, `API_KEY_ANTHROPIC`, `API_KEY_GEMINI`, `API_KEY_XAI`, and `API_KEY_TOGETHERAI`. | ||
| - To test a specific shared model run, set `LLM_MODEL_RUN_KEYS` to one or more comma-separated `model_run_key`s and run `pytest --integration tests/integration/llm/test_model_runs.py`. | ||
| - The model-run integration test calls `model_run.get_response`, so it uses the run's declared provider route, provider model ID, and options. | ||
| - For a newly added model run, prefer running its exact smoke test before assuming the provider accepts the declared options. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| @AGENTS.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # Third-Party Notices | ||
|
|
||
| This repository includes normalized metadata derived from third-party sources. | ||
|
|
||
| ## Models.dev | ||
|
|
||
| The checked-in Models.dev snapshot is derived from https://models.dev/api.json | ||
| and the upstream repository https://github.com/anomalyco/models.dev | ||
|
|
||
| Models.dev is licensed under the MIT License: | ||
|
|
||
| ```text | ||
| MIT License | ||
|
|
||
| Copyright (c) 2025 models.dev | ||
|
|
||
| Permission is hereby granted, free of charge, to any person obtaining a copy | ||
| of this software and associated documentation files (the "Software"), to deal | ||
| in the Software without restriction, including without limitation the rights | ||
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
| copies of the Software, and to permit persons to whom the Software is | ||
| furnished to do so, subject to the following conditions: | ||
|
|
||
| The above copyright notice and this permission notice shall be included in all | ||
| copies or substantial portions of the Software. | ||
|
|
||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
| SOFTWARE. | ||
| ``` | ||
|
|
||
| ## Artificial Analysis | ||
|
|
||
| The checked-in Artificial Analysis snapshot is derived from the Artificial | ||
| Analysis free API and is minimized to the stable model IDs and display names | ||
| used by this package. | ||
|
|
||
| Attribution: Artificial Analysis, https://artificialanalysis.ai/. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,15 +1 @@ | ||
| google-genai==1.73.1 | ||
| anthropic==0.97.0 | ||
| together==2.11.0 | ||
| openai==2.33.0 | ||
| google-cloud-secret-manager>=2.20.0 | ||
| google-cloud-storage>=2.14.0 | ||
| python-dotenv>=1.0.0 | ||
| isort | ||
| black | ||
| flake8 | ||
| flake8-bugbear | ||
| pydocstyle | ||
| pytest | ||
| pytest-cov | ||
| pytest-xdist | ||
| .[dev] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may force every downstream consumer (like forecastbench-sim, anything importing utils) onto 3.14. It may be worth a line in the PR description. If the registry doesn't need 3.14 features, a lower floor may keep utils consumers free to choose their Python version.