Skip to content

Commit 38ec8d9

Browse files
feat: cut over metadata history to snapshot store (#158)
* feat: add llmfit sidecar enrichment for hf metadata * feat: cut over metadata history to snapshot store * fix: satisfy dialyzer for snapshot tasks * fix: clarify initial history seed failures
1 parent 19af028 commit 38ec8d9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+220234
-18599
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Brief description of changes.
2626
- [ ] All new and existing tests pass
2727
- [ ] My commits follow conventional commit format
2828
- [ ] I have **NOT** edited `CHANGELOG.md` (it is auto-generated by git_ops)
29-
- [ ] I have **NOT** directly edited files in `priv/llm_db/providers/` (they are generated by `mix llm_db.build`)
29+
- [ ] I have **NOT** directly edited generated snapshot/history artifacts unless this PR intentionally regenerates them (`priv/llm_db/snapshot.json`, `priv/llm_db/history/**`)
3030

3131
## Related Issues
3232

.github/workflows/README.md

Lines changed: 24 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -24,43 +24,38 @@ Runs on every push and pull request to ensure code quality.
2424
- OTP versions: 25, 26
2525
- Excludes: Elixir 1.14 with OTP 26 (compatibility)
2626

27-
### 2. Refresh Upstream Metadata (`build-metadata.yml`)
27+
### 2. Publish Snapshot Catalog (`build-metadata.yml`)
2828

29-
Automatically pulls latest LLM model metadata from upstream sources and creates PRs for review.
29+
Automatically pulls latest upstream metadata, publishes a content-addressed snapshot,
30+
and rebuilds the published history bundle from the snapshot store.
3031

3132
**Triggers:**
3233
- **Schedule**: Every Monday at 00:00 UTC
3334
- **Manual**: Via workflow_dispatch in GitHub Actions UI
3435

3536
**Jobs:**
3637
1. Pull latest metadata using `mix llm_db.pull`
37-
2. Regenerate metadata using `mix llm_db.build`
38-
3. Reset `metadata-update` from `origin/main`
39-
4. If non-history metadata changes are detected:
40-
- Commit metadata artifacts first
41-
- Run `mix llm_db.history.sync --to HEAD`
42-
- Commit history artifacts second
43-
- Create or update the PR with `gh pr create` / `gh pr edit`
38+
2. Publish the current canonical snapshot using `mix llm_db.snapshot.publish`
39+
3. Rebuild and publish `history.tar.gz` from the published snapshot chain using `mix llm_db.history.rebuild --publish`
40+
4. Validate the packaged snapshot with `mix llm_db.build --check --install`
41+
5. Run the test suite against the resulting packaged snapshot
4442

4543
**Output:**
46-
- Pull request with metadata changes for human review
47-
- Summary includes provider/model statistics and changed files
48-
- Metadata update PRs now contain two commits
49-
- Metadata update PRs must be merged with a merge commit, not squash-merged or rebase-merged
44+
- Updated GitHub Releases snapshot assets
45+
- Updated `catalog-index` assets: `latest.json`, `snapshot-index.json`, `history.tar.gz`, and `history-meta.json`
5046

51-
### 3. Publish Release (`publish-release.yml`)
47+
### 3. Publish Release (`release.yml`)
5248

53-
Automatically publishes new Hex.pm releases when metadata updates are merged.
49+
Automatically publishes new Hex.pm releases from the latest published snapshot.
5450

5551
**Triggers:**
5652
- Push to `main` branch
57-
- Only when `priv/llm_db/snapshot.json` changes
58-
- Only from metadata update merges
53+
- Release workflow fetches the latest published snapshot and packages it into `priv/llm_db/snapshot.json`
5954

6055
**Jobs:**
61-
1. Verify trigger is from metadata update merge
56+
1. Fetch the latest published snapshot into `priv/llm_db/snapshot.json`
6257
2. Prepare release using `mix llm_db.release prepare`
63-
- Determines version from snapshot timestamp (YYYY.MM.DD format)
58+
- Determines version from the packaged snapshot timestamp (YYYY.MM.DD format)
6459
- Updates `mix.exs` version
6560
3. Run tests to ensure quality
6661
4. Build Hex package
@@ -123,40 +118,26 @@ Cron examples:
123118
- `'0 0 * * 0'` - Weekly on Sunday
124119
- `'0 0 1 * *'` - Monthly on the 1st
125120
126-
#### PR Reviewers
127-
128-
To auto-assign reviewers to metadata update PRs, modify the PR creation step in `build-metadata.yml`:
129-
130-
```bash
131-
gh pr create \
132-
--base main \
133-
--head "metadata-update" \
134-
--title "Update model metadata - $(date +%Y-%m-%d)" \
135-
--body-file "$SUMMARY_FILE" \
136-
--label "metadata-update" \
137-
--label "automated" \
138-
--reviewer "username1,username2" # Add this line
139-
```
140-
141121
## Manual Operations
142122
143-
### Manually Trigger Metadata Update
123+
### Manually Trigger Snapshot Publish
144124
145125
1. Go to Actions tab in GitHub
146-
2. Select "Refresh Upstream Metadata" workflow
126+
2. Select "Publish Snapshot Catalog" workflow
147127
3. Click "Run workflow"
148128
4. Select branch (usually `main`)
149129
5. Click "Run workflow"
150130

151131
### Manually Create a Release
152132

153-
Releases are automatically triggered when metadata updates merge to main. To manually release:
133+
Releases package the latest published snapshot. To manually release:
154134

155-
1. Ensure snapshot is updated: `mix llm_db.pull`
156-
2. Prepare release: `mix llm_db.release prepare`
157-
3. Review version in `mix.exs`
158-
4. Commit and push to main
159-
5. Workflow will detect snapshot change and publish
135+
1. Ensure the latest snapshot has been published: `mix llm_db.snapshot.publish`
136+
2. Rebuild the published history bundle if needed: `mix llm_db.history.rebuild --publish`
137+
3. Prepare release: `mix llm_db.release prepare`
138+
4. Review version in `mix.exs`
139+
5. Commit and push to main
140+
6. Workflow will fetch the latest published snapshot and publish the package
160141

161142
## Workflow Scripts
162143

@@ -166,8 +147,8 @@ Helper scripts in `.github/workflows/scripts/`:
166147

167148
Generates PR description for metadata updates with:
168149
- Provider and model counts
150+
- Snapshot publication details
169151
- Generated timestamp
170-
- File diff statistics
171152
- Review checklist
172153

173154
### `generate_release_notes.sh`
Lines changed: 16 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Refresh Upstream Metadata
1+
name: Publish Snapshot Catalog
22

33
on:
44
schedule:
@@ -9,13 +9,15 @@ permissions:
99
contents: write
1010

1111
jobs:
12-
update-metadata:
13-
name: Pull and update LLM model metadata
12+
publish-snapshot:
13+
name: Pull upstream metadata and publish snapshot
1414
runs-on: ubuntu-latest
1515
timeout-minutes: 30
1616
concurrency:
1717
group: llm-metadata
1818
cancel-in-progress: false
19+
env:
20+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
1921

2022
steps:
2123
- name: Checkout main
@@ -64,62 +66,26 @@ jobs:
6466
set -euo pipefail
6567
echo "Pulling latest LLM model metadata..."
6668
mix llm_db.pull
67-
mix llm_db.build
69+
mix llm_db.snapshot.publish
6870
69-
- name: Commit metadata artifacts
70-
id: metadata_commit
71+
- name: Install packaged snapshot for validation
7172
run: |
7273
set -euo pipefail
74+
mix llm_db.build --install
7375
74-
git add -A -- priv/llm_db lib/llm_db/generated
75-
git restore --staged -- priv/llm_db/history
76-
77-
if git diff --cached --quiet; then
78-
echo "created=false" >> "$GITHUB_OUTPUT"
79-
echo "No non-history metadata changes detected."
80-
exit 0
81-
fi
82-
83-
git commit -m "chore: refresh model metadata"
84-
echo "created=true" >> "$GITHUB_OUTPUT"
85-
86-
- name: Sync history to committed metadata
87-
run: |
88-
set -euo pipefail
89-
mix llm_db.history.sync --to HEAD
90-
91-
- name: Commit synced history
92-
id: history_commit
76+
- name: Rebuild and publish history bundle
9377
run: |
9478
set -euo pipefail
79+
mix llm_db.history.rebuild --publish
9580
96-
git add -A -- priv/llm_db/history
97-
98-
if git diff --cached --quiet; then
99-
echo "created=false" >> "$GITHUB_OUTPUT"
100-
echo "No history changes detected."
101-
exit 0
102-
fi
103-
104-
git commit -m "chore: sync model history"
105-
echo "created=true" >> "$GITHUB_OUTPUT"
106-
107-
- name: Validate refreshed catalog
108-
if: steps.metadata_commit.outputs.created == 'true' || steps.history_commit.outputs.created == 'true'
81+
- name: Validate published catalog
10982
run: |
11083
set -euo pipefail
111-
mix llm_db.build --check
112-
mix llm_db.history.check
84+
mix llm_db.build --check --install
85+
mix llm_db.history.check --allow-missing
11386
mix test
11487
115-
- name: Push refreshed catalog to main
116-
if: steps.metadata_commit.outputs.created == 'true' || steps.history_commit.outputs.created == 'true'
117-
run: |
118-
set -euo pipefail
119-
git push origin HEAD:main
120-
121-
- name: Report no changes
122-
if: steps.metadata_commit.outputs.created != 'true' && steps.history_commit.outputs.created != 'true'
88+
- name: Report success
12389
run: |
124-
echo "No metadata or history changes detected."
125-
echo "### No Changes Detected" >> "$GITHUB_STEP_SUMMARY"
90+
echo "### Snapshot Publish Complete" >> "$GITHUB_STEP_SUMMARY"
91+
echo "- Snapshot release and catalog index updated" >> "$GITHUB_STEP_SUMMARY"

.github/workflows/ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,6 @@ jobs:
9595
otp-version: "27"
9696
elixir-version: "1.18"
9797
- run: mix deps.get
98-
- run: mix llm_db.build --check
98+
- run: mix llm_db.build --check --install
9999
- name: Check history drift
100-
run: mix llm_db.history.check
100+
run: mix llm_db.history.check --allow-missing

.github/workflows/release.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,11 @@ jobs:
6666
mix deps.get
6767
MIX_ENV=dev mix deps.compile git_ops
6868
69+
- name: Fetch latest published snapshot
70+
run: |
71+
set -euo pipefail
72+
mix llm_db.snapshot.fetch --ref latest --install
73+
6974
- name: Run tests
7075
if: ${{ inputs.skip_tests != true }}
7176
run: |
@@ -125,6 +130,7 @@ jobs:
125130
fi
126131
echo "- Package: llm_db" >> "$GITHUB_STEP_SUMMARY"
127132
echo "- Version: ${{ steps.version.outputs.version }}" >> "$GITHUB_STEP_SUMMARY"
133+
echo "- Snapshot: priv/llm_db/snapshot.json" >> "$GITHUB_STEP_SUMMARY"
128134
129135
- name: Create GitHub release
130136
if: ${{ inputs.dry_run != true }}

AGENTS.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,14 @@
1010
- **Format code**: `mix format`
1111
- **Compile**: `mix compile`
1212
- **Quality check**: `mix quality` (format, compile warnings, dialyzer, credo)
13-
- **Update model data**: `mix llm_db.pull` (fetches from configured remote sources and regenerates snapshot)
14-
- **Backfill model history (one-time)**: `mix llm_db.history.backfill --force` (generates `priv/llm_db/history/events/*.ndjson` from git snapshot history)
15-
- **Sync model history (incremental)**: `mix llm_db.history.sync`
16-
- **Check history drift**: `mix llm_db.history.check`
13+
- **Build local snapshot artifact**: `mix llm_db.build --install`
14+
- **Pull upstream metadata**: `mix llm_db.pull`
15+
- **Publish snapshot to GitHub Releases**: `mix llm_db.snapshot.publish`
16+
- **Fetch published snapshot locally**: `mix llm_db.snapshot.fetch --ref latest --install`
17+
- **Migrate reachable Git history (one-time)**: `mix llm_db.history.migrate_git --force`
18+
- **Rebuild/publish history bundle**: `mix llm_db.history.rebuild --publish`
19+
- **Sync published history bundle**: `mix llm_db.history.sync`
20+
- **Check published history drift**: `mix llm_db.history.check --allow-missing`
1721
- **Dependencies**: `mix deps.get`
1822
- **Release**: `mix llm_db.version && mix git_ops.release && git push && git push --tags` (bumps to date-based version, updates CHANGELOG, tags, and pushes)
1923

@@ -61,7 +65,7 @@ config :llm_db,
6165
- **Type**: Elixir library providing fast, persistent_term-backed LLM model metadata catalog
6266
- **Core modules**: `LLMDB` (main API), `LLMDB.Engine` (ETL pipeline), `LLMDB.Store` (persistent_term storage)
6367
- **Data structures**: `LLMDB.Provider`, `LLMDB.Model` with Zoi validation schemas in `lib/llm_db/schema/`
64-
- **Storage**: O(1) lock-free queries via `:persistent_term`, snapshot in `priv/llm_db/snapshot.json`
68+
- **Storage**: O(1) lock-free queries via `:persistent_term`, packaged catalog snapshot in `priv/llm_db/snapshot.json`
6569
- **ETL Pipeline**: Ingest → Normalize → Validate → Merge → Enrich → Filter → Index + Publish (7 stages)
6670
- **Startup**: Catalog automatically loads on application start via `LLMDB.Application` (no manual `load()` needed in IEx or runtime)
6771

README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -253,23 +253,27 @@ Snapshot is shipped with the library. To rebuild with fresh data:
253253
# Fetch upstream data (optional)
254254
mix llm_db.pull
255255

256-
# Run ETL and write snapshot.json
256+
# Build canonical snapshot artifacts
257257
mix llm_db.build
258+
259+
# Install the packaged snapshot for local runtime/package validation
260+
mix llm_db.build --install
258261
```
259262

260-
To generate historical change events from committed snapshot history (initial setup):
263+
To migrate legacy Git-tracked metadata history into the snapshot store (one-time maintainer task):
261264

262265
```bash
263-
mix llm_db.history.backfill --force
266+
mix llm_db.history.migrate_git
264267
```
265268

266-
This writes append-only NDJSON history artifacts under `priv/llm_db/history/`:
267-
`priv/llm_db/history/events/YYYY.ndjson`,
268-
`priv/llm_db/history/snapshots.ndjson`, and `priv/llm_db/history/meta.json`.
269+
This writes snapshot-based history artifacts under `priv/llm_db/history/` and
270+
materializes immutable historical snapshots under `_build/llm_db/snapshot_store/snapshots/`.
269271

270-
For daily/incremental maintenance:
272+
For daily publication and local history maintenance:
271273

272274
```bash
275+
mix llm_db.snapshot.publish
276+
mix llm_db.history.rebuild --publish
273277
mix llm_db.history.sync
274278
mix llm_db.history.check
275279
```
@@ -286,8 +290,8 @@ add optional lineage overrides in `priv/llm_db/history/lineage_overrides.json`:
286290
}
287291
```
288292

289-
History artifacts are intended for Git/path dependencies and local repo usage.
290-
Hex packages do not guarantee inclusion of `priv/llm_db/history/**`.
293+
History artifacts remain optional local/published data.
294+
Hex packages still only ship `priv/llm_db/snapshot.json`.
291295

292296
See the [Sources & Engine](guides/sources-and-engine.md) guide for details.
293297

config/config.exs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,14 @@ config :llm_db,
1717
# Cache directory for remote sources
1818
models_dev_cache_dir: "priv/llm_db/upstream",
1919
openrouter_cache_dir: "priv/llm_db/upstream",
20+
llmfit_cache_dir: "priv/llm_db/upstream",
2021
upstream_cache_dir: "priv/llm_db/upstream",
2122
openai_cache_dir: "priv/llm_db/remote",
2223
anthropic_cache_dir: "priv/llm_db/remote",
2324
google_cache_dir: "priv/llm_db/remote",
2425
xai_cache_dir: "priv/llm_db/remote",
2526
zenmux_cache_dir: "priv/llm_db/remote",
27+
llmfit_enrichment: true,
2628
azure_foundry_cache_dir: "priv/llm_db/remote"
2729

2830
if Mix.env() == :dev do
@@ -46,7 +48,7 @@ if Mix.env() == :dev do
4648
pre_commit: [
4749
tasks: [
4850
{:mix_task, :format, ["--check-formatted"]},
49-
{:cmd, "mix llm_db.build --check"}
51+
{:cmd, "mix llm_db.build --check --install"}
5052
]
5153
],
5254
pre_push: [

config/test.exs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,11 @@ import Config
44
config :llm_db,
55
compile_embed: false,
66
integrity_policy: :warn,
7+
skip_packaged_load: true,
8+
snapshot_path: "priv/llm_db/snapshot.json",
79
# Use test-specific cache directory to avoid polluting production cache
810
models_dev_cache_dir: "tmp/test/upstream",
911
openrouter_cache_dir: "tmp/test/upstream",
12+
llmfit_cache_dir: "tmp/test/upstream",
13+
llmfit_enrichment: true,
1014
upstream_cache_dir: "tmp/test/upstream"

0 commit comments

Comments
 (0)