-
Notifications
You must be signed in to change notification settings - Fork 261
Add Nemotron Custom Policy Generator skill #237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
svenchilton
wants to merge
1
commit into
main
Choose a base branch
from
schilton-publish-nemotron-custom-policy-generator-skill
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| # BENCHMARK.md — nemotron-policy-generator | ||
|
|
||
| <!-- | ||
| SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| SPDX-License-Identifier: CC-BY-4.0 | ||
| --> | ||
|
|
||
| Evaluation report for `nemotron-policy-generator` v1.0.0. Per the NVIDIA Skills Publishing Onboarding Guide, every published skill ships a BENCHMARK.md showing the with-skill vs without-skill comparison so the uplift is visible. | ||
|
|
||
| > **Status: placeholder.** Tables below are scaffolding. Populate via `nv-aces run … --output BENCHMARK.md` (see `evals/EVAL.md`) or fill in manually after team-owned evaluation. Either way, this file lands in the skill directory and surfaces in the catalog. | ||
|
|
||
| ## Harness configuration | ||
|
|
||
| | Field | Value | | ||
| |---|---| | ||
| | Skill version | 1.0.0 | | ||
| | Eval dataset | `evals/evals.json` (8 cases: 4 positive, 4 negative) | | ||
| | Agent harness A | Claude Code — `claude-sonnet-4-5` (TBD) | | ||
| | Agent harness B | Codex — `gpt-5` (TBD) | | ||
| | Evaluator framework | NV-ACES (Skill Execution, Skill Efficiency, Accuracy, Goal Accuracy, Behavior Check) | | ||
| | Date evaluated | _TBD_ | | ||
| | Evaluator(s) | Shyamala Prayaga \<sprayaga@nvidia.com\> | | ||
|
|
||
| ## Results — Claude Code (sonnet) | ||
|
|
||
| ### With skill installed | ||
|
|
||
| | Case ID | Skill Execution | Skill Efficiency | Accuracy | Goal Accuracy | Behavior Check | Tokens | Wall-clock (s) | | ||
| |---|---|---|---|---|---|---|---| | ||
| | pos-001-rough-keywords | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | pos-002-multimodal-byo | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | pos-003-extend-existing | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | pos-004-eval-rubric | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | neg-001-eval-existing | n/a (silent expected) | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | neg-002-legal-advice | n/a (silent expected) | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | neg-003-benchmark-test | n/a (silent expected) | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | neg-004-unrelated-llm | n/a (silent expected) | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | **Aggregate** | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
|
|
||
| ### Without skill (baseline) | ||
|
|
||
| | Case ID | Accuracy | Goal Accuracy | Behavior Check | Tokens | Wall-clock (s) | | ||
| |---|---|---|---|---|---| | ||
| | pos-001-rough-keywords | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | pos-002-multimodal-byo | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | pos-003-extend-existing | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | pos-004-eval-rubric | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
| | **Aggregate** | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ | | ||
|
|
||
| ### Uplift (with-skill minus without-skill) | ||
|
|
||
| | Metric | Δ | | ||
| |---|---| | ||
| | Accuracy | _TBD_ | | ||
| | Goal Accuracy | _TBD_ | | ||
| | Behavior Check | _TBD_ | | ||
| | Tokens (lower is better) | _TBD_ | | ||
| | Wall-clock (lower is better) | _TBD_ | | ||
|
|
||
| ## Results — Codex (gpt-5) | ||
|
|
||
| Same table shape as above, populated when Codex eval run completes. | ||
|
|
||
| ## Five-dimension rollup (NV-BASE) | ||
|
|
||
| | Dimension | Score | Notes | | ||
| |---|---|---| | ||
| | Security | _TBD_ | NV-BASE scan result; secrets, PII, prompt injection, spec compliance | | ||
| | Correctness | _TBD_ | accuracy + goal_accuracy weighted | | ||
| | Discoverability | _TBD_ | trigger precision on negative cases; trigger recall on positive cases | | ||
| | Effectiveness | _TBD_ | uplift vs without-skill baseline | | ||
| | Efficiency | _TBD_ | token + wall-clock vs baseline | | ||
|
|
||
| ## Known limitations and follow-ups | ||
|
|
||
| - Today's dataset focuses on the four most common policy-generation shapes. Multi-skill composition (this skill + a downstream eval skill) is out of scope until JFP's stack-level evaluation framework lands. | ||
| - BENCHMARK.md does not detect SKILL.md drift against the Nemotron Content Safety V2 taxonomy. Per the publishing guide, drift is team-owned: refresh this skill (and re-run evals) on every Nemotron-Content-Safety / Nemotron-3-Content-Safety model release. | ||
| - Trigger evals on `neg-004-unrelated-llm` should be re-run whenever a new Nemotron-adjacent skill ships to the catalog, in case the distractor distribution shifts. |
126 changes: 126 additions & 0 deletions
126
skills/nemotron-custom-policy-generator/PUBLISHING_AUDIT.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| # Publishing Audit — nemotron-policy-generator | ||
|
|
||
| Audited against [NVIDIA Agent Skills — Publishing Onboarding Guide](https://docs.google.com/document/d/1SNFRQCv0_p3DC2a_IWIf0cB3c49tSE1d/edit) on 2026-05-27. Target: external publication on [github.com/nvidia/skills](https://github.com/nvidia/skills). | ||
|
|
||
| ## Executive verdict | ||
|
|
||
| The skill is **structurally close to publish-ready**. Naming, length, frontmatter quality, and the "when NOT to trigger" boundary all already comply. The gaps are mechanical and now scaffolded in this folder: a license header inside `SKILL.md`, an `evals/` directory, a `BENCHMARK.md`, and a `SKILLCARD.yaml` template. The remaining work is process — OSRB sign-off, moving the skill into a real NVIDIA-owned GitHub repo at `skills/<name>/` at the repo root, and running `nv-base validate --external` plus a full `nv-aces` eval pass before the `components.d/<slug>.yml` PR. | ||
|
|
||
| ## What already complies | ||
|
|
||
| | Guide requirement | Status | Evidence | | ||
| |---|---|---| | ||
| | Skill name is lowercase, hyphenated, CLI-safe, ≤64 chars, product-scoped | Pass | `nemotron-policy-generator` is product-prefixed and under length | | ||
| | Directory name matches `name:` field in frontmatter | Pass | Both equal `nemotron-policy-generator` | | ||
| | `SKILL.md` under 500 lines | Pass | ~262 lines after license header insertion | | ||
| | `description` frontmatter is specific, not categorical | Pass | Names exact models (NCS-Reasoning-4B, Nemotron-3), exact deployment patterns, exact output formats | | ||
| | `description` includes trigger keywords a user would actually say | Pass | "BYO-policy", "custom safety taxonomy", "eval rubric", "labeling rubric", "guardrail config", "moderation policy" | | ||
| | `description` states when NOT to trigger (implicitly via scope) | Pass | "Do not activate this skill when" block names the three failure modes (evaluate / test / legal-advice) | | ||
| | No `alwaysApply` or `globs` in frontmatter | Pass | Neither key present | | ||
| | References live under `references/` | Pass | `aegis_taxonomy.md`, `policy_patterns.md` | | ||
| | Assets live under `assets/` | Pass | Templates + HTML GUI present | | ||
| | License chosen from allowed set (Apache 2.0 / CC-BY 4.0 / dual) | Pass | Frontmatter declares `Apache-2.0 AND CC-BY-4.0` — appropriate dual license given the skill mixes prose (CC-BY) and the HTML GUI / templates (Apache) | | ||
| | Audience is "use the product" (not contributor tooling) | Pass | The skill helps customers use Nemotron content-safety models, not internal CI/code-style | | ||
|
|
||
| ## What was missing (now fixed in this folder) | ||
|
|
||
| ### 1. License header below frontmatter and H1 — FIXED | ||
|
|
||
| The publishing guide says explicitly: "Copyright / license headers go below the YAML frontmatter and the initial H1 heading, not above. Headers above the frontmatter interfere with agent parsing." The previous `SKILL.md` had no copyright block. An HTML-comment SPDX header has been inserted directly after the `# Nemotron Custom Policy Generator` H1, before `## When to Use This Skill`. | ||
|
|
||
| ### 2. `evals/evals.json` — FIXED (scaffolded) | ||
|
|
||
| The guide requires every published skill to ship an `evals/` directory next to `SKILL.md` containing `evals.json`. The dataset shipped here has 8 cases (4 positive, 4 negative) covering: | ||
|
|
||
| - Rough-keywords-only input → clean V2 map | ||
| - Multimodal + multilingual BYO with custom categories → Nemotron-3 emit block | ||
| - Extending an existing policy → version bump + diff summary | ||
| - Labeling-rubric primary use case → binary severity branch | ||
| - Distractors for the three "Do not activate" failure modes plus one wholly unrelated LLM task | ||
|
|
||
| Each positive case has `expected_skill`, `expected_script`, `ground_truth`, and an ordered `expected_behavior` list that NV-ACES can score per the guide's evaluator spec. Negative cases set `expected_skill: null`. | ||
|
|
||
| ### 3. `evals/EVAL.md` — FIXED (scaffolded) | ||
|
|
||
| Developer-facing how-to for running the eval set on both Claude Code and Codex harnesses, with the acceptance bar that gates publication (trigger precision = 1.0 on negative cases is the hard rule). | ||
|
|
||
| ### 4. `BENCHMARK.md` — FIXED (scaffolded) | ||
|
|
||
| Per the guide: "Every published skill ships a BENCHMARK.md summarizing how it was evaluated… with BOTH with-skill and without-skill tables so the uplift is visible." The scaffolded report has the harness configuration block, both with-skill and without-skill result tables, a five-dimension NV-BASE rollup, and a known-limitations block. Cells are `_TBD_` and populate either via `nv-aces run … --output BENCHMARK.md` or manually after evaluation. | ||
|
|
||
| ### 5. `SKILLCARD.yaml` — FIXED (scaffolded) | ||
|
|
||
| The NVCARPS pipeline auto-generates the bulk of this from the skill repo + pipeline outputs (identity, provenance, scan results, evaluation metrics). The template here pre-fills the team-owned fields (identity, ownership, license, compatibility, intended use, behavioral boundaries) and marks auto-populated fields with `_auto_` placeholders so the pipeline can fill them in. Per Michael Boone's NVCARPS pipeline note in the guide, this lands May 18 IST 2026 in NVCARPS CI. | ||
|
|
||
| ## What still needs to happen (process work, not file changes) | ||
|
|
||
| These are upstream of the publication pipeline and the audit can't fix them — they require human / NVIDIA-process action: | ||
|
|
||
| ### 1. Move to a real GitHub repo at `skills/<name>/` | ||
|
|
||
| Today the skill lives under `.claude/skills/nemotron-policy-generator/` in this Cowork session. The guide is explicit: the canonical release-facing path is `<repo>/skills/<skill-name>/SKILL.md` at the repo root. `nv-base validate --external` (the catalog publish gate) enforces this. Path migration is required before submitting `components.d/<slug>.yml`. | ||
|
|
||
| If you want to keep `.claude/skills/` for local Cowork use too, the guide permits adding it as a symlink alias to the canonical `skills/` path — but the canonical path is the source of truth. | ||
|
|
||
| ### 2. OSRB IP-review clearance | ||
|
|
||
| Per Step 3 of the guide, the IP Review Process must complete with all six answers affirmative before the skill is eligible for publication. Since this introduces a new external skill repo, OSRB will need to clear: (a) the new repo, (b) the dual Apache-2.0 + CC-BY-4.0 license selection, and (c) any third-party dependencies (the HTML GUI in `assets/nemotron_policy_generator.html` — confirm it has no bundled third-party JS without OSRB clearance). | ||
|
|
||
| OSRB is the longest lead-time item in publication. Start now if you haven't. | ||
|
|
||
| ### 3. NV-BASE local scan | ||
|
|
||
| Run `nv-base validate --external` locally before pushing. The publishing guide flags this twice: "Skills become public the moment you push to your product repo — before the pre-catalog scan fires at sync time. Run NV-BASE locally first to catch issues before they land in your public git history." | ||
|
|
||
| ### 4. NVCARPS CI run + signing | ||
|
|
||
| Comment `/nvskills-ci` on the eventual PR to trigger the NVCARPS pipeline. The pipeline produces `skill.oms.sig`, fills in the `_auto_` fields of `SKILLCARD.yaml`, and produces the BENCHMARK.md from your evals dataset. Verify the signature commit `Attach NVSkills validation signatures` lands before merge. | ||
|
|
||
| ### 5. `components.d/<slug>.yml` PR to NVIDIA/skills | ||
|
|
||
| One file, ~6 lines: | ||
|
|
||
| ```yaml | ||
| - name: Nemotron Policy Generator | ||
| repo: NVIDIA/<your-repo> | ||
| description: >- | ||
| Generates BYO content-safety policies for NVIDIA Nemotron content-safety | ||
| guardrails (Reasoning-4B today; Nemotron-3 at Computex 2026). | ||
| skills: | ||
| - path: /skills/ | ||
| catalog_dir: nemotron-policy-generator | ||
| ``` | ||
|
|
||
| Use `git commit -s` (DCO sign-off is enforced). | ||
|
|
||
| ## Quibbles and small wins | ||
|
|
||
| These aren't blockers, just polish: | ||
|
|
||
| ### Frontmatter is richer than the spec minimum | ||
|
|
||
| The current frontmatter includes `title`, `license`, `compatibility`, and a full `metadata` block (`author`, `team`, `tags`, `languages`, `frameworks`, `domain`). The publishing guide's minimum required frontmatter is just `name` + `description`. The extras don't violate the agentskills.io spec, but if `nv-base validate --external` flags any of them as unrecognized, drop them into the `metadata:` block (which the spec treats as opaque) rather than top-level keys. The `metadata:` block here already nests author/team/tags/languages/frameworks/domain correctly — but `title`, `license`, and `compatibility` are at the top level and would be safer nested inside `metadata:`. | ||
|
|
||
| ### "Aegis" naming carry-over | ||
|
|
||
| The reference file is `references/aegis_taxonomy.md` but the workflow has migrated to Nemotron Content Safety V2. The file's *content* is up-to-date, but the *filename* still references the older Aegis name. Consider renaming to `references/v2_taxonomy.md` and updating the two references in `SKILL.md` — pure hygiene, not a publishing blocker, but it removes a stale signal. | ||
|
|
||
| ### `references/` and `assets/` token budget | ||
|
|
||
| Both folders are under the SKILL.md 500-line gate that the guide tracks at the SKILL.md level, but NVCARPS also scans for token budget overruns across the skill directory. The HTML GUI in `assets/` is 44 KB — fine — but worth confirming it doesn't contain inline bundled libraries (jQuery, Tailwind CDN copy) that might trip the dependency audit. Spot-check before pushing. | ||
|
|
||
| ## File index (what's in this audit deliverable) | ||
|
|
||
| - `SKILL.md` — corrected, with SPDX license header inserted below the H1 (line 50 area) | ||
| - `PUBLISHING_AUDIT.md` — this document | ||
| - `BENCHMARK.md` — placeholder report with both result tables and the five-dimension rollup | ||
| - `SKILLCARD.yaml` — team-owned fields filled; auto fields marked `_auto_` | ||
| - `evals/evals.json` — 8-case dataset (4 positive, 4 negative) | ||
| - `evals/EVAL.md` — how to run + acceptance bar | ||
| - `assets/`, `references/` — unchanged from your working copy | ||
|
|
||
| ## Sources | ||
|
|
||
| - [Skills_Publishing_Onboarding_Guide.docx](https://docs.google.com/document/d/1SNFRQCv0_p3DC2a_IWIf0cB3c49tSE1d/edit) | ||
| - [github.com/nvidia/skills](https://github.com/nvidia/skills) (catalog) | ||
| - [agentskills.io/specification](https://agentskills.io/specification) (spec) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file?