Context
We just shipped the initial set of skills in #1235. They work, but there's room to make them significantly more effective.
In the port blog, this paper was suggested, SkillsBench paper, which benchmarked 86 real-world tasks across 7 agent-model combos (7,308 trajectories). Some of the findings are really relevant to how we structure our skills going forward.
Key findings worth acting on
1. Skill length has a sweet spot
| Length |
Impact |
| Compact |
+17.1pp |
| Detailed |
+18.8pp |
| Comprehensive |
-2.9pp (hurts!) |
Exhaustive docs actually degrade performance — they eat context without adding actionable guidance. Two of our skills (solidity-security at 533 lines, defi-protocol-templates at 443 lines) fall into the "comprehensive" bucket.
2. 2-3 skills per task is the sweet spot
| Count |
Impact |
| 1 skill |
+17.8pp |
| 2-3 skills |
+18.6pp (peak) |
| 4+ skills |
+5.9pp (big drop) |
Loading too many skills creates conflicting guidance. Right now AGENTS.md lists all 6 and the agent could try to load 4+ for a complex task.
3. Executable resources matter
Skills with companion scripts/templates outperform markdown-only skills. Currently all our skills are standalone SKILL.md files with no runnable code alongside them.
4. Smaller models + good skills > bigger models without
Haiku 4.5 with curated skills (27.7%) beat Opus 4.5 without skills (22.0%). Well-designed skills are worth more than a model tier upgrade — which is great news for making SE-2 accessible to folks using cheaper models.
What we could try next
These are ideas for a next iteration, not hard commitments. Would be good to test these and see what actually moves the needle:
- Split the large skills — break
solidity-security into focused pieces (reentrancy, access control, gas optimization) and defi-protocol-templates into individual protocol skills (staking, amm, governance, flash-loans). Target ~150-250 lines each.
- Add template files alongside SKILL.md — working .sol contracts, deploy scripts, maybe starter .tsx pages. Give the agent something concrete to adapt rather than generating from scratch.
- Add composition guidance in AGENTS.md — suggest which 2-3 skills pair well for common tasks so the agent doesn't try to load everything.
- Add verification checklists — short "did you actually do this right" section at the end of each skill (compiles? deploy tags correct? using right hook names?).
How to validate
The tricky part is measuring whether these changes actually help. Some ideas:
- Pick 3-4 common build prompts ("build an NFT marketplace", "build a staking dapp", etc.)
- Run them with current skills vs improved skills
- Compare output quality (compiles? deploys? frontend works?)
No rush on this — just capturing the research so we can iterate on the skills systematically.
References
Context
We just shipped the initial set of skills in #1235. They work, but there's room to make them significantly more effective.
In the port blog, this paper was suggested, SkillsBench paper, which benchmarked 86 real-world tasks across 7 agent-model combos (7,308 trajectories). Some of the findings are really relevant to how we structure our skills going forward.
Key findings worth acting on
1. Skill length has a sweet spot
Exhaustive docs actually degrade performance — they eat context without adding actionable guidance. Two of our skills (
solidity-securityat 533 lines,defi-protocol-templatesat 443 lines) fall into the "comprehensive" bucket.2. 2-3 skills per task is the sweet spot
Loading too many skills creates conflicting guidance. Right now AGENTS.md lists all 6 and the agent could try to load 4+ for a complex task.
3. Executable resources matter
Skills with companion scripts/templates outperform markdown-only skills. Currently all our skills are standalone SKILL.md files with no runnable code alongside them.
4. Smaller models + good skills > bigger models without
Haiku 4.5 with curated skills (27.7%) beat Opus 4.5 without skills (22.0%). Well-designed skills are worth more than a model tier upgrade — which is great news for making SE-2 accessible to folks using cheaper models.
What we could try next
These are ideas for a next iteration, not hard commitments. Would be good to test these and see what actually moves the needle:
solidity-securityinto focused pieces (reentrancy, access control, gas optimization) anddefi-protocol-templatesinto individual protocol skills (staking, amm, governance, flash-loans). Target ~150-250 lines each.How to validate
The tricky part is measuring whether these changes actually help. Some ideas:
No rush on this — just capturing the research so we can iterate on the skills systematically.
References