Skip to content

Improve skill architecture based on SkillsBench findings #1236

@technophile-04

Description

@technophile-04

Context

We just shipped the initial set of skills in #1235. They work, but there's room to make them significantly more effective.

In the port blog, this paper was suggested, SkillsBench paper, which benchmarked 86 real-world tasks across 7 agent-model combos (7,308 trajectories). Some of the findings are really relevant to how we structure our skills going forward.

Key findings worth acting on

1. Skill length has a sweet spot

Length Impact
Compact +17.1pp
Detailed +18.8pp
Comprehensive -2.9pp (hurts!)

Exhaustive docs actually degrade performance — they eat context without adding actionable guidance. Two of our skills (solidity-security at 533 lines, defi-protocol-templates at 443 lines) fall into the "comprehensive" bucket.

2. 2-3 skills per task is the sweet spot

Count Impact
1 skill +17.8pp
2-3 skills +18.6pp (peak)
4+ skills +5.9pp (big drop)

Loading too many skills creates conflicting guidance. Right now AGENTS.md lists all 6 and the agent could try to load 4+ for a complex task.

3. Executable resources matter

Skills with companion scripts/templates outperform markdown-only skills. Currently all our skills are standalone SKILL.md files with no runnable code alongside them.

4. Smaller models + good skills > bigger models without

Haiku 4.5 with curated skills (27.7%) beat Opus 4.5 without skills (22.0%). Well-designed skills are worth more than a model tier upgrade — which is great news for making SE-2 accessible to folks using cheaper models.

What we could try next

These are ideas for a next iteration, not hard commitments. Would be good to test these and see what actually moves the needle:

  • Split the large skills — break solidity-security into focused pieces (reentrancy, access control, gas optimization) and defi-protocol-templates into individual protocol skills (staking, amm, governance, flash-loans). Target ~150-250 lines each.
  • Add template files alongside SKILL.md — working .sol contracts, deploy scripts, maybe starter .tsx pages. Give the agent something concrete to adapt rather than generating from scratch.
  • Add composition guidance in AGENTS.md — suggest which 2-3 skills pair well for common tasks so the agent doesn't try to load everything.
  • Add verification checklists — short "did you actually do this right" section at the end of each skill (compiles? deploy tags correct? using right hook names?).

How to validate

The tricky part is measuring whether these changes actually help. Some ideas:

  • Pick 3-4 common build prompts ("build an NFT marketplace", "build a staking dapp", etc.)
  • Run them with current skills vs improved skills
  • Compare output quality (compiles? deploys? frontend works?)

No rush on this — just capturing the research so we can iterate on the skills systematically.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions