Improve skill architecture based on SkillsBench findings

## Context

We just shipped the initial set of skills in #1235. They work, but there's room to make them significantly more effective.

In the port blog, this paper was suggested, [SkillsBench paper](https://arxiv.org/abs/2602.12670), which benchmarked 86 real-world tasks across 7 agent-model combos (7,308 trajectories). Some of the findings are really relevant to how we structure our skills going forward.

<details>
<summary><strong>Key findings worth acting on</strong></summary>

### 1. Skill length has a sweet spot

| Length | Impact |
|--------|--------|
| Compact | +17.1pp |
| Detailed | +18.8pp |
| Comprehensive | **-2.9pp** (hurts!) |

Exhaustive docs actually degrade performance — they eat context without adding actionable guidance. Two of our skills (`solidity-security` at 533 lines, `defi-protocol-templates` at 443 lines) fall into the "comprehensive" bucket.

### 2. 2-3 skills per task is the sweet spot

| Count | Impact |
|-------|--------|
| 1 skill | +17.8pp |
| 2-3 skills | +18.6pp (peak) |
| 4+ skills | +5.9pp (big drop) |

Loading too many skills creates conflicting guidance. Right now AGENTS.md lists all 6 and the agent could try to load 4+ for a complex task.

### 3. Executable resources matter

Skills with companion scripts/templates outperform markdown-only skills. Currently all our skills are standalone SKILL.md files with no runnable code alongside them.

### 4. Smaller models + good skills > bigger models without

Haiku 4.5 with curated skills (27.7%) beat Opus 4.5 without skills (22.0%). Well-designed skills are worth more than a model tier upgrade — which is great news for making SE-2 accessible to folks using cheaper models.

</details>

## What we could try next

These are ideas for a next iteration, not hard commitments. Would be good to test these and see what actually moves the needle:

- **Split the large skills** — break `solidity-security` into focused pieces (reentrancy, access control, gas optimization) and `defi-protocol-templates` into individual protocol skills (staking, amm, governance, flash-loans). Target ~150-250 lines each.
- **Add template files alongside SKILL.md** — working .sol contracts, deploy scripts, maybe starter .tsx pages. Give the agent something concrete to adapt rather than generating from scratch.
- **Add composition guidance in AGENTS.md** — suggest which 2-3 skills pair well for common tasks so the agent doesn't try to load everything.
- **Add verification checklists** — short "did you actually do this right" section at the end of each skill (compiles? deploy tags correct? using right hook names?).

## How to validate

The tricky part is measuring whether these changes actually help. Some ideas:
- Pick 3-4 common build prompts ("build an NFT marketplace", "build a staking dapp", etc.)
- Run them with current skills vs improved skills
- Compare output quality (compiles? deploys? frontend works?)

No rush on this — just capturing the research so we can iterate on the skills systematically.

## References

- [SkillsBench paper](https://arxiv.org/abs/2602.12670)
- [Tweet thread by @port_dev](https://x.com/port_dev/status/2024916757337571605) that surfaced the paper
- #1235 — initial skills PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve skill architecture based on SkillsBench findings #1236

Context

1. Skill length has a sweet spot

2. 2-3 skills per task is the sweet spot

3. Executable resources matter

4. Smaller models + good skills > bigger models without

What we could try next

How to validate

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Length	Impact
Compact	+17.1pp
Detailed	+18.8pp
Comprehensive	-2.9pp (hurts!)

Count	Impact
1 skill	+17.8pp
2-3 skills	+18.6pp (peak)
4+ skills	+5.9pp (big drop)

Improve skill architecture based on SkillsBench findings #1236

Description

Context

1. Skill length has a sweet spot

2. 2-3 skills per task is the sweet spot

3. Executable resources matter

4. Smaller models + good skills > bigger models without

What we could try next

How to validate

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions