Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pkg/create/templates.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,8 @@ var Templates = map[string]TemplateInfo{
Languages: []string{LanguageTypeScript, LanguagePython},
},
TemplateYutoriComputerUse: {
Name: "Yutori n1 Computer Use",
Description: "Implements a Yutori n1 computer use agent",
Name: "Yutori n1.5 Computer Use",
Description: "Implements a Yutori n1.5 computer use agent",
Languages: []string{LanguageTypeScript, LanguagePython},
},
TemplateTzafonComputerUse: {
Expand Down
37 changes: 25 additions & 12 deletions pkg/templates/python/yutori/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Kernel Python Sample App - Yutori n1 Computer Use
# Kernel Python Sample App - Yutori n1.5 Computer Use

This is a Kernel application that implements a prompt loop using Yutori's n1 computer use model with Kernel's Computer Controls API.
This Kernel app implements a prompt loop using Yutori's Navigator n1.5 with Kernel's Computer Controls API.

[n1](https://yutori.com/blog/introducing-navigator) is Yutori's pixels-to-actions LLM that predicts browser actions from screenshots.
[Navigator n1.5](https://yutori.com/blog/introducing-n1-5) is Yutori's pixels-to-actions LLM that predicts browser actions from screenshots. Yutori reports **94.5% on Online-Mind2Web** and **88.0% on Navi-Bench v2** at **$1.50 per 1M input tokens** — currently among the most accurate and cost-effective computer-use models.

This template runs n1.5 in **computer-use-only mode**. n1.5 also supports a hybrid vision + DOM/JavaScript path (page-state extraction, custom JS, structured JSON output) for multi-field forms and bulk data extraction, but those tools are intentionally disabled here — see [Disabled tools](#disabled-tools).

## Setup

Expand Down Expand Up @@ -55,35 +57,46 @@ kernel invoke python-yutori-cua cua-task --payload '{"query": "Enter https://exa

## Viewport Configuration

Yutori n1 recommends a **1280×800 (WXGA, 16:10)** viewport for best grounding accuracy.
Yutori n1.5 recommends a **1280×800 (WXGA, 16:10)** viewport for best grounding accuracy.

> **Note:** n1 outputs coordinates in a 1000×1000 relative space, which are automatically scaled to the actual viewport dimensions.
> **Note:** n1.5 outputs coordinates in a 1000×1000 relative space, which are automatically scaled to the actual viewport dimensions.

See [Kernel Viewport Documentation](https://www.kernel.sh/docs/browsers/viewport) for all supported configurations.

## Screenshots

Screenshots are automatically converted to WebP format for better compression across multi-step trajectories, as recommended by Yutori.

## n1-latest Supported Actions
## n1.5-latest Supported Actions

This template uses the `browser_tools_core-20260403` tool set — coordinate-based browser actions that operate on screenshots only.

| Action | Description |
|--------|-------------|
| `left_click` | Left mouse click at coordinates |
| `double_click` | Double-click at coordinates |
| `triple_click` | Triple-click at coordinates |
| `left_click` | Left mouse click at coordinates (supports `modifier`) |
| `double_click` | Double-click at coordinates (supports `modifier`) |
| `triple_click` | Triple-click at coordinates (supports `modifier`) |
| `middle_click` | Middle mouse click at coordinates |
| `right_click` | Right mouse click at coordinates |
| `mouse_move` | Move mouse to coordinates without clicking |
| `mouse_down` | Press the left mouse button at coordinates |
| `mouse_up` | Release the left mouse button at coordinates |
| `scroll` | Scroll page in a direction |
| `type` | Type text into focused element |
| `key_press` | Send keyboard input |
| `hover` | Move mouse without clicking |
| `key_press` | Send a single key or key combination |
| `hold_key` | Hold a key for a duration |
| `drag` | Click-and-drag operation |
| `wait` | Pause for UI to update |
| `refresh` | Reload current page |
| `go_back` | Navigate back in history |
| `go_forward` | Navigate forward in history |
| `goto_url` | Navigate to a URL |

### Disabled tools

The DOM/Playwright-based "expanded" tools (`extract_elements`, `find`, `set_element_value`, `execute_js`) are intentionally disabled via the `disable_tools` request parameter — this template runs computer-use only and does not expose a Playwright page to the model.

## Resources

- [Yutori n1 API Documentation](https://docs.yutori.com/reference/n1)
- [Yutori n1.5 API Documentation](https://docs.yutori.com/reference/n1-5)
- [Kernel Documentation](https://www.kernel.sh/docs/quickstart)
122 changes: 113 additions & 9 deletions pkg/templates/python/yutori/loop.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,42 @@
"""
Yutori n1 Sampling Loop
Yutori n1.5 Sampling Loop

Implements the agent loop for Yutori's n1-latest computer use model.
n1-latest uses an OpenAI-compatible API with tool_calls:
Implements the agent loop for Yutori's n1.5-latest computer use model.
n1.5-latest uses an OpenAI-compatible API with tool_calls:
- Actions are returned via tool_calls in the assistant message
- Tool results use role: "tool" with matching tool_call_id
- The model stops by returning content without tool_calls
- Coordinates are returned in 1000x1000 space and need scaling

@see https://docs.yutori.com/reference/n1
@see https://docs.yutori.com/reference/n1-5
"""

import copy
import json
from typing import Any, Optional

from kernel import Kernel
from openai import OpenAI

from tools import ComputerTool, N1Action, ToolResult
from tools import ComputerTool, N15Action, ToolResult

# Tools that require a Playwright page / DOM access. The default core tool set
# already excludes them, but we also list them in `disable_tools` so the
# exclusion is explicit and survives if the default ever changes.
DISABLED_TOOLS = ["extract_elements", "find", "set_element_value", "execute_js"]
TOOL_SET = "browser_tools_core-20260403"

# Screenshot-trimming defaults mirror Yutori's reference loop:
# https://github.com/yutori-ai/yutori-sdk-python/blob/main/yutori/navigator/payload.py
# Trimming is size-triggered — we only drop old screenshots when the payload
# exceeds MAX_REQUEST_BYTES, and we always keep at least KEEP_RECENT_SCREENSHOTS.
MAX_REQUEST_BYTES = 9_500_000
KEEP_RECENT_SCREENSHOTS = 6


async def sampling_loop(
*,
model: str = "n1-latest",
model: str = "n1.5-latest",
task: str,
api_key: str,
kernel: Kernel,
Expand Down Expand Up @@ -63,12 +77,23 @@ async def sampling_loop(
iteration += 1
print(f"\n=== Iteration {iteration} ===")

request_messages, dropped = _trimmed_for_request(conversation_messages)
if dropped:
print(f"Trimmed {dropped} old screenshot(s) to fit request size limit")

try:
response = client.chat.completions.create(
model=model,
messages=conversation_messages,
messages=request_messages,
max_completion_tokens=max_completion_tokens,
temperature=0.3,
# n1.5-specific knobs go in extra_body.
# tool_set selects the core (coordinate-based) tools.
# disable_tools is a defense-in-depth exclusion of DOM/Playwright tools.
extra_body={
"tool_set": TOOL_SET,
"disable_tools": DISABLED_TOOLS,
},
)
except Exception as api_error:
print(f"API call failed: {api_error}")
Expand Down Expand Up @@ -108,7 +133,7 @@ async def sampling_loop(
})
continue

action: N1Action = {"action_type": action_name, **args}
action: N15Action = {"action_type": action_name, **args}
print(f"Executing action: {action_name}", args)

scaled_action = _scale_coordinates(action, viewport_width, viewport_height)
Expand Down Expand Up @@ -155,7 +180,86 @@ async def sampling_loop(
}


def _scale_coordinates(action: N1Action, viewport_width: int, viewport_height: int) -> N1Action:
def _trimmed_for_request(
messages: list[dict[str, Any]],
) -> tuple[list[dict[str, Any]], int]:
"""Return a deep-copied messages list with old screenshots stripped to fit MAX_REQUEST_BYTES.

The most recent KEEP_RECENT_SCREENSHOTS screenshots are protected. The full
`messages` list is preserved unchanged for the caller's return value.
"""
trimmed = copy.deepcopy(messages)
size = _estimate_size(trimmed)
if size <= MAX_REQUEST_BYTES:
return trimmed, 0

image_indices = [i for i, m in enumerate(trimmed) if _message_has_image(m)]
if not image_indices:
return trimmed, 0

protected = set(image_indices[-max(1, KEEP_RECENT_SCREENSHOTS):])
removed = 0

for idx in image_indices:
if size <= MAX_REQUEST_BYTES:
break
if idx in protected:
continue
if _strip_one_image(trimmed[idx]):
removed += 1
size = _estimate_size(trimmed)

# If still over, strip from the protected window too — but always keep the latest.
if size > MAX_REQUEST_BYTES:
last_idx = image_indices[-1]
for idx in image_indices:
if size <= MAX_REQUEST_BYTES:
break
if idx == last_idx:
continue
if _strip_one_image(trimmed[idx]):
removed += 1
size = _estimate_size(trimmed)

return trimmed, removed


def _estimate_size(messages: list[dict[str, Any]]) -> int:
return len(json.dumps(messages, separators=(",", ":"), ensure_ascii=False).encode("utf-8"))


def _message_has_image(msg: dict[str, Any]) -> bool:
content = msg.get("content")
if not isinstance(content, list):
return False
return any(isinstance(p, dict) and p.get("type") == "image_url" for p in content)


def _strip_one_image(msg: dict[str, Any]) -> bool:
content = msg.get("content")
if not isinstance(content, list):
return False

removed = False
new_content: list[dict[str, Any]] = []
for part in content:
if not removed and isinstance(part, dict) and part.get("type") == "image_url":
removed = True
continue
new_content.append(part)

if not removed:
return False

has_text = any(isinstance(p, dict) and p.get("type") == "text" for p in new_content)
if not has_text:
new_content.append({"type": "text", "text": "Screenshot omitted to stay under request size limit."})

msg["content"] = new_content
return True


def _scale_coordinates(action: N15Action, viewport_width: int, viewport_height: int) -> N15Action:
scaled = dict(action)

if "coordinates" in scaled and scaled["coordinates"]:
Expand Down
4 changes: 2 additions & 2 deletions pkg/templates/python/yutori/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ async def cua_task(
payload: QueryInput,
) -> QueryOutput:
"""
Process a user query using Yutori n1 Computer Use with Kernel's browser automation.
Process a user query using Yutori n1.5 Computer Use with Kernel's browser automation.

Args:
ctx: Kernel context containing invocation information
Expand Down Expand Up @@ -58,7 +58,7 @@ async def cua_task(
print("Kernel browser live view url:", session.live_view_url)

loop_result = await sampling_loop(
model="n1-latest",
model="n1.5-latest",
task=payload["query"],
api_key=str(api_key),
kernel=session.kernel,
Expand Down
6 changes: 3 additions & 3 deletions pkg/templates/python/yutori/tools/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
"""Yutori n1 Computer Tools."""
"""Yutori n1.5 Computer Tools."""

from .base import ToolError, ToolResult
from .computer import ComputerTool, N1Action
from .computer import ComputerTool, N15Action

__all__ = [
"ToolError",
"ToolResult",
"ComputerTool",
"N1Action",
"N15Action",
]
Loading
Loading