diff --git a/CONTRIBUTING.claude.md b/CONTRIBUTING.claude.md new file mode 100644 index 000000000..6ab95f539 --- /dev/null +++ b/CONTRIBUTING.claude.md @@ -0,0 +1,332 @@ +# Chromie Contribution Guidelines + +This file contains guidelines for Chromie (AI coding assistant) when contributing to Stagehand. + +## Core Principle: Test-First Bug Fixing + +**Every bug fix MUST include a failing test that proves the bug exists.** + +### Workflow + +1. **Analyze the bug** - Understand the root cause +2. **Write a failing test** - Create a test that fails with the current code +3. **Verify the test fails** - Run `pnpm test` to confirm +4. **Implement the fix** - Write minimal code to fix the bug +5. **Verify the test passes** - Run `pnpm test` to confirm +6. **Create PR** - Submit with both test and fix + +### Why Test-First? + +- **Proves understanding** - The test demonstrates you understand the bug +- **Prevents regression** - The test catches if the bug returns +- **Documents behavior** - The test explains expected behavior +- **Validates the fix** - Green tests prove the fix works + +--- + +## Test Location Guide + +| Bug Type | Test Location | Test Framework | +|----------|---------------|----------------| +| Core handler bugs | `packages/core/lib/v3/tests/*.spec.ts` | Playwright Test | +| Extract failures | `packages/core/lib/v3/tests/` + `packages/evals/tasks/extract_*.ts` | Playwright + Evals | +| Act failures | `packages/core/lib/v3/tests/` + `packages/evals/tasks/` | Playwright + Evals | +| Agent bugs | `packages/core/lib/v3/tests/agent-*.spec.ts` | Playwright Test | +| Shadow DOM issues | `packages/core/lib/v3/tests/shadow-iframe.spec.ts` | Playwright Test | +| iframe issues | `packages/core/lib/v3/tests/frame-*.spec.ts` | Playwright Test | + +### When to Write Playwright Tests vs Evals + +**Playwright Tests** (`packages/core/lib/v3/tests/`): +- Unit/integration tests for specific functionality +- Fast, deterministic tests +- Can use mock servers or static HTML +- Run with `pnpm test` + +**Evals** (`packages/evals/tasks/`): +- End-to-end tests against real websites +- Test real-world scenarios +- May be flaky due to site changes +- Run with `pnpm evals` + +--- + +## Test Template + +### Playwright Test + +```typescript +// packages/core/lib/v3/tests/bug-description.spec.ts +import { test, expect } from "@playwright/test"; +import { Stagehand } from "@browserbasehq/stagehand"; + +test.describe("Bug: [Description]", () => { + let stagehand: Stagehand; + + test.beforeEach(async () => { + stagehand = new Stagehand({ env: "LOCAL", verbose: 0 }); + await stagehand.init(); + }); + + test.afterEach(async () => { + await stagehand.close(); + }); + + test("should [expected behavior]", async () => { + const page = stagehand.context.pages()[0]; + + // Setup: Navigate to test page or mock scenario + await page.goto("https://example.com"); + + // Action: Perform the operation that was broken + const result = await stagehand.extract("extract the title"); + + // Assert: Verify expected behavior + expect(result.extraction).toBe("Expected Title"); + }); +}); +``` + +### Eval Task + +```typescript +// packages/evals/tasks/bug_description.ts +import { EvalFunction } from "../types/evals"; + +export const bug_description: EvalFunction = async ({ + debugUrl, + sessionUrl, + v3, + logger, +}) => { + try { + const page = v3.context.pages()[0]; + await page.goto("https://example.com"); + + // Perform the operation that was broken + const result = await v3.extract("extract the data"); + + logger.log({ + message: "Extraction result", + level: 1, + auxiliary: { result: { value: result, type: "object" } }, + }); + + // Assert expected behavior + return { + _success: result.extraction === "expected value", + result, + debugUrl, + sessionUrl, + logs: logger.getLogs(), + }; + } catch (error) { + return { + _success: false, + error: JSON.parse(JSON.stringify(error, null, 2)), + debugUrl, + sessionUrl, + logs: logger.getLogs(), + }; + } finally { + await v3.close(); + } +}; +``` + +--- + +## Common Bug Patterns + +### Selector/Element Not Found + +**Symptoms**: `XPathResolutionError`, `StagehandElementNotFoundError` + +**Test approach**: +```typescript +test("should handle dynamic content loading", async () => { + // Navigate to page with dynamic content + // Wait for content to load + // Perform action + // Assert success +}); +``` + +### Shadow DOM Issues + +**Symptoms**: `StagehandShadowRootMissingError`, elements inside shadow DOM not found + +**Test approach**: +```typescript +test("should interact with shadow DOM elements", async () => { + // Navigate to page with shadow DOM + // Use deepLocator or appropriate method + // Assert element is found and actionable +}); +``` + +### Timeout Issues + +**Symptoms**: `ActTimeoutError`, `ExtractTimeoutError` + +**Test approach**: +```typescript +test("should complete within timeout", async () => { + // Set explicit timeout + // Perform action that was timing out + // Assert completes successfully +}); +``` + +### LLM Response Parsing + +**Symptoms**: `ZodSchemaValidationError`, malformed extraction + +**Test approach**: +```typescript +test("should extract data matching schema", async () => { + // Define schema + // Extract with schema + // Assert structure matches +}); +``` + +--- + +## PR Requirements + +### Title Format + +``` +fix(component): brief description of fix + +Examples: +fix(actHandler): handle shadow DOM elements in nested iframes +fix(extractHandler): parse URLs correctly when schema uses z.string().url() +fix(agent): prevent infinite loop when task is already complete +``` + +### PR Body Template + +```markdown +## Problem + +[Describe the bug - what was happening?] + +## Root Cause + +[What caused the bug?] + +## Solution + +[How does this fix address the root cause?] + +## Test Plan + +- [ ] Added failing test that reproduces the bug +- [ ] Verified test fails before fix +- [ ] Verified test passes after fix +- [ ] Ran full test suite: `pnpm test` +- [ ] (If applicable) Ran relevant evals: `pnpm evals [category]` + +## Related Issues + +Fixes #[issue-number] +``` + +--- + +## Code Style + +### Follow Existing Patterns + +- Look at surrounding code for style guidance +- Use existing utilities (don't reinvent) +- Follow handler pattern for new functionality +- Keep changes minimal and focused + +### Error Handling + +- Throw specific error types from `types/public/sdkErrors.ts` +- Include helpful error messages +- Log at appropriate verbosity levels + +### TypeScript + +- Use strict types (no `any` unless necessary) +- Export types from `types/public/` for public API +- Keep internal types in `types/private/` + +--- + +## Running Tests + +### Before Submitting PR + +```bash +# Build the project +pnpm build + +# Run all tests +pnpm test + +# Run specific test file +pnpm test packages/core/lib/v3/tests/my-test.spec.ts + +# Run e2e tests locally +pnpm e2e:local + +# Run relevant evals (if applicable) +pnpm evals [task-name] +``` + +### Test Environment + +- **Local testing**: Uses local Chrome via `chrome-launcher` +- **Browserbase testing**: Uses remote browser sessions +- Default to local for faster iteration + +--- + +## Common Mistakes to Avoid + +1. **Don't skip the failing test** - Every fix needs a test +2. **Don't modify unrelated code** - Keep changes focused +3. **Don't add unnecessary abstractions** - Simpler is better +4. **Don't forget to close Stagehand** - Always use `finally { await v3.close() }` +5. **Don't hardcode timeouts** - Use configurable values +6. **Don't ignore TypeScript errors** - Fix them properly +7. **Don't add console.log** - Use the logger system + +--- + +## Key Files for Common Fixes + +| Issue Type | Key Files | +|------------|-----------| +| Action execution | `handlers/actHandler.ts`, `understudy/page.ts` | +| Data extraction | `handlers/extractHandler.ts`, `utils.ts` | +| Element selection | `understudy/a11y/snapshot.ts`, `dom/` | +| Shadow DOM | `understudy/page.ts`, `dom/` | +| Agent behavior | `handlers/v3AgentHandler.ts`, `agent/tools/` | +| Timeouts | `handlers/handlerUtils/timeoutGuard.ts` | +| LLM inference | `llm/LLMClient.ts`, `inference.ts` | +| Error handling | `types/public/sdkErrors.ts` | + +--- + +## Escalation Context + +When receiving an escalation from Slack, extract: + +1. **Bug description** - What's not working? +2. **Reproduction steps** - How to trigger the bug? +3. **Error message** - Exact error text if available +4. **Environment** - Local or Browserbase? Which model? +5. **Code snippet** - User's code if provided + +Use this context to: +1. Write a focused test that reproduces the issue +2. Identify the root cause in the codebase +3. Implement a minimal fix +4. Verify the fix addresses the original report diff --git a/packages/core/CLAUDE.md b/packages/core/CLAUDE.md new file mode 100644 index 000000000..f7a770ffd --- /dev/null +++ b/packages/core/CLAUDE.md @@ -0,0 +1,410 @@ +# Stagehand Core Package + +This is the main Stagehand SDK (`@browserbasehq/stagehand`). It contains the V3 implementation of the browser automation framework. + +## Directory Structure + +``` +packages/core/ +├── lib/v3/ # V3 implementation +│ ├── v3.ts # Main orchestrator (exported as Stagehand) +│ ├── index.ts # Public exports +│ ├── handlers/ # API handlers +│ │ ├── actHandler.ts # Action execution +│ │ ├── extractHandler.ts # Data extraction +│ │ ├── observeHandler.ts # Action planning +│ │ ├── v3AgentHandler.ts # Tools-based agent +│ │ ├── v3CuaAgentHandler.ts # Computer Use Agent +│ │ └── handlerUtils/ # Shared handler utilities +│ ├── understudy/ # CDP browser abstraction +│ │ ├── context.ts # V3Context - manages browser context +│ │ ├── page.ts # Page abstraction +│ │ ├── frame.ts # Frame abstraction +│ │ └── a11y/ # Accessibility tree utilities +│ │ └── snapshot.ts # captureHybridSnapshot() +│ ├── llm/ # LLM abstraction +│ │ ├── LLMClient.ts # Base LLM interface +│ │ └── LLMProvider.ts # Provider factory (57+ models) +│ ├── launch/ # Browser launch +│ │ ├── local.ts # Local Chrome (chrome-launcher) +│ │ └── browserbase.ts # Browserbase sessions +│ ├── dom/ # DOM scripts +│ │ └── *.ts # Scripts injected into pages +│ ├── agent/ # Agent components +│ │ ├── tools/ # Built-in agent tools +│ │ └── utils/ # Agent utilities +│ ├── types/ # TypeScript types +│ │ ├── public/ # Exported types +│ │ │ ├── methods.ts # act/extract/observe types +│ │ │ ├── sdkErrors.ts # Error classes +│ │ │ ├── model.ts # Model types +│ │ │ └── agent.ts # Agent types +│ │ └── private/ # Internal types +│ ├── cache/ # Caching utilities +│ │ ├── ActCache.ts # Action result caching +│ │ └── AgentCache.ts # Agent state caching +│ ├── mcp/ # Model Context Protocol +│ └── tests/ # Playwright tests +└── examples/ # Usage examples +``` + +--- + +## Core Class: V3 (Stagehand) + +**File**: `lib/v3/v3.ts` + +The `V3` class is the main entry point, exported as `Stagehand`. It orchestrates: + +1. **Browser lifecycle**: Launch/connect to browser, create context +2. **Handler delegation**: Route API calls to appropriate handlers +3. **LLM management**: Resolve model clients per-call or globally +4. **Metrics tracking**: Token usage, inference time + +### Initialization Flow + +```typescript +const stagehand = new Stagehand({ env: "LOCAL", model: "openai/gpt-4.1-mini" }); +await stagehand.init(); +``` + +**What happens in `init()`:** + +1. Load environment variables (`.env`) +2. Launch browser: + - `env: "LOCAL"`: `launchLocalChrome()` via chrome-launcher + - `env: "BROWSERBASE"`: `createBrowserbaseSession()` via SDK +3. Connect to CDP WebSocket (15s timeout) +4. Create `V3Context` from CDP connection +5. Initialize handlers: `ActHandler`, `ExtractHandler`, `ObserveHandler` +6. Wait for first page to load + +### Key Properties + +```typescript +stagehand.context; // V3Context - browser context management +stagehand.llmClient; // LLMClient - current LLM client +stagehand.browserbaseSessionId; // Session ID (if using Browserbase) +``` + +--- + +## Handler Pattern + +Each core API has a dedicated handler class. Handlers are stateless and receive all dependencies via constructor. + +### Common Handler Structure + +```typescript +export class FooHandler { + private readonly llmClient: LLMClient; + private readonly resolveLlmClient: (model?: ModelConfiguration) => LLMClient; + private readonly onMetrics?: (...) => void; + + constructor(llmClient, defaultModel, clientOptions, resolveLlmClient, ...) { + // Store dependencies + } + + async foo(params: FooHandlerParams): Promise { + // 1. Capture snapshot + // 2. Send to LLM + // 3. Process response + // 4. Execute action (if applicable) + // 5. Return result + } +} +``` + +### ActHandler (`handlers/actHandler.ts`) + +Executes single atomic actions on the page. + +**Flow:** + +1. Capture hybrid snapshot (accessibility tree + element mappings) +2. Build prompt with instruction + DOM elements +3. Send to LLM via `actInference()` +4. LLM returns: `{ elementId, method, arguments }` +5. Map elementId to XPath selector +6. Execute action via `performUnderstudyMethod()` +7. Wait for DOM/network quiet + +**Self-healing (when enabled):** + +- If action fails, retake screenshot +- Re-prompt LLM with error context +- Retry up to 3 times + +**Key methods:** + +- `act(params)`: Main action execution +- `takeDeterministicAction(page, action)`: Direct Action → execute (skip LLM) + +### ExtractHandler (`handlers/extractHandler.ts`) + +Extracts structured data from pages using Zod schemas. + +**Flow:** + +1. If no instruction: return raw page text (accessibility tree) +2. Transform schema (convert `z.string().url()` to numeric IDs) +3. Capture hybrid snapshot +4. Send to LLM via `runExtract()` +5. LLM returns structured data matching schema +6. Inject real URLs back into numeric ID placeholders +7. Validate against Zod schema + +**URL handling:** + +- `z.string().url()` fields are replaced with `z.number()` before LLM call +- LLM returns numeric IDs referencing elements in DOM +- IDs are mapped back to actual URLs after extraction +- This prevents URL hallucination + +### ObserveHandler (`handlers/observeHandler.ts`) + +Plans actions without executing them. Returns candidate actions. + +**Flow:** + +1. Capture hybrid snapshot +2. If instruction provided: find matching elements +3. If no instruction: return all interactive elements +4. Build Action objects with XPath selectors +5. Return Action[] for user to choose from + +**Use case:** Observe + Act pattern - plan once, execute later + +### V3AgentHandler (`handlers/v3AgentHandler.ts`) + +Multi-step autonomous execution using AI SDK tools. + +**Tools available to agent:** + +- `act`: Execute single action +- `extract`: Extract data +- `observe`: Plan actions +- `screenshot`: Capture page screenshot +- `goto`: Navigate to URL +- `scroll`: Scroll page +- `wait`: Wait for time/condition +- `close`: Close page + +**Flow:** + +1. Create AI SDK messages with system prompt +2. Loop until max_steps or task complete: + - Call LLM with current state + - Execute tool calls + - Append results to messages +3. Return final result with message, actions, reasoning + +### V3CuaAgentHandler (`handlers/v3CuaAgentHandler.ts`) + +Computer Use Agent for Claude Sonnet 4 or Gemini 2.5 computer-use models. + +**Difference from V3AgentHandler:** + +- Direct browser control without Stagehand tool wrapping +- Uses native computer-use capabilities of the model +- Enabled via `agent({ cua: true })` + +--- + +## CDP Abstraction (understudy/) + +The `understudy` directory contains the Chrome DevTools Protocol abstraction layer. + +### V3Context (`understudy/context.ts`) + +Manages the browser context and page lifecycle. + +**Responsibilities:** + +- Own root CDP connection (`CdpConnection`) +- Manage `Page` objects (one per browser tab) +- Handle Target events (new tabs, closes) +- Track frame topology and OOPIF adoption + +**Key methods:** + +```typescript +context.pages(); // Get all Page objects +context.newPage(); // Create new page/tab +context.activePage; // Get current active page +``` + +### Page (`understudy/page.ts`) + +Abstraction over a browser tab. + +**Key methods:** + +```typescript +page.goto(url); // Navigate +page.screenshot(); // Capture screenshot +page.evaluate(fn); // Run JS in page context +page.locator(selector); // Get element locator +page.deepLocator(xpath); // XPath across shadow DOM/iframes +``` + +### Snapshots (`understudy/a11y/snapshot.ts`) + +**`captureHybridSnapshot()`:** + +- Captures accessibility tree +- Maps element IDs to XPath selectors +- Used by all handlers for LLM context + +--- + +## LLM Abstraction (llm/) + +### LLMClient (`llm/LLMClient.ts`) + +Base interface for LLM clients. + +```typescript +interface LLMClient { + createChatCompletion(params): Promise; + // Model-specific implementations +} +``` + +### LLMProvider (`llm/LLMProvider.ts`) + +Factory for creating LLM clients by model name. + +**Supported providers:** + +- OpenAI: `openai/gpt-4.1`, `openai/gpt-4.1-mini`, etc. +- Anthropic: `anthropic/claude-sonnet-4`, `anthropic/claude-haiku-4-5` +- Google: `google/gemini-2.0-flash`, `google/gemini-2.5-*` +- Others: Together, Groq, Cerebras, Mistral, xAI, Perplexity, Ollama + +--- + +## Error Classes (`types/public/sdkErrors.ts`) + +All errors extend `StagehandError`. + +| Error | When Thrown | +| --------------------------------- | ---------------------------------- | +| `StagehandNotInitializedError` | Calling methods before `init()` | +| `MissingEnvironmentVariableError` | Missing API keys | +| `ConnectionTimeoutError` | Can't connect to Chrome (15s) | +| `ActTimeoutError` | `act()` exceeds timeout | +| `ExtractTimeoutError` | `extract()` exceeds timeout | +| `ObserveTimeoutError` | `observe()` exceeds timeout | +| `XPathResolutionError` | Selector doesn't match element | +| `StagehandElementNotFoundError` | Element not in DOM | +| `StagehandShadowRootMissingError` | Shadow DOM pierce failed | +| `ZodSchemaValidationError` | LLM output doesn't match schema | +| `AgentAbortError` | Agent execution cancelled | +| `CuaModelRequiredError` | Using CUA without compatible model | + +--- + +## Testing + +### Test Location + +Tests are in `lib/v3/tests/` using Playwright Test. + +### Test Configurations + +- `v3.playwright.config.ts`: Default (parallel, 90s timeout) +- `v3.local.playwright.config.ts`: Local Chrome testing +- `v3.bb.playwright.config.ts`: Browserbase testing + +### Running Tests + +```bash +# From repo root +pnpm test # Default config +pnpm e2e:local # Local Chrome +pnpm e2e:bb # Browserbase + +# From packages/core +pnpm test +``` + +### Writing Tests + +```typescript +import { test, expect } from "@playwright/test"; +import { Stagehand } from "@browserbasehq/stagehand"; + +test("should extract data", async () => { + const stagehand = new Stagehand({ env: "LOCAL" }); + await stagehand.init(); + + const page = stagehand.context.pages()[0]; + await page.goto("https://example.com"); + + const data = await stagehand.extract( + "get the title", + z.object({ + title: z.string(), + }), + ); + + expect(data.title).toBeDefined(); + await stagehand.close(); +}); +``` + +--- + +## Key Patterns + +### Snapshot-Based AI + +All AI operations use accessibility tree snapshots, not live DOM queries. This ensures determinism and avoids race conditions. + +### Element ID → XPath Mapping + +1. `captureHybridSnapshot()` assigns numeric IDs to elements +2. LLM references elements by ID +3. IDs are mapped back to XPath for execution + +### Per-Call Model Override + +Any method can override the default model: + +```typescript +await stagehand.act("click button", { model: "anthropic/claude-sonnet-4" }); +``` + +### Metrics Tracking + +All handlers report: + +- `promptTokens`, `completionTokens`, `reasoningTokens` +- `cachedInputTokens`, `inferenceTimeMs` + +--- + +## Common Modifications + +### Adding a New Handler Method + +1. Add type to `types/public/methods.ts` +2. Create handler class in `handlers/` +3. Add handler to V3 constructor in `v3.ts` +4. Expose method on V3 class +5. Export from `index.ts` +6. Add tests in `tests/` + +### Adding a New LLM Provider + +1. Add provider client in `llm/` +2. Register in `LLMProvider.ts` +3. Add model names to `AvailableModel` type +4. Test with existing evals + +### Adding a New Agent Tool + +1. Add tool definition in `agent/tools/` +2. Register in `v3AgentHandler.ts` tools array +3. Add tests for new tool behavior diff --git a/packages/evals/CLAUDE.md b/packages/evals/CLAUDE.md new file mode 100644 index 000000000..15dd57a85 --- /dev/null +++ b/packages/evals/CLAUDE.md @@ -0,0 +1,461 @@ +# Stagehand Evals Package + +This package contains the evaluation suite for Stagehand. It provides a framework for running automated tests against live websites to measure Stagehand's capabilities. + +## Quick Start + +```bash +# Run all evals +pnpm evals + +# Run specific task +pnpm evals extract_repo_name + +# Run by category +pnpm evals observe_* + +# Run with options +pnpm evals --env=browserbase --trials=5 +``` + +## Directory Structure + +``` +packages/evals/ +├── tasks/ # Individual eval task files (126 tasks) +│ ├── agent/ # Agent-specific tasks (30+) +│ ├── extract_*.ts # Extract tasks +│ ├── observe_*.ts # Observe tasks +├── suites/ # External benchmark suites +│ ├── gaia.ts # GAIA benchmark +│ ├── webvoyager.ts # WebVoyager benchmark +│ └── onlineMind2Web.ts # OnlineMind2Web benchmark +├── types/ +│ └── evals.ts # Type definitions +├── evals.config.json # Task registry and configuration +├── run.ts # CLI entry point +├── index.eval.ts # Braintrust eval orchestrator +├── taskConfig.ts # Model and task configuration +├── scoring.ts # Scoring functions +├── logger.ts # EvalLogger utility +├── initV3.ts # Stagehand initialization +└── summary.ts # Result aggregation +``` + +--- + +## Writing Eval Tasks + +### Basic Task Structure + +Create a new file in `tasks/`: + +```typescript +// tasks/my_new_task.ts +import { EvalFunction } from "../types/evals"; + +export const my_new_task: EvalFunction = async ({ + debugUrl, + sessionUrl, + v3, + logger, +}) => { + try { + const page = v3.context.pages()[0]; + await page.goto("https://example.com"); + + // Perform actions + await v3.act("click the button"); + + // Extract data + const { extraction } = await v3.extract("get the result"); + + // Log intermediate results + logger.log({ + message: "Extracted result", + level: 1, + auxiliary: { + result: { value: extraction, type: "object" }, + }, + }); + + // Return success/failure + return { + _success: extraction === "expected value", + extraction, + debugUrl, + sessionUrl, + logs: logger.getLogs(), + }; + } catch (error) { + return { + _success: false, + error: JSON.parse(JSON.stringify(error, null, 2)), + debugUrl, + sessionUrl, + logs: logger.getLogs(), + }; + } finally { + await v3.close(); + } +}; +``` + +### Register the Task + +Add to `evals.config.json`: + +```json +{ + "tasks": [ + { + "name": "my_new_task", + "categories": ["extract"] + } + ] +} +``` + +### EvalFunction Signature + +```typescript +type EvalFunction = (taskInput: { + v3: V3; // Stagehand instance + v3Agent?: AgentInstance; // Agent instance (for agent tasks) + logger: EvalLogger; // Logging utility + debugUrl: string; // Debug URL for session + sessionUrl: string; // Browserbase session URL + modelName: AvailableModel; // Current model being tested + input: EvalInput; // Task input with params +}) => Promise<{ + _success: boolean; // Pass/fail + logs: LogLine[]; // Captured logs + debugUrl: string; // Debug URL + sessionUrl: string; // Session URL + error?: unknown; // Error if failed +}>; +``` + +--- + +## Task Categories + +| Category | Description | Example Tasks | +| --------------------------- | ------------------------------ | ----------------------------------------------- | +| `act` | Single action execution | `amazon_add_to_cart`, `dropdown`, `login` | +| `extract` | Data extraction | `extract_repo_name`, `extract_github_stars` | +| `observe` | Action planning | `observe_github`, `observe_amazon_add_to_cart` | +| `combination` | Multi-step workflows | `arxiv`, `allrecipes`, `peeler_complex` | +| `agent` | Agent-based tasks | `agent/google_flights`, `agent/sf_library_card` | +| `targeted_extract` | Extract from specific selector | `extract_recipe`, `extract_hamilton_weather` | +| `regression` | Regression tests | `wichita`, `heal_simple_google_search` | +| `experimental` | Experimental features | `apple`, `costar` | +| `llm_clients` | LLM provider tests | `hn_aisdk`, `hn_langchain` | +| `external_agent_benchmarks` | External benchmarks | `agent/gaia`, `agent/webvoyager` | + +--- + +## Running Evals + +### Command Line Options + +```bash +pnpm evals [task-name] [options] +``` + +| Option | Description | Default | +| -------------------------------------- | ------------------------- | -------------- | +| `--env=local\|browserbase` | Environment | `local` | +| `--trials=N` | Number of trials per eval | `3` | +| `--concurrency=N` | Max parallel sessions | `10` | +| `--provider=openai\|anthropic\|google` | Model provider filter | all | +| `--model=MODEL_NAME` | Specific model | default models | +| `--api=true\|false` | Use API mode | `false` | +| `--max_k=N` | Limit number of evals | unlimited | + +### Examples + +```bash +# Run extract tasks locally +pnpm evals extract_* + +# Run with Browserbase +pnpm evals amazon_add_to_cart --env=browserbase + +# Run 5 trials with specific model +pnpm evals observe_github --trials=5 --model=anthropic/claude-sonnet-4 + +# Run agent tasks with high concurrency +pnpm evals agent/* --concurrency=20 +``` + +### Environment Variables + +```bash +# Required for Browserbase +BROWSERBASE_API_KEY= +BROWSERBASE_PROJECT_ID= + +# LLM API Keys +OPENAI_API_KEY= +ANTHROPIC_API_KEY= +GOOGLE_GENERATIVE_AI_API_KEY= + +# Optional +BRAINTRUST_API_KEY= # For result aggregation +EVAL_ENV=local # Override default env +EVAL_TRIAL_COUNT=3 # Override trials +EVAL_MAX_CONCURRENCY=10 # Override concurrency +``` + +--- + +## Agent Tasks + +Agent tasks test multi-step autonomous execution. + +### Agent Task Structure + +```typescript +// tasks/agent/my_agent_task.ts +import { EvalFunction } from "../../types/evals"; +import { V3Evaluator } from "@browserbasehq/stagehand"; + +export const my_agent_task: EvalFunction = async ({ + debugUrl, + sessionUrl, + v3, + v3Agent, + logger, +}) => { + try { + const page = v3.context.pages()[0]; + await page.goto("https://example.com"); + + // Execute agent task + const result = await v3Agent.execute({ + instruction: "Search for the latest news and summarize", + maxSteps: 20, + }); + + // Use V3Evaluator for LLM-based evaluation + const evaluator = new V3Evaluator(v3); + const { evaluation, reasoning } = await evaluator.ask({ + question: "Did the agent successfully complete the task?", + answer: result.message, + screenshot: true, + }); + + logger.log({ + message: "Agent evaluation", + level: 1, + auxiliary: { + evaluation: { value: evaluation, type: "string" }, + reasoning: { value: reasoning, type: "string" }, + }, + }); + + return { + _success: evaluation === "YES", + result: result.message, + debugUrl, + sessionUrl, + logs: logger.getLogs(), + }; + } catch (error) { + return { + _success: false, + error: JSON.parse(JSON.stringify(error, null, 2)), + debugUrl, + sessionUrl, + logs: logger.getLogs(), + }; + } finally { + await v3.close(); + } +}; +``` + +--- + +## V3Evaluator + +Use `V3Evaluator` for LLM-based pass/fail evaluation. + +```typescript +import { V3Evaluator } from "@browserbasehq/stagehand"; + +const evaluator = new V3Evaluator(v3); + +// Simple YES/NO evaluation +const { evaluation, reasoning } = await evaluator.ask({ + question: "Does the page show the search results?", + answer: "Page shows 10 search results", + screenshot: true, // Include current screenshot +}); + +// evaluation: "YES" | "NO" +// reasoning: "The screenshot shows..." +``` + +--- + +## External Benchmarks + +The `suites/` directory contains integrations with external benchmarks: + +### GAIA + +General AI Assistant benchmark for complex reasoning tasks. + +```bash +pnpm evals agent/gaia --trials=1 +``` + +### WebVoyager + +Web navigation and task completion benchmark. + +```bash +pnpm evals agent/webvoyager --trials=1 +``` + +### WebBench + +Real-world web automation across live sites. + +### OSWorld + +Chrome browser automation tasks. + +### OnlineMind2Web + +Real-world web interaction tasks. + +--- + +## Scoring + +Scoring functions in `scoring.ts`: + +```typescript +// Exact match: 1 for success, 0 for failure +export function exactMatch(result: { _success: boolean }): number { + return result._success ? 1 : 0; +} + +// Error match: Score based on error occurrence +export function errorMatch(result: { error?: unknown }): number { + return result.error ? 0 : 1; +} +``` + +--- + +## Results + +### Output Format + +Results are written to `eval-summary.json`: + +```json +{ + "experimentName": "extract_browserbase_20251026035649", + "passed": [ + { + "eval": "extract_repo_name", + "model": "openai/gpt-4.1-mini", + "categories": ["extract"] + } + ], + "failed": [ + { + "eval": "extract_github_stars", + "model": "google/gemini-2.0-flash", + "categories": ["extract"], + "error": "Extraction mismatch" + } + ], + "summary": { + "total": 10, + "passed": 8, + "failed": 2, + "success_rate": 0.8 + } +} +``` + +--- + +## Default Models + +From `taskConfig.ts`: + +**Standard evals:** + +- `google/gemini-2.0-flash` +- `openai/gpt-4.1-mini` +- `anthropic/claude-haiku-4-5` + +**Agent evals:** + +- `anthropic/claude-sonnet-4-20250514` + +**CUA (Computer Use Agent) evals:** + +- `openai/computer-use-preview-2025-03-11` +- `google/gemini-2.5-computer-use-preview-10-2025` +- `anthropic/claude-sonnet-4-20250514` + +--- + +## Adding New Evals + +### 1. Create Task File + +```typescript +// tasks/my_task.ts +import { EvalFunction } from "../types/evals"; + +export const my_task: EvalFunction = async ({ + v3, + logger, + debugUrl, + sessionUrl, +}) => { + // Implementation +}; +``` + +### 2. Register in Config + +```json +// evals.config.json +{ + "tasks": [{ "name": "my_task", "categories": ["extract"] }] +} +``` + +### 3. Run and Verify + +```bash +pnpm evals my_task --trials=1 +``` + +### 4. Check Results + +```bash +cat eval-summary.json | jq '.passed[] | select(.eval == "my_task")' +``` + +--- + +## Best Practices + +1. **Always close Stagehand** - Use `finally` block with `await v3.close()` +2. **Log intermediate results** - Use `logger.log()` for debugging +3. **Handle errors gracefully** - Catch and return error details +4. **Use specific assertions** - Prefer exact match over fuzzy matching +5. **Test locally first** - Run with `--env=local` before Browserbase +6. **Keep tasks focused** - One clear objective per task +7. **Use V3Evaluator for complex checks** - When exact match isn't possible