diff --git a/CONTRIBUTING.claude.md b/CONTRIBUTING.claude.md
new file mode 100644
index 000000000..6ab95f539
--- /dev/null
+++ b/CONTRIBUTING.claude.md
@@ -0,0 +1,332 @@
+# Chromie Contribution Guidelines
+
+This file contains guidelines for Chromie (AI coding assistant) when contributing to Stagehand.
+
+## Core Principle: Test-First Bug Fixing
+
+**Every bug fix MUST include a failing test that proves the bug exists.**
+
+### Workflow
+
+1. **Analyze the bug** - Understand the root cause
+2. **Write a failing test** - Create a test that fails with the current code
+3. **Verify the test fails** - Run `pnpm test` to confirm
+4. **Implement the fix** - Write minimal code to fix the bug
+5. **Verify the test passes** - Run `pnpm test` to confirm
+6. **Create PR** - Submit with both test and fix
+
+### Why Test-First?
+
+- **Proves understanding** - The test demonstrates you understand the bug
+- **Prevents regression** - The test catches if the bug returns
+- **Documents behavior** - The test explains expected behavior
+- **Validates the fix** - Green tests prove the fix works
+
+---
+
+## Test Location Guide
+
+| Bug Type | Test Location | Test Framework |
+|----------|---------------|----------------|
+| Core handler bugs | `packages/core/lib/v3/tests/*.spec.ts` | Playwright Test |
+| Extract failures | `packages/core/lib/v3/tests/` + `packages/evals/tasks/extract_*.ts` | Playwright + Evals |
+| Act failures | `packages/core/lib/v3/tests/` + `packages/evals/tasks/` | Playwright + Evals |
+| Agent bugs | `packages/core/lib/v3/tests/agent-*.spec.ts` | Playwright Test |
+| Shadow DOM issues | `packages/core/lib/v3/tests/shadow-iframe.spec.ts` | Playwright Test |
+| iframe issues | `packages/core/lib/v3/tests/frame-*.spec.ts` | Playwright Test |
+
+### When to Write Playwright Tests vs Evals
+
+**Playwright Tests** (`packages/core/lib/v3/tests/`):
+- Unit/integration tests for specific functionality
+- Fast, deterministic tests
+- Can use mock servers or static HTML
+- Run with `pnpm test`
+
+**Evals** (`packages/evals/tasks/`):
+- End-to-end tests against real websites
+- Test real-world scenarios
+- May be flaky due to site changes
+- Run with `pnpm evals`
+
+---
+
+## Test Template
+
+### Playwright Test
+
+```typescript
+// packages/core/lib/v3/tests/bug-description.spec.ts
+import { test, expect } from "@playwright/test";
+import { Stagehand } from "@browserbasehq/stagehand";
+
+test.describe("Bug: [Description]", () => {
+  let stagehand: Stagehand;
+
+  test.beforeEach(async () => {
+    stagehand = new Stagehand({ env: "LOCAL", verbose: 0 });
+    await stagehand.init();
+  });
+
+  test.afterEach(async () => {
+    await stagehand.close();
+  });
+
+  test("should [expected behavior]", async () => {
+    const page = stagehand.context.pages()[0];
+
+    // Setup: Navigate to test page or mock scenario
+    await page.goto("https://example.com");
+
+    // Action: Perform the operation that was broken
+    const result = await stagehand.extract("extract the title");
+
+    // Assert: Verify expected behavior
+    expect(result.extraction).toBe("Expected Title");
+  });
+});
+```
+
+### Eval Task
+
+```typescript
+// packages/evals/tasks/bug_description.ts
+import { EvalFunction } from "../types/evals";
+
+export const bug_description: EvalFunction = async ({
+  debugUrl,
+  sessionUrl,
+  v3,
+  logger,
+}) => {
+  try {
+    const page = v3.context.pages()[0];
+    await page.goto("https://example.com");
+
+    // Perform the operation that was broken
+    const result = await v3.extract("extract the data");
+
+    logger.log({
+      message: "Extraction result",
+      level: 1,
+      auxiliary: { result: { value: result, type: "object" } },
+    });
+
+    // Assert expected behavior
+    return {
+      _success: result.extraction === "expected value",
+      result,
+      debugUrl,
+      sessionUrl,
+      logs: logger.getLogs(),
+    };
+  } catch (error) {
+    return {
+      _success: false,
+      error: JSON.parse(JSON.stringify(error, null, 2)),
+      debugUrl,
+      sessionUrl,
+      logs: logger.getLogs(),
+    };
+  } finally {
+    await v3.close();
+  }
+};
+```
+
+---
+
+## Common Bug Patterns
+
+### Selector/Element Not Found
+
+**Symptoms**: `XPathResolutionError`, `StagehandElementNotFoundError`
+
+**Test approach**:
+```typescript
+test("should handle dynamic content loading", async () => {
+  // Navigate to page with dynamic content
+  // Wait for content to load
+  // Perform action
+  // Assert success
+});
+```
+
+### Shadow DOM Issues
+
+**Symptoms**: `StagehandShadowRootMissingError`, elements inside shadow DOM not found
+
+**Test approach**:
+```typescript
+test("should interact with shadow DOM elements", async () => {
+  // Navigate to page with shadow DOM
+  // Use deepLocator or appropriate method
+  // Assert element is found and actionable
+});
+```
+
+### Timeout Issues
+
+**Symptoms**: `ActTimeoutError`, `ExtractTimeoutError`
+
+**Test approach**:
+```typescript
+test("should complete within timeout", async () => {
+  // Set explicit timeout
+  // Perform action that was timing out
+  // Assert completes successfully
+});
+```
+
+### LLM Response Parsing
+
+**Symptoms**: `ZodSchemaValidationError`, malformed extraction
+
+**Test approach**:
+```typescript
+test("should extract data matching schema", async () => {
+  // Define schema
+  // Extract with schema
+  // Assert structure matches
+});
+```
+
+---
+
+## PR Requirements
+
+### Title Format
+
+```
+fix(component): brief description of fix
+
+Examples:
+fix(actHandler): handle shadow DOM elements in nested iframes
+fix(extractHandler): parse URLs correctly when schema uses z.string().url()
+fix(agent): prevent infinite loop when task is already complete
+```
+
+### PR Body Template
+
+```markdown
+## Problem
+
+[Describe the bug - what was happening?]
+
+## Root Cause
+
+[What caused the bug?]
+
+## Solution
+
+[How does this fix address the root cause?]
+
+## Test Plan
+
+- [ ] Added failing test that reproduces the bug
+- [ ] Verified test fails before fix
+- [ ] Verified test passes after fix
+- [ ] Ran full test suite: `pnpm test`
+- [ ] (If applicable) Ran relevant evals: `pnpm evals [category]`
+
+## Related Issues
+
+Fixes #[issue-number]
+```
+
+---
+
+## Code Style
+
+### Follow Existing Patterns
+
+- Look at surrounding code for style guidance
+- Use existing utilities (don't reinvent)
+- Follow handler pattern for new functionality
+- Keep changes minimal and focused
+
+### Error Handling
+
+- Throw specific error types from `types/public/sdkErrors.ts`
+- Include helpful error messages
+- Log at appropriate verbosity levels
+
+### TypeScript
+
+- Use strict types (no `any` unless necessary)
+- Export types from `types/public/` for public API
+- Keep internal types in `types/private/`
+
+---
+
+## Running Tests
+
+### Before Submitting PR
+
+```bash
+# Build the project
+pnpm build
+
+# Run all tests
+pnpm test
+
+# Run specific test file
+pnpm test packages/core/lib/v3/tests/my-test.spec.ts
+
+# Run e2e tests locally
+pnpm e2e:local
+
+# Run relevant evals (if applicable)
+pnpm evals [task-name]
+```
+
+### Test Environment
+
+- **Local testing**: Uses local Chrome via `chrome-launcher`
+- **Browserbase testing**: Uses remote browser sessions
+- Default to local for faster iteration
+
+---
+
+## Common Mistakes to Avoid
+
+1. **Don't skip the failing test** - Every fix needs a test
+2. **Don't modify unrelated code** - Keep changes focused
+3. **Don't add unnecessary abstractions** - Simpler is better
+4. **Don't forget to close Stagehand** - Always use `finally { await v3.close() }`
+5. **Don't hardcode timeouts** - Use configurable values
+6. **Don't ignore TypeScript errors** - Fix them properly
+7. **Don't add console.log** - Use the logger system
+
+---
+
+## Key Files for Common Fixes
+
+| Issue Type | Key Files |
+|------------|-----------|
+| Action execution | `handlers/actHandler.ts`, `understudy/page.ts` |
+| Data extraction | `handlers/extractHandler.ts`, `utils.ts` |
+| Element selection | `understudy/a11y/snapshot.ts`, `dom/` |
+| Shadow DOM | `understudy/page.ts`, `dom/` |
+| Agent behavior | `handlers/v3AgentHandler.ts`, `agent/tools/` |
+| Timeouts | `handlers/handlerUtils/timeoutGuard.ts` |
+| LLM inference | `llm/LLMClient.ts`, `inference.ts` |
+| Error handling | `types/public/sdkErrors.ts` |
+
+---
+
+## Escalation Context
+
+When receiving an escalation from Slack, extract:
+
+1. **Bug description** - What's not working?
+2. **Reproduction steps** - How to trigger the bug?
+3. **Error message** - Exact error text if available
+4. **Environment** - Local or Browserbase? Which model?
+5. **Code snippet** - User's code if provided
+
+Use this context to:
+1. Write a focused test that reproduces the issue
+2. Identify the root cause in the codebase
+3. Implement a minimal fix
+4. Verify the fix addresses the original report
diff --git a/packages/core/CLAUDE.md b/packages/core/CLAUDE.md
new file mode 100644
index 000000000..f7a770ffd
--- /dev/null
+++ b/packages/core/CLAUDE.md
@@ -0,0 +1,410 @@
+# Stagehand Core Package
+
+This is the main Stagehand SDK (`@browserbasehq/stagehand`). It contains the V3 implementation of the browser automation framework.
+
+## Directory Structure
+
+```
+packages/core/
+├── lib/v3/                    # V3 implementation
+│   ├── v3.ts                  # Main orchestrator (exported as Stagehand)
+│   ├── index.ts               # Public exports
+│   ├── handlers/              # API handlers
+│   │   ├── actHandler.ts      # Action execution
+│   │   ├── extractHandler.ts  # Data extraction
+│   │   ├── observeHandler.ts  # Action planning
+│   │   ├── v3AgentHandler.ts  # Tools-based agent
+│   │   ├── v3CuaAgentHandler.ts # Computer Use Agent
+│   │   └── handlerUtils/      # Shared handler utilities
+│   ├── understudy/            # CDP browser abstraction
+│   │   ├── context.ts         # V3Context - manages browser context
+│   │   ├── page.ts            # Page abstraction
+│   │   ├── frame.ts           # Frame abstraction
+│   │   └── a11y/              # Accessibility tree utilities
+│   │       └── snapshot.ts    # captureHybridSnapshot()
+│   ├── llm/                   # LLM abstraction
+│   │   ├── LLMClient.ts       # Base LLM interface
+│   │   └── LLMProvider.ts     # Provider factory (57+ models)
+│   ├── launch/                # Browser launch
+│   │   ├── local.ts           # Local Chrome (chrome-launcher)
+│   │   └── browserbase.ts     # Browserbase sessions
+│   ├── dom/                   # DOM scripts
+│   │   └── *.ts               # Scripts injected into pages
+│   ├── agent/                 # Agent components
+│   │   ├── tools/             # Built-in agent tools
+│   │   └── utils/             # Agent utilities
+│   ├── types/                 # TypeScript types
+│   │   ├── public/            # Exported types
+│   │   │   ├── methods.ts     # act/extract/observe types
+│   │   │   ├── sdkErrors.ts   # Error classes
+│   │   │   ├── model.ts       # Model types
+│   │   │   └── agent.ts       # Agent types
+│   │   └── private/           # Internal types
+│   ├── cache/                 # Caching utilities
+│   │   ├── ActCache.ts        # Action result caching
+│   │   └── AgentCache.ts      # Agent state caching
+│   ├── mcp/                   # Model Context Protocol
+│   └── tests/                 # Playwright tests
+└── examples/                  # Usage examples
+```
+
+---
+
+## Core Class: V3 (Stagehand)
+
+**File**: `lib/v3/v3.ts`
+
+The `V3` class is the main entry point, exported as `Stagehand`. It orchestrates:
+
+1. **Browser lifecycle**: Launch/connect to browser, create context
+2. **Handler delegation**: Route API calls to appropriate handlers
+3. **LLM management**: Resolve model clients per-call or globally
+4. **Metrics tracking**: Token usage, inference time
+
+### Initialization Flow
+
+```typescript
+const stagehand = new Stagehand({ env: "LOCAL", model: "openai/gpt-4.1-mini" });
+await stagehand.init();
+```
+
+**What happens in `init()`:**
+
+1. Load environment variables (`.env`)
+2. Launch browser:
+   - `env: "LOCAL"`: `launchLocalChrome()` via chrome-launcher
+   - `env: "BROWSERBASE"`: `createBrowserbaseSession()` via SDK
+3. Connect to CDP WebSocket (15s timeout)
+4. Create `V3Context` from CDP connection
+5. Initialize handlers: `ActHandler`, `ExtractHandler`, `ObserveHandler`
+6. Wait for first page to load
+
+### Key Properties
+
+```typescript
+stagehand.context; // V3Context - browser context management
+stagehand.llmClient; // LLMClient - current LLM client
+stagehand.browserbaseSessionId; // Session ID (if using Browserbase)
+```
+
+---
+
+## Handler Pattern
+
+Each core API has a dedicated handler class. Handlers are stateless and receive all dependencies via constructor.
+
+### Common Handler Structure
+
+```typescript
+export class FooHandler {
+  private readonly llmClient: LLMClient;
+  private readonly resolveLlmClient: (model?: ModelConfiguration) => LLMClient;
+  private readonly onMetrics?: (...) => void;
+
+  constructor(llmClient, defaultModel, clientOptions, resolveLlmClient, ...) {
+    // Store dependencies
+  }
+
+  async foo(params: FooHandlerParams): Promise<FooResult> {
+    // 1. Capture snapshot
+    // 2. Send to LLM
+    // 3. Process response
+    // 4. Execute action (if applicable)
+    // 5. Return result
+  }
+}
+```
+
+### ActHandler (`handlers/actHandler.ts`)
+
+Executes single atomic actions on the page.
+
+**Flow:**
+
+1. Capture hybrid snapshot (accessibility tree + element mappings)
+2. Build prompt with instruction + DOM elements
+3. Send to LLM via `actInference()`
+4. LLM returns: `{ elementId, method, arguments }`
+5. Map elementId to XPath selector
+6. Execute action via `performUnderstudyMethod()`
+7. Wait for DOM/network quiet
+
+**Self-healing (when enabled):**
+
+- If action fails, retake screenshot
+- Re-prompt LLM with error context
+- Retry up to 3 times
+
+**Key methods:**
+
+- `act(params)`: Main action execution
+- `takeDeterministicAction(page, action)`: Direct Action → execute (skip LLM)
+
+### ExtractHandler (`handlers/extractHandler.ts`)
+
+Extracts structured data from pages using Zod schemas.
+
+**Flow:**
+
+1. If no instruction: return raw page text (accessibility tree)
+2. Transform schema (convert `z.string().url()` to numeric IDs)
+3. Capture hybrid snapshot
+4. Send to LLM via `runExtract()`
+5. LLM returns structured data matching schema
+6. Inject real URLs back into numeric ID placeholders
+7. Validate against Zod schema
+
+**URL handling:**
+
+- `z.string().url()` fields are replaced with `z.number()` before LLM call
+- LLM returns numeric IDs referencing elements in DOM
+- IDs are mapped back to actual URLs after extraction
+- This prevents URL hallucination
+
+### ObserveHandler (`handlers/observeHandler.ts`)
+
+Plans actions without executing them. Returns candidate actions.
+
+**Flow:**
+
+1. Capture hybrid snapshot
+2. If instruction provided: find matching elements
+3. If no instruction: return all interactive elements
+4. Build Action objects with XPath selectors
+5. Return Action[] for user to choose from
+
+**Use case:** Observe + Act pattern - plan once, execute later
+
+### V3AgentHandler (`handlers/v3AgentHandler.ts`)
+
+Multi-step autonomous execution using AI SDK tools.
+
+**Tools available to agent:**
+
+- `act`: Execute single action
+- `extract`: Extract data
+- `observe`: Plan actions
+- `screenshot`: Capture page screenshot
+- `goto`: Navigate to URL
+- `scroll`: Scroll page
+- `wait`: Wait for time/condition
+- `close`: Close page
+
+**Flow:**
+
+1. Create AI SDK messages with system prompt
+2. Loop until max_steps or task complete:
+   - Call LLM with current state
+   - Execute tool calls
+   - Append results to messages
+3. Return final result with message, actions, reasoning
+
+### V3CuaAgentHandler (`handlers/v3CuaAgentHandler.ts`)
+
+Computer Use Agent for Claude Sonnet 4 or Gemini 2.5 computer-use models.
+
+**Difference from V3AgentHandler:**
+
+- Direct browser control without Stagehand tool wrapping
+- Uses native computer-use capabilities of the model
+- Enabled via `agent({ cua: true })`
+
+---
+
+## CDP Abstraction (understudy/)
+
+The `understudy` directory contains the Chrome DevTools Protocol abstraction layer.
+
+### V3Context (`understudy/context.ts`)
+
+Manages the browser context and page lifecycle.
+
+**Responsibilities:**
+
+- Own root CDP connection (`CdpConnection`)
+- Manage `Page` objects (one per browser tab)
+- Handle Target events (new tabs, closes)
+- Track frame topology and OOPIF adoption
+
+**Key methods:**
+
+```typescript
+context.pages(); // Get all Page objects
+context.newPage(); // Create new page/tab
+context.activePage; // Get current active page
+```
+
+### Page (`understudy/page.ts`)
+
+Abstraction over a browser tab.
+
+**Key methods:**
+
+```typescript
+page.goto(url); // Navigate
+page.screenshot(); // Capture screenshot
+page.evaluate(fn); // Run JS in page context
+page.locator(selector); // Get element locator
+page.deepLocator(xpath); // XPath across shadow DOM/iframes
+```
+
+### Snapshots (`understudy/a11y/snapshot.ts`)
+
+**`captureHybridSnapshot()`:**
+
+- Captures accessibility tree
+- Maps element IDs to XPath selectors
+- Used by all handlers for LLM context
+
+---
+
+## LLM Abstraction (llm/)
+
+### LLMClient (`llm/LLMClient.ts`)
+
+Base interface for LLM clients.
+
+```typescript
+interface LLMClient {
+  createChatCompletion(params): Promise<Response>;
+  // Model-specific implementations
+}
+```
+
+### LLMProvider (`llm/LLMProvider.ts`)
+
+Factory for creating LLM clients by model name.
+
+**Supported providers:**
+
+- OpenAI: `openai/gpt-4.1`, `openai/gpt-4.1-mini`, etc.
+- Anthropic: `anthropic/claude-sonnet-4`, `anthropic/claude-haiku-4-5`
+- Google: `google/gemini-2.0-flash`, `google/gemini-2.5-*`
+- Others: Together, Groq, Cerebras, Mistral, xAI, Perplexity, Ollama
+
+---
+
+## Error Classes (`types/public/sdkErrors.ts`)
+
+All errors extend `StagehandError`.
+
+| Error                             | When Thrown                        |
+| --------------------------------- | ---------------------------------- |
+| `StagehandNotInitializedError`    | Calling methods before `init()`    |
+| `MissingEnvironmentVariableError` | Missing API keys                   |
+| `ConnectionTimeoutError`          | Can't connect to Chrome (15s)      |
+| `ActTimeoutError`                 | `act()` exceeds timeout            |
+| `ExtractTimeoutError`             | `extract()` exceeds timeout        |
+| `ObserveTimeoutError`             | `observe()` exceeds timeout        |
+| `XPathResolutionError`            | Selector doesn't match element     |
+| `StagehandElementNotFoundError`   | Element not in DOM                 |
+| `StagehandShadowRootMissingError` | Shadow DOM pierce failed           |
+| `ZodSchemaValidationError`        | LLM output doesn't match schema    |
+| `AgentAbortError`                 | Agent execution cancelled          |
+| `CuaModelRequiredError`           | Using CUA without compatible model |
+
+---
+
+## Testing
+
+### Test Location
+
+Tests are in `lib/v3/tests/` using Playwright Test.
+
+### Test Configurations
+
+- `v3.playwright.config.ts`: Default (parallel, 90s timeout)
+- `v3.local.playwright.config.ts`: Local Chrome testing
+- `v3.bb.playwright.config.ts`: Browserbase testing
+
+### Running Tests
+
+```bash
+# From repo root
+pnpm test            # Default config
+pnpm e2e:local       # Local Chrome
+pnpm e2e:bb          # Browserbase
+
+# From packages/core
+pnpm test
+```
+
+### Writing Tests
+
+```typescript
+import { test, expect } from "@playwright/test";
+import { Stagehand } from "@browserbasehq/stagehand";
+
+test("should extract data", async () => {
+  const stagehand = new Stagehand({ env: "LOCAL" });
+  await stagehand.init();
+
+  const page = stagehand.context.pages()[0];
+  await page.goto("https://example.com");
+
+  const data = await stagehand.extract(
+    "get the title",
+    z.object({
+      title: z.string(),
+    }),
+  );
+
+  expect(data.title).toBeDefined();
+  await stagehand.close();
+});
+```
+
+---
+
+## Key Patterns
+
+### Snapshot-Based AI
+
+All AI operations use accessibility tree snapshots, not live DOM queries. This ensures determinism and avoids race conditions.
+
+### Element ID → XPath Mapping
+
+1. `captureHybridSnapshot()` assigns numeric IDs to elements
+2. LLM references elements by ID
+3. IDs are mapped back to XPath for execution
+
+### Per-Call Model Override
+
+Any method can override the default model:
+
+```typescript
+await stagehand.act("click button", { model: "anthropic/claude-sonnet-4" });
+```
+
+### Metrics Tracking
+
+All handlers report:
+
+- `promptTokens`, `completionTokens`, `reasoningTokens`
+- `cachedInputTokens`, `inferenceTimeMs`
+
+---
+
+## Common Modifications
+
+### Adding a New Handler Method
+
+1. Add type to `types/public/methods.ts`
+2. Create handler class in `handlers/`
+3. Add handler to V3 constructor in `v3.ts`
+4. Expose method on V3 class
+5. Export from `index.ts`
+6. Add tests in `tests/`
+
+### Adding a New LLM Provider
+
+1. Add provider client in `llm/`
+2. Register in `LLMProvider.ts`
+3. Add model names to `AvailableModel` type
+4. Test with existing evals
+
+### Adding a New Agent Tool
+
+1. Add tool definition in `agent/tools/`
+2. Register in `v3AgentHandler.ts` tools array
+3. Add tests for new tool behavior
diff --git a/packages/evals/CLAUDE.md b/packages/evals/CLAUDE.md
new file mode 100644
index 000000000..15dd57a85
--- /dev/null
+++ b/packages/evals/CLAUDE.md
@@ -0,0 +1,461 @@
+# Stagehand Evals Package
+
+This package contains the evaluation suite for Stagehand. It provides a framework for running automated tests against live websites to measure Stagehand's capabilities.
+
+## Quick Start
+
+```bash
+# Run all evals
+pnpm evals
+
+# Run specific task
+pnpm evals extract_repo_name
+
+# Run by category
+pnpm evals observe_*
+
+# Run with options
+pnpm evals --env=browserbase --trials=5
+```
+
+## Directory Structure
+
+```
+packages/evals/
+├── tasks/               # Individual eval task files (126 tasks)
+│   ├── agent/           # Agent-specific tasks (30+)
+│   ├── extract_*.ts     # Extract tasks
+│   ├── observe_*.ts     # Observe tasks
+├── suites/              # External benchmark suites
+│   ├── gaia.ts          # GAIA benchmark
+│   ├── webvoyager.ts    # WebVoyager benchmark
+│   └── onlineMind2Web.ts # OnlineMind2Web benchmark
+├── types/
+│   └── evals.ts         # Type definitions
+├── evals.config.json    # Task registry and configuration
+├── run.ts               # CLI entry point
+├── index.eval.ts        # Braintrust eval orchestrator
+├── taskConfig.ts        # Model and task configuration
+├── scoring.ts           # Scoring functions
+├── logger.ts            # EvalLogger utility
+├── initV3.ts            # Stagehand initialization
+└── summary.ts           # Result aggregation
+```
+
+---
+
+## Writing Eval Tasks
+
+### Basic Task Structure
+
+Create a new file in `tasks/`:
+
+```typescript
+// tasks/my_new_task.ts
+import { EvalFunction } from "../types/evals";
+
+export const my_new_task: EvalFunction = async ({
+  debugUrl,
+  sessionUrl,
+  v3,
+  logger,
+}) => {
+  try {
+    const page = v3.context.pages()[0];
+    await page.goto("https://example.com");
+
+    // Perform actions
+    await v3.act("click the button");
+
+    // Extract data
+    const { extraction } = await v3.extract("get the result");
+
+    // Log intermediate results
+    logger.log({
+      message: "Extracted result",
+      level: 1,
+      auxiliary: {
+        result: { value: extraction, type: "object" },
+      },
+    });
+
+    // Return success/failure
+    return {
+      _success: extraction === "expected value",
+      extraction,
+      debugUrl,
+      sessionUrl,
+      logs: logger.getLogs(),
+    };
+  } catch (error) {
+    return {
+      _success: false,
+      error: JSON.parse(JSON.stringify(error, null, 2)),
+      debugUrl,
+      sessionUrl,
+      logs: logger.getLogs(),
+    };
+  } finally {
+    await v3.close();
+  }
+};
+```
+
+### Register the Task
+
+Add to `evals.config.json`:
+
+```json
+{
+  "tasks": [
+    {
+      "name": "my_new_task",
+      "categories": ["extract"]
+    }
+  ]
+}
+```
+
+### EvalFunction Signature
+
+```typescript
+type EvalFunction = (taskInput: {
+  v3: V3; // Stagehand instance
+  v3Agent?: AgentInstance; // Agent instance (for agent tasks)
+  logger: EvalLogger; // Logging utility
+  debugUrl: string; // Debug URL for session
+  sessionUrl: string; // Browserbase session URL
+  modelName: AvailableModel; // Current model being tested
+  input: EvalInput; // Task input with params
+}) => Promise<{
+  _success: boolean; // Pass/fail
+  logs: LogLine[]; // Captured logs
+  debugUrl: string; // Debug URL
+  sessionUrl: string; // Session URL
+  error?: unknown; // Error if failed
+}>;
+```
+
+---
+
+## Task Categories
+
+| Category                    | Description                    | Example Tasks                                   |
+| --------------------------- | ------------------------------ | ----------------------------------------------- |
+| `act`                       | Single action execution        | `amazon_add_to_cart`, `dropdown`, `login`       |
+| `extract`                   | Data extraction                | `extract_repo_name`, `extract_github_stars`     |
+| `observe`                   | Action planning                | `observe_github`, `observe_amazon_add_to_cart`  |
+| `combination`               | Multi-step workflows           | `arxiv`, `allrecipes`, `peeler_complex`         |
+| `agent`                     | Agent-based tasks              | `agent/google_flights`, `agent/sf_library_card` |
+| `targeted_extract`          | Extract from specific selector | `extract_recipe`, `extract_hamilton_weather`    |
+| `regression`                | Regression tests               | `wichita`, `heal_simple_google_search`          |
+| `experimental`              | Experimental features          | `apple`, `costar`                               |
+| `llm_clients`               | LLM provider tests             | `hn_aisdk`, `hn_langchain`                      |
+| `external_agent_benchmarks` | External benchmarks            | `agent/gaia`, `agent/webvoyager`                |
+
+---
+
+## Running Evals
+
+### Command Line Options
+
+```bash
+pnpm evals [task-name] [options]
+```
+
+| Option                                 | Description               | Default        |
+| -------------------------------------- | ------------------------- | -------------- |
+| `--env=local\|browserbase`             | Environment               | `local`        |
+| `--trials=N`                           | Number of trials per eval | `3`            |
+| `--concurrency=N`                      | Max parallel sessions     | `10`           |
+| `--provider=openai\|anthropic\|google` | Model provider filter     | all            |
+| `--model=MODEL_NAME`                   | Specific model            | default models |
+| `--api=true\|false`                    | Use API mode              | `false`        |
+| `--max_k=N`                            | Limit number of evals     | unlimited      |
+
+### Examples
+
+```bash
+# Run extract tasks locally
+pnpm evals extract_*
+
+# Run with Browserbase
+pnpm evals amazon_add_to_cart --env=browserbase
+
+# Run 5 trials with specific model
+pnpm evals observe_github --trials=5 --model=anthropic/claude-sonnet-4
+
+# Run agent tasks with high concurrency
+pnpm evals agent/* --concurrency=20
+```
+
+### Environment Variables
+
+```bash
+# Required for Browserbase
+BROWSERBASE_API_KEY=
+BROWSERBASE_PROJECT_ID=
+
+# LLM API Keys
+OPENAI_API_KEY=
+ANTHROPIC_API_KEY=
+GOOGLE_GENERATIVE_AI_API_KEY=
+
+# Optional
+BRAINTRUST_API_KEY=       # For result aggregation
+EVAL_ENV=local            # Override default env
+EVAL_TRIAL_COUNT=3        # Override trials
+EVAL_MAX_CONCURRENCY=10   # Override concurrency
+```
+
+---
+
+## Agent Tasks
+
+Agent tasks test multi-step autonomous execution.
+
+### Agent Task Structure
+
+```typescript
+// tasks/agent/my_agent_task.ts
+import { EvalFunction } from "../../types/evals";
+import { V3Evaluator } from "@browserbasehq/stagehand";
+
+export const my_agent_task: EvalFunction = async ({
+  debugUrl,
+  sessionUrl,
+  v3,
+  v3Agent,
+  logger,
+}) => {
+  try {
+    const page = v3.context.pages()[0];
+    await page.goto("https://example.com");
+
+    // Execute agent task
+    const result = await v3Agent.execute({
+      instruction: "Search for the latest news and summarize",
+      maxSteps: 20,
+    });
+
+    // Use V3Evaluator for LLM-based evaluation
+    const evaluator = new V3Evaluator(v3);
+    const { evaluation, reasoning } = await evaluator.ask({
+      question: "Did the agent successfully complete the task?",
+      answer: result.message,
+      screenshot: true,
+    });
+
+    logger.log({
+      message: "Agent evaluation",
+      level: 1,
+      auxiliary: {
+        evaluation: { value: evaluation, type: "string" },
+        reasoning: { value: reasoning, type: "string" },
+      },
+    });
+
+    return {
+      _success: evaluation === "YES",
+      result: result.message,
+      debugUrl,
+      sessionUrl,
+      logs: logger.getLogs(),
+    };
+  } catch (error) {
+    return {
+      _success: false,
+      error: JSON.parse(JSON.stringify(error, null, 2)),
+      debugUrl,
+      sessionUrl,
+      logs: logger.getLogs(),
+    };
+  } finally {
+    await v3.close();
+  }
+};
+```
+
+---
+
+## V3Evaluator
+
+Use `V3Evaluator` for LLM-based pass/fail evaluation.
+
+```typescript
+import { V3Evaluator } from "@browserbasehq/stagehand";
+
+const evaluator = new V3Evaluator(v3);
+
+// Simple YES/NO evaluation
+const { evaluation, reasoning } = await evaluator.ask({
+  question: "Does the page show the search results?",
+  answer: "Page shows 10 search results",
+  screenshot: true, // Include current screenshot
+});
+
+// evaluation: "YES" | "NO"
+// reasoning: "The screenshot shows..."
+```
+
+---
+
+## External Benchmarks
+
+The `suites/` directory contains integrations with external benchmarks:
+
+### GAIA
+
+General AI Assistant benchmark for complex reasoning tasks.
+
+```bash
+pnpm evals agent/gaia --trials=1
+```
+
+### WebVoyager
+
+Web navigation and task completion benchmark.
+
+```bash
+pnpm evals agent/webvoyager --trials=1
+```
+
+### WebBench
+
+Real-world web automation across live sites.
+
+### OSWorld
+
+Chrome browser automation tasks.
+
+### OnlineMind2Web
+
+Real-world web interaction tasks.
+
+---
+
+## Scoring
+
+Scoring functions in `scoring.ts`:
+
+```typescript
+// Exact match: 1 for success, 0 for failure
+export function exactMatch(result: { _success: boolean }): number {
+  return result._success ? 1 : 0;
+}
+
+// Error match: Score based on error occurrence
+export function errorMatch(result: { error?: unknown }): number {
+  return result.error ? 0 : 1;
+}
+```
+
+---
+
+## Results
+
+### Output Format
+
+Results are written to `eval-summary.json`:
+
+```json
+{
+  "experimentName": "extract_browserbase_20251026035649",
+  "passed": [
+    {
+      "eval": "extract_repo_name",
+      "model": "openai/gpt-4.1-mini",
+      "categories": ["extract"]
+    }
+  ],
+  "failed": [
+    {
+      "eval": "extract_github_stars",
+      "model": "google/gemini-2.0-flash",
+      "categories": ["extract"],
+      "error": "Extraction mismatch"
+    }
+  ],
+  "summary": {
+    "total": 10,
+    "passed": 8,
+    "failed": 2,
+    "success_rate": 0.8
+  }
+}
+```
+
+---
+
+## Default Models
+
+From `taskConfig.ts`:
+
+**Standard evals:**
+
+- `google/gemini-2.0-flash`
+- `openai/gpt-4.1-mini`
+- `anthropic/claude-haiku-4-5`
+
+**Agent evals:**
+
+- `anthropic/claude-sonnet-4-20250514`
+
+**CUA (Computer Use Agent) evals:**
+
+- `openai/computer-use-preview-2025-03-11`
+- `google/gemini-2.5-computer-use-preview-10-2025`
+- `anthropic/claude-sonnet-4-20250514`
+
+---
+
+## Adding New Evals
+
+### 1. Create Task File
+
+```typescript
+// tasks/my_task.ts
+import { EvalFunction } from "../types/evals";
+
+export const my_task: EvalFunction = async ({
+  v3,
+  logger,
+  debugUrl,
+  sessionUrl,
+}) => {
+  // Implementation
+};
+```
+
+### 2. Register in Config
+
+```json
+// evals.config.json
+{
+  "tasks": [{ "name": "my_task", "categories": ["extract"] }]
+}
+```
+
+### 3. Run and Verify
+
+```bash
+pnpm evals my_task --trials=1
+```
+
+### 4. Check Results
+
+```bash
+cat eval-summary.json | jq '.passed[] | select(.eval == "my_task")'
+```
+
+---
+
+## Best Practices
+
+1. **Always close Stagehand** - Use `finally` block with `await v3.close()`
+2. **Log intermediate results** - Use `logger.log()` for debugging
+3. **Handle errors gracefully** - Catch and return error details
+4. **Use specific assertions** - Prefer exact match over fuzzy matching
+5. **Test locally first** - Run with `--env=local` before Browserbase
+6. **Keep tasks focused** - One clear objective per task
+7. **Use V3Evaluator for complex checks** - When exact match isn't possible