From 65f5897f98ec8d25bfba2792eecbb3e9c1d06252 Mon Sep 17 00:00:00 2001
From: Kevin Liu <kevin.liu@openai.com>
Date: Wed, 26 Feb 2025 10:26:48 -0800
Subject: [PATCH] [nanoeval] update readme (#45)

---
 project/nanoeval/README.md | 61 +++++++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 18 deletions(-)
diff --git a/project/nanoeval/README.md b/project/nanoeval/README.md
index 3d129ca..3127d58 100644
--- a/project/nanoeval/README.md
+++ b/project/nanoeval/README.md
@@ -1,6 +1,17 @@
 # nanoeval
 
-Simple, ergonomic, and high performance evals.
+Simple, ergonomic, and high performance evals. We use it at OpenAI as part of our infrastructure to run Preparedness evaluations.
+
+# Installation
+
+```bash
+# Using https://github.com/astral-sh/uv (recommended)
+uv add "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"
+# Using pip
+pip install "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"
+```
+
+nanoeval is pre-release software and may have breaking changes, so it's recommended that you pin your installation to a specific commit. The uv command above will do this for you.
 
 # Principles
 
@@ -13,8 +24,8 @@ Simple, ergonomic, and high performance evals.
 
 - `Eval` - A [chz](https://github.com/openai/chz) class. Enumerates a set of tasks, and (typically) uses a "Solver" to solve them and then records the results. Can be configured in code or on the CLI using a chz entrypoint.
 - `EvalSpec` - An eval to run and runtime characteristics of how to run it (i.e. concurrency, recording, other administrivia)
-- `Task` - A separable, scoreable unit of work.
-- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, using consensus, etc)
+- `Task` - A single scoreable unit of work.
+- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, few-shot prompting, etc)
 
 # Running your first eval
 
@@ -33,9 +44,15 @@ The executors can operate in two modes:
 1. **In-process:** The executor is just an async task running in the same process as the main eval script. The default.
 2. **Multiprocessing:** Starts a pool of executor processes that all poll the db. Use this via `spec.runner.experimental_use_multiprocessing=True`.
 
-## The monitor
+## Performance
+
+nanoeval has been tested up to ~5,000 concurrent rollouts. It is likely that it can go higher.
+
+For highest performance, use multiprocessing with as many processes as your system memory + core count allows. See `RunnerArgs` for documentation.
+
+## Monitoring
 
-Nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
+nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
 
 ```bash
 # either set spec.runner.use_monitor=True OR run this command:
@@ -44,25 +61,28 @@ python3 -m nanoeval.bin.mon
 
 ## Resumption
 
-Because nanoeval uses a persistent database to track the state of individual tasks in a run, this means you can restart an in-progress eval if it crashes. To do this:
+Because nanoeval uses a persistent database to track the state of individual tasks in a run, you can restart an in-progress eval if it crashes. (In-progress rollouts will be restarted from scratch, but completed rollouts will be saved.) To do this:
 
 ```bash
-python3 -m nanoeval.extras.resume db_name=<NAME OF YOUR RUN DB>
+# Restarts the eval in a new process
+python3 -m nanoeval.extras.resume run_set_id=...
 ```
 
-The `db_name` is typically autogenerated and looks something like `<computer hostname>-<pid>`. You can list all your databases with:
+You can list all run sets (databases) using the following command:
 
 ```bash
-ls -lh ~/Library/Application Support/nanoeval/run_state/
+ls -lh "$(python3 -c "from nanoeval.fs_paths import database_dir; print(database_dir())")"
 ```
 
-# **Writing your first eval**
+The run set ID for each database is simply the filename, without the `.db*` suffix.
+
+# Writing your first eval
 
 An eval is just a `chz` class that defines `get_name()`, `get_tasks()`, `evaluate()` and `get_summary()`. Start with `gpqa_simple.py`; copy it and modify it to suit your needs. If necessary, drop down to the base `nanoeval.Eval` class instead of using `MCQEval`.
 
 The following sections describe common use case needs and how to achieve them.
 
-## **Public API**
+## Public API
 
 You may import code from any `nanoeval.*` package that does not start with an underscore. Functions and classes that start with an underscore are considered private.
 
@@ -90,18 +110,23 @@ class MCQEval(Eval[MCQTask, Answer]):
 
 # Debugging
 
-Is your big eval not working? Check here.
-
-## Killing old executors
+## Kill dangling executors
 
-Sometimes, if you ctrl-c the main job, executors don’t have time to exit. A quick fix:
+Nanoeval uses `multiprocessing` to execute rollouts in parallel. Sometimes, if you ctrl-c the main job, the multiprocessing executors don’t have time to exit. A quick fix:
 
 ```bash
 pkill -f multiprocessing.spawn
 ```
 
-## Observability
-
-### py-spy/aiomonitor
+## Debugging stuck runs
 
 `py-spy` is an excellent tool to figure out where processes are stuck if progress isn’t happening. You can check the monitor to find the PIDs of all the executors and py-spy them one by one. The executors also run `aiomonitor`, so you can connect to them via `python3 -m aiomonitor.cli ...` to inspect async tasks.
+
+## Diagnosing main thread stalls
+
+nanoeval relies heavily on Python asyncio for concurrency within each executor process; thus, if you block the main thread, this will harm performance and lead to main thread stalls. A common footgun is making a synchronous LLM or HTTP call, which can stall the main thread for dozens of seconds.
+
+Tracking down blocking calls can be annoying, so nanoeval comes with some built-in features to diagnose these.
+
+1. Blocking synchronous calls will trigger a stacktrace dump to a temporary directory. You can see them by running `open "$(python3 -c "from nanoeval.fs_paths import stacktrace_root_dir; print(stacktrace_root_dir())")"`.
+2. Blocking synchronous calls will also trigger a console warning.