From 65f5897f98ec8d25bfba2792eecbb3e9c1d06252 Mon Sep 17 00:00:00 2001 From: Kevin Liu Date: Wed, 26 Feb 2025 10:26:48 -0800 Subject: [PATCH] [nanoeval] update readme (#45) --- project/nanoeval/README.md | 61 +++++++++++++++++++++++++++----------- 1 file changed, 43 insertions(+), 18 deletions(-) diff --git a/project/nanoeval/README.md b/project/nanoeval/README.md index 3d129ca..3127d58 100644 --- a/project/nanoeval/README.md +++ b/project/nanoeval/README.md @@ -1,6 +1,17 @@ # nanoeval -Simple, ergonomic, and high performance evals. +Simple, ergonomic, and high performance evals. We use it at OpenAI as part of our infrastructure to run Preparedness evaluations. + +# Installation + +```bash +# Using https://github.com/astral-sh/uv (recommended) +uv add "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval" +# Using pip +pip install "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval" +``` + +nanoeval is pre-release software and may have breaking changes, so it's recommended that you pin your installation to a specific commit. The uv command above will do this for you. # Principles @@ -13,8 +24,8 @@ Simple, ergonomic, and high performance evals. - `Eval` - A [chz](https://github.com/openai/chz) class. Enumerates a set of tasks, and (typically) uses a "Solver" to solve them and then records the results. Can be configured in code or on the CLI using a chz entrypoint. - `EvalSpec` - An eval to run and runtime characteristics of how to run it (i.e. concurrency, recording, other administrivia) -- `Task` - A separable, scoreable unit of work. -- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, using consensus, etc) +- `Task` - A single scoreable unit of work. +- `Solver` - A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, few-shot prompting, etc) # Running your first eval @@ -33,9 +44,15 @@ The executors can operate in two modes: 1. **In-process:** The executor is just an async task running in the same process as the main eval script. The default. 2. **Multiprocessing:** Starts a pool of executor processes that all poll the db. Use this via `spec.runner.experimental_use_multiprocessing=True`. -## The monitor +## Performance + +nanoeval has been tested up to ~5,000 concurrent rollouts. It is likely that it can go higher. + +For highest performance, use multiprocessing with as many processes as your system memory + core count allows. See `RunnerArgs` for documentation. + +## Monitoring -Nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it: +nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it: ```bash # either set spec.runner.use_monitor=True OR run this command: @@ -44,25 +61,28 @@ python3 -m nanoeval.bin.mon ## Resumption -Because nanoeval uses a persistent database to track the state of individual tasks in a run, this means you can restart an in-progress eval if it crashes. To do this: +Because nanoeval uses a persistent database to track the state of individual tasks in a run, you can restart an in-progress eval if it crashes. (In-progress rollouts will be restarted from scratch, but completed rollouts will be saved.) To do this: ```bash -python3 -m nanoeval.extras.resume db_name= +# Restarts the eval in a new process +python3 -m nanoeval.extras.resume run_set_id=... ``` -The `db_name` is typically autogenerated and looks something like `-`. You can list all your databases with: +You can list all run sets (databases) using the following command: ```bash -ls -lh ~/Library/Application Support/nanoeval/run_state/ +ls -lh "$(python3 -c "from nanoeval.fs_paths import database_dir; print(database_dir())")" ``` -# **Writing your first eval** +The run set ID for each database is simply the filename, without the `.db*` suffix. + +# Writing your first eval An eval is just a `chz` class that defines `get_name()`, `get_tasks()`, `evaluate()` and `get_summary()`. Start with `gpqa_simple.py`; copy it and modify it to suit your needs. If necessary, drop down to the base `nanoeval.Eval` class instead of using `MCQEval`. The following sections describe common use case needs and how to achieve them. -## **Public API** +## Public API You may import code from any `nanoeval.*` package that does not start with an underscore. Functions and classes that start with an underscore are considered private. @@ -90,18 +110,23 @@ class MCQEval(Eval[MCQTask, Answer]): # Debugging -Is your big eval not working? Check here. - -## Killing old executors +## Kill dangling executors -Sometimes, if you ctrl-c the main job, executors don’t have time to exit. A quick fix: +Nanoeval uses `multiprocessing` to execute rollouts in parallel. Sometimes, if you ctrl-c the main job, the multiprocessing executors don’t have time to exit. A quick fix: ```bash pkill -f multiprocessing.spawn ``` -## Observability - -### py-spy/aiomonitor +## Debugging stuck runs `py-spy` is an excellent tool to figure out where processes are stuck if progress isn’t happening. You can check the monitor to find the PIDs of all the executors and py-spy them one by one. The executors also run `aiomonitor`, so you can connect to them via `python3 -m aiomonitor.cli ...` to inspect async tasks. + +## Diagnosing main thread stalls + +nanoeval relies heavily on Python asyncio for concurrency within each executor process; thus, if you block the main thread, this will harm performance and lead to main thread stalls. A common footgun is making a synchronous LLM or HTTP call, which can stall the main thread for dozens of seconds. + +Tracking down blocking calls can be annoying, so nanoeval comes with some built-in features to diagnose these. + +1. Blocking synchronous calls will trigger a stacktrace dump to a temporary directory. You can see them by running `open "$(python3 -c "from nanoeval.fs_paths import stacktrace_root_dir; print(stacktrace_root_dir())")"`. +2. Blocking synchronous calls will also trigger a console warning.