A benchmark for evaluating AI spatial reasoning through Minecraft-style voxel construction.
Models are given a natural-language prompt and must produce raw 3D coordinates as JSON. In tool mode, models call voxel.exec (minimal primitives: block, box, line) to generate large builds beyond token-only JSON limits. MineBench visualizes the output and ranks models via head-to-head voting with a confidence-aware Glicko-style system (public ordering by conservative score).
Most LLM benchmarks test text and raw accuracy. MineBench instead tests whether a model reason about 3D space. Given a prompt like "a medieval castle with four towers", the model must mentally construct geometry, pick materials, and output thousands of precise block coordinates. No vision model or diffusion – just math and spatial logic.
As it turns out, this kind of spatial reasoning correlates strongly with a model's raw general intelligence; the MineBench leaderboard tracks, anecdotally, the same hierarchy that most people observe in real-world usage: the smartest reasoning models are clearly visible when asked to produce visual builds.
MineBench, unlike other benchmarks, gives an easy way to visually determine (at least one aspect of) a model's raw intelligence. The ranking system also highlights which models are clearly 'bench-maxed' (i.e. when a model has amazing benchmarks on paper, but clearly lacks in real world usage).
- Arena — blind head-to-head comparisons of pre-generated builds with confidence-aware ranking
- Sandbox — compare existing builds or generate new ones live with your own API keys
- Local Lab — copy the benchmark prompt, run it in any model, paste the JSON back to render
- Leaderboard — live rankings with win/loss/draw stats across all models
- Full docs index:
docs/README.md - Ranking math and matchmaking walkthrough:
docs/arena-ranking-system.md - Ranking policy:
docs/arena-ranking-validity-policy-v2.md - Voxel tool and raw-output pipeline:
docs/voxel-exec-raw-output.md
MineBench currently benchmarks models from OpenAI, Anthropic, Google, Moonshot, DeepSeek, xAI, Z.AI, Qwen, Meta, and any model available through OpenRouter.
Contributions are welcome! See CONTRIBUTING.md for how to add new models, submit benchmark prompts, improve the UI, or fix bugs.
Running MineBench is expensive: model inference, storage, and hosting costs add up quickly as the benchmark grows.
If MineBench is useful to you and you want to help keep updates and new model runs coming, you can support it here:
Texture pack: Faithful (see faithful-32x-1.21.11/LICENSE.txt)
This path lets you run the full app and compare existing builds from uploads/ without generating new ones.
- Node.js
18+ pnpm- Docker (for local Postgres)
pnpm installcp .env.example .envpnpm dev:setuppnpm dev:setup will:
- ensure
.envexists - build the texture atlas
- reset local Docker Postgres volume
- run Prisma migrations
- start Next.js dev server on
http://localhost:3000
In a second terminal:
pnpm prompt --importThen open:
http://localhost:3000/(Arena)http://localhost:3000/sandbox(Benchmark Compare works immediately)http://localhost:3000/leaderboard
If you do not want to reset the DB volume each time:
pnpm db:up
pnpm prisma:migrate
pnpm devTo generate fresh builds in /sandbox -> Live Generate:
- Open
http://localhost:3000/sandbox - Switch to
Live Generate - Enter either:
- an
OpenRouterkey (recommended), or - provider-specific keys (OpenAI/Anthropic/Gemini/Moonshot/DeepSeek)
- an
- Pick 2 models and click
Generate
Notes:
- Keys entered in Sandbox are stored in browser
localStorageand sent only with that request. - In production,
/api/generaterequires request keys unlessMINEBENCH_ALLOW_SERVER_KEYS=1.
Copy .env.example to .env and set what you need:
DATABASE_URL(required): pooled/runtime Postgres URLDIRECT_URL(required): direct Postgres URL for Prisma migrationsADMIN_TOKEN(required for/api/admin/*)CRON_SECRET(recommended if using Vercel Cron for/api/admin/rank-snapshots/capture)SUPABASE_URL+SUPABASE_SERVICE_ROLE_KEY(required for large build upload/download via Supabase Storage)SUPABASE_STORAGE_BUCKET(optional, defaultbuilds)SUPABASE_STORAGE_PREFIX(optional, defaultimports)
OPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_AI_API_KEYMOONSHOT_API_KEYDEEPSEEK_API_KEYOPENROUTER_API_KEY
MINEBENCH_ALLOW_SERVER_KEYS=1(production opt-in for server env keys in/api/generate)ANTHROPIC_OPUS_4_6_EFFORT=low|medium|high|maxANTHROPIC_SONNET_4_6_EFFORT=low|medium|high|max(runtime falls back automatically if provider rejectsmax)ANTHROPIC_STREAM_RESPONSES=1OPENAI_STREAM_RESPONSES=1OPENAI_USE_BACKGROUND_MODE=1(recommended for long-running Responses jobs, especially GPT-5.2 Pro)OPENAI_BACKGROUND_POLL_MS=2000(poll interval for background mode)OPENAI_GPT5_PRO_TIMEOUT_MS=7200000andOPENAI_REQUEST_TIMEOUT_MS(optional timeout overrides)ANTHROPIC_ENABLE_1M_CONTEXT_BETA=1ANTHROPIC_THINKING_BUDGET(legacy/manual thinking models)OPENROUTER_BASE_URL,MOONSHOT_BASE_URL,DEEPSEEK_BASE_URLAI_DEBUG=1(logs raw model output on failures)MINEBENCH_TOOL_OUTPUT_DIR,MINEBENCH_TOOL_TIMEOUT_MS,MINEBENCH_TOOL_MAX_*(advancedvoxel.execcontrols)
- Grid size:
256 - Palette:
simple - Mode:
precise
- Arena matchups are sampled from pre-seeded builds only.
- Matchups are selected by lane scheduler:
- coverage
40% - contender
30% - uncertainty
20% - exploration
10%
- coverage
- Prompt/model eligibility is based on arena-ready builds (
gridSize=256,palette=simple,mode=precise) with at least two enabled models per prompt. - New models are prioritized for calibration exposure via low-coverage + high-uncertainty + low-shownCount weighting, but there is no hard equal-vote-count guarantee.
- A session cookie (
mb_session) is used so each session can vote once per matchup. - Vote options:
A,B,TIE,BOTH_BAD. - Rating updates:
A/B/TIE: Glicko-style pair update (rating, RD, volatility)- public leaderboard order uses conservative score:
rating - 2*RD BOTH_BAD: updatesbothBadCountonly (quality-floor signal), does not mutate pairwise skill rating
For formulas and worked examples, see docs/arena-ranking-system.md.
Middleware rate limits non-admin API routes to 18 requests / 10 seconds per IP + path.
Default behavior: all runtime model generations use voxel.exec tool mode.
- Models emit a tool-call envelope (
tool: voxel.exec+ JS code). - MineBench executes that code server-side and converts it into final voxel build JSON.
- The artifact we render/store is always build JSON in
version/boxes/lines/blocksformat.
Raw tool-call example:
Full pipeline docs (raw output -> executed tool -> validated final build JSON):
Note: non-tool generation exists only as an explicit fallback/dev path (for example pnpm batch:generate --notools).
Models produce JSON in this schema:
{
"version": "1.0",
"boxes": [
{ "x1": 10, "y1": 0, "z1": 10, "x2": 20, "y2": 6, "z2": 20, "type": "stone" }
],
"lines": [
{ "from": { "x": 15, "y": 7, "z": 15 }, "to": { "x": 15, "y": 18, "z": 15 }, "type": "oak_log" }
],
"blocks": [
{ "x": 15, "y": 19, "z": 15, "type": "glowstone" }
]
}Validation pipeline:
- expands
boxesandlines - normalizes/drops invalid block types
- drops out-of-bounds coordinates
- deduplicates final blocks
- enforces max block limits
Current generation constraints:
- grid sizes:
64,256,512 - minimum blocks:
200/500/800 - max blocks:
196,608/2,000,000/4,000,000 - minimum structural span checks for width/depth and height (to reject tiny builds)
Block palettes:
simpleandadvancedare defined inlib/blocks/palettes.json
pnpm prompt # inspect detected prompt folders/builds
pnpm prompt --import # import into local DB
pnpm prompt --import --overwriteCreate a new prompt folder scaffold:
pnpm prompt --init --prompt arcade --text "A classic arcade cabinet with ..."Set ADMIN_TOKEN in .env, restart dev server, then:
# status
curl -sS "http://localhost:3000/api/admin/status" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# prompts + model catalog only (no generation)
curl -sS -X POST "http://localhost:3000/api/admin/seed?generateBuilds=0" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# dry run
curl -sS -X POST "http://localhost:3000/api/admin/seed?dryRun=1" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# generate missing builds in batches (repeat until done=true)
curl -sS -X POST "http://localhost:3000/api/admin/seed?batchSize=2" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# capture current leaderboard rank snapshot (hourly cadence recommended)
curl -sS -X POST "http://localhost:3000/api/admin/rank-snapshots/capture" \
-H "Authorization: Bearer $ADMIN_TOKEN"At least one provider key must be configured (OPENROUTER_API_KEY or a direct provider key) for generation to run.
Use this to import JSON from ChatGPT web or other tools:
curl -sS -X POST "http://localhost:3000/api/admin/import-build?modelKey=openai_gpt_5_2_pro&promptText=$(node -p 'encodeURIComponent(process.argv[1])' 'A medieval stone castle')&overwrite=1" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
--data-binary "@uploads/castle/castle-gpt-5-2-pro.json"For large payloads in production (50MB+), use the batch uploader with Supabase Storage enabled:
pnpm batch:generate --upload --prompt castle --model gpt-5-2-proThe script uploads *.json.gz directly to Supabase Storage and then calls /api/admin/import-build with a small storage reference payload.
Reference prompt template: docs/chatgpt-web-voxel-prompt.md
POST /api/generate- body:
{ prompt, gridSize, palette, modelKeys, providerKeys? } - response:
application/x-ndjsonstream (hello,start,retry,delta,result,error,ping)
- body:
GET /api/arena/matchup?promptId=<optional>POST /api/arena/vote- body:
{ matchupId, choice } choice:A | B | TIE | BOTH_BAD
- body:
GET /api/arena/promptsGET /api/sandbox/benchmark?promptId=&modelA=&modelB=GET /api/leaderboard
GET /api/admin/statusPOST /api/admin/seed?dryRun=1&generateBuilds=0&batchSize=2GET|POST /api/admin/rank-snapshots/capture?at=<optional-iso-timestamp>POST /api/admin/import-build?modelKey=...&promptId=...|promptText=...&gridSize=256&palette=simple&mode=precise&overwrite=1- body can be either:
- raw voxel JSON (legacy)
- storage envelope:
{ "storage": { "bucket": "...", "path": "...", "encoding": "gzip", ... } }
- body can be either:
pnpm dev:setup: full local bootstrap (resets DB, migrates, runs dev server)pnpm dev: start Next.js dev serverpnpm build/pnpm start: production build and servepnpm lint: ESLintpnpm db:up/pnpm db:down/pnpm db:resetpnpm prisma:migrate/pnpm prisma:dev/pnpm prisma:generatepnpm atlas: rebuild texture atlaspnpm prompt: inspect/import prompt build files fromuploads/pnpm batch:generate: batch-generate and/or upload build filespnpm elo:reset --yes [--keep-history]: reset arena rating/leaderboard stats (legacy script name)
# status only
pnpm batch:generate
# generate missing files
pnpm batch:generate --generate
# generate without voxel.exec tool mode
pnpm batch:generate --generate --notools
# upload existing files to production
pnpm batch:generate --upload
# generate + upload with prompt/model filters
pnpm batch:generate --generate --upload --prompt castle --model sonnet
# all options
pnpm batch:generate --helpBuild files are written under uploads/<prompt-slug>/.
When SUPABASE_URL and SUPABASE_SERVICE_ROLE_KEY are set, --upload uses direct Supabase Storage upload + finalize import, which is the recommended path for 100MB-scale builds.
pnpm lintfor static checks- no automated test suite is configured yet
Prisma models:
ModelPromptBuildMatchupVote
Prisma creates quoted PascalCase table names in Postgres. When querying manually, use quoted identifiers, for example:
select count(*) from public."Prompt";app/ Next.js App Router pages and API routes
components/ UI and voxel viewer components
lib/ai/ generation pipeline and provider adapters
lib/arena/ matchup sampling and rating logic
lib/blocks/ palette and texture atlas mapping
lib/voxel/ voxel types, validation, mesh helpers
prisma/ schema and migrations
scripts/ setup, import, generation, maintenance utilities
uploads/ local build JSON files and prompt folders
No seeded prompts foundon Arena:- Run
pnpm prompt --importor use/api/admin/seed.
- Run
Missing ADMIN_TOKEN/Invalid tokenon admin endpoints:- Set
ADMIN_TOKENin.envand sendAuthorization: Bearer $ADMIN_TOKEN.
- Set
/api/generatereturns no-key error in production:- send
providerKeysfrom client or setMINEBENCH_ALLOW_SERVER_KEYS=1.
- send
- large upload fails and script falls back to direct API body upload:
- set
SUPABASE_URL,SUPABASE_SERVICE_ROLE_KEY, and (optionally)SUPABASE_STORAGE_BUCKET.
- set
- DB connection errors:
- ensure Docker is running,
DATABASE_URL/DIRECT_URLare valid, then runpnpm db:up.
- ensure Docker is running,
- Missing/broken block textures:
- run
pnpm atlasto rebuildpublic/textures/atlas.png.
- run
- Works well with Vercel + Supabase Postgres.
- Recommended:
DATABASE_URL: Supabase pooler URL (pgbouncer=true)DIRECT_URL: Supabase direct URL (for Prisma migrations)SUPABASE_URL: your Supabase project URLSUPABASE_SERVICE_ROLE_KEY: server-only key (do not expose to client)SUPABASE_STORAGE_BUCKET: private bucket for build payload objects (defaultbuilds)
- Hourly rank snapshots for movement markers:
vercel.jsonschedules/api/admin/rank-snapshots/captureevery hour- set
CRON_SECRETin Vercel, and keepADMIN_TOKENfor manual/admin calls
- Create a private bucket (for example
builds) in Supabase Storage. - In Vercel project env vars, add:
SUPABASE_URLSUPABASE_SERVICE_ROLE_KEYSUPABASE_STORAGE_BUCKET(optional ifbuilds)SUPABASE_STORAGE_PREFIX(optional ifimports)
- Ensure
ADMIN_TOKENis set in Vercel (still required by/api/admin/*). - Deploy the app, then run your existing upload command flow (
pnpm batch:generate --upload ...).
Notes:
SUPABASE_SERVICE_ROLE_KEYmust stay server-side only.- Build APIs now support records stored as inline JSON or storage pointers.
- Textures: Faithful pack (
faithful-32x-1.21.11/) - License: see
faithful-32x-1.21.11/LICENSE.txt
[Disclaimer: all documentation (including README) and frontend is almost entirely AI-created]


