docs: plan DuckDB/Node backend for buckaroo-js-core (#930)#935
Conversation
v1 scope: SUMMARIZE stats + windowed sort/paging over an IModel-over-IPC adapter, injected DuckSource, no buckaroo-js-core change. Search, histograms, and quick commands are designed-for fast-follows behind one effective-query seam. Blocked on #933 (unified DF transport): its decodeDFData + parquet_b64 envelope let the infinite path take a single JSON message, removing the two-frame requirement that would otherwise force a JS-core change. DECIMAL precision loss split out to #934. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4a9353eb2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | `min` / `max` | `min` / `max` | | ||
| | `distinct_count` | `approx_unique` | | ||
| | `mean` / `std` | `avg` / `std` | | ||
| | `null_count` | `count × null_percentage` (derived) | |
There was a problem hiding this comment.
Derive null_count from the percentage scale
DuckDB SUMMARIZE exposes null_percentage as a percentage, so this formula would overstate nulls by 100x if implemented literally: with 1,000 rows and 25% nulls, count × null_percentage gives 25,000 instead of 250. Since Buckaroo's null_count is an absolute integer count, the plan should either divide the parsed percentage by 100 or compute nulls directly.
Useful? React with 👍 / 👎.
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.1.dev28046370246or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.1.dev28046370246MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.15.1.dev28046370246" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
Summary
Design doc for #930 — a first-class DuckDB-backed buckaroo backend that runs in a pure Node/Electron host with no Python kernel, so the JS-core viewer (
DFViewerInfinite/SmartRowCache/IDatasource) renders the same behind DuckDB as behind pandas/polars.IModelis the transport seam and stays untouched.Motivating consumer is an Electron app (
@duckdb/node-api, no Python) whose author wants the full notebook experience — search, infinite scroll, summary stats, histograms.Blocked on #933
This waits on #933 (unified DF transport). That PR's
decodeDFData+parquet_b64envelope let the infinite path accept a single JSON message with inline base64 parquet, removing the two-frame requirement (BuckarooWidgetInfinite.tsx:122readsbuffers[0]unconditionally today). With #933 the IPCIModeladapter is a plain round-trip and there is zero buckaroo-js-core change; without it, 930 would need a JS-core patch. Don't start the row-transport work until #933 lands.v1 scope
SUMMARIZE→SDType→ wide{col}__{stat}parquet for pinned summary rows.infinite_request.viewermode, read-only (no autoclean/post-processing/search/quick-commands).COPY … TO tmpfile (FORMAT PARQUET)for rows — the only serialization path that writes no type-coercion code (this@duckdb/node-apihas no Arrow and no in-memory parquet; coercion is where fidelity bugs live).a,b,crename + synthesizedindex(killsindex-named / dotted / duplicate-column landmines that user SQL hits more than DataFrames do).DuckSourceconnection — required for catalog correctness, not just hygiene.Fast-follows (designed-for, not built)
Histograms/quantiles; search (as a
search_-prefixed filtered stat set — the broader plan, which DuckDB is the natural first backend to implement since xorq-slowness was the only reason it didn't exist); quick commands (per-command SQL translation behind one effective-query seam).Also
DECIMAL → DOUBLE; likely affects all backends).Doc:
docs/plans/930-duckdb-node-backend.md. Plan only, no code.🤖 Generated with Claude Code