Skip to content

docs: plan DuckDB/Node backend for buckaroo-js-core (#930)#935

Open
paddymul wants to merge 1 commit into
mainfrom
docs/930-duckdb-node-backend-plan
Open

docs: plan DuckDB/Node backend for buckaroo-js-core (#930)#935
paddymul wants to merge 1 commit into
mainfrom
docs/930-duckdb-node-backend-plan

Conversation

@paddymul

Copy link
Copy Markdown
Collaborator

Summary

Design doc for #930 — a first-class DuckDB-backed buckaroo backend that runs in a pure Node/Electron host with no Python kernel, so the JS-core viewer (DFViewerInfinite/SmartRowCache/IDatasource) renders the same behind DuckDB as behind pandas/polars. IModel is the transport seam and stays untouched.

Motivating consumer is an Electron app (@duckdb/node-api, no Python) whose author wants the full notebook experience — search, infinite scroll, summary stats, histograms.

Blocked on #933

This waits on #933 (unified DF transport). That PR's decodeDFData + parquet_b64 envelope let the infinite path accept a single JSON message with inline base64 parquet, removing the two-frame requirement (BuckarooWidgetInfinite.tsx:122 reads buffers[0] unconditionally today). With #933 the IPC IModel adapter is a plain round-trip and there is zero buckaroo-js-core change; without it, 930 would need a JS-core patch. Don't start the row-transport work until #933 lands.

v1 scope

  • SUMMARIZESDType → wide {col}__{stat} parquet for pinned summary rows.
  • Windowed rows + sort + paging over infinite_request.
  • viewer mode, read-only (no autoclean/post-processing/search/quick-commands).
  • COPY … TO tmpfile (FORMAT PARQUET) for rows — the only serialization path that writes no type-coercion code (this @duckdb/node-api has no Arrow and no in-memory parquet; coercion is where fidelity bugs live).
  • Faithful a,b,c rename + synthesized index (kills index-named / dotted / duplicate-column landmines that user SQL hits more than DataFrames do).
  • Injected DuckSource connection — required for catalog correctness, not just hygiene.

Fast-follows (designed-for, not built)

Histograms/quantiles; search (as a search_-prefixed filtered stat set — the broader plan, which DuckDB is the natural first backend to implement since xorq-slowness was the only reason it didn't exist); quick commands (per-command SQL translation behind one effective-query seam).

Also

Doc: docs/plans/930-duckdb-node-backend.md. Plan only, no code.

🤖 Generated with Claude Code

v1 scope: SUMMARIZE stats + windowed sort/paging over an IModel-over-IPC
adapter, injected DuckSource, no buckaroo-js-core change. Search, histograms,
and quick commands are designed-for fast-follows behind one effective-query
seam.

Blocked on #933 (unified DF transport): its decodeDFData + parquet_b64 envelope
let the infinite path take a single JSON message, removing the two-frame
requirement that would otherwise force a JS-core change.

DECIMAL precision loss split out to #934.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4a9353eb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

| `min` / `max` | `min` / `max` |
| `distinct_count` | `approx_unique` |
| `mean` / `std` | `avg` / `std` |
| `null_count` | `count × null_percentage` (derived) |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive null_count from the percentage scale

DuckDB SUMMARIZE exposes null_percentage as a percentage, so this formula would overstate nulls by 100x if implemented literally: with 1,000 rows and 25% nulls, count × null_percentage gives 25,000 instead of 250. Since Buckaroo's null_count is an absolute integer count, the plan should either divide the parsed percentage by 100 or compute nulls directly.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown
Contributor

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.1.dev28046370246

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.1.dev28046370246

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.15.1.dev28046370246" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant