feat: dataset onboard improvements (failed rows, unique keys, discovered datasets, bulk mode, qualifiedName fix) by adkinsty · Pull Request #10 · sodadata/soda-cli

adkinsty · 2026-04-27T18:34:47Z

Summary

A coherent arc of sodacli dataset onboard improvements driven by an n_able POC. Five commits, each self-contained:

9745b34 — onboard discovered-but-not-yet-onboarded datasets. dataset onboard <id> previously failed with "dataset not found" when given a discovered ID. Falls back to looking up the discovered dataset across datasources and promoting it on the fly.
4c8518e — --unique-keys on dataset diagnostics for failed-rows collection (preserves untouched fields by seeding from current state).
8f58637 — optional failed rows collection in dataset onboard. New --collect-failed-rows / --no-collect-failed-rows / --unique-keys flags + interactive prompt. --unique-keys alone implies --collect-failed-rows.
d8fb4e2 — fix: avoid double datasource prefix in qualifiedName for promote-on-the-fly path. Bug introduced by commit 1. DiscoveredDataset.QualifiedName from /api/v1/discoveredDatasets is already prefixed (sf_nable/SODA_CE/...), so prepending datasource name doubled it (sf_nable/sf_nable/...) and contract generation failed with "datasets not found". Re-fetch via GetDataset() after promotion to get the unprefixed shape.
edf753f — bulk mode via repeatable --dataset flag. Onboard N datasets in one command. Promotion is batched per datasource into one OnboardDiscoveredDatasets call each; copilot generation is batched into one GenerateContract operation; skeleton loops per-dataset (API doesn't accept a list). Bulk mode requires non-interactive flags and rejects --collect-failed-rows (unique keys are dataset-specific).

Test plan

sodacli dataset onboard <discovered-id> --contracts copilot against an n_able Snowflake dataset → no doubled prefix, contract operation accepted (HTTP 202)
sodacli dataset onboard <id1> --dataset <id2> --no-monitoring --no-profiling --contracts none → batched promotion works
sodacli dataset onboard <id1> --dataset <id2> (no flags) → errors with bulk-mode requirement
sodacli dataset onboard <id1> --dataset <id2> ... --collect-failed-rows --unique-keys id → rejected with clear message
sodacli dataset onboard <id> --collect-failed-rows --unique-keys id,email → enables failed rows
sodacli dataset onboard <id> --collect-failed-rows (no keys) → errors clearly
sodacli dataset diagnostics <id> --unique-keys id → updates keys without resetting enabled
Build clean on this branch

Previously `dataset onboard <id>` failed with "dataset not found" when given the ID of a dataset that was discovered but not yet onboarded, even though `dataset list` shows those IDs. Fall back to looking up discovered datasets across datasources and promote on the fly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Soda Cloud requires uniqueKeyColumnNames to be set for failed rows collection to actually work. Expose it via a new --unique-keys flag on `dataset diagnostics` and include it in the displayed config. Uses read-modify-write on failedRowsConfiguration so partial updates (e.g. setting keys without touching enabled) don't clobber untouched fields — the API replaces the whole section otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add an optional step to `dataset onboard` that enables failed rows collection with user-specified unique key columns. Covered by new --collect-failed-rows / --no-collect-failed-rows and --unique-keys flags, and by a new interactive prompt when no flags are given. --unique-keys alone implies --collect-failed-rows. Enabling failed rows also enables scan results collection (required by the API). Datasource-level diagnostics must be set up first — if it isn't, the step warns and continues rather than aborting the onboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…he-fly path DiscoveredDataset.QualifiedName from /api/v1/discoveredDatasets already contains the datasource prefix (e.g. "sf_nable/SODA_CE/SCHEMA/TABLE"), while Dataset.QualifiedName from /api/v1/datasets/{id} does not (e.g. "SODA_CE.SCHEMA.TABLE"). The promote-on-the-fly path was using foundDiscovered.QualifiedName and prepending the datasource name, which produced a doubled prefix like "sf_nable/sf_nable/SODA_CE/SCHEMA/TABLE" and caused contract copilot/skeleton to fail with "datasets not found". Re-fetch via GetDataset(datasetID) after promotion so qualifiedName is built from the same Dataset shape used by the already-onboarded path.

Adds bulk onboarding for multiple datasets in one command. Matches the existing soda-cli convention for bulk inputs (repeatable StringArray flag, single positional kept for the common single-dataset case). sodacli dataset onboard <id> # single (interactive) sodacli dataset onboard <id> --dataset <id2> --dataset <id3> \ --monitoring --no-profiling --contracts copilot # bulk Behavior: - Bulk mode requires --monitoring/--no-monitoring, --profiling/--no-profiling, and --contracts (no interactive form for N datasets). - --collect-failed-rows / --unique-keys are rejected in bulk mode since unique keys are dataset-specific. - Promotion of discovered datasets is batched per datasource into a single OnboardDiscoveredDatasets call each. - Copilot generation is batched into one GenerateContract operation across all qualifiedNames (single poll, fetch each contract after completion). - Skeleton generation loops per-dataset (the API doesn't accept a list). - Each per-dataset failure is logged as a warning; the rest continue. Adds runContractCreateCopilotBulk helper alongside the existing single-dataset runContractCreateCopilot — same shape, polls one operation, fetches N contracts.

adkinsty and others added 5 commits April 24, 2026 13:12

adkinsty changed the title ~~feat: dataset onboarding improvements (failed rows, unique keys, discovered datasets)~~ feat: dataset onboard improvements (failed rows, unique keys, discovered datasets, bulk mode, qualifiedName fix) May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dataset onboard improvements (failed rows, unique keys, discovered datasets, bulk mode, qualifiedName fix)#10

feat: dataset onboard improvements (failed rows, unique keys, discovered datasets, bulk mode, qualifiedName fix)#10
adkinsty wants to merge 5 commits intomainfrom
feat-dataset-onboard-improvements

adkinsty commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adkinsty commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adkinsty commented Apr 27, 2026 •

edited

Loading