Skip to content

feat: dataset onboard improvements (failed rows, unique keys, discovered datasets, bulk mode, qualifiedName fix)#10

Open
adkinsty wants to merge 5 commits intomainfrom
feat-dataset-onboard-improvements
Open

feat: dataset onboard improvements (failed rows, unique keys, discovered datasets, bulk mode, qualifiedName fix)#10
adkinsty wants to merge 5 commits intomainfrom
feat-dataset-onboard-improvements

Conversation

@adkinsty
Copy link
Copy Markdown
Collaborator

@adkinsty adkinsty commented Apr 27, 2026

Summary

A coherent arc of sodacli dataset onboard improvements driven by an n_able POC. Five commits, each self-contained:

  1. 9745b34 — onboard discovered-but-not-yet-onboarded datasets. dataset onboard <id> previously failed with "dataset not found" when given a discovered ID. Falls back to looking up the discovered dataset across datasources and promoting it on the fly.
  2. 4c8518e--unique-keys on dataset diagnostics for failed-rows collection (preserves untouched fields by seeding from current state).
  3. 8f58637 — optional failed rows collection in dataset onboard. New --collect-failed-rows / --no-collect-failed-rows / --unique-keys flags + interactive prompt. --unique-keys alone implies --collect-failed-rows.
  4. d8fb4e2 — fix: avoid double datasource prefix in qualifiedName for promote-on-the-fly path. Bug introduced by commit 1. DiscoveredDataset.QualifiedName from /api/v1/discoveredDatasets is already prefixed (sf_nable/SODA_CE/...), so prepending datasource name doubled it (sf_nable/sf_nable/...) and contract generation failed with "datasets not found". Re-fetch via GetDataset() after promotion to get the unprefixed shape.
  5. edf753f — bulk mode via repeatable --dataset flag. Onboard N datasets in one command. Promotion is batched per datasource into one OnboardDiscoveredDatasets call each; copilot generation is batched into one GenerateContract operation; skeleton loops per-dataset (API doesn't accept a list). Bulk mode requires non-interactive flags and rejects --collect-failed-rows (unique keys are dataset-specific).

Test plan

  • sodacli dataset onboard <discovered-id> --contracts copilot against an n_able Snowflake dataset → no doubled prefix, contract operation accepted (HTTP 202)
  • sodacli dataset onboard <id1> --dataset <id2> --no-monitoring --no-profiling --contracts none → batched promotion works
  • sodacli dataset onboard <id1> --dataset <id2> (no flags) → errors with bulk-mode requirement
  • sodacli dataset onboard <id1> --dataset <id2> ... --collect-failed-rows --unique-keys id → rejected with clear message
  • sodacli dataset onboard <id> --collect-failed-rows --unique-keys id,email → enables failed rows
  • sodacli dataset onboard <id> --collect-failed-rows (no keys) → errors clearly
  • sodacli dataset diagnostics <id> --unique-keys id → updates keys without resetting enabled
  • Build clean on this branch

adkinsty and others added 5 commits April 24, 2026 13:12
Previously `dataset onboard <id>` failed with "dataset not found" when
given the ID of a dataset that was discovered but not yet onboarded,
even though `dataset list` shows those IDs. Fall back to looking up
discovered datasets across datasources and promote on the fly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soda Cloud requires uniqueKeyColumnNames to be set for failed rows
collection to actually work. Expose it via a new --unique-keys flag
on `dataset diagnostics` and include it in the displayed config.

Uses read-modify-write on failedRowsConfiguration so partial updates
(e.g. setting keys without touching enabled) don't clobber untouched
fields — the API replaces the whole section otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an optional step to `dataset onboard` that enables failed rows
collection with user-specified unique key columns. Covered by new
--collect-failed-rows / --no-collect-failed-rows and --unique-keys
flags, and by a new interactive prompt when no flags are given.

--unique-keys alone implies --collect-failed-rows. Enabling failed
rows also enables scan results collection (required by the API).
Datasource-level diagnostics must be set up first — if it isn't,
the step warns and continues rather than aborting the onboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…he-fly path

DiscoveredDataset.QualifiedName from /api/v1/discoveredDatasets already
contains the datasource prefix (e.g. "sf_nable/SODA_CE/SCHEMA/TABLE"),
while Dataset.QualifiedName from /api/v1/datasets/{id} does not (e.g.
"SODA_CE.SCHEMA.TABLE"). The promote-on-the-fly path was using
foundDiscovered.QualifiedName and prepending the datasource name, which
produced a doubled prefix like "sf_nable/sf_nable/SODA_CE/SCHEMA/TABLE"
and caused contract copilot/skeleton to fail with "datasets not found".

Re-fetch via GetDataset(datasetID) after promotion so qualifiedName is
built from the same Dataset shape used by the already-onboarded path.
Adds bulk onboarding for multiple datasets in one command. Matches the
existing soda-cli convention for bulk inputs (repeatable StringArray
flag, single positional kept for the common single-dataset case).

  sodacli dataset onboard <id>                                  # single (interactive)
  sodacli dataset onboard <id> --dataset <id2> --dataset <id3> \
      --monitoring --no-profiling --contracts copilot           # bulk

Behavior:
  - Bulk mode requires --monitoring/--no-monitoring, --profiling/--no-profiling,
    and --contracts (no interactive form for N datasets).
  - --collect-failed-rows / --unique-keys are rejected in bulk mode since
    unique keys are dataset-specific.
  - Promotion of discovered datasets is batched per datasource into a single
    OnboardDiscoveredDatasets call each.
  - Copilot generation is batched into one GenerateContract operation across
    all qualifiedNames (single poll, fetch each contract after completion).
  - Skeleton generation loops per-dataset (the API doesn't accept a list).
  - Each per-dataset failure is logged as a warning; the rest continue.

Adds runContractCreateCopilotBulk helper alongside the existing single-dataset
runContractCreateCopilot — same shape, polls one operation, fetches N contracts.
@adkinsty adkinsty changed the title feat: dataset onboarding improvements (failed rows, unique keys, discovered datasets) feat: dataset onboard improvements (failed rows, unique keys, discovered datasets, bulk mode, qualifiedName fix) May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant