Conversation
Previously `dataset onboard <id>` failed with "dataset not found" when given the ID of a dataset that was discovered but not yet onboarded, even though `dataset list` shows those IDs. Fall back to looking up discovered datasets across datasources and promote on the fly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soda Cloud requires uniqueKeyColumnNames to be set for failed rows collection to actually work. Expose it via a new --unique-keys flag on `dataset diagnostics` and include it in the displayed config. Uses read-modify-write on failedRowsConfiguration so partial updates (e.g. setting keys without touching enabled) don't clobber untouched fields — the API replaces the whole section otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an optional step to `dataset onboard` that enables failed rows collection with user-specified unique key columns. Covered by new --collect-failed-rows / --no-collect-failed-rows and --unique-keys flags, and by a new interactive prompt when no flags are given. --unique-keys alone implies --collect-failed-rows. Enabling failed rows also enables scan results collection (required by the API). Datasource-level diagnostics must be set up first — if it isn't, the step warns and continues rather than aborting the onboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…he-fly path
DiscoveredDataset.QualifiedName from /api/v1/discoveredDatasets already
contains the datasource prefix (e.g. "sf_nable/SODA_CE/SCHEMA/TABLE"),
while Dataset.QualifiedName from /api/v1/datasets/{id} does not (e.g.
"SODA_CE.SCHEMA.TABLE"). The promote-on-the-fly path was using
foundDiscovered.QualifiedName and prepending the datasource name, which
produced a doubled prefix like "sf_nable/sf_nable/SODA_CE/SCHEMA/TABLE"
and caused contract copilot/skeleton to fail with "datasets not found".
Re-fetch via GetDataset(datasetID) after promotion so qualifiedName is
built from the same Dataset shape used by the already-onboarded path.
Adds bulk onboarding for multiple datasets in one command. Matches the
existing soda-cli convention for bulk inputs (repeatable StringArray
flag, single positional kept for the common single-dataset case).
sodacli dataset onboard <id> # single (interactive)
sodacli dataset onboard <id> --dataset <id2> --dataset <id3> \
--monitoring --no-profiling --contracts copilot # bulk
Behavior:
- Bulk mode requires --monitoring/--no-monitoring, --profiling/--no-profiling,
and --contracts (no interactive form for N datasets).
- --collect-failed-rows / --unique-keys are rejected in bulk mode since
unique keys are dataset-specific.
- Promotion of discovered datasets is batched per datasource into a single
OnboardDiscoveredDatasets call each.
- Copilot generation is batched into one GenerateContract operation across
all qualifiedNames (single poll, fetch each contract after completion).
- Skeleton generation loops per-dataset (the API doesn't accept a list).
- Each per-dataset failure is logged as a warning; the rest continue.
Adds runContractCreateCopilotBulk helper alongside the existing single-dataset
runContractCreateCopilot — same shape, polls one operation, fetches N contracts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A coherent arc of
sodacli dataset onboardimprovements driven by an n_able POC. Five commits, each self-contained:9745b34— onboard discovered-but-not-yet-onboarded datasets.dataset onboard <id>previously failed with "dataset not found" when given a discovered ID. Falls back to looking up the discovered dataset across datasources and promoting it on the fly.4c8518e—--unique-keysondataset diagnosticsfor failed-rows collection (preserves untouched fields by seeding from current state).8f58637— optional failed rows collection indataset onboard. New--collect-failed-rows/--no-collect-failed-rows/--unique-keysflags + interactive prompt.--unique-keysalone implies--collect-failed-rows.d8fb4e2— fix: avoid double datasource prefix in qualifiedName for promote-on-the-fly path. Bug introduced by commit 1.DiscoveredDataset.QualifiedNamefrom/api/v1/discoveredDatasetsis already prefixed (sf_nable/SODA_CE/...), so prepending datasource name doubled it (sf_nable/sf_nable/...) and contract generation failed with "datasets not found". Re-fetch viaGetDataset()after promotion to get the unprefixed shape.edf753f— bulk mode via repeatable--datasetflag. Onboard N datasets in one command. Promotion is batched per datasource into oneOnboardDiscoveredDatasetscall each; copilot generation is batched into oneGenerateContractoperation; skeleton loops per-dataset (API doesn't accept a list). Bulk mode requires non-interactive flags and rejects--collect-failed-rows(unique keys are dataset-specific).Test plan
sodacli dataset onboard <discovered-id> --contracts copilotagainst an n_able Snowflake dataset → no doubled prefix, contract operation accepted (HTTP 202)sodacli dataset onboard <id1> --dataset <id2> --no-monitoring --no-profiling --contracts none→ batched promotion workssodacli dataset onboard <id1> --dataset <id2>(no flags) → errors with bulk-mode requirementsodacli dataset onboard <id1> --dataset <id2> ... --collect-failed-rows --unique-keys id→ rejected with clear messagesodacli dataset onboard <id> --collect-failed-rows --unique-keys id,email→ enables failed rowssodacli dataset onboard <id> --collect-failed-rows(no keys) → errors clearlysodacli dataset diagnostics <id> --unique-keys id→ updates keys without resettingenabled