Skip to content

Conversation

@anuunchin
Copy link
Contributor

@anuunchin anuunchin commented Nov 21, 2025

This PR attempts to solve the issue where setting primary_key to an empty string in apply_hints does not actually remove it from schema.

First attempted to be resolved in #3280.

TLDR: This solution uses the x-extractor field to pass empty value hints around.

Resolves #3210

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Nov 21, 2025

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs cb94884 Commit Preview URL

Branch Preview URL
Nov 21 2025, 12:43 PM

@anuunchin anuunchin force-pushed the fix/3210-empty-keys branch 2 times, most recently from b5a7d10 to 6c3f35a Compare November 21, 2025 12:29
@anuunchin anuunchin marked this pull request as ready for review November 21, 2025 12:29
@anuunchin anuunchin requested a review from rudolfix November 21, 2025 12:34
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can handle everything in diff_table if we mark certain hints as compound and implement missing behaviors. Here's how hints are evolved:

  • we have resource definition (columns, primary keys), apply hints, and hints manipulation directly in code. this always happens from scratch in each pipeline run. user can do here whatever user wants. no changes are needed
  • extract starts: here we must deal with schema from previous run or with imported schema
  • this happens just before extract: source.schema.update_schema(source_schema): we use newly computed source schema to update pipeline schema (table by table)
  • tables diffs are calculated: during that process a diff is created which is a partial table with new and modified columns
  • the diff is applied to table in pipeline schema and columns are merged with merge_columns.
  • table updates happen a few times more (dynamic tables in the extract, schema evolution in the normalize) but the logic is the same

the root of the problem is in diff_table: it just blindly merges column hints additively. we must handle compound hints here. we assume that partial table has complete information on each hint. so for example:

  • if there a column that sets a primary key, we must reset those keys on all columns of pipeline schema table (table_a) and add them to diff.
  • if there a column without primary key but in the table_a the key is set we know that primary key hint is modified and we need to remove it from table_a fully
    tldr ;> a block replaces existing block, even if new block is empty (but intersection of column of old blocks with columns in the diff cannot be empty)

a few notes:

  1. preserve the column order (merge_columns does it, it will be sufficient)
  2. merge_columns won't need columns_partial - this should be a flag when generating diff (see docstrings)
  3. we are implementing if columns_partialis False, hints likeprimary_keyandmerge_keyare dropped fromcolumns_aand replaced fromcolumns_b``
  4. merge_column merge default is always True. you can remove that flag and remove all "false" usage (let's see what happens)

this is not that complicated overall. and the hints merge code is in better shape than I remembered...



TColumnPropMergeType = Literal[
"replacable",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just mark certain hints as compound

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

primary_key not removed with apply_hints(primary_key="") (same issue with merge_key)

3 participants