Skip to content

Conversation

@alkaline-0
Copy link
Contributor

Description

  • Fix SCD2 sticky boundary behavior: when write_disposition.strategy="scd2" and boundary_timestamp is omitted, always set x-boundary-timestamp to the current load package’s created_at. This ensures each run uses the effective time of that run rather than persisting the previous boundary.
  • Also updated the SCD2 tests to validate this case

Related Issues

Fixes #3190

@alkaline-0 alkaline-0 marked this pull request as ready for review November 14, 2025 19:50
@alkaline-0 alkaline-0 self-assigned this Nov 14, 2025
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Nov 14, 2025

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 7892550 Commit Preview URL

Branch Preview URL
Nov 19 2025, 09:00 AM

@alkaline-0 alkaline-0 force-pushed the fix/3190-value-of-boundary-value-remains-fixed branch from 29292e6 to 7e708f3 Compare November 14, 2025 20:15
@alkaline-0 alkaline-0 marked this pull request as draft November 14, 2025 20:20
@alkaline-0 alkaline-0 force-pushed the fix/3190-value-of-boundary-value-remains-fixed branch 2 times, most recently from 3e6dc17 to 760fe85 Compare November 14, 2025 20:37
@alkaline-0 alkaline-0 marked this pull request as ready for review November 15, 2025 03:39
@alkaline-0 alkaline-0 requested review from rudolfix and sh-rp November 15, 2025 03:39
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls check the review comments. schema hints are additive: schema fragments may come from many places: import schema, explicit code, resource hints, apply hints etc. the fact that hint is not present is not enough to remove it from the schema.
That's why user needs to pass None here. If it does not work we have a real problem though...

if merge_strategy == "scd2":
md_dict = cast(TScd2StrategyDict, md_dict)
if "boundary_timestamp" in md_dict:
boundary = md_dict.get("boundary_timestamp")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the root problem here IMO is that user is not correctly resetting the boundary value (or it does not work). we already do what you implemented below when we generate SQL:

boundary_ts = ensure_pendulum_datetime_utc(
            root_table.get("x-boundary-timestamp", current_load_package()["state"]["created_at"])  # type: ignore[arg-type]
        )

and IMO the problem was user was skipping this hint on a next run instead of setting boundary to None. Let's do an experiment first. see below

}
)
with mock.patch(
"dlt.common.storages.load_package.precise_time",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do an experiment here. please stash changes in hints.py so we work with the original code and modify the test to pass "boundary_timestamp": None

this test should pass (we drop the hint, code in sql_jobs.py sets the package timestamp). if the test passes we just need to amend our docs in scd2 on how to reset the boundary.

@alkaline-0 alkaline-0 removed the request for review from sh-rp November 18, 2025 12:06
…p using the current load package's creation time when none is provided. Modified corresponding test to ensure correct behavior in the pipeline.
@alkaline-0 alkaline-0 force-pushed the fix/3190-value-of-boundary-value-remains-fixed branch 4 times, most recently from 8706139 to 0adf780 Compare November 18, 2025 15:33
…ests for new behavior, including resetting to current load time.
@alkaline-0 alkaline-0 force-pushed the fix/3190-value-of-boundary-value-remains-fixed branch from 0adf780 to 7892550 Compare November 18, 2025 15:37
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have a little cleaner code. docs and tests are LGTM - thanks for that!

please take the sqlglot patch commit so common tests are passing. we can't merge without them

f"could not parse `{ts}` value `{wd[ts]}`" # type: ignore[literal-required]
)

art = wd.get("active_record_timestamp")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous loop was IMO more elegant. too much duplicated code here IMO

boundary_ts = ensure_pendulum_datetime_utc(
root_table.get("x-boundary-timestamp", current_load_package()["state"]["created_at"]) # type: ignore[arg-type]
)
created_at = current_load_package()["state"]["created_at"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code is correct. but you do not need to take "created_at" from the package unconditionally. do it in ternary operator below else current_load_package()["state"]["created_at"]

same thing in sql_jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The value of _dlt_valid_from and _dlt_valid_to keeps being set to a former boundary_value

3 participants