-
Notifications
You must be signed in to change notification settings - Fork 394
Description
Summary
Currently, dlt.Relation is coupled to pipeline.run() in a way that those two are not equivalent
@dlt.transformation
def foo(dataset: dlt.Dataset):
yield dataset.table("bar")
# retrieve relation from transformation then execute and get data
result_rel: dlt.Relation = list(foo(dataset))[0]
result_rel.df()
# execute transformation via pipeline.run() then get data
pipeline.run([foo])
result2_rel = pipeline.dataset().table("foo")
result2_rel.df()This behavior is inconsistent with @dlt.resource where doing list(resource) will gather records in-memory and add the _dlt_id and _dlt_load_id fields.
This happens because dlt.Pipeline.run() goes through Extract, Normalize, Load. The Normalize step adds the _dlt_id and _dlt_load_id columns.
Motivation
We want to be able to define @dlt.transformation interactively without requiring the heavy dlt.Pipeline.run() constructs on each execution.
The column _dlt_id exists on all tables produced by dlt by default and is useful to uniquely identify records. This is useful for transformation when:
- identifying rows for data quality checks
- incremental strategies
- smart joins
Intended behavior
list(transformation)[0]should include the_dlt_idand_dlt_load_idjust likelist(resource)does- Currently, transformation outputs are all treated like root tables; they have their own
_dlt_idand_dlt_load_idwith no parent.
- This could change in the future (e.g., we generate dimensions and facts), but is good for now
- Executing
dlt.Relationvia.df(),.fetchall(), etc. shouldn't add_dlt_idnor_dlt_load_id. Relations are a more generic mechanism than transformations - Methods and features related to
dlt.Datasetshouldn't mutate queries to include / generate_dlt_idnor_dlt_load_id
Potential solution
The internals of @dlt.transformation (e.g., make_transformation_function) produce a dlt.Relation. The code should be changed such that dlt.Relation generates the _dlt_id and _dlt_load_id instead of this being done in the Normalization of pipeline.run()
question: can I have a list of things that happen a Normalization for @dlt.transformation?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status