Skip to content

feat(relation): clarify dlt.Relation lifecycle #3206

@zilto

Description

@zilto

Summary

Currently, dlt.Relation is coupled to pipeline.run() in a way that those two are not equivalent

@dlt.transformation
def foo(dataset: dlt.Dataset):
  yield dataset.table("bar")

# retrieve relation from transformation then execute and get data
result_rel: dlt.Relation = list(foo(dataset))[0]
result_rel.df()

# execute transformation via pipeline.run() then get data
pipeline.run([foo])
result2_rel = pipeline.dataset().table("foo")
result2_rel.df()

This behavior is inconsistent with @dlt.resource where doing list(resource) will gather records in-memory and add the _dlt_id and _dlt_load_id fields.

This happens because dlt.Pipeline.run() goes through Extract, Normalize, Load. The Normalize step adds the _dlt_id and _dlt_load_id columns.

Motivation

We want to be able to define @dlt.transformation interactively without requiring the heavy dlt.Pipeline.run() constructs on each execution.

The column _dlt_id exists on all tables produced by dlt by default and is useful to uniquely identify records. This is useful for transformation when:

  • identifying rows for data quality checks
  • incremental strategies
  • smart joins

Intended behavior

  1. list(transformation)[0] should include the _dlt_id and _dlt_load_id just like list(resource) does
  2. Currently, transformation outputs are all treated like root tables; they have their own _dlt_id and _dlt_load_id with no parent.
  • This could change in the future (e.g., we generate dimensions and facts), but is good for now
  1. Executing dlt.Relation via .df(), .fetchall(), etc. shouldn't add _dlt_id nor _dlt_load_id. Relations are a more generic mechanism than transformations
  2. Methods and features related to dlt.Dataset shouldn't mutate queries to include / generate _dlt_id nor _dlt_load_id

Potential solution

The internals of @dlt.transformation (e.g., make_transformation_function) produce a dlt.Relation. The code should be changed such that dlt.Relation generates the _dlt_id and _dlt_load_id instead of this being done in the Normalization of pipeline.run()

question: can I have a list of things that happen a Normalization for @dlt.transformation?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions