Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -659,6 +659,11 @@ runtime. The sharding behavior depends on the runners.
You must use `triggering_frequency` to specify a triggering frequency for
initiating load jobs. Be careful about setting the frequency such that your
pipeline doesn't exceed the BigQuery load job [quota limit](https://cloud.google.com/bigquery/quotas#load_jobs).

> **Note:** When using file load-based BigQuery writes with dynamic destinations and a non-zero
> `triggering_frequency`, temporary tables may be created repeatedly and loads
> are not finalized into destination tables. This is a known limitation (see BEAM-9917).
Comment on lines +663 to +665
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the goal of this issue (#20242) is to resolve the root cause not documenting the issue/bug?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification — that makes sense.

I added the documentation note to clarify the current behavior while
investigating the issue, but I understand that the primary goal is to
address the root cause.

As a next step, I can add a failing unit test that captures the current
behavior with dynamic destinations and triggering_frequency, or help
with investigation into the finalization logic. Please let me know which
direction would be preferred.

Copy link
Contributor

@mohamedawnallah mohamedawnallah Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can reproduce this issue locally, we are halfway towards the resolution. A reproducibility experiment can be something along those lines:

  • We can see where BigQueryBatchFileLoads is located in the codebase (using keyword-based search in the IDE)
  • Once we know where it is located, we can see how it has been tested e.g with single table/multiple tables
  • If there are tests for multiple tables, we can see if there are ongoing residual temporary tables (as mentioned in the issue)
  • If there are no tests for multiple tables or not feasible to be integration-tested, we can test it with a free tier version of our real GCP

Once we can reproduce that issue, we can see the relevant codepaths and start to tweak them intentionally with relevant tests so that issue doesn't happen again as a regression

Copy link
Contributor

@mohamedawnallah mohamedawnallah Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a next step, I can add a failing unit test that captures the current
behavior with dynamic destinations and triggering_frequency, or help
with investigation into the finalization logic. Please let me know which
direction would be preferred.

It would be great if we could have a reproducible test that captures the bug first, then we can iterate on the solution


{{< /paragraph >}}

{{< paragraph class="language-py" >}}
Expand Down
Loading