-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[Bug]: BigQueryEnrichmentHandler batch mode drops earlier requests when multiple requests share the same key #38035
Description
What happened?
There appears to be a bug in:
sdks/python/apache_beam/transforms/enrichment_handlers/bigquery.py
In batch mode, requests are collected into a map keyed by the enrichment key:
requests_map[self.create_row_key(req)] = reqIf more than one request in the same batch has the same enrichment key, the newer request overwrites the earlier one in requests_map.
As a result, duplicate-key requests in the same batch are not all preserved and matched correctly.
What did you expect to happen?
When multiple requests in the same batch share the same enrichment key, all of them should be retained and handled correctly.
What is the actual behavior?
Only the latest request for a given key is stored in requests_map, so earlier requests with the same key are lost.
Example
Given batched requests like:
Row(id=1, ...)Row(id=1, ...)
both requests produce the same enrichment key. The second request overwrites the first in requests_map, which can cause one of the requests to be dropped or treated as unmatched incorrectly.
Proposed fix
Store a collection of requests per key instead of a single request, for example a list or queue, and then consume one request per matching response key during result processing.
Additional context
This seems to affect the batch path in BigQueryEnrichmentHandler.__call__, not the single-request path.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner