Skip to content

[Bug]: BigQueryEnrichmentHandler batch mode drops earlier requests when multiple requests share the same key #38035

@prabhnoor0212

Description

@prabhnoor0212

What happened?

There appears to be a bug in:

sdks/python/apache_beam/transforms/enrichment_handlers/bigquery.py

In batch mode, requests are collected into a map keyed by the enrichment key:

requests_map[self.create_row_key(req)] = req

If more than one request in the same batch has the same enrichment key, the newer request overwrites the earlier one in requests_map.

As a result, duplicate-key requests in the same batch are not all preserved and matched correctly.

What did you expect to happen?

When multiple requests in the same batch share the same enrichment key, all of them should be retained and handled correctly.

What is the actual behavior?

Only the latest request for a given key is stored in requests_map, so earlier requests with the same key are lost.

Example

Given batched requests like:

  • Row(id=1, ...)
  • Row(id=1, ...)

both requests produce the same enrichment key. The second request overwrites the first in requests_map, which can cause one of the requests to be dropped or treated as unmatched incorrectly.

Proposed fix

Store a collection of requests per key instead of a single request, for example a list or queue, and then consume one request per matching response key during result processing.

Additional context

This seems to affect the batch path in BigQueryEnrichmentHandler.__call__, not the single-request path.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions