Skip to content

Conversation

@lucasbru
Copy link
Member

@lucasbru lucasbru commented Oct 23, 2025

This change enhances the offset commit validation logic in streams
groups to validate against per-partition assignment epochs. When a
member attempts to commit offsets with an older member epoch, the logic
now validates that the epoch is not older than the assignment epoch for
each individual partition being committed.

The implementation adds a new createAssignmentEpochValidator method
that creates partition-level validators, checking each partition against
its assignment epoch from either assigned tasks or tasks pending
revocation.

We extend the SmokeTestDriverIntegrationTest to detect if we have
processed more records than needed, which, in this restricted scenario,
should only happen when offset commits are failing.

We re-enable the previously flaky test in EosIntegrationTest, which
failed due to previously failing offset commits.

Both tests have been run 100x in their streams protocol variation to
validate that they are not flaky anymore.

Reviewers: David Jacot [email protected], Matthias J. Sax
[email protected]

@github-actions github-actions bot added streams build Gradle build or GitHub Actions group-coordinator labels Oct 23, 2025
return (topicName, topicId, partitionId) -> {
final StreamsGroupTopologyValue.Subtopology subtopology = streamsTopology.sourceTopicMap().get(topicName);
if (subtopology == null) {
throw new StaleMemberEpochException("Topic " + topicName + " is not in the topology.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is actually impossible right now because we do not allow updating the topology yet. But I think this would be the correct behavior once we allow changing the topology: We are trying to commit for a subtopology that does not exist anymore, so we should fence the member.

) {
// Retrieve topology once for all partitions - not per partition!
final StreamsTopology streamsTopology = topology.get().orElseThrow(() ->
new StaleMemberEpochException("Topology is not available for offset commit validation."));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not allow removing the topology, so I think this may almost impossible. We'd have to recreate the group of the same name, and get the same member ID back to reach this point. If that would ever happen, I think fencing the member would be okay.

@lucasbru lucasbru requested a review from Copilot October 23, 2025 15:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances offset commit validation in streams groups by implementing per-partition epoch validation. Instead of just checking the member epoch, the system now validates that each partition being committed has an assignment epoch that is not newer than the commit request's epoch, enabling proper fencing of zombie commit requests.

Key changes:

  • Introduction of TasksTupleWithEpochs to track assignment epochs per partition for active tasks
  • Addition of createAssignmentEpochValidator method for partition-level validation
  • Extension of SmokeTestDriverIntegrationTest to detect excessive record processing
  • Re-enabling of previously flaky EosIntegrationTest

Reviewed Changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
TasksTupleWithEpochs.java New class storing active tasks with per-partition assignment epochs
TasksTuple.java Updated to work with TasksTupleWithEpochs for containment checks
StreamsGroup.java Implements per-partition epoch validation via createAssignmentEpochValidator
StreamsGroupMember.java Changed to use TasksTupleWithEpochs for assigned tasks
CurrentAssignmentBuilder.java Updated to preserve and assign epochs when building assignments
StreamsCoordinatorRecordHelpers.java Serialization support for assignment epochs
StreamsGroupCurrentMemberAssignmentValue.json Added AssignmentEpochs field to schema
SmokeTestClient.java Tracks total data records processed
SmokeTestDriverIntegrationTest.java Validates no excessive record reprocessing occurs
EosIntegrationTest.java Removed @Flaky annotation from test
Various test files Updated to work with epoch-aware task assignments

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@lucasbru lucasbru changed the title KAFKA-19779: Add per-partition epoch validation to streams groups [5/N] KAFKA-19779: Add per-partition epoch validation to streams groups [4/N] Oct 23, 2025
group.updateMember(new StreamsGroupMember.Builder("member-1")
.setMemberEpoch(5)
.setAssignedTasks(new TasksTupleWithEpochs(
Map.of("0", Map.of(0, 2, 1, 2)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem we only test epoch quality? Should we use two different epochs for each partition to extend test coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I made them different

.build());

CommitPartitionValidator validator = group.validateOffsetCommit(
"member-1", "", 2, false, ApiKeys.OFFSET_COMMIT.latestVersion());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set member epoch to 5 above, should we also pass in 5 in here for the commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, with the relaxed check, it should just be smaller I guess, but larger or equals than the assignment epoch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

"member-1", "", 1, false, ApiKeys.OFFSET_COMMIT.latestVersion());

// Partition 0 assigned with epoch 2, received epoch 1 should throw
assertThrows(StaleMemberEpochException.class, () ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem this check is somewhat redundant to testValidateOffsetCommitWithOlderEpoch from above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This is not really what we are testing in this test. Removed this assertion

Map.of(), Map.of()))
.build());

// Commit with member epoch 3 should fail (3 < assignment epochs 5 and 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should even fail for epoch 7 (< 8) right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's a better test if only one partition fails the check.

}

@Test
public void testStreamsGroupOffsetCommitWithOlderMemberEpochValidAssignments() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering to what extend this test, and the above testStreamsGroupOffsetCommitWithAssignmentEpochValid contain each other? For both cases, we expect success, what means member epoch > every assignment epoch.

What increase test coverage to we get by having both tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Removed this test.

Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock overall GTM.

@lucasbru
Copy link
Member Author

lucasbru commented Nov 5, 2025

@mjsax Comments addressed and PR rebased

lucasbru and others added 3 commits November 6, 2025 11:04
This commit enhances the offset commit validation logic in streams groups
to validate against per-partition assignment epochs. When a member attempts to
commit offsets with an older member epoch, the logic now validates that the epoch
is not older than the assignment epoch for each individual partition being committed.

The implementation adds a new `createAssignmentEpochValidator` method that creates
partition-level validators, checking each partition against its assignment epoch from
either assigned tasks or tasks pending revocation.

We extend the SmokeTestDriverIntegrationTest to detect if we have
processed more records than needed, which, in this restricted scenario,
should only happen when offset commits are failing.

We re-enable the previously flaky test in EosIntegrationTest, which
failed due to previously failing offset commits.

Both tests have been run 100x in there streams protocol variation to
validate that they are not flaky anymore.
@lucasbru lucasbru marked this pull request as ready for review November 6, 2025 10:06
…onTest.shouldNotViolateEosIfOneTaskFails` as flaky (apache#20733)"

This reverts commit 8dfaedd.
Copy link
Member

@dajac dajac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lucasbru lucasbru merged commit d5d9892 into apache:trunk Nov 6, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants