Skip to content

Fail partition processor on scheduler errors#4952

Open
muhamadazmy wants to merge 1 commit into
mainfrom
pr4952
Open

Fail partition processor on scheduler errors#4952
muhamadazmy wants to merge 1 commit into
mainfrom
pr4952

Conversation

@muhamadazmy

Copy link
Copy Markdown
Contributor

Summary:
Make sure that partition processor fails when
facing scheduler errors

Summary:
Make sure that partition processor fails when
facing scheduler errors
@muhamadazmy muhamadazmy requested a review from tillrohrmann June 17, 2026 09:56

@tillrohrmann tillrohrmann left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @muhamadazmy. Can we ensure that the PP fails if there is a scheduler error?

}
}
let result = scheduler.schedule_next(vqueue_metas).await;
Some((ActionEffect::Scheduler(result), scheduler))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of forwarding the error and logging it at a different place, can we let run fail if we encounter a scheduler error? Then the pp should fail as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PP still fails on applying the action effects! I don't only log the error.

I was aiming first on failing run() directly, but it wasn't possible cleanly without too much changes and some allocations as well.

Instead, I emitted the scheduler decision result, then fail while applying the action. Which still fails the PP

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons i didn't just poll the scheduler in the select! block is cancellation safety, since all_streams is created on the stack on each call to run() racing with the scheduler will cause loss of already polled effects.

All other branches in the select block return with error this is why it's okay now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the part wrt to propagating the error. This should be fine.

What I don't fully understand is the cancellation safety that we get with the stream_select! vs polling the scheduler directly in the select! statement. Could you help me understand?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mistaken, the ready_chunk returns immediately if it collected some items and remaining streams returns pending. so there is no fear of loss

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Test Results

  8 files  ±0    8 suites  ±0   4m 34s ⏱️ -7s
 60 tests ±0   60 ✅ ±0  0 💤 ±0  0 ❌ ±0 
267 runs  ±0  267 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit a8736c0. ± Comparison against base commit 59163fa.

♻️ This comment has been updated with latest results.

@muhamadazmy

Copy link
Copy Markdown
Contributor Author

@tillrohrmann please let me know if we can merge this for v1.7 or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants