Ack/Nack for routing in transform_processor #1798

albertlockett · 2026-01-15T21:27:23Z

Opening this as draft for now. There are some addition test cases and cleanup I want to do, but I wanted to first ensure my handling of the Pdata Contexts was correct.

Closes: #1784

Now that we have route_to in OPL, in combination with if/else, this can create a scenario where we split the batch.

logs |
if (severity_text == "ERROR") {
  route_to "out_port1"
}
// implicit collect everything that didn't go in "if" branch

A pipeline like this would emit two batches:

"ERROR" logs on the processor's "out_port1"
all other logs on the default out port in two

If the batch had subscribers, when we process a pdata we must keep the context for the inbound batch, and create new contexts for the outbound batches. When all the outbound batches Ack/Nack'd, we must then Ack/Nack the inbound context.

This PR adds a Contexts type for juggling the inbound/outbound contexts and updates the transform processor to manage contexts + Ack/NAck correctly.

codecov · 2026-01-15T21:30:06Z

Codecov Report

❌ Patch coverage is 96.44269% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.79%. Comparing base (5f651fd) to head (0610f2f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1798      +/-   ##
==========================================
+ Coverage   84.73%   84.79%   +0.06%     
==========================================
  Files         499      501       +2     
  Lines      146596   147333     +737     
==========================================
+ Hits       124222   124936     +714     
- Misses      21840    21863      +23     
  Partials      534      534

Components	Coverage Δ
otap-dataflow	`86.26% <96.44%> (+0.08%)`	⬆️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.52% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`53.50% <ø> (ø)`
quiver	`90.66% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

albertlockett · 2026-01-15T21:37:30Z

@jmacd could I ask you for a review on this :)? I was using the batch_processor Ack/Nack implementation as inspiration.

One thing in particular I'd like to ensure is correct is the behaviour of juggling the contexts.

Basically when we find we've split the batch

For the inbound batch, put it in the inbounds slot map in Contexts which contains number of outbound batches + the original inbound context
For each outbound batch, put it in the outbounds slot map in Contexts with a value pointing back at the inbounds key.
Subscribe the outbound context using call data derived from the the outbounds slotmap key

Then when we receive an Ack/Nack:

Lookup the inbound key from outbounds slot map using the key from the calldata. Clear the outbounds slot map
decrement the count of outbounds in the inbounds slot map. If the count is zero, then we Ack/Nack the original inbound context.

I'm also curious if, when testing, is this the correct way to setup test Ack messages using the outbound contexts to simulate a downstream component having Ack/Nacked the batch:

otel-arrow/rust/otap-dataflow/crates/otap/src/transform_processor.rs

Lines 996 to 1017 in e5a0975

    
           // now we'll Ack the outbound messages and ensure that we eventually emit an ack 
        
           // for the inbound message 
        
           let call_data = outbound_context1.current_calldata().unwrap(); 
        
           let mut ack1 = AckMsg::new(OtapPdata::new( 
        
               outbound_context1, 
        
               OtapPayload::empty(SignalType::Logs), 
        
           )); 
        
           ack1.calldata = call_data; 
        
           let call_data = outbound_context2.current_calldata().unwrap(); 
        
           let mut ack2 = AckMsg::new(OtapPdata::new( 
        
               outbound_context2, 
        
               OtapPayload::empty(SignalType::Logs), 
        
           )); 
        
           ack2.calldata = call_data; 
        
           let call_data = outbound_context3.current_calldata().unwrap(); 
        
           let mut ack3 = AckMsg::new(OtapPdata::new( 
        
               outbound_context3, 
        
               OtapPayload::empty(SignalType::Logs), 
        
           )); 
        
           ack3.calldata = call_data;

I realize I'm asking you to reverse engineer a lot of code here, so happy to walk through on teams if it's easier :)

jmacd

Looks good. Looks like maybe the new code could be applied to the batch_processor in a future PR, maybe.

jmacd · 2026-01-16T19:37:02Z

@albertlockett I think you want to use a call to Context::next_ack, for the test section you quoted. This does what the effect handler would have done when the recipient responded with an Ack. See how batch_processor.rs tests use Context::next_ack, basically.

albertlockett · 2026-01-17T03:25:46Z

@albertlockett I think you want to use a call to Context::next_ack, for the test section you quoted. This does what the effect handler would have done when the recipient responded with an Ack. See how batch_processor.rs tests use Context::next_ack, basically.

thanks @jmacd, that worked! made this change in c031e62

albertlockett · 2026-01-17T03:30:44Z

Looks good. Looks like maybe the new code could be applied to the batch_processor in a future PR, maybe.

Yeah I think we could reuse the Contexts with a little bit of refactoring. The gap with the current implementation is that it doesn't expect an outbound batch to be associated with more than one inbound batch (because currently we only split, we don't combine batches). I imagine eventually this change will need to be made b/c we'd want OPL to support batching, so once that is in place we could reuse this in batch_processor.

lalitb · 2026-01-17T04:19:17Z

rust/otap-dataflow/crates/otap/src/transform_processor/context.rs

+    }
+
+    pub fn set_failed(&mut self, outbound_key: Key, error_reason: String) {
+        if let Some(inbound) = self.inbound.get_mut(outbound_key) {


Should we first lookup the outbound to get the inbound_key, as done in clear_outbound (line 115) ?

Yes, you're right @lalitb

Fixed in 4ab041d and added some tests that exercise this code path

lalitb · 2026-01-17T04:22:47Z

rust/otap-dataflow/crates/otap/src/transform_processor/context.rs

+    /// Ack/NAck'd
+    pub fn clear_outbound(&mut self, outbound_key: Key) -> Option<(Context, Option<String>)> {
+        let inbound_key = {
+            let outbound = self.outbound.get(outbound_key)?;


Maybe I am missing something, but seems we get the slot, but never remove it from self.outbound ?

Thanks @lalitb , that's a good catch. This should call take, not get. Fixed in 79788a0

lalitb · 2026-01-17T04:45:02Z

rust/otap-dataflow/crates/otap/src/transform_processor/context.rs

+            // insert outbound
+            let outbound = Outbound { inbound_key };
+            self.outbound
+                .allocate(|| (outbound, ()))


If allocate fails, we return early but already incremented num_outbound at line 88 - this would leave the inbound context stuck.

Thanks @lalitb . Fixed, and added tests that would catch this bug in 1d3e9a3

…failed

albertlockett · 2026-01-19T21:52:24Z

Converting to draft as there's still some work to do to clean up handle_exec_result. I'd like to make the error handling more consistent and have better handling/coverage on the case where the pipeline routed some batches, but the batch destined for the default out port was not returned b/c the pipeline failed to execute to completion

albertlockett added 7 commits January 15, 2026 11:52

implemented code that has no tests

6f853d2

added tests for context

8f594b2

added some processor tests

f5734a0

comments in contexts

25df7b8

comments

ef1ff11

add TODO for optimization

af42a99

remove dead code

e5a0975

albertlockett requested a review from a team as a code owner January 15, 2026 21:27

github-project-automation bot added this to OTel-Arrow Jan 15, 2026

github-actions bot added the rust Pull requests that update Rust code label Jan 15, 2026

jmacd reviewed Jan 16, 2026

View reviewed changes

fix suggestion for handling Ack in test

c031e62

lalitb reviewed Jan 17, 2026

View reviewed changes

Addest tests for Nack cases in transform processor and tests for set_…

4ab041d

…failed

albertlockett force-pushed the albert/route-to-ack-nack branch from 0bfc412 to 4ab041d Compare January 19, 2026 19:29

albertlockett added 4 commits January 19, 2026 14:33

fix not clearing outbound bug in contexts

79788a0

added more tests for slots being full

1d3e9a3

Clean up some TODO comments

891f9ae

Clean up another TODO comment

ebdcda3

albertlockett marked this pull request as draft January 19, 2026 21:50

albertlockett added 2 commits January 20, 2026 07:42

fix possible memory leak in pipeline failure case

5fe16cf

fix clippies

9fe5e5e

albertlockett added 3 commits January 20, 2026 08:09

cleanup

d622828

added test for inbound slot map full

be2714a

add last test case and reduce boilerplate

c84474a

albertlockett marked this pull request as ready for review January 20, 2026 14:14

Merge branch 'main' into albert/route-to-ack-nack

0610f2f

albertlockett changed the title ~~[WIP] Ack/Nack for routing in transform_processor~~ Ack/Nack for routing in transform_processor Jan 20, 2026

Ack/Nack for routing in transform_processor #1798

Are you sure you want to change the base?

Ack/Nack for routing in transform_processor #1798

Conversation

albertlockett commented Jan 15, 2026

Uh oh!

codecov bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

albertlockett commented Jan 15, 2026

Uh oh!

jmacd left a comment

Choose a reason for hiding this comment

Uh oh!

jmacd commented Jan 16, 2026

Uh oh!

albertlockett commented Jan 17, 2026

Uh oh!

albertlockett commented Jan 17, 2026

Uh oh!

lalitb Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

albertlockett Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

lalitb Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertlockett Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

lalitb Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

albertlockett Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

albertlockett commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jan 15, 2026 •

edited

Loading

lalitb Jan 17, 2026 •

edited

Loading