Short-read optimization is wrong for O_DIRECT pipes #7051

throwable-one · 2024-12-29T14:07:25Z

Version

tokio v1.42.0

Platform

Linux UNIT-2619 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 GNU/Linux

(but I tried that on several different Linuxes)

Description
The problem is covered here:
https://users.rust-lang.org/t/tokio-process-freezes-with-packet-pipes-on-linux-when-buffer-is-too-big/123103
Here is a copy

There is a thing called "packet mode" pipes in Linux, see pipe(2).
TL;TR: when opened with O_DIRECT, each write is a packet (not larger than 4096 -- PIPE_BUF).

Each read reads one "packet", if buffer is too small remain bytes are discarded.

Here is a small tool that runs dd(1) in a "packet" mode.

use std::process::Stdio;
use tokio::io::AsyncReadExt;
use tokio::process::Command;

const READ_BLOCK_SIZE: usize = 65536;
const BYTES_TO_WRITE: usize = 65536 * 2;

#[tokio::main]
async fn main() {
    let process = Command::new("/bin/dd")
        .arg("if=/dev/zero")
        // important: sets `fcntl` F_SETFL O_DIRECT
        // enables so-called "packet mode", see `pipe(2)` `O_DIRECT` option
        .arg("oflag=direct")
        .arg(format!("bs={}", BYTES_TO_WRITE))
        .arg("count=1")
        .stdout(Stdio::piped())
        .spawn()
        .unwrap();


    let mut stdout = process.stdout.unwrap();
    let mut buffer = [0u8; READ_BLOCK_SIZE];
    let mut bytes_read = 0;
    loop {
        let i = stdout.read(&mut buffer).await.unwrap();
        println!("I read {}", i);
        bytes_read += i;
        if i == 0 {
            break;
        }
    }
    if bytes_read != BYTES_TO_WRITE {
        panic!("Wrong number of bytes read: {bytes_read}");
    }
}

...and it gets stuck. Here is a strace:

// dd enables packet mode
[pid 20030] fcntl(1, F_SETFL, O_WRONLY|O_DIRECT) = 0

// reads and writes zeros
[pid 20030] read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
[pid 20030] write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072 <unfinished ...>

// futex awakes
[pid 20017] <... epoll_wait resumed>[{events=EPOLLIN, data={u32=3533706496, u64=94346785330432}}], 1024, -1) = 1
[pid 20017] futex(0x55ced29ecd70, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 20013] <... futex resumed>)        = 0
[pid 20017] epoll_wait(3,  <unfinished ...>

// Tokio ties to read 64K, but reads only 4K (due to packet mode)
[pid 20013] read(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 4096
[pid 20013] write(1, "I read 4096\n", 12I read 4096
) = 12
[pid 20013] futex(0x55ced29ecd70, FUTEX_WAIT_BITSET_PRIVATE, 1, NULL, FUTEX_BITSET_MATCH_ANY
// everything is frozen here forever

Now, let's try to use blocking api.

-use tokio::process::Command;
+use std::process::Command;

and remove await from read:

it works!!: it reads 4096 blocks till the end (just like pipe(2) suggests).

Workaround: setting buffer size to 4096 helps. It seems that Tokio waits for more data (to fill the buffer) but no more than 4096 packet might come from the "packet" pipe.

The text was updated successfully, but these errors were encountered:

Darksonn · 2024-12-29T15:05:08Z

Thanks for reporting this. This is due to Tokio's short-read optimization:

tokio/tokio/src/io/poll_evented.rs

Lines 190 to 220 in 4ca13e6

    
           // When mio is using the epoll or kqueue selector, reading a partially full 
        
           // buffer is sufficient to show that the socket buffer has been drained. 
        
           // 
        
           // This optimization does not work for level-triggered selectors such as 
        
           // windows or when poll is used. 
        
           // 
        
           // Read more: 
        
           // https://github.com/tokio-rs/tokio/issues/5866 
        
           #[cfg(all( 
        
               not(mio_unsupported_force_poll_poll), 
        
               any( 
        
                   // epoll 
        
                   target_os = "android", 
        
                   target_os = "illumos", 
        
                   target_os = "linux", 
        
                   target_os = "redox", 
        
                   // kqueue 
        
                   target_os = "dragonfly", 
        
                   target_os = "freebsd", 
        
                   target_os = "ios", 
        
                   target_os = "macos", 
        
                   target_os = "netbsd", 
        
                   target_os = "openbsd", 
        
                   target_os = "tvos", 
        
                   target_os = "visionos", 
        
                   target_os = "watchos", 
        
               ) 
        
           ))] 
        
           if 0 < n && n < len { 
        
               self.registration.clear_readiness(evt); 
        
           }

Normally, a read that is shorter than the buffer size indicates that Tokio should wait for readiness before attempting to read again. This is incorrect for O_DIRECT pipes.

@Noah-Kennedy Thoughts on what we should do here? Since the flag can be changed on an existing pipe, I'm not sure that we can just cache the flag ...?

throwable-one added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Dec 29, 2024

Darksonn added the M-net Module: tokio/net label Dec 29, 2024

Darksonn changed the title ~~tokio::process freezes with "packet pipes" on Linux when buffer is too big~~ Short-read optimization is wrong for O_DIRECT pipes Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short-read optimization is wrong for O_DIRECT pipes #7051

Short-read optimization is wrong for O_DIRECT pipes #7051

throwable-one commented Dec 29, 2024

Darksonn commented Dec 29, 2024

Short-read optimization is wrong for O_DIRECT pipes #7051

Short-read optimization is wrong for O_DIRECT pipes #7051

Comments

throwable-one commented Dec 29, 2024

Darksonn commented Dec 29, 2024