Skip to content

writing an I/O-related library that doesn't care about whether it's used in a sync or async context #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BurntSushi opened this issue Mar 18, 2021 · 9 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed status-quo-story-ideas "Status quo" user story ideas

Comments

@BurntSushi
Copy link
Member

BurntSushi commented Mar 18, 2021

  • Character: Barbara (almost)
  • Brief summary: Barbara tries to write a library that parses a particular kind of format in a streaming fashion that works with any implementation of std::io::Read or std::io::Write, and wants it to work in an async context with minimal fuss.
  • Key points or morals (if known):
    • Library author may not know much (or anything) about Async Rust.
    • There is nothing inherently "sync" or "async" about the actual details of the format. By virtue of working on things like the Read/Write traits, it works on streams. (Where "stream" is used colloquially here.)
    • Depending on specific async runtimes with specific async versions of the Read/Write traits seems inelegant and doesn't seem to scale with respect to maintenance effort.
    • Async Runtimes may have adapters for working with Read/Write impls, but it may come with additional costs that one might not want to pay.
    • Making such code work regardless of if it's used in a sync or async context should not require major effort or significant code duplication.

One example here is the flate2 crate. To solve this problem, they have an optional dependency on a specific async runtime, tokio, and have impls for AsyncRead and AsyncWrite that are specific to async runtime.

Another example is the csv crate. The problem has come up a few times:

Its author (me) has not wanted to wade into these waters because of the aforementioned problems. As a result, folks are maintaining a csv-async fork to make it work in an async context. This doesn't seem ideal.

This is somewhat related to #45. For example, AsyncRead/AsyncWrite traits that are shared across all async runtimes might do a lot to fix this problem. But I'm not sure. Fundamentally, this, to me, is about writing I/O adapters that don't care about whether they're used in a sync or async context with minimal fuss, rather than just about trying to abstract over all async runtimes.

Apologies in advance if I've filled out this issue incorrectly. I tried to follow the others, but maybe I got a bit too specific! Happy to update it as appropriate. Overall, I think this is a really wonderful approach to gather feedback. I'm completely blown away by this!

Also, above, I said this was "almost" Barabara, because this story doesn't actually require the programmer to write or even care about async code at all.

@BurntSushi BurntSushi added good first issue Good for newcomers help wanted Extra attention is needed status-quo-story-ideas "Status quo" user story ideas labels Mar 18, 2021
@nikomatsakis
Copy link
Contributor

Hmm, so this is mostly(?) a question of "sync-async" bridging, I think? I was going to open a similar-ish (only vaguely) issue based on some conversations I had with @Mark-Simulacrum, though in that case the gist is more like "I want to write a client that uses libraries that are implemented in async with minimal fuss, and right now it's painful".

@nikomatsakis
Copy link
Contributor

I opened #54 to cover my conversation with @Mark-Simulacrum

@BurntSushi
Copy link
Member Author

@nikomatsakis Yeah, I think "sync-async" bridging is a very general way to put it, although there may be many specific cases of it with this being one of them. (Not quite sure how to categorize everything, I leave that difficult task to y'all. :-))

@nikomatsakis
Copy link
Contributor

Some elements of this blog post feel related:

https://kevinhoffman.medium.com/rust-async-and-the-terrible-horrible-no-good-very-bad-day-348ebc836274

Lesson II — It is possible for a single dependency of your crate to be so tightly coupled to a future polling runtime that it effectively makes that runtime mandatory for all consumers.

@matklad
Copy link
Member

matklad commented Jun 13, 2021

Here’s an approach to this from Python’s ecosystem: https://sans-io.readthedocs.io/

@BurntSushi
Copy link
Member Author

@matklad In my example above, that's what csv-core is. The problem comes when you want to build some nice abstractions that do I/O. e.g., csv::Reader.

@jorgecarleitao
Copy link

jorgecarleitao commented Aug 23, 2021

I have recently implemented an async version of parquet here using only futures:: (i.e. run-time agnostic) and my understanding is that ultimately, implementations have to offer two APIs

pub fn read_*<R: Read + Seek> -> Result<Foo>;
pub async fn read_*_async<R: AsyncRead + AsyncSeek + Send + Unpin> -> Result<Foo>;

and that this changes propagates to any code that performs I/O (it does not propagate to code that performs CPU-intensive tasks such as de-serialization) by basically adding async and .await.

They represent two different compilation paths: one that calls .read_()? and the other that calls .read_*().await?.

It all boils down to the fact that we never know what the next byte will cause to the state machine. In the case of CSV, the key item is the un-escaped end of line, that causes a whole panoply of events. The reader must read byte by byte in a state machine until the state is "new row". An equivalent way of framing this problem is that it is not possible to tell how many bytes we should read for the state machine to move to its "clean" state (finished reading a whole line).

This framing of the problem also emerges in reading protobuf, flatbuffers, thrift, because certain nested types are stored in things like <number of bytes in i32 little endian><bytes> and that are read as let bytes = reader.read_i32() and reader.read_bytes(bytes). It is not possible to know how much to read prior to calling .read_i32(), which means that we need to offer the ability to .await the reading of <bytes>. This requires the whole thing to be async because even with buffering, there is a chance that the buffer is not large enough to cover <number of bytes in i32 little endian><bytes>, in which case the buffer itself needs to call .read_*(remaining bytes).await to cover <bytes>.

The alternative to this is to offer a sync API and re-use it via spawn_blocking. The problem with this is that it forces the library to use a concrete runtime engine, which falls into the Lesson II of the blog post.

I see two potential ways improve the situation:

  • design a mechanism to be able to reuse sync code in async code by generating the corresponding async code whenever certain traits are used. E.g. some macro that automagically converts Read <> AsyncRead -> all calls on the trait become .await and the function is automatically converted to async (including every usage of the trait in dependent functions).
  • design a mechanism to "spawn blocking" without a concrete runtime engine. E.g. A trait describing the runtime state that is passed to a function together with a generic spawn_blocking<T: RuntimeState> that can be called from that function, or something like that (I do not know the details of what spawn_blocking needs to be generic)

@anguslees
Copy link

I know we don't need a list of examples, but serde seems worth a special mention here. It's a complex, popular, amazing library that at the bottom just writes/reads from a stream. Currently, we have to fork a separate version of entire serde ecosystem with all the functions in a different colour just so we can deserialise from an async source. (ie: it's not just the io-related library - it's everything that calls that io-related library, and sometimes this involves crates from multiple third-party authors.)

@zeenix
Copy link
Contributor

zeenix commented Dec 31, 2021

@anguslees that's a very good point. We should also keep in mind that serde has been unmaintained for almost 2 years now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed status-quo-story-ideas "Status quo" user story ideas
Projects
None yet
Development

No branches or pull requests

6 participants