Skip to content

Commit 80515a0

Browse files
committed
Merge remote-tracking branch 'aturon/io-string-handling'
2 parents c55b770 + 27bd6ad commit 80515a0

File tree

1 file changed

+272
-3
lines changed

1 file changed

+272
-3
lines changed

text/0517-io-os-reform.md

Lines changed: 272 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,10 @@ follow-up PRs against this RFC.
4343
* [Platform-specific opt-in]
4444
* [Proposed organization]
4545
* [Revising `Reader` and `Writer`] (stub)
46-
* [String handling] (stub)
46+
* [String handling]
47+
* [Key observations]
48+
* [The design: `os_str`]
49+
* [The future]
4750
* [Deadlines] (stub)
4851
* [Splitting streams and cancellation] (stub)
4952
* [Modules]
@@ -452,7 +455,224 @@ counts, arguments to `main`, and so on).
452455
## String handling
453456
[String handling]: #string-handling
454457

455-
> To be added in a follow-up PR.
458+
The fundamental problem with Rust's full embrace of UTF-8 strings is that not
459+
all strings taken or returned by system APIs are Unicode, let alone UTF-8
460+
encoded.
461+
462+
In the past, `std` has assumed that all strings are *either* in some form of
463+
Unicode (Windows), *or* are simply `u8` sequences (Unix). Unfortunately, this is
464+
wrong, and the situation is more subtle:
465+
466+
* Unix platforms do indeed work with arbitrary `u8` sequences (without interior
467+
nulls) and today's platforms usually interpret them as UTF-8 when displayed.
468+
469+
* Windows, however, works with *arbitrary `u16` sequences* that are roughly
470+
interpreted at UTF-16, but may not actually be valid UTF-16 -- an "encoding"
471+
often called UCS-2; see http://justsolve.archiveteam.org/wiki/UCS-2 for a bit
472+
more detail.
473+
474+
What this means is that all of Rust's platforms go beyond Unicode, but they do
475+
so in different and incompatible ways.
476+
477+
The current solution of providing both `str` and `[u8]` versions of
478+
APIs is therefore problematic for multiple reasons. For one, **the
479+
`[u8]` versions are not actually cross-platform** -- even today, they
480+
panic on Windows when given non-UTF-8 data, a platform-specific
481+
behavior. But they are also incomplete, because on Windows you should
482+
be able to work directly with UCS-2 data.
483+
484+
### Key observations
485+
[Key observations]: #key-observations
486+
487+
Fortunately, there is a solution that fits well with Rust's UTF-8 strings *and*
488+
offers the possibility of platform-specific APIs.
489+
490+
**Observation 1**: it is possible to re-encode UCS-2 data in a way that is also
491+
compatible with UTF-8. This is the
492+
[WTF-8 encoding format](http://simonsapin.github.io/wtf-8/) proposed by Simon
493+
Sapin. This encoding has some remarkable properties:
494+
495+
* Valid UTF-8 data is valid WTF-8 data. When decoded to UCS-2, the result is
496+
exactly what would be produced by going straight from UTF-8 to UTF-16. In
497+
other words, making up some methods:
498+
499+
```rust
500+
my_ut8_data.to_wtf8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_u16_slice()
501+
```
502+
503+
* Valid UTF-16 data re-encoded as WTF-8 produces the corresponding UTF-8 data:
504+
505+
```rust
506+
my_utf16_data.to_wtf8().as_bytes() == my_utf16_data.to_utf8().as_bytes()
507+
```
508+
509+
These two properties mean that, when working with Unicode data, the WTF-8
510+
encoding is highly compatible with both UTF-8 *and* UTF-16. In particular, the
511+
conversion from a Rust string to a WTF-8 string is a no-op, and the conversion
512+
in the other direction is just a validation.
513+
514+
**Observation 2**: all platforms can *consume* Unicode data (suitably
515+
re-encoded), and it's also possible to validate the data they produce as
516+
Unicode and extract it.
517+
518+
**Observation 3**: the non-Unicode spaces on various platforms are deeply
519+
incompatible: there is no standard way to port non-Unicode data from one to
520+
another. Therefore, the only cross-platform APIs are those that work entirely
521+
with Unicode.
522+
523+
### The design: `os_str`
524+
[The design: `os_str`]: #the-design-os_str
525+
526+
The observations above lead to a somewhat radical new treatment of strings,
527+
first proposed in the
528+
[Path Reform RFC](https://github.com/rust-lang/rfcs/pull/474). This RFC proposes
529+
to introduce new string and string slice types that (opaquely) represent
530+
*platform-sensitive strings*, housed in the `std::os_str` module.
531+
532+
The `OsString` type is analogous to `String`, and `OsStr` is analogous to `str`.
533+
Their backing implementation is platform-dependent, but they offer a
534+
cross-platform API:
535+
536+
```rust
537+
pub mod os_str {
538+
/// Owned OS strings
539+
struct OsString {
540+
inner: imp::Buf
541+
}
542+
/// Slices into OS strings
543+
struct OsStr {
544+
inner: imp::Slice
545+
}
546+
547+
// Platform-specific implementation details:
548+
#[cfg(unix)]
549+
mod imp {
550+
type Buf = Vec<u8>;
551+
type Slice = [u8;
552+
...
553+
}
554+
555+
#[cfg(windows)]
556+
mod imp {
557+
type Buf = Wtf8Buf; // See https://github.com/SimonSapin/rust-wtf8
558+
type Slice = Wtf8;
559+
...
560+
}
561+
562+
impl OsString {
563+
pub fn from_string(String) -> OsString;
564+
pub fn from_str(&str) -> OsString;
565+
pub fn as_slice(&self) -> &OsStr;
566+
pub fn into_string(Self) -> Result<String, OsString>;
567+
pub fn into_string_lossy(Self) -> String;
568+
569+
// and ultimately other functionality typically found on vectors,
570+
// but CRUCIALLY NOT as_bytes
571+
}
572+
573+
impl Deref<OsStr> for OsString { ... }
574+
575+
impl OsStr {
576+
pub fn from_str(value: &str) -> &OsStr;
577+
pub fn as_str(&self) -> Option<&str>;
578+
pub fn to_string_lossy(&self) -> CowString;
579+
580+
// and ultimately other functionality typically found on slices,
581+
// but CRUCIALLY NOT as_bytes
582+
}
583+
584+
trait IntoOsString {
585+
fn into_os_str_buf(self) -> OsString;
586+
}
587+
588+
impl IntoOsString for OsString { ... }
589+
impl<'a> IntoOsString for &'a OsStr { ... }
590+
591+
...
592+
}
593+
```
594+
595+
These APIs make OS strings appear roughly as opaque vectors (you
596+
cannot see the byte representation directly), and can always be
597+
produced starting from Unicode data. They make it possible to collapse
598+
functions like `getenv` and `getenv_as_bytes` into a single function
599+
that produces an OS string, allowing the client to decide how (or
600+
whether) to extract Unicode data. It will be possible to do things
601+
like concatenate OS strings without ever going through Unicode.
602+
603+
It will also likely be possible to do things like search for Unicode
604+
substrings. The exact details of the API are left open and are likely
605+
to grow over time.
606+
607+
In addition to APIs like the above, there will also be
608+
platform-specific ways of viewing or constructing OS strings that
609+
reveals more about the space of possible values:
610+
611+
```rust
612+
pub mod os {
613+
#[cfg(unix)]
614+
pub mod unix {
615+
trait OsStringExt {
616+
fn from_vec(Vec<u8>) -> Self;
617+
fn into_vec(Self) -> Vec<u8>;
618+
}
619+
620+
impl OsStringExt for os_str::OsString { ... }
621+
622+
trait OsStrExt {
623+
fn as_byte_slice(&self) -> &[u8];
624+
fn from_byte_slice(&[u8]) -> &Self;
625+
}
626+
627+
impl OsStrExt for os_str::OsStr { ... }
628+
629+
...
630+
}
631+
632+
#[cfg(windows)]
633+
pub mod windows{
634+
// The following extension traits provide a UCS-2 view of OS strings
635+
636+
trait OsStringExt {
637+
fn from_wide_slice(&[u16]) -> Self;
638+
}
639+
640+
impl OsStringExt for os_str::OsString { ... }
641+
642+
trait OsStrExt {
643+
fn to_wide_vec(&self) -> Vec<u16>;
644+
}
645+
646+
impl OsStrExt for os_str::OsStr { ... }
647+
648+
...
649+
}
650+
651+
...
652+
}
653+
```
654+
655+
By placing these APIs under `os`, using them requires a clear *opt in*
656+
to platform-specific functionality.
657+
658+
### The future
659+
[The future]: #the-future
660+
661+
Introducing an additional string type is a bit daunting, since many
662+
existing APIs take and consume only standard Rust strings. Today's
663+
solution demands that strings coming from the OS be assumed or turned
664+
into Unicode, and the proposed API continues to allow that (with more
665+
explicit and finer-grained control).
666+
667+
In the long run, however, robust applications are likely to work
668+
opaquely with OS strings far beyond the boundary to the system to
669+
avoid data loss and ensure maximal compatibility. If this situation
670+
becomes common, it should be possible to introduce an abstraction over
671+
various string types and generalize most functions that work with
672+
`String`/`str` to instead work generically. This RFC does *not*
673+
propose taking any such steps now -- but it's important that we *can*
674+
do so later if Rust's standard strings turn out to not be sufficient
675+
and OS strings become commonplace.
456676

457677
## Deadlines
458678
[Deadlines]: #deadlines
@@ -547,4 +767,53 @@ principles or visions) are outside the scope of this RFC.
547767
# Unresolved questions
548768
[Unresolved questions]: #unresolved-questions
549769

550-
> To be expanded in a follow-up PR.
770+
> To be expanded in follow-up PRs.
771+
772+
## Wide string representation
773+
774+
(Text from @SimonSapin)
775+
776+
Rather than WTF-8, `OsStr` and `OsString` on Windows could use
777+
potentially-ill-formed UTF-16 (a.k.a. "wide" strings), with a
778+
different cost trade off.
779+
780+
Upside:
781+
* No conversion between `OsStr` / `OsString` and OS calls.
782+
783+
Downsides:
784+
* More expensive conversions between `OsStr` / `OsString` and `str` / `String`.
785+
* These conversions have inconsistent performance characteristics between platforms. (Need to allocate on Windows, but not on Unix.)
786+
* Some of them return `Cow`, which has some ergonomic hit.
787+
788+
The API (only parts that differ) could look like:
789+
790+
```rust
791+
pub mod os_str {
792+
#[cfg(windows)]
793+
mod imp {
794+
type Buf = Vec<u16>;
795+
type Slice = [u16];
796+
...
797+
}
798+
799+
impl OsStr {
800+
pub fn from_str(&str) -> Cow<OsString, OsStr>;
801+
pub fn to_string(&self) -> Option<CowString>;
802+
pub fn to_string_lossy(&self) -> CowString;
803+
}
804+
805+
#[cfg(windows)]
806+
pub mod windows{
807+
trait OsStringExt {
808+
fn from_wide_slice(&[u16]) -> Self;
809+
fn from_wide_vec(Vec<u16>) -> Self;
810+
fn into_wide_vec(self) -> Vec<u16>;
811+
}
812+
813+
trait OsStrExt {
814+
fn from_wide_slice(&[u16]) -> Self;
815+
fn as_wide_slice(&self) -> &[u16];
816+
}
817+
}
818+
}
819+
```

0 commit comments

Comments
 (0)