@@ -43,7 +43,10 @@ follow-up PRs against this RFC.
43
43
* [ Platform-specific opt-in]
44
44
* [ Proposed organization]
45
45
* [ Revising ` Reader ` and ` Writer ` ] (stub)
46
- * [ String handling] (stub)
46
+ * [ String handling]
47
+ * [ Key observations]
48
+ * [ The design: ` os_str ` ]
49
+ * [ The future]
47
50
* [ Deadlines] (stub)
48
51
* [ Splitting streams and cancellation] (stub)
49
52
* [ Modules]
@@ -452,7 +455,224 @@ counts, arguments to `main`, and so on).
452
455
## String handling
453
456
[ String handling ] : #string-handling
454
457
455
- > To be added in a follow-up PR.
458
+ The fundamental problem with Rust's full embrace of UTF-8 strings is that not
459
+ all strings taken or returned by system APIs are Unicode, let alone UTF-8
460
+ encoded.
461
+
462
+ In the past, ` std ` has assumed that all strings are * either* in some form of
463
+ Unicode (Windows), * or* are simply ` u8 ` sequences (Unix). Unfortunately, this is
464
+ wrong, and the situation is more subtle:
465
+
466
+ * Unix platforms do indeed work with arbitrary ` u8 ` sequences (without interior
467
+ nulls) and today's platforms usually interpret them as UTF-8 when displayed.
468
+
469
+ * Windows, however, works with * arbitrary ` u16 ` sequences* that are roughly
470
+ interpreted at UTF-16, but may not actually be valid UTF-16 -- an "encoding"
471
+ often called UCS-2; see http://justsolve.archiveteam.org/wiki/UCS-2 for a bit
472
+ more detail.
473
+
474
+ What this means is that all of Rust's platforms go beyond Unicode, but they do
475
+ so in different and incompatible ways.
476
+
477
+ The current solution of providing both ` str ` and ` [u8] ` versions of
478
+ APIs is therefore problematic for multiple reasons. For one, ** the
479
+ ` [u8] ` versions are not actually cross-platform** -- even today, they
480
+ panic on Windows when given non-UTF-8 data, a platform-specific
481
+ behavior. But they are also incomplete, because on Windows you should
482
+ be able to work directly with UCS-2 data.
483
+
484
+ ### Key observations
485
+ [ Key observations ] : #key-observations
486
+
487
+ Fortunately, there is a solution that fits well with Rust's UTF-8 strings * and*
488
+ offers the possibility of platform-specific APIs.
489
+
490
+ ** Observation 1** : it is possible to re-encode UCS-2 data in a way that is also
491
+ compatible with UTF-8. This is the
492
+ [ WTF-8 encoding format] ( http://simonsapin.github.io/wtf-8/ ) proposed by Simon
493
+ Sapin. This encoding has some remarkable properties:
494
+
495
+ * Valid UTF-8 data is valid WTF-8 data. When decoded to UCS-2, the result is
496
+ exactly what would be produced by going straight from UTF-8 to UTF-16. In
497
+ other words, making up some methods:
498
+
499
+ ``` rust
500
+ my_ut8_data . to_wtf8 (). to_ucs2 (). as_u16_slice () == my_utf8_data . to_utf16 (). as_u16_slice ()
501
+ ```
502
+
503
+ * Valid UTF-16 data re-encoded as WTF-8 produces the corresponding UTF-8 data:
504
+
505
+ ``` rust
506
+ my_utf16_data . to_wtf8 (). as_bytes () == my_utf16_data . to_utf8 (). as_bytes ()
507
+ ```
508
+
509
+ These two properties mean that, when working with Unicode data, the WTF-8
510
+ encoding is highly compatible with both UTF-8 * and* UTF-16. In particular, the
511
+ conversion from a Rust string to a WTF-8 string is a no-op, and the conversion
512
+ in the other direction is just a validation.
513
+
514
+ ** Observation 2** : all platforms can * consume* Unicode data (suitably
515
+ re-encoded), and it's also possible to validate the data they produce as
516
+ Unicode and extract it.
517
+
518
+ ** Observation 3** : the non-Unicode spaces on various platforms are deeply
519
+ incompatible: there is no standard way to port non-Unicode data from one to
520
+ another. Therefore, the only cross-platform APIs are those that work entirely
521
+ with Unicode.
522
+
523
+ ### The design: ` os_str `
524
+ [ The design: `os_str` ] : #the-design-os_str
525
+
526
+ The observations above lead to a somewhat radical new treatment of strings,
527
+ first proposed in the
528
+ [ Path Reform RFC] ( https://github.com/rust-lang/rfcs/pull/474 ) . This RFC proposes
529
+ to introduce new string and string slice types that (opaquely) represent
530
+ * platform-sensitive strings* , housed in the ` std::os_str ` module.
531
+
532
+ The ` OsString ` type is analogous to ` String ` , and ` OsStr ` is analogous to ` str ` .
533
+ Their backing implementation is platform-dependent, but they offer a
534
+ cross-platform API:
535
+
536
+ ``` rust
537
+ pub mod os_str {
538
+ /// Owned OS strings
539
+ struct OsString {
540
+ inner : imp :: Buf
541
+ }
542
+ /// Slices into OS strings
543
+ struct OsStr {
544
+ inner : imp :: Slice
545
+ }
546
+
547
+ // Platform-specific implementation details:
548
+ #[cfg(unix)]
549
+ mod imp {
550
+ type Buf = Vec <u8 >;
551
+ type Slice = [u8 ;
552
+ ...
553
+ }
554
+
555
+ #[cfg(windows)]
556
+ mod imp {
557
+ type Buf = Wtf8Buf ; // See https://github.com/SimonSapin/rust-wtf8
558
+ type Slice = Wtf8 ;
559
+ ...
560
+ }
561
+
562
+ impl OsString {
563
+ pub fn from_string (String ) -> OsString ;
564
+ pub fn from_str (& str ) -> OsString ;
565
+ pub fn as_slice (& self ) -> & OsStr ;
566
+ pub fn into_string (Self ) -> Result <String , OsString >;
567
+ pub fn into_string_lossy (Self ) -> String ;
568
+
569
+ // and ultimately other functionality typically found on vectors,
570
+ // but CRUCIALLY NOT as_bytes
571
+ }
572
+
573
+ impl Deref <OsStr > for OsString { ... }
574
+
575
+ impl OsStr {
576
+ pub fn from_str (value : & str ) -> & OsStr ;
577
+ pub fn as_str (& self ) -> Option <& str >;
578
+ pub fn to_string_lossy (& self ) -> CowString ;
579
+
580
+ // and ultimately other functionality typically found on slices,
581
+ // but CRUCIALLY NOT as_bytes
582
+ }
583
+
584
+ trait IntoOsString {
585
+ fn into_os_str_buf (self ) -> OsString ;
586
+ }
587
+
588
+ impl IntoOsString for OsString { ... }
589
+ impl <'a > IntoOsString for & 'a OsStr { ... }
590
+
591
+ ...
592
+ }
593
+ ```
594
+
595
+ These APIs make OS strings appear roughly as opaque vectors (you
596
+ cannot see the byte representation directly), and can always be
597
+ produced starting from Unicode data. They make it possible to collapse
598
+ functions like ` getenv ` and ` getenv_as_bytes ` into a single function
599
+ that produces an OS string, allowing the client to decide how (or
600
+ whether) to extract Unicode data. It will be possible to do things
601
+ like concatenate OS strings without ever going through Unicode.
602
+
603
+ It will also likely be possible to do things like search for Unicode
604
+ substrings. The exact details of the API are left open and are likely
605
+ to grow over time.
606
+
607
+ In addition to APIs like the above, there will also be
608
+ platform-specific ways of viewing or constructing OS strings that
609
+ reveals more about the space of possible values:
610
+
611
+ ``` rust
612
+ pub mod os {
613
+ #[cfg(unix)]
614
+ pub mod unix {
615
+ trait OsStringExt {
616
+ fn from_vec (Vec <u8 >) -> Self ;
617
+ fn into_vec (Self ) -> Vec <u8 >;
618
+ }
619
+
620
+ impl OsStringExt for os_str :: OsString { ... }
621
+
622
+ trait OsStrExt {
623
+ fn as_byte_slice (& self ) -> & [u8 ];
624
+ fn from_byte_slice (& [u8 ]) -> & Self ;
625
+ }
626
+
627
+ impl OsStrExt for os_str :: OsStr { ... }
628
+
629
+ ...
630
+ }
631
+
632
+ #[cfg(windows)]
633
+ pub mod windows {
634
+ // The following extension traits provide a UCS-2 view of OS strings
635
+
636
+ trait OsStringExt {
637
+ fn from_wide_slice (& [u16 ]) -> Self ;
638
+ }
639
+
640
+ impl OsStringExt for os_str :: OsString { ... }
641
+
642
+ trait OsStrExt {
643
+ fn to_wide_vec (& self ) -> Vec <u16 >;
644
+ }
645
+
646
+ impl OsStrExt for os_str :: OsStr { ... }
647
+
648
+ ...
649
+ }
650
+
651
+ ...
652
+ }
653
+ ```
654
+
655
+ By placing these APIs under ` os ` , using them requires a clear * opt in*
656
+ to platform-specific functionality.
657
+
658
+ ### The future
659
+ [ The future ] : #the-future
660
+
661
+ Introducing an additional string type is a bit daunting, since many
662
+ existing APIs take and consume only standard Rust strings. Today's
663
+ solution demands that strings coming from the OS be assumed or turned
664
+ into Unicode, and the proposed API continues to allow that (with more
665
+ explicit and finer-grained control).
666
+
667
+ In the long run, however, robust applications are likely to work
668
+ opaquely with OS strings far beyond the boundary to the system to
669
+ avoid data loss and ensure maximal compatibility. If this situation
670
+ becomes common, it should be possible to introduce an abstraction over
671
+ various string types and generalize most functions that work with
672
+ ` String ` /` str ` to instead work generically. This RFC does * not*
673
+ propose taking any such steps now -- but it's important that we * can*
674
+ do so later if Rust's standard strings turn out to not be sufficient
675
+ and OS strings become commonplace.
456
676
457
677
## Deadlines
458
678
[ Deadlines ] : #deadlines
@@ -547,4 +767,53 @@ principles or visions) are outside the scope of this RFC.
547
767
# Unresolved questions
548
768
[ Unresolved questions ] : #unresolved-questions
549
769
550
- > To be expanded in a follow-up PR.
770
+ > To be expanded in follow-up PRs.
771
+
772
+ ## Wide string representation
773
+
774
+ (Text from @SimonSapin )
775
+
776
+ Rather than WTF-8, ` OsStr ` and ` OsString ` on Windows could use
777
+ potentially-ill-formed UTF-16 (a.k.a. "wide" strings), with a
778
+ different cost trade off.
779
+
780
+ Upside:
781
+ * No conversion between ` OsStr ` / ` OsString ` and OS calls.
782
+
783
+ Downsides:
784
+ * More expensive conversions between ` OsStr ` / ` OsString ` and ` str ` / ` String ` .
785
+ * These conversions have inconsistent performance characteristics between platforms. (Need to allocate on Windows, but not on Unix.)
786
+ * Some of them return ` Cow ` , which has some ergonomic hit.
787
+
788
+ The API (only parts that differ) could look like:
789
+
790
+ ``` rust
791
+ pub mod os_str {
792
+ #[cfg(windows)]
793
+ mod imp {
794
+ type Buf = Vec <u16 >;
795
+ type Slice = [u16 ];
796
+ ...
797
+ }
798
+
799
+ impl OsStr {
800
+ pub fn from_str (& str ) -> Cow <OsString , OsStr >;
801
+ pub fn to_string (& self ) -> Option <CowString >;
802
+ pub fn to_string_lossy (& self ) -> CowString ;
803
+ }
804
+
805
+ #[cfg(windows)]
806
+ pub mod windows {
807
+ trait OsStringExt {
808
+ fn from_wide_slice (& [u16 ]) -> Self ;
809
+ fn from_wide_vec (Vec <u16 >) -> Self ;
810
+ fn into_wide_vec (self ) -> Vec <u16 >;
811
+ }
812
+
813
+ trait OsStrExt {
814
+ fn from_wide_slice (& [u16 ]) -> Self ;
815
+ fn as_wide_slice (& self ) -> & [u16 ];
816
+ }
817
+ }
818
+ }
819
+ ```
0 commit comments