diff --git a/text/0000-char-uax-31.md b/text/0000-char-uax-31.md new file mode 100644 index 00000000000..1ab9bb32999 --- /dev/null +++ b/text/0000-char-uax-31.md @@ -0,0 +1,107 @@ +- Feature Name: `char_uax_31` +- Start Date: 2019-04-24 +- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000) +- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) + +# Summary +[summary]: #summary + +Add functions to the standard library for testing a `char` against [UAX TR31](https://unicode.org/reports/tr31/) ("Unicode Annex 31") +`Pattern_White_Space`, `Pattern_Syntax`, `XID_Start`, `ID_Nonstart`, and `XID_Continue` (the XID ones are already in the standard +library, but are unstable; this RFC proposes to stablize them). + +# Motivation +[motivation]: #motivation + +As a systems language, Rust is heavily used for parsing. +As a progressive, forward-thinking language that accepts anyone, +Rust supports Unicode and makes the definitive string types UTF-8. +At the intersection of these needs sits *UAX #31: Unicode Identifier and Pattern Syntax* ("Annex 31"), +a standardized set of code point categories for defining computer language syntax. + +This is being used in production Rust code already. +Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer, +[but not everyone who works on the compiler knows about them](https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876), +and since they're not in the standard library, +not everyone who works on Rust-related tooling has access to them. +I'm not asserting that putting these in libstd would've avoided that bug, +but if it was in the standard library, +it would resolve the questions about whether third-party tooling can be expected to support the full range of Unicode whitespace. + +[Other languages](https://rosettacode.org/wiki/Unicode_variable_names#C) also follow Annex 31, such as C# and Elixir. +Other common grammars, even ones that aren't actually for programming languages, can also be found or defined in Annex 31, +such as hashtags and XML. + +It's also pretty clear what the "right" API is for this, +since `is_whitespace` and `is_ascii_whitespace` already set the precedent here, +so there's little need to experiment with API design. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +In addition to functions for checking "ASCII white space" and "Unicode white space," +some languages, such as Rust and C#, use Unicode Annex 31 to define their syntax. +These functions are also exposed as methods on the `char` type. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +## `fn char::is_id_nonstart(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `ID_Nonstart` code point category. +This function is defined as `self.is_xid_continue() && !self.is_xid_start()`. + +## `fn char::is_pattern_syntax(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `Pattern_Syntax` code point category. + +## `fn char::is_pattern_white_space(self) -> bool` + +Check if `self` is a member of Unicode Annex 31's `Pattern_White_Space` code point category. + +# Drawbacks +[drawbacks]: #drawbacks + +The big problem, that has always made designing the text APIs hard, +is that it's not clear how much of Unicode we want to include in libstd. +The standard library certainly doesn't want a hashtag parser, even though Annex 31 describes one in section 6, +and libstd certainly doesn't want a character shaping algorithm, +even though Unicode places plenty of requirements on that process, too. + +The other problem is that a lot of languages aren't defined in terms of Annex 31 anyway, +like Swift and HTML, which simply spell out the set of allowed code points themselves, +so this isn't necessarily useful to all of the language implementers. + +The other big drawback is that Unicode changes, so keeping the standard library synced with it represents a backwards- +compatibility hazard. `is_whitespace` already has this problem, but the set of Unicode whitespace changes less +frequently than XID does, so the behavior of these functions would be expected to change more often. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +The design was chosen to line up with how character classification is already being done (like `is_whitespace`). +The alternative, of providing a more generic classification API, +seems to have enough room for debate that it would be better served in crates that provide purpose-built frameworks. +In particular, proposal is made for the benefit of parsers, not text layout engines. +Those will still need to use things like `rust-unic`. + +# Prior art +[prior-art]: #prior-art + +There's already a crate that mostly provides this API, [unicode-xid](https://lib.rs/crates/unicode-xid), +but it's actually less comprehensive than this proposal (it only provides XID_Start and XID_Continue). + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- What about ID_Start and ID_Continue? They're deprecated by the Unicode Consortium, but probably still useful for parsing some languages. +- `is_pattern_white_space`, like UAX 31 spells it? Or `is_pattern_whitespace`, for consistency with the rest of libstd? + +# Future possibilities +[future-possibilities]: #future-possibilities + +What does [Mosh](https://mosh.org/) use need to know for its UTF-8 handling? +Anything that's necessary to implement a correct UTF-8 enabled VT100 state machine seems applicable to Rust, +since that state machine is separate from the text shaping itself, but still has to know things like combining marks, +and what's necessary there is probably necessary for other, similar state machines like HTML and PDF, +where you have to pick out weird combining-mark corner cases.