Refactor lexer to treat all input characters as UTF-8#2307
Refactor lexer to treat all input characters as UTF-8#2307CohenArthur merged 1 commit intoRust-GCC:masterfrom
Conversation
| // Check if the input source is valid as utf-8 and copy all characters to | ||
| // `chars`. | ||
| void init () | ||
| { |
There was a problem hiding this comment.
modified InputSource to check the input string is valid utf-8 and push utf-8 characters to its buffer (field) immidiately after an instance of this class is created. (i.e. this method is a post-constructor)
By this, we do not have to decode each Unicode character more than once.
| Codepoint | ||
| Lexer::peek_codepoint_input () | ||
| { |
There was a problem hiding this comment.
peek_codepoint_input and skip_codepoint_input are no longer needed.
They are just wrappers of peek_input and skip_input respetively for now.
358ffbe to
57f3b06
Compare
| void | ||
| rust_input_source_test () | ||
| { | ||
| std::string src = u8"_abcde\tXYZ\v\f"; | ||
| std::vector<uint32_t> expected | ||
| = {'_', 'a', 'b', 'c', 'd', 'e', '\t', 'X', 'Y', 'Z', '\v', '\f'}; | ||
| test_buffer_input_source (src, expected); |
There was a problem hiding this comment.
I have no idea how to convert(?) std::string into FILE so only BufferInputSource is tested now.
There was a problem hiding this comment.
Added unit tests for BufferInputSource. See #2307 (comment)
gcc/rust/ChangeLog: * lex/rust-lex.cc (is_float_digit):Change types of param to `uint32_t` (is_x_digit):Likewise (is_octal_digit):Likewise (is_bin_digit):Likewise (check_valid_float_dot_end):Likewise (is_whitespace):Likewise (is_non_decimal_int_literal_separator):Likewise (is_identifier_start):Likewise (is_identifier_continue):Likewise (Lexer::skip_broken_string_input):Likewise (Lexer::build_token):Remove handling BOM (Lexer::parse_in_type_suffix):Modify use of `current_char` (Lexer::parse_in_decimal):Likewise (Lexer::parse_escape):Likewise (Lexer::parse_utf8_escape):Likewise (Lexer::parse_partial_string_continue):Likewise (Lexer::parse_partial_hex_escape):Likewise (Lexer::parse_partial_unicode_escape):Likewise (Lexer::parse_byte_char):Likewise (Lexer::parse_byte_string):Likewise (Lexer::parse_raw_byte_string):Likewise (Lexer::parse_raw_identifier):Likewise (Lexer::parse_non_decimal_int_literal):Likewise (Lexer::parse_decimal_int_or_float):Likewise (Lexer::peek_input):Change return type to `Codepoint` (Lexer::get_input_codepoint_length):Change to return 1 (Lexer::peek_codepoint_input):Change to be wrapper of `peek_input` (Lexer::skip_codepoint_input):Change to be wrapper of `skip_input` (Lexer::test_get_input_codepoint_n_length):Deleted (Lexer::split_current_token):Deleted (Lexer::test_peek_codepoint_input):Deleted (Lexer::start_line):Move backwards (assert_source_content):New helper function for selftest (test_buffer_input_source):New helper function for selftest (test_file_input_source):Likewise (rust_input_source_test):New test * lex/rust-lex.h (rust_input_source_test):New test * rust-lang.cc (run_rust_tests):Add selftest Signed-off-by: Raiki Tamura <tamaron1203@gmail.com>
| static const int max_column_hint = 80; | ||
|
|
||
| Optional<std::ofstream &> dump_lex_out; | ||
|
|
There was a problem hiding this comment.
These lines are just moved backwards, not changed.
|
@philberty @CohenArthur |
| // return 0xFFFE; | ||
| return 0; | ||
| /* TODO: assert that this TokenId is a "simple token" like punctuation and not | ||
| * like "IDENTIFIER"? */ |
There was a problem hiding this comment.
We have a token enum that you can implement a switch satement on to figure that out.
philberty
left a comment
There was a problem hiding this comment.
This looks good to me nothing to add here. Great work
CohenArthur
left a comment
There was a problem hiding this comment.
Looks great! Amazing work, thank you!
| } | ||
| } | ||
|
|
||
| // TODO remove this function |
There was a problem hiding this comment.
Please open an issue so we don't forget :)
| return 1; | ||
| } | ||
|
|
||
| // TODO remove this function |
There was a problem hiding this comment.
Mention this function in the issue as well
| return peek_input (); | ||
| } | ||
|
|
||
| // TODO remove this function |
Addresses #2287, #2309
In this PR, I have modified
peek_input(int n), andskip_input(int n)to handle UTF-8 characters.To do so, I also dramatically modified
InputSourceto decode utf-8 and buffer its characters.