-
Notifications
You must be signed in to change notification settings - Fork 45
Add experimental SIMD tokenizer #1934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| } | ||
|
|
||
| const CHUNK: usize = 64; | ||
|
|
||
| fn is_ident_start(b: u8) -> bool { | ||
| b.is_ascii_alphabetic() || b == b'_' || b >= 0x80 | ||
| } | ||
|
|
||
| fn skip_whitespace(bytes: &[u8]) -> usize { | ||
| let mut i = 0; | ||
| while i + CHUNK <= bytes.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Avoid infinite loop on non‑ASCII identifiers
When the SIMD tokenizer encounters a byte ≥ 0x80 it treats it as a valid identifier start, but take_identifier only advances over ASCII letters/digits/underscores. For non‑ASCII input len becomes 0 and pos += len never advances, so while pos < bytes.len() spins forever as soon as the source contains any UTF‑8 characters beyond ASCII. Either reject non‑ASCII in is_ident_start or teach take_identifier to advance over those bytes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle.
Your free Bugbot reviews will reset on October 13
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, please upgrade to Bugbot Pro by visiting the Cursor dashboard. Your first 14 days will be free!
|
|
||
| str_buff = resort_res.forward_string; | ||
| element_buffer.append(&mut resort_res.elements); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Duplicate EndOfFile Segments in SIMD Path
The SIMD tokenizer adds an EndOfFile token, leading to duplicate EndOfFile segments. The elements_to_segments method also unconditionally adds an EndOfFile segment, resulting in two when the SIMD tokenizer is enabled, unlike the single segment from the non-SIMD path.
Additional Locations (1)
| i += 1; | ||
| } | ||
| i | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Identifier Parsing Mismatch Causes Infinite Loop
The is_ident_start function incorrectly validates non-ASCII bytes, including UTF-8 continuation bytes, as valid identifier starts. This clashes with take_identifier, which only consumes ASCII characters. This inconsistency causes incorrect tokenization of non-ASCII identifiers and, more critically, an infinite loop when take_identifier fails to advance the position.
Additional Locations (1)
Benchmark for 96d48aeClick to view benchmark
|
Benchmark for 6cb3df5Click to view benchmark
|
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Summary
simd-tokenizerfeature tosqruff-lib-coreand gateportable_simdusageLexerto tokenize withSimdTokenizerwhen the feature is enabledTesting
cargo test -p sqruff-lib-corecargo test -p sqruff-lib-core --features simd-tokenizerhttps://chatgpt.com/codex/tasks/task_e_68bb26bf18688330a6be5fbb04e56cc5