Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U+200D (zero-width joiner) breaks the parsing #26

Open
mexus opened this issue Dec 26, 2023 · 1 comment
Open

U+200D (zero-width joiner) breaks the parsing #26

mexus opened this issue Dec 26, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@mexus
Copy link

mexus commented Dec 26, 2023

Long story short:

fn main() {
    assert_eq!(voca_rs::strip::strip_tags("<p>\u{200D}</p>after"), "after");
}

Leads to

thread 'main' panicked at src/main.rs:2:5:
assertion `left == right` failed
  left: ""
 right: "after"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I believe it is caused by the following fact:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let graphemes = "<p>\u{200D}</p>".graphemes(true).collect::<Vec<_>>();
    assert_eq!(graphemes, ["<", "p", ">\u{200d}", "<", "/", "p", ">"]);
}

It is very hard to work correctly with unicode, and it is even more hard to make non-trivial assumptions (like a "grapheme is a character or something like that", or "nothing would be attached to a normal character in a grapheme") 😢

@mexus
Copy link
Author

mexus commented Dec 26, 2023

P.S.

Zero-width joiner is not the only one "character" in the Unicode that causes a mess. There are at least 2527 characters that cause exactly the same behavior, and here goes a script to obtain everything:

fn main() {
    let mess = (0..=u32::MAX)
        .flat_map(char::from_u32)
        .filter(|c| *c != '<')
        .filter(|c| voca_rs::strip::strip_tags(&format!("<p>{c}</p>after")).is_empty())
        .collect::<Vec<_>>();
    println!("in total = {}", mess.len());
    println!("{mess:?}");
}

(better run with --release, might take some time)

@a-merezhanyi a-merezhanyi added the bug Something isn't working label Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants