-
Notifications
You must be signed in to change notification settings - Fork 10
Description
I have used this tokenizer with this test string:
"जब मैंने सुबह उठकर खिड़की खोली तो मैंने देखा कि बाहर एक सुंदर सा पक्षी गुनगुनाते हुए पेड़ की डाल पर बैठा था, जिसका रंग सुनहरा था और जिसकी आँखें चमकदार नीली थीं, वह इतना खूबसूरत था कि मैं उसे देखकर मंत्रमुग्ध हो गया।"
And got:
[]string{
"जब",
"म",
"\xe0",
"\xa5",
"\x88",
"\xe0",
"\xa4",
"\x82",
"न",
"\xe0",
"\xa5",
"\x87",
"स",
// and so on
}In indexes where diacritics are expected there seems to be several separated one-byte long sequence.
So, if I understand correctly this string tokenizer does not try to honor Unicode codepoint/grapheme boundaries in the byte sequence and therefore it is primarily meant for escaped, ASCII-like content right?
It would be nice to have it mentioned in the description that these unicode boundaries are not considered. (github.com/rivo/uniseg seems to be the tool for that).
Thanks