Skip to content

Unicode codepoint/grapheme boundaries #34

@tassa-yoniso-manasi-karoto

Description

I have used this tokenizer with this test string:
"जब मैंने सुबह उठकर खिड़की खोली तो मैंने देखा कि बाहर एक सुंदर सा पक्षी गुनगुनाते हुए पेड़ की डाल पर बैठा था, जिसका रंग सुनहरा था और जिसकी आँखें चमकदार नीली थीं, वह इतना खूबसूरत था कि मैं उसे देखकर मंत्रमुग्ध हो गया।"

And got:

[]string{
	"जब",
	"म",
	"\xe0",
	"\xa5",
	"\x88",
	"\xe0",
	"\xa4",
	"\x82",
	"न",
	"\xe0",
	"\xa5",
	"\x87",
	"स",
	// and so on
}

In indexes where diacritics are expected there seems to be several separated one-byte long sequence.

So, if I understand correctly this string tokenizer does not try to honor Unicode codepoint/grapheme boundaries in the byte sequence and therefore it is primarily meant for escaped, ASCII-like content right?

It would be nice to have it mentioned in the description that these unicode boundaries are not considered. (github.com/rivo/uniseg seems to be the tool for that).

Thanks

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions