Unicode codepoint/grapheme boundaries

I have used this tokenizer with this test string:
"जब मैंने सुबह उठकर खिड़की खोली तो मैंने देखा कि बाहर एक सुंदर सा पक्षी गुनगुनाते हुए पेड़ की डाल पर बैठा था, जिसका रंग सुनहरा था और जिसकी आँखें चमकदार नीली थीं, वह इतना खूबसूरत था कि मैं उसे देखकर मंत्रमुग्ध हो गया।"

And got:
```go
[]string{
	"जब",
	"म",
	"\xe0",
	"\xa5",
	"\x88",
	"\xe0",
	"\xa4",
	"\x82",
	"न",
	"\xe0",
	"\xa5",
	"\x87",
	"स",
	// and so on
}
```

In indexes where diacritics are expected there seems to be several separated one-byte long sequence.

So, if I understand correctly this string tokenizer does not try to honor Unicode codepoint/grapheme boundaries in the byte sequence and therefore it is primarily meant for escaped, ASCII-like content right?

It would be nice to have it mentioned in the description that these unicode boundaries are not considered. (github.com/rivo/uniseg seems to be the tool for that).

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode codepoint/grapheme boundaries #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unicode codepoint/grapheme boundaries #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions