-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Labels
Description
I notice that very often raw texts are taken from the web and they contain & or " strings in the raw text.
These strings are not recognized by ucto and are split into multiple tokens: & amp ;
Obviously the user is responsible for clean text input but these HTML codes are easily overlooked in large quantities of data.
We could:
- add a rule to recognize HTML codes and keep them as 1 token
- add a rule to recognize HTML codes and replace them with the actual character they represent
- perhaps only give a warning --this text contains HTML codes-- ?
- just close this issue without any changes ;-)