HTML Ampersand Character Codes

I notice that very often raw texts are taken from the web and they contain &amp; or &quot; strings in the raw text. 

These strings are not recognized by ucto and are split into multiple tokens: & amp ;

Obviously the user is responsible for clean text input but these HTML codes are easily overlooked in large quantities of data.
We could:
- add a rule to recognize HTML codes and keep them as 1 token
- add a rule to recognize HTML codes and replace them with the actual character they represent
- perhaps only give a warning --this text contains HTML codes-- ?
- just close this issue without any changes ;-)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML Ampersand Character Codes #57

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTML Ampersand Character Codes #57

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions