Skip to content

HTML Ampersand Character Codes #57

@Irishx

Description

@Irishx

I notice that very often raw texts are taken from the web and they contain & or " strings in the raw text.

These strings are not recognized by ucto and are split into multiple tokens: & amp ;

Obviously the user is responsible for clean text input but these HTML codes are easily overlooked in large quantities of data.
We could:

  • add a rule to recognize HTML codes and keep them as 1 token
  • add a rule to recognize HTML codes and replace them with the actual character they represent
  • perhaps only give a warning --this text contains HTML codes-- ?
  • just close this issue without any changes ;-)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions