-
Notifications
You must be signed in to change notification settings - Fork 51
Open
Labels
Description
Here you split words just around white spaces. You should use word boundaries instead (in regexp: \b, or something equivalent like common separators). Otherwise, the word is not detected in many contexts. For example, I have the word oster-monath in the disallowed words file, but in a sentence it appears between quotation marks ("oster-monath") or near a comma (oster-monath, ) and it is not detected.
Case is very important for spelling in most languages. I think the disallowed words should be case-sensitive. Case-insensitive is used sometimes in NLP, but not in spell-checking! It could be optional for each language.