recently an --add-tokens option was introduced to Ucto to add extra 'TOKENS' to the configuration.
We might consider extending this, so a user could add extra, non-default rules/items to the tokenizer.
Some caveats to consider:
- are the extra rules additional, of do they override?
- make it possible to disable a certain rule
- are the additions language specific? How to express that