Boosting tagger and lemmatizer using additional data #4760
Replies: 2 comments
-
These are all reasonable things to want to include and I hope most or all of this is possible in the future, but right now none of this is very easy with spacy, sorry! We would like to develop a neural lemmatizer, but first we're working on a morphological analyzer, and it's still very much under development. The options for lemmatization are a bit limited, since spacy only supports lookup-based and rule-based lemmatizers. The rule-based lemmatizer can currently use the UD POS tag and suffix rules along with a table of exceptions, but it doesn't have any other features or other kinds of rules. It's enough for English and languages with a similar amount of suffix-based inflection, but not great for many other languages. You might be able to rely heavily on the exceptions table in order to have a more POS-based lookup than with the plain lookup lemmatizer, but you might still run into some issues because the supported POS tags are still a bit English-centric. What language are you working on? |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for your prompt response and good luck with the further development! The language I use spaCy for is Serbian, that is highly inflectional: e.g. nouns can change in gender, number and case. Therefore, the lookup table would be highly repetitive and some heuristic at least is desired. For now, I think we will try to figure out some rules. Thank you once again! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Info about spaCy
Dear spaCy developers,
I have two questions regarding the inclusion of the external resources in the tagger training procedure.
I saw that a tagger model can be improved using the word vectors. But:
and
Put precisely, we would like to include external resources, besides our tagger training corpora, that contain information about a word, its lemma, universal part of speech tag and grammatical categories.
Many thanks
Beta Was this translation helpful? Give feedback.
All reactions