You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reading the docstring for validate_single_subject in validate.py I see that it is permitted to use the MISC column to specify an outer subject instead of the deprel subtypes csubj:outer and nsubj:outer when they are very rare. Without having done the experiment I suspect that the ARCOSG corpus for Scottish Gaelic is small enough for this to be the case.
I'm reviewing proper names and pondering whether to use nmod:desc in line with English when I retag the various Sirs, Lords, Ladies, Reverends and Professors (Sir, am Mòrar, Leadaidh, an t-Urramach, an t-Ollamh) but I'm concerned that they may be a bit sparse and I should stick to nmod:unmarked and put something in the MISC column.
Does anyone with more experience of parsing corpora have a feel for how few examples can be plausibly learnt by current parsers?
It also feels a bit unsatisfactory to have annotation decisions based on how big your corpus is, but I'd rather have something useful with a few odd corner cases than something less useful but perfect.
The text was updated successfully, but these errors were encountered:
I wouldn't read too much into that comment in the validator—the history is that some flexibility was put in to ease the transition to a new validation rule for outer subjects.
In general, aside from a few "semi-mandatory" subtypes listed on this page, different subtypes will make sense for different languages (and the same goes for different MISC features). Frequency is only one factor to consider; another is how many different relations annotators will have to learn, how confusing it would be to have the same label for two different constructions, and whether you want to encourage crosslingual comparisons with other treebanks that use the subtype.
(BTW, the UCxn project's specification of MISC information could be an alternative to a distinct subtype in some cases.)
the history is that some flexibility was put in to ease the transition to a new validation rule for outer subjects.
I wouldn't say frequency was not related to this - in UD Coptic, for example, the outer labels would have appeared only in dev/test, so they would have made it impossible to test a parser properly on the corpus. I think that might have been the proximal cause for adding the MISC option.
I don't think there is a general tendency to move some subtypes to MISC. For normal subtype candidates, you would probably just omit them. But in the specific case of nsubj:outer, you may need a way to tell the validator that two subjects under one node are not an error. That's why you need MISC in this case, if you want to avoid the subtype.
Reading the docstring for validate_single_subject in validate.py I see that it is permitted to use the MISC column to specify an outer subject instead of the deprel subtypes
csubj:outer
andnsubj:outer
when they are very rare. Without having done the experiment I suspect that the ARCOSG corpus for Scottish Gaelic is small enough for this to be the case.I'm reviewing proper names and pondering whether to use
nmod:desc
in line with English when I retag the various Sirs, Lords, Ladies, Reverends and Professors (Sir, am Mòrar, Leadaidh, an t-Urramach, an t-Ollamh) but I'm concerned that they may be a bit sparse and I should stick tonmod:unmarked
and put something in the MISC column.Does anyone with more experience of parsing corpora have a feel for how few examples can be plausibly learnt by current parsers?
It also feels a bit unsatisfactory to have annotation decisions based on how big your corpus is, but I'd rather have something useful with a few odd corner cases than something less useful but perfect.
The text was updated successfully, but these errors were encountered: