Infrequent subtypes: when to use MISC instead? #1080

colinbatchelor · 2025-01-16T18:09:02Z

Reading the docstring for validate_single_subject in validate.py I see that it is permitted to use the MISC column to specify an outer subject instead of the deprel subtypes csubj:outer and nsubj:outer when they are very rare. Without having done the experiment I suspect that the ARCOSG corpus for Scottish Gaelic is small enough for this to be the case.

I'm reviewing proper names and pondering whether to use nmod:desc in line with English when I retag the various Sirs, Lords, Ladies, Reverends and Professors (Sir, am Mòrar, Leadaidh, an t-Urramach, an t-Ollamh) but I'm concerned that they may be a bit sparse and I should stick to nmod:unmarked and put something in the MISC column.

Does anyone with more experience of parsing corpora have a feel for how few examples can be plausibly learnt by current parsers?

It also feels a bit unsatisfactory to have annotation decisions based on how big your corpus is, but I'd rather have something useful with a few odd corner cases than something less useful but perfect.

The text was updated successfully, but these errors were encountered:

nschneid · 2025-01-16T18:29:00Z

I wouldn't read too much into that comment in the validator—the history is that some flexibility was put in to ease the transition to a new validation rule for outer subjects.

In general, aside from a few "semi-mandatory" subtypes listed on this page, different subtypes will make sense for different languages (and the same goes for different MISC features). Frequency is only one factor to consider; another is how many different relations annotators will have to learn, how confusing it would be to have the same label for two different constructions, and whether you want to encourage crosslingual comparisons with other treebanks that use the subtype.

(BTW, the UCxn project's specification of MISC information could be an alternative to a distinct subtype in some cases.)

amir-zeldes · 2025-01-16T20:44:34Z

the history is that some flexibility was put in to ease the transition to a new validation rule for outer subjects.

I wouldn't say frequency was not related to this - in UD Coptic, for example, the outer labels would have appeared only in dev/test, so they would have made it impossible to test a parser properly on the corpus. I think that might have been the proximal cause for adding the MISC option.

dan-zeman · 2025-01-16T21:06:47Z

I don't think there is a general tendency to move some subtypes to MISC. For normal subtype candidates, you would probably just omit them. But in the specific case of nsubj:outer, you may need a way to tell the validator that two subjects under one node are not an error. That's why you need MISC in this case, if you want to avoid the subtype.

dan-zeman added question dependencies labels Jan 16, 2025

dan-zeman added this to the v2.16 milestone Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrequent subtypes: when to use MISC instead? #1080

Infrequent subtypes: when to use MISC instead? #1080

colinbatchelor commented Jan 16, 2025

nschneid commented Jan 16, 2025

amir-zeldes commented Jan 16, 2025

dan-zeman commented Jan 16, 2025

Infrequent subtypes: when to use MISC instead? #1080

Infrequent subtypes: when to use MISC instead? #1080

Comments

colinbatchelor commented Jan 16, 2025

nschneid commented Jan 16, 2025

amir-zeldes commented Jan 16, 2025

dan-zeman commented Jan 16, 2025