Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrequent subtypes: when to use MISC instead? #1080

Open
colinbatchelor opened this issue Jan 16, 2025 · 3 comments
Open

Infrequent subtypes: when to use MISC instead? #1080

colinbatchelor opened this issue Jan 16, 2025 · 3 comments

Comments

@colinbatchelor
Copy link
Contributor

Reading the docstring for validate_single_subject in validate.py I see that it is permitted to use the MISC column to specify an outer subject instead of the deprel subtypes csubj:outer and nsubj:outer when they are very rare. Without having done the experiment I suspect that the ARCOSG corpus for Scottish Gaelic is small enough for this to be the case.

I'm reviewing proper names and pondering whether to use nmod:desc in line with English when I retag the various Sirs, Lords, Ladies, Reverends and Professors (Sir, am Mòrar, Leadaidh, an t-Urramach, an t-Ollamh) but I'm concerned that they may be a bit sparse and I should stick to nmod:unmarked and put something in the MISC column.

Does anyone with more experience of parsing corpora have a feel for how few examples can be plausibly learnt by current parsers?

It also feels a bit unsatisfactory to have annotation decisions based on how big your corpus is, but I'd rather have something useful with a few odd corner cases than something less useful but perfect.

@nschneid
Copy link
Contributor

I wouldn't read too much into that comment in the validator—the history is that some flexibility was put in to ease the transition to a new validation rule for outer subjects.

In general, aside from a few "semi-mandatory" subtypes listed on this page, different subtypes will make sense for different languages (and the same goes for different MISC features). Frequency is only one factor to consider; another is how many different relations annotators will have to learn, how confusing it would be to have the same label for two different constructions, and whether you want to encourage crosslingual comparisons with other treebanks that use the subtype.

(BTW, the UCxn project's specification of MISC information could be an alternative to a distinct subtype in some cases.)

@amir-zeldes
Copy link
Contributor

the history is that some flexibility was put in to ease the transition to a new validation rule for outer subjects.

I wouldn't say frequency was not related to this - in UD Coptic, for example, the outer labels would have appeared only in dev/test, so they would have made it impossible to test a parser properly on the corpus. I think that might have been the proximal cause for adding the MISC option.

@dan-zeman
Copy link
Member

I don't think there is a general tendency to move some subtypes to MISC. For normal subtype candidates, you would probably just omit them. But in the specific case of nsubj:outer, you may need a way to tell the validator that two subjects under one node are not an error. That's why you need MISC in this case, if you want to avoid the subtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants