-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data - lemma should be data or datum? #1075
Comments
The number of "data" is controversial...in EWT (and for many speakers) it is singular and thus the lemma is "data". That seems to me less presumptuous than assuming speakers have a distinct underlying singular form. (The word "datum" does not occur in EWT or GUM.) There are a few cases where "data" is the subject and the verb takes either singular or plural agreement. I wouldn't mind saying the number feature defaults to singular unless the sentence gives evidence for plural, and the wordform simply does not distinguish number for most speakers. (I also see "data" as a lemma for 1 token in GUM—but it is "datum" for 66 tokens.) |
What this really is is echoing the behavior of the old TreeTagger lemmatizer, which many of the huge legacy CQP corpora at Georgetown used, and so "datum" is the lemma our students have been encountering there and that just got carried over to GUM based on corpus searches I guess (and at some point Stanza might have picked it up from GUM then). My gut feeling is that at this point "datum" is wrong for English, so I'm inclined to agree with what you're saying, but I'm not sure if the default number should be Sing or Plur. Oddly, GUM only has one hit for a present with "data" as nsubj, and it's plural... I'm pretty sure that's the rarer behavior in spoken language, but "data" comes up more often in writing. I thought it might be a blip, but in OntoNotes it's 7 plurals to 1 singular! So now I'm a bit torn - I certainly use data as a singular all the time, but lemmatization is meant to collapse things to 'authoritative' forms, and written corpus data from genre diverse datasets seems to point in the opposite direction from how we all probably speak right now. What do you think? |
Need to also look for |
Right, nice catch with pass! It's still majority Plur in GUM... But it also has a lot of senses, like if we're talking about how much data a cellular carrier offers, we wouldn't say "they have good messaging rates but the international data is/(*are) bad", right? I guess in sum I would feel more comfortable with Sing as the default, but yeah, should be Plur if it's unambiguous. I'll change to xpos=NN if it's ambiguous and the FEATS will auto-fit that in GUM & co. |
In EWT, the lemma of
data
isdata
, whereas in GUM it isdatum
. Something we can unify? Then I can go badger the other treebanks...The text was updated successfully, but these errors were encountered: