Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data - lemma should be data or datum? #1075

Open
AngledLuffa opened this issue Dec 31, 2024 · 4 comments
Open

data - lemma should be data or datum? #1075

AngledLuffa opened this issue Dec 31, 2024 · 4 comments

Comments

@AngledLuffa
Copy link

In EWT, the lemma of data is data, whereas in GUM it is datum. Something we can unify? Then I can go badger the other treebanks...

@nschneid
Copy link
Contributor

The number of "data" is controversial...in EWT (and for many speakers) it is singular and thus the lemma is "data". That seems to me less presumptuous than assuming speakers have a distinct underlying singular form. (The word "datum" does not occur in EWT or GUM.)

There are a few cases where "data" is the subject and the verb takes either singular or plural agreement. I wouldn't mind saying the number feature defaults to singular unless the sentence gives evidence for plural, and the wordform simply does not distinguish number for most speakers.

(I also see "data" as a lemma for 1 token in GUM—but it is "datum" for 66 tokens.)

@amir-zeldes
Copy link
Contributor

What this really is is echoing the behavior of the old TreeTagger lemmatizer, which many of the huge legacy CQP corpora at Georgetown used, and so "datum" is the lemma our students have been encountering there and that just got carried over to GUM based on corpus searches I guess (and at some point Stanza might have picked it up from GUM then).

My gut feeling is that at this point "datum" is wrong for English, so I'm inclined to agree with what you're saying, but I'm not sure if the default number should be Sing or Plur. Oddly, GUM only has one hit for a present with "data" as nsubj, and it's plural... I'm pretty sure that's the rarer behavior in spoken language, but "data" comes up more often in writing. I thought it might be a blip, but in OntoNotes it's 7 plurals to 1 singular!

So now I'm a bit torn - I certainly use data as a singular all the time, but lemmatization is meant to collapse things to 'authoritative' forms, and written corpus data from genre diverse datasets seems to point in the opposite direction from how we all probably speak right now. What do you think?

@nschneid
Copy link
Contributor

nschneid commented Jan 4, 2025

Need to also look for nsubj:pass :) GUM: https://universal.grew.fr/?custom=677966e49249b (a lot of academic writing so I am not surprised to see the plural there). EWT: https://universal.grew.fr/?custom=67796722a8433 (only singular)

@amir-zeldes
Copy link
Contributor

Right, nice catch with pass! It's still majority Plur in GUM... But it also has a lot of senses, like if we're talking about how much data a cellular carrier offers, we wouldn't say "they have good messaging rates but the international data is/(*are) bad", right?

I guess in sum I would feel more comfortable with Sing as the default, but yeah, should be Plur if it's unambiguous. I'll change to xpos=NN if it's ambiguous and the FEATS will auto-fit that in GUM & co.

amir-zeldes added a commit to amir-zeldes/gum that referenced this issue Jan 7, 2025
amir-zeldes added a commit to gucorpling/gentle that referenced this issue Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants