data - lemma should be data or datum? #1075

AngledLuffa · 2024-12-31T06:18:02Z

In EWT, the lemma of data is data, whereas in GUM it is datum. Something we can unify? Then I can go badger the other treebanks...

The text was updated successfully, but these errors were encountered:

nschneid · 2024-12-31T16:01:29Z

The number of "data" is controversial...in EWT (and for many speakers) it is singular and thus the lemma is "data". That seems to me less presumptuous than assuming speakers have a distinct underlying singular form. (The word "datum" does not occur in EWT or GUM.)

There are a few cases where "data" is the subject and the verb takes either singular or plural agreement. I wouldn't mind saying the number feature defaults to singular unless the sentence gives evidence for plural, and the wordform simply does not distinguish number for most speakers.

(I also see "data" as a lemma for 1 token in GUM—but it is "datum" for 66 tokens.)

amir-zeldes · 2025-01-04T16:29:04Z

What this really is is echoing the behavior of the old TreeTagger lemmatizer, which many of the huge legacy CQP corpora at Georgetown used, and so "datum" is the lemma our students have been encountering there and that just got carried over to GUM based on corpus searches I guess (and at some point Stanza might have picked it up from GUM then).

My gut feeling is that at this point "datum" is wrong for English, so I'm inclined to agree with what you're saying, but I'm not sure if the default number should be Sing or Plur. Oddly, GUM only has one hit for a present with "data" as nsubj, and it's plural... I'm pretty sure that's the rarer behavior in spoken language, but "data" comes up more often in writing. I thought it might be a blip, but in OntoNotes it's 7 plurals to 1 singular!

So now I'm a bit torn - I certainly use data as a singular all the time, but lemmatization is meant to collapse things to 'authoritative' forms, and written corpus data from genre diverse datasets seems to point in the opposite direction from how we all probably speak right now. What do you think?

nschneid · 2025-01-04T16:52:18Z

Need to also look for nsubj:pass :) GUM: https://universal.grew.fr/?custom=677966e49249b (a lot of academic writing so I am not surprised to see the plural there). EWT: https://universal.grew.fr/?custom=67796722a8433 (only singular)

amir-zeldes · 2025-01-07T19:24:33Z

Right, nice catch with pass! It's still majority Plur in GUM... But it also has a lot of senses, like if we're talking about how much data a cellular carrier offers, we wouldn't say "they have good messaging rates but the international data is/(*are) bad", right?

I guess in sum I would feel more comfortable with Sing as the default, but yeah, should be Plur if it's unambiguous. I'll change to xpos=NN if it's ambiguous and the FEATS will auto-fit that in GUM & co.

* See UniversalDependencies/docs#1075

AngledLuffa added the English label Dec 31, 2024

dan-zeman added this to the v2.16 milestone Dec 31, 2024

dan-zeman added standard needed lemmatization labels Dec 31, 2024

amir-zeldes added a commit to amir-zeldes/gum that referenced this issue Jan 7, 2025

Singular data by default

b827378

* See UniversalDependencies/docs#1075

amir-zeldes added a commit to gucorpling/gentle that referenced this issue Jan 7, 2025

Singular data by default

d5fa6e5

* See UniversalDependencies/docs#1075

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data - lemma should be data or datum? #1075

data - lemma should be data or datum? #1075

AngledLuffa commented Dec 31, 2024

nschneid commented Dec 31, 2024

amir-zeldes commented Jan 4, 2025

nschneid commented Jan 4, 2025

amir-zeldes commented Jan 7, 2025

data - lemma should be data or datum? #1075

data - lemma should be data or datum? #1075

Comments

AngledLuffa commented Dec 31, 2024

nschneid commented Dec 31, 2024

amir-zeldes commented Jan 4, 2025

nschneid commented Jan 4, 2025

amir-zeldes commented Jan 7, 2025