-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should Number=Ptan
be used instead of Number=Plur
for English plural-only words?
#999
Comments
I was today years old when I learned of the It would have the benefit of explaining why lemmas contain the plural morphology; lemmatizers/checkers would have to implement a fixed list of pluralia tantum anyway, and this makes it explicit in the data. A counterargument might be that it is a rare value that does not really have morphosyntactic consequences for English beyond the lemma—morphosyntactically it is a kind of plural, so users may expect |
Here is a nice little summary of pluralia tantum and singularia tantum in English: https://english.stackexchange.com/questions/407446/does-english-have-any-singularia-tantum-besides-mass-nouns @amir-zeldes and I are on board with implementing Let's NOT implement |
I've rewritten https://universaldependencies.org/en/feat/Number.html to better reflect how we currently use the feature. The page isn't updating immediately, but you can look at the diff. Note that, contra the original post above, I don't think species falls under the category of pluralia tantum. Here's how I defined it:
|
Note that even for pluralia tantum, the "s" can sometimes be chopped off when used attributively or in a compound ("pant leg", "scissor kick"). Hence the qualification "(at least when serving as a nominal head)". |
Right - looks like I responded in the wrong issue! All I need is a list/notes if you want to kick anything OFF the list in the GUM validator. |
Current GUM validator lists where the lemma is allowed to end with "s":
Additional items in EWT that are being flagged:
I don't have clear intuitions about all of these, hoping somebody else can weigh in. Note that only a subset are pluralia tantum—not "economics", "series", "species", or "news" (at least). |
I don't understand what you want to do. In English, there are only two numbers, singular and plural. OK, some nouns can only be singular or only plural. This an interesting property, but which concerns the lexicon. These nouns behave as normal singular or plural nouns. If you add another value for |
That's literally the proposal - annotate those nouns. |
There are various plural-only words such as in "I put on my glasses". With these, the lemma is the same as the form, not the depluralized form. I.e. the lemma for "glasses" in this case is "glasses" not "glass". The proposal is to mark these uses as If |
These are Latin terms:
|
We know what "plurale tantum" means. It is not the question. A plurale tantum is a lexical unit whose occurences are all plural. The properties of the lexical unit must not be confused with those of its occurrences. |
@sylvainkahane I understand |
Would |
I do have a slight worry that algorithms projecting agreement features onto verbs from their subjects would naïvely copy |
Ptan is a canonical annotation value of UD, and I assume in any language where it is used, it implies Plur, so @sylvainkahane 's objection is not really English-specific IMO, it sounds like a general criticism of Ptan. But other languages do use it in exactly this way, so it is UD English that's unusual here. I think it's not so odd to have more specific values that imply other values. For example, in many languages, numbers are essentially nouns (e.g. Semitic), or there is no really strong distinction between ADJ and NOUN, but we still use the more specific tag where appropriate. Should we avoid NUM in Arabic just because it obscures the facts that Arabic cardinal numbers are morphosyntactically also nouns? I think in such cases it can be understood that a language should implement the most specific labels possible, and an implication hierarchy such as Ptan -> (subtype of) Plur is understood.
Such alogrithms wouldn't get very far anyway: even just for plain coordination of two singulars you need to switch to Plur, so I would say the problem would be with the algorithm, not the annotation. I don't mind if UD says as a whole that Ptan is not a value of Number, but if it is, then I see nothing about the English case to suggest this shouldn't be used here. |
Yes, In some languages (English, probably), the special bit is a property of the lexeme, as @sylvainkahane points out. In others (Czech, for instance) it also has morphosyntactic implications (you must use different forms of numerals with plurale tantum than with normal plural nouns). |
OK, so if we're doing this we need a list. Here's what I gleaned from the above plus the GUM exempt plural form lemmas (presumably EWT has some more):
Not Ptan: species, series, biceps, triceps Any additions/comments welcome! |
Disagree on series: one TV show or one set of 7 playoff games is one series |
Right, one series - so there can be a single one, or multiple series. So it's not Ptan, just a noun whose singular form is identical to the plural form (like "sheep"), no? |
Yeah: "That series is canceled." "Those series are canceled." It ends in -ies because of the Latin source, nothing to do with pluralization. Ptan means the form cannot be used in the singular or made singular in the same sense. (Maybe "economics" is valid as Ptan in the plural: If I say "the economics are sound" that's has nothing to do with multiple "economic"s, and is a different sense from talking about the field of economics.) |
My mistake, sounds good |
Implemented for EWT. Scripts: UniversalDependencies/UD_English-EWT@547b675...cd0d92f#diff-e02db0ba7788687b383704df1414689e399e8d52709bac84029ab9a86d64c109 I disambiguated "respects" manually, but haven't worked on the "-ics" nouns ("politics", etc.), except to apply |
Number=Ptan was implemented for English, at least EWT and GUM! |
Going back a bit - does this apply to This occurs a few times in LinES: @LarsAhrenberg |
Yeah I would treat "pyjamas" like "pants". |
Words such as "glasses" and "species" are in their plural form according to English pluralization rules. Regarding the lemmas, with some exceptions:
EWT is correct here, but these cases are very likely to confuse lemmatizers trained on the UD English corpora due to these cases not following the English plural lemmatization rules.
Making use of the plurale tantum annotation (
Number=Ptan
) which already exists would make the intention clearer, and allow lemmatizers to differentiate fromNNS/Number=Plur
andNNS/Number=Ptan
.Note: This should also apply to the dates like
1980s
where the lemma retains the form's plural suffix.Relevant issues:
The text was updated successfully, but these errors were encountered: