Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part of Speech 's' #9

Closed
fredsonaguiar opened this issue Jul 20, 2021 · 9 comments
Closed

Part of Speech 's' #9

fredsonaguiar opened this issue Jul 20, 2021 · 9 comments

Comments

@fredsonaguiar
Copy link

In the PWN:3.0 and PWN:3.1 data, one may find occurrences of PartsOfSpeech s, such as in

[...]
<Lemma writtenForm="well-connected" partOfSpeech="s" />
<Lemma writtenForm="humongous" partOfSpeech="s" />
<Lemma writtenForm="trimotored" partOfSpeech="s" />
<Lemma writtenForm="right-hand" partOfSpeech="s" />
[...]

In https://wordnet.princeton.edu/documentation/lexnames5wn the POS described are: NOUN; VERB; ADJECTIVE; and ADVERB.

That might have been a misleading with the ss_types from https://wordnet.princeton.edu/documentation/wndb5wn: NOUN (n); VERB (v); ADJECTIVE (a) ; ADJECTIVE SATELLITE (s); and ADVERB (r).

@goodmami
Copy link
Collaborator

So lemmas with partOfSpeech="s" should probably be partOfSpeech="a"?

It might also be good for https://github.com/globalwordnet/schemas/ to change the partOfSpeech attribute on <Synset> in WN-LMF to sstype or something, but being a backwards-incompatible change it might be harder to get that through.

@fcbond
Copy link
Contributor

fcbond commented Jul 26, 2021 via email

@jmccrae
Copy link

jmccrae commented Jul 26, 2021

Satellite is a fundamentally different part-of-speech in the structure of Princeton WordNet and certain parts of the structure, as well as related technical implementations (sense key calculation), rely on this. Linguistically, IMHO, it is not a sensible distinction and it leads to all kinds of issues (see OEWN's Issue globalwordnet/english-wordnet#35 for the start of the rabbit hole).

My opinion is that PWN uses 'satellite' as a part-of-speech value on the same level as 'noun' and this should be respected in any export of PWN. OEWN may at some point, I hope, get round to removing this distinction.

My opinion on adding sstype to the schema is that this just leads to the part-of-speech being duplicated for most synsets. It is easier for implementations just to understand s as a value for part-of-speech.

@goodmami
Copy link
Collaborator

@jmccrae I think you're addressing a different issue than what @FredsoNerd raised.

Part-of-speech is, linguistically, a syntactic property and not a semantic property, and therefore in WNDB lexical entries (in the index files) have a pos field while synsets (in the data files) have a ss_type field even though their values are mostly the same (compare Index File Format and Data File Format here: https://wordnet.princeton.edu/documentation/wndb5wn). Both pos and ss_type can have n, v, a, and r, but only ss_type may have the fifth: s.

@FredsoNerd was pointing out that partOfSpeech="s" is appearing on <Lemma> elements, not (just) on <Synset> elements. For instance:

$ grep well-connected *.adj
data.adj:00567414 00 s 01 well-connected 0 001 & 00566099 a 0000 | connected by blood or close acquaintance with people of wealth or social position; "a well-connected Edinburgh family"  
index.adj:well-connected a 1 1 & 1 0 00567414

Note that the ss_type is s in data.adj but the pos is a in index.adj, whereas in the WN-LMF conversion of PWN (and OEWN) the lemma seems to copy the ss_type value from the data file instead of using the pos value from the index file:

$ grep  '"well-connected"' wn30.xml 
      <Lemma writtenForm="well-connected" partOfSpeech="s" />

Unless I've misunderstood the WNDB format, it appears this is an error, and the <Lemma> elements should have partOfSpeech="a" while their corresponding synsets retain the partOfSpeech="s" attribute.

@jmccrae
Copy link

jmccrae commented Jul 27, 2021

I guess that is an interpretation... the schema description of the format is kind of clear that that is not the interpretation we have made so far:
https://globalwordnet.github.io/schemas/
I was also checking the first paper we wrote (https://aclanthology.org/2016.gwc-1.9.pdf) and it says that s is for sentence!
On the other hand, I don't think it breaks anything to use a different value of partOfSpeech for the lemma and synset, although I would not be in favour of this in projects like OEWN, as I hope we can remove satellites entirely in the future.

@arademaker
Copy link

Satellite is a fundamentally different part-of-speech in the structure of Princeton WordNet and certain parts of the structure, as well as related technical implementations (sense key calculation), rely on this.

But PWN never claimed that s is a part of speech. That is the point, synset types are not part of speech...

@arademaker
Copy link

arademaker commented Aug 3, 2021

I was also checking the first paper we wrote (https://aclanthology.org/2016.gwc-1.9.pdf) and it says that s is for sentence!

Oh, this was the interpretation at that time or a typo?

@goodmami
Copy link
Collaborator

goodmami commented Aug 3, 2021

Also, regardless of the interpretation and what we want to do moving forward, we should be careful that we're not changing or losing information in PWN, which is fixed (see #5 (comment)). Currently there's no info loss (in the entropy sense) as we can replace partOfSpeech="s" with ...="a" on Lemmas, but it has been changed (imagine a user searching for "well-connected" with POS "a" and finding nothing).

@jmccrae
Copy link

jmccrae commented Aug 4, 2021

I think s for sentence was a mistake, I guess it was from me.

This was referenced Sep 22, 2021
@fcbond fcbond closed this as completed in 3847e8e Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants