Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin cleaning up some remaining needed ExtPos tags #64

Open
AngledLuffa opened this issue Dec 28, 2024 · 21 comments
Open

Begin cleaning up some remaining needed ExtPos tags #64

AngledLuffa opened this issue Dec 28, 2024 · 21 comments

Comments

@AngledLuffa
Copy link
Contributor

@nschneid

One of questionable ExtPosness: up to

# newdoc id = w05004
# sent_id = w05004031
# text = It is the portion from this second boundary up to the outer boundary
1       It      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  4       nsubj   4:nsubj _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     4:cop   _
3       the     the     DET     DT      Definite=Def|PronType=Art       4       det     4:det   _
4       portion portion NOUN    NN      Number=Sing     0       root    0:root  _
5       from    from    ADP     IN      _       8       case    8:case  _
6       this    this    DET     DT      Number=Sing|PronType=Dem        8       det     8:det   _
7       second  second  ADJ     JJ      Degree=Pos|NumForm=Word|NumType=Ord     8       amod    8:amod  _
8       boundary        boundary        NOUN    NN      Number=Sing     4       nmod    4:nmod:from     _
9       up      up      ADP     RB      _       13      case    13:case _
10      to      to      ADP     IN      _       9       fixed   9:fixed _
11      the     the     DET     DT      Definite=Def|PronType=Art       13      det     13:det  _
12      outer   outer   ADJ     JJ      Degree=Pos      13      amod    13:amod _
13      boundary        boundary        NOUN    NN      Number=Sing     4       nmod    4:nmod:up_to    SpaceAfter=No

It seems kind of similar to up to 90% or up to 20,000 which typically gets labeled ExtPos=ADV:

29      up      up      ADP     IN      ExtPos=ADV      32      advmod  32:advmod       _
30      to      to      ADP     IN      _       29      fixed   29:fixed        _
31      6       6       NUM     CD      NumForm=Digit|NumType=Card      32      nummod  32:nummod       _
32      months  month   NOUN    NNS     Number=Plur     28      obl     28:obl  SpaceAfter=No
@nschneid
Copy link
Contributor

According to https://universaldependencies.org/en/dep/fixed.html#approximators-quantity-modifiers I don't think it should be fixed as it is not a quantity. "to" could attach as case alongside "up".

@nschneid
Copy link
Contributor

image

@AngledLuffa
Copy link
Contributor Author

alright, i took a first pass at this sentence, so that and such as, and at best / at worst

#65

@AngledLuffa
Copy link
Contributor Author

after all is not treated as fixed in EWT?

i snuck that in to the current PR while waiting for a review. still a variety of fixed expressions of the form as well as which need updates

@AngledLuffa
Copy link
Contributor Author

more or less is treated as not fixed in GUM. Are we happy with that analsys? @nschneid @amir-zeldes

# text = however, in the study by Hein and colleagues it was found that children responded more or less frequently based on factors such as stimuli type.
16      more    more    ADV     RBR     Degree=Cmp      19      advmod  19:advmod       _
17      or      or      CCONJ   CC      _       18      cc      18:cc   _
18      less    less    ADV     RBR     Degree=Cmp      16      conj    16:conj:or|19:advmod    _

If so, I can remove that one here in PUD

@nschneid
Copy link
Contributor

Definitely not fixed here (it just means "more frequently or less frequently"), and not listed on https://universaldependencies.org/en/dep/fixed.html.

@AngledLuffa
Copy link
Contributor Author

Should close to be treated similar to more than?

# sent_id = newsgroup-groups.google.com_alt.animals.bears_1125853b1f13cff6_ENG_20040126_171100-0056
# text = (I think there are close to 24,000 groups)
1       (       (       PUNCT   -LRB-   _       3       punct   3:punct SpaceAfter=No
2       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   3:nsubj _
3       think   think   VERB    VBP     Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   0       root    0:root  _
4       there   there   PRON    EX      _       5       expl    5:expl  _
5       are     be      VERB    VBP     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   3       ccomp   3:ccomp _
6       close   close   ADJ     JJ      Degree=Pos      9       advmod  9:advmod        _
7       to      to      ADP     IN      _       8       case    8:case  _
8       24,000  24000   NUM     CD      NumForm=Digit|NumType=Card      6       obl     6:obl:to        _
9       groups  group   NOUN    NNS     Number=Plur     5       nsubj   5:nsubj SpaceAfter=No
10      )       )       PUNCT   -RRB-   _       3       punct   3:punct _

vs

# sent_id = newsgroup-groups.google.com_misc.consumers_a534e32067078b08_ENG_20060116_030800-0058
# text = The occupation of Iraq has become a guerrilla war, a siege that has lasted more than a thousand days.
11      a       a       DET     DT      Definite=Ind|PronType=Art       12      det     12:det  _
12      siege   siege   NOUN    NN      Number=Sing     9       appos   9:appos|15:nsubj        _
13      that    that    PRON    WDT     PronType=Rel    15      nsubj   12:ref  _
14      has     have    AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   15      aux     15:aux  _
15      lasted  last    VERB    VBN     Tense=Past|VerbForm=Part        12      acl:relcl       12:acl:relcl    Cxn=rc-that-nsubj
16      more    more    ADJ     JJR     Degree=Cmp|ExtPos=ADV   19      advmod  19:advmod       _
17      than    than    ADP     IN      _       16      fixed   16:fixed        _
18      a       a       DET     DT      Definite=Ind|PronType=Art       19      det     19:det  _
19      thousand        thousand        NUM     CD      NumForm=Word|NumType=Card       20      nummod  20:nummod       _
20      days    day     NOUN    NNS     Number=Plur     15      obl:unmarked    15:obl:unmarked SpaceAfter=No|TemporalNPAdjunct=Yes

@AngledLuffa
Copy link
Contributor Author

#67

@AngledLuffa AngledLuffa mentioned this issue Dec 28, 2024
@nschneid
Copy link
Contributor

"Close to" is not documented in the fixed list. From EWT, where it pertains to a quantity:

image

@AngledLuffa
Copy link
Contributor Author

Got it, so rather than treat it as a single MWT, we'll treat the to as similar to ... similar to, far from, etc. Although to be honest I don't quite see why it's different from more than in this context. Another possibility would be to add it to the list

@AngledLuffa
Copy link
Contributor Author

btw that analysis doesn't quite work here, since the thing being counted isn't specifically stated. perhaps like this, with just the to analysis changing?

6       a       a       DET     DT      Definite=Ind|PronType=Art       7       det     7:det   _
7       population      population      NOUN    NN      Number=Sing     5       obj     5:obj   _
8       of      of      ADP     IN      _       13      case    13:case _
9       close   close   ADJ     JJ      Degree=Pos      13      advmod  13:advmod       _
10      to      to      ADP     IN      _       9       fixed   9:fixed _
11      half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  13      compound        13:compound     _
12      a       a       DET     DT      Definite=Ind|PronType=Art       13      det     13:det  _
13      million million NUM     CD      NumForm=Word|NumType=Card       7       nmod    7:nmod:of       SpaceAfter=No

--->

6       a       a       DET     DT      Definite=Ind|PronType=Art       7       det     7:det   _
7       population      population      NOUN    NN      Number=Sing     5       obj     5:obj   _
8       of      of      ADP     IN      _       13      case    13:case _
9       close   close   ADJ     JJ      Degree=Pos      13      advmod  13:advmod       _
10      to      to      ADP     IN      _       13      case    13:case _
11      half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  13      compound        13:compound     _
12      a       a       DET     DT      Definite=Ind|PronType=Art       13      det     13:det  _
13      million million NUM     CD      NumForm=Word|NumType=Card       7       nmod    7:nmod:of       SpaceAfter=No

@AngledLuffa
Copy link
Contributor Author

another option here would be to attach close to population and half a million to close

@nschneid
Copy link
Contributor

Honestly I have never been sure about the policy on approximators or how it should generalize to new expressions. Maybe @amir-zeldes has thoughts?

@AngledLuffa
Copy link
Contributor Author

In the meantime, call this good for this round of ExtPos improvements and see what remaining complaints the validator has?

@nschneid
Copy link
Contributor

Sure

@AngledLuffa
Copy link
Contributor Author

Is as much as <quantity> similar to more than <quantity>, and we could add it to the list, or is it not close enough?

There are a few examples of as much as in EWT and GUM, but none of them are followed by a quantity like the more than <quantity> expressions which are marked as fixed

@AngledLuffa
Copy link
Contributor Author

any thoughts on as much as? as well and as well as are pretty straightforward, and those are the only remaining ExtPos needed in PUD

@amir-zeldes
Copy link

Honestly I have never been sure about the policy on approximators or how it should generalize to new expressions. Maybe @amir-zeldes has thoughts?

I prefer close to as fixed to two case sisters, the "to" doesn't make sense without "close". Either that or do it transparently (i.e. "close" is the head and "to half a million" is obl, but then it's not a nummod anymore).

any thoughts on as much as?

The problem is that we'd like to keep the fixed list small and closed. I can understand wanting it to be fixed and we could do that, but there will be productive extensions of that. Real example:

  • But , the price is affordable . revenue as very much as 60 -80 % apart

@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

The problem is that we'd like to keep the fixed list small and closed.

It's already on there: https://universaldependencies.org/en/dep/fixed.html#approximators-quantity-modifiers

Are you suggesting we remove entries?

  • But , the price is affordable . revenue as very much as 60 -80 % apart

Well this is marginally attested on the web, but it's not grammatical for me.

@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

BTW if we're going to extend the approximator list to add "close to" ('approximately') what about "greater than" (similar to "more than"), "in excess of", "in the ballpark of"...seems like it's actually a semi-open class.

@AngledLuffa
Copy link
Contributor Author

It's already on there: https://universaldependencies.org/en/dep/fixed.html#approximators-quantity-modifiers

Look at that! I guess ctrl-f failed me where using my own eyes would have worked out better.

close to

Not sure where the line is on how much we can add, but "close to" looks very similar to "up to"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants