-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Way to silence fixed-gap validator warning? #1003
Comments
Why? I do not see how a gap inside a Do you have some examples? If something can come in between, then is this not a sign that the syntax is not "frozen" (as per guidelines) and so has to be made explicit? |
In languages with Wackernagel particles, such as ⲇⲉ in Classical Greek or Coptic, fixed expressions can often be interrupted if they happen to stand in the first position in the sentence, simply because the enclitic particle has to appear in the second position. It would be strange to consider such expressions fixed except when they happen to begin a sentence which has such a particle. The placement of the particle in those cases is fully automatic and does not respect syntactic phrasal constructions as a constraint. |
Wackernagel particles are a clear place the exception is needed. The one case in EWT is "due largely to". It's a bit borderline, but when we last discussed this the consensus was that "due to" is sufficiently frozen to annotate it as such even if there are occasional internal modifiers. |
A Swedish example is ”för … sedan”, meaning ”… ago”, as in “för 20 år sedan”, meaning ”20 years ago”. You can insert any time expression between “för” (“for”) and “sedan” (“then”), but the combination of ”för” and ”sedan” is completely frozen and syntactically anomalous.
Joakim
Skickat från Outlook för iOS<https://aka.ms/o0ukef>
…________________________________
Från: Nathan Schneider ***@***.***>
Skickat: Thursday, December 14, 2023 7:44:19 PM
Till: UniversalDependencies/docs ***@***.***>
Kopia: Subscribed ***@***.***>
Ämne: Re: [UniversalDependencies/docs] Way to silence fixed-gap validator warning? (Issue #1003)
Wackernagel particles are a clear place the exception is needed. The one case in EWT is "due largely to". It's a bit borderline, but when we last discussed this the consensus was that "due to" is sufficiently fixed to annotate it as such even if there are occasional internal modifiers.
—
Reply to this email directly, view it on GitHub<#1003 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVUP7EPEE6KGKOZU5P3YJNCIHAVCNFSM6AAAAABAN3D3AKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGM4TKOBXGE>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert.
CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
|
With regard to Wackernagel particles:
So something is wrong with such a
I would strongly argue against that (although I have no quick references at hand, I am sorry), given that we observe such particles also sometimes considering whole phrases, not only single elements. So they do interact with (or "respect") syntax.
This is an interesting and tricky case. But could a lexicocentric approach not favour tying both members to the head år instead of making them depend on each other? Is this not one of those cases that should be shifted to a different MWE level? |
I disagree with this - such cases are not morphosyntactically flexible, they retain both the same valency structure and the same constituent words exactly, with no change. The behavior of Wackernagel particles inserted in the middle of a fixed phrase can be explained on purely phonological grounds, when the first part of the fixed expression is stressed.
Well, if it happens only some of the time, and we consider it a fixed expression there, shouldn't we want it to have the same structure even when a particle interrupts it? It means exactly the same thing, and the particle isn't dominated by either part of the fixed expression, indicating the "disrespect" of syntax I was referring to. For example, Greek εἰμή is generally considered a lexicon entry meaning "unless" and is often tokenized as one word. But historically it has two parts (<if not), and we can find both uninterrupted cases, annotated in UD as
Notice that the intervening particle is dominated by the root from outside the fixed expression - it is not properly part of the phrase that it projects. I understand why the annotators would want such cases to be annotated in a way that is consistent with the much more common non-interrupted cases. |
Another thought: it seems that So, for example in Latin, we observe
but never ever
even if we can clearly identify a root dic and a TAM-person affix unt, the stress goes on dic-, etc. Conversely we find:
where enim intervenes only after the phrase ADP+NOUN ad verba (in the second place with respect to the co-ordinated blocks), but this is of course not an argument to say that the ADP forms a fixed block, because among other we can find ad primam mulierem 'to the first woman', with some material in between ADP and NOUN. In this case it is an ADJ, but a discoursive PART is equally valid for this argument, all the more since we can observe the non-occurrence of cases like (2). |
@Stormur are you saying that etymology is always paramount? If so, how do we know that we should not divide unless (<on+less), or whatever (what+ever)? It's true that they are spelled together, but so is εἰ μή, at least in post-classical Greek, and "whatever" is even interruptible in "whatsoever". I am not a Greek lexicographer, but if Autenrieth treats it as a word I think he must have had a reason. Some possibilities that come to mind are the frequent omission of the verb next to it, the limitation of the meaning to a specific subset in that construction, the gradual death of μή as a negation despite the survival of this construction, and more. There is a whole discussion on it here for example. I agree that dicunt is different, because "unt" is an inflectional suffix, and not a 'syntactic word' as discussed in #1006 . But UD Latin also has interrupted fixed expressions, for example si forte "perhaps", is interrupted in this example, here too due to enclisis, and is still annotated as fixed: I think the view that any interruption, incl. by a structure not dominated by the fixed expression itself, disqualifies it as |
The fixed guidelines were recently revised taking into account input from MWE folks at Dagstuhl. If further substantive changes are to be considered I think it would have to go through UniDive. The question for this issue is whether, given the current guidelines, it would make sense to tell the validator when a (generally rare kind of) annotation is intentional and shouldn't trigger a warning. |
Etymology is not paramount, but it surely can be decisive in choosing some annotation strategies. In this case, though, I do not think it is even etymology, it is simply a composition of words. It is not at least in the sense in which we see that non 'not' in Latin is derived from ne unum 'not one', and I would never propose to split something similar (probably the same goes for unless). The interruptibility of whatever and the identification of so as an independent word is in fact an argument for splitting: it seems at least that whatever is not as much a word as dicunt. A lexicographic entry like that by Autenrieth is not necessarily taking a stance that impacts on a UD-style annotation. I do not even think it is implying something towards wordhood, but just identifying a very common co-occurrence of two terms. We can surely observe an evolution of an expresion like εἰμή, but if a unitary treatment of its possible (and very often just supposed...) nuances still makes sense at a morphosyntactic level, I do not see much reason to let other factors interfere.
Hm, this was either left out from the reannotation of
I could point out that it could be seen as the contrary: the tying together of word co-occurrences as I would not give too much leeway to languages as much as I would see some suggestions in the guidelines of how to favour a non-
I do not know how much they coincide, but for example PARSEME (as far as I understand from the papers) is also pushing towards a reanalysis of In any case I would not eliminate the warning as it points to many factual non-ideal annotations (as I think I have shown in the previous cases). Maybe I could envision an option for the validator to suppress warnings in general? A kind of "less strict validation"? But I would leave them there somewhere as possible reminders that, very probably, some interventio nneeds to be done. |
I am not an expert on Latin, and it's very possible "si forte" is not a good candidate for Ultimately, it's about consistency and knowing the language in question and its UD annotations in detail. I'm not really involved in those decisions for Greek or Latin, but I am for English, and I definitely don't want to split up "whatever", which is quite lexicalized and equivalent to a single 'syntactic word' in every function I can think of. Other English corpus designers have seen it the same way, so that's the English-specific decision - the Greek one can be similar or different, but it's not trivial to distinguish that "unless" is different from "εἰμή", or how many words "nevertheless", or "whatsoever" or "gonna" should all be. |
The validator issues a warning if there are words intervening between elements of a fixed expression (https://github.com/UniversalDependencies/tools/blob/cf9d1ae087e01a0a8646d0352315528fcbfc3ab8/validate.py#L1868).
This is just a warning because there are some legitimate cases, either due to a systematic construction in a language or due to an exceptional sentence. Could there be a way to indicate this in the data so as to remove the warning? E.g.
FixedGap=Yes
in MISC.The text was updated successfully, but these errors were encountered: