-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orthographic vs. pronunciation order of dollar signs/currency symbols #73
Comments
Reopening this as I think reordering tokens for CGEL is going to cause problems as we scale up parsing. Is there an argument to be made that the dollar sign notation is not completely equivalent to its spelled-out version?
It seems to me that the currency symbol denotes a unit of measure that is a mass noun despite its plural ending, whereas the spelled-out word "dollars" can be construed either way. I wonder if, on this basis, we could conclude that "3" rather than "$" is the head of "$3". If "$" is a modifier, then we could keep the linear order of tokens. (This would be a departure from UD, but as a dependency-only formalism, UD does not care about projectivity to the extent that we do.) Other examples to consider: "your over $10 in savings", "$10-20 in savings". If "10" is a noun head, then it can coordinate with another number, can take a determiner, etc. The case of "over" is tricky (approximator conundrum; #71) but I think we need to allow for "over" taking an NP complement as in "over a million" or "over a third of the population", and then the PP essentially gets coerced into a Nom. (Another way to analyze "$3" without reordering it might be a flat structure, but currently CGELBank only does that at the lexical level; we would need nested phrasal structure for coordinations like "$10-20" where the "$" scopes over the whole coordination.) |
Counterargument to Mod: "$" can't be omitted from the full NP even when there is strong context?
just looks wrong. So maybe this is an orthographic construction which just doesn't follow general English syntax and we should call it a flat structure with the possibility of nesting. |
I've been thinking about this for a few days, and I'm still in against keeping the original ordering, setting aside practical processing issues. If those are significant, then a flat analysis seems like an acceptable compromise. |
Some thoughts on this:
You can omit $ in at least some cases where there is very strong context. See, e.g., titles of the Reddit posts below: "How can I afford 1000 a month rent on 18 an hour at 20 years old?" "Is 1,189 too much to pay for rent if I make 3,600 a month? (PA)" Also, is "USD" a postpositive modifier in "$100 USD"? I'm not sure what else it could be if the internal structure is anything other than flat. Sequences like "100 USD" (w/o "$") are also broadly attested, and presumably "USD" has the same function in this context (which would leave one candidate for the head, "100").* So unless we want to go with the 'flat' analysis, it seems to me that there are cases of [numeral]+[currency symbol] / [currency symbol]+[numeral] where we may have to say the numeral is the head. *Also, I think the same kind of number agreement data that support the numeral-as-head analysis for
|
It occurs to me that "USD" resembles units of temperature measurement, e.g. "It's 100 (degrees) (Fahrenheit) out today." |
Or C$ |
https://github.com/UniversalDependencies/UD_English-EWT/blob/5aefb34b5082283ae50c02359852bfed1301a0dd/not-to-release/sources/newsgroup/groups.google.com_alt.animals_0e65f540816d780c_ENG_20041116_124800.xml.conllu#L51 https://github.com/UniversalDependencies/UD_English-EWT/blob/5aefb34b5082283ae50c02359852bfed1301a0dd/not-to-release/sources/newsgroup/groups.google.com_civilization_1201f7692b7769fb_ENG_20050908_010400.xml.conllu#L145 |
Yeah I don't know that we ever really standardized it in UD. |
In the expression "over $300", syntactically we want to analyze this as "[[over 300] dollars]". This would be nonprojective in the original string.
Today's decision with @BrettRey: the CGEL-tokenized version of the sentence would move all currency symbols to pronunciation order, so "$300" in the original string (
text
line) becomes "300 $" in the CGEL tokenization (sent
line). And post-headDet
function is prohibited across the board.The validator should check that this rule is applied consistently,
and the evaluation script needs to know about this case when calculating character offsets(eval script should usesent
line for char offsets, so not an issue). It should not be hard to check in the UD data for xpos=$
heading anummod
to its right. The symbol should be moved after the entirenummod
subtree (which may involve coordination).The text was updated successfully, but these errors were encountered: