-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphenation/spacing in multi-word tokens #74
Comments
I would rather not have to add anything special to the tree—I would say the tokens can be aligned to character offsets with the provision that a hyphen may be inserted or deleted just like a space. (But some hyphens are tokens of their own, serving as Coordinators.) |
One way to implement this might be to take the |
With the current implementation, I think all spans are indexed by offsets in the tree's dehydrated string (the concatenation of terminals ignoring spaces, hyphens, and corrections). This means that
So a sequence of multiple nodes for a span being aligned by Levenshtein is not always due to unaries. Is this what we want? Maybe it's fine in practice, even if we could in principle be a bit more precise about spans by assigning indices for left-gaps/insertions and right-gaps/insertions. |
Output of unaligned trees from the eval script is below. Apologies for the lack of spacing; currently alignment of the two trees is done by ignoring punctuation and any possible spaces between the tokens in the tree.
Most of these issues are due to hyphenation/spacing in the tokens which are comprised of multiple words. How should we handle this? One idea is that we consistently use the
:subt
relation and the aligner prioritises those over the surface form, so e.g. both "good looking" and "good-looking" are indicated to have the same underlying structure:subt "good" :subt "looking"
.The text was updated successfully, but these errors were encountered: