Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenation/spacing in multi-word tokens #74

Open
aryamanarora opened this issue Mar 24, 2023 · 3 comments
Open

Hyphenation/spacing in multi-word tokens #74

aryamanarora opened this issue Mar 24, 2023 · 3 comments

Comments

@aryamanarora
Copy link
Member

aryamanarora commented Mar 24, 2023

Output of unaligned trees from the eval script is below. Apologies for the lack of spacing; currently alignment of the two trees is done by ignoring punctuation and any possible spaces between the tokens in the tree.

Tree #3 not aligned.
     ourplansincluderaisingprivatecapitaltodevelopbuildflighttestandoperatethisearthmoonhighwayforthebenefitofthecountryandthebenefitofourinvestors
     ourplansincluderaisingprivatecapitaltodevelopbuildflight-testandoperatethisearthmoonhighwayforthebenefitofthecountryandthebenefitofourinvestors
Tree #5 not aligned.
     ineedsomethingreliableandgood looking
     ineedsomethingreliableandgood-looking
Tree #10 not aligned.
     idon'twanttohavetodealwiththosedealadaywebsiteslikegroupon
     idon'twanttohavetodealwiththosedeal-a-daywebsiteslikegroupon
Tree #22 not aligned.
     apacmanfrogwillneedaheatsourcethatcreatesabaskingtempintheupper80'sfforatleast10-12hoursaday
     apacmanfrogwillneedaheatsourcethatcreatesabaskingtempintheupper80'sfforat least10-12hoursaday
Tree #36 not aligned.
     janetteelbertsonadministrativecoordinatorewslegaleb3326telephone713 853-7906facsimile713 [email protected]
     janetteelbertsonadministrativecoordinatorewslegaleb3326telephone713853-7906facsimile713646-2600e-mailaddressjanette.elbertson@enron.com
Tree #49 not aligned.
     afterthatpointalqaedawasajointenterprisebetweentheegyptianextremistsandthepolyglotarabsaroundbinladenonlysomeofwhomweresaudi
     afterthatpointal-qaedawasajointenterprisebetweentheegyptianextremistsandthepolyglotarabsaroundbinladenonlysomeofwhomweresaudi

Most of these issues are due to hyphenation/spacing in the tokens which are comprised of multiple words. How should we handle this? One idea is that we consistently use the :subt relation and the aligner prioritises those over the surface form, so e.g. both "good looking" and "good-looking" are indicated to have the same underlying structure :subt "good" :subt "looking".

@nschneid
Copy link
Contributor

I would rather not have to add anything special to the tree—I would say the tokens can be aligned to character offsets with the provision that a hyphen may be inserted or deleted just like a space. (But some hyphens are tokens of their own, serving as Coordinators.)

@nschneid
Copy link
Contributor

nschneid commented Mar 24, 2023

One way to implement this might be to take the sent line before each tree, convert the -- gaps to a special character so they won't be confused with hyphens, and compute edit distance to get an alignment.

@nschneid
Copy link
Contributor

nschneid commented Apr 10, 2023

With the current implementation, I think all spans are indexed by offsets in the tree's dehydrated string (the concatenation of terminals ignoring spaces, hyphens, and corrections). This means that

  • gaps and insertions are length-0 spans
  • a hyphen coordinator will be indexed like an insertion
  • multiple consecutive gap/insertion terminals will have the same span (are they sequenced for alignment in LTR order?)
  • a nonterminal constituent starting or ending with a gap/insertion will likely have the same span as a nonterminal without the gap/insertion.

So a sequence of multiple nodes for a span being aligned by Levenshtein is not always due to unaries.

Is this what we want? Maybe it's fine in practice, even if we could in principle be a bit more precise about spans by assigning indices for left-gaps/insertions and right-gaps/insertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants