Hyphenation/spacing in multi-word tokens #74

aryamanarora · 2023-03-24T21:03:04Z

Output of unaligned trees from the eval script is below. Apologies for the lack of spacing; currently alignment of the two trees is done by ignoring punctuation and any possible spaces between the tokens in the tree.

Tree #3 not aligned.
     ourplansincluderaisingprivatecapitaltodevelopbuildflighttestandoperatethisearthmoonhighwayforthebenefitofthecountryandthebenefitofourinvestors
     ourplansincluderaisingprivatecapitaltodevelopbuildflight-testandoperatethisearthmoonhighwayforthebenefitofthecountryandthebenefitofourinvestors
Tree #5 not aligned.
     ineedsomethingreliableandgood looking
     ineedsomethingreliableandgood-looking
Tree #10 not aligned.
     idon'twanttohavetodealwiththosedealadaywebsiteslikegroupon
     idon'twanttohavetodealwiththosedeal-a-daywebsiteslikegroupon
Tree #22 not aligned.
     apacmanfrogwillneedaheatsourcethatcreatesabaskingtempintheupper80'sfforatleast10-12hoursaday
     apacmanfrogwillneedaheatsourcethatcreatesabaskingtempintheupper80'sfforat least10-12hoursaday
Tree #36 not aligned.
     janetteelbertsonadministrativecoordinatorewslegaleb3326telephone713 853-7906facsimile713 [email protected]
     janetteelbertsonadministrativecoordinatorewslegaleb3326telephone713853-7906facsimile713646-2600e-mailaddressjanette.elbertson@enron.com
Tree #49 not aligned.
     afterthatpointalqaedawasajointenterprisebetweentheegyptianextremistsandthepolyglotarabsaroundbinladenonlysomeofwhomweresaudi
     afterthatpointal-qaedawasajointenterprisebetweentheegyptianextremistsandthepolyglotarabsaroundbinladenonlysomeofwhomweresaudi

Most of these issues are due to hyphenation/spacing in the tokens which are comprised of multiple words. How should we handle this? One idea is that we consistently use the :subt relation and the aligner prioritises those over the surface form, so e.g. both "good looking" and "good-looking" are indicated to have the same underlying structure :subt "good" :subt "looking".

The text was updated successfully, but these errors were encountered:

nschneid · 2023-03-24T21:48:19Z

I would rather not have to add anything special to the tree—I would say the tokens can be aligned to character offsets with the provision that a hyphen may be inserted or deleted just like a space. (But some hyphens are tokens of their own, serving as Coordinators.)

nschneid · 2023-03-24T21:50:07Z

One way to implement this might be to take the sent line before each tree, convert the -- gaps to a special character so they won't be confused with hyphens, and compute edit distance to get an alignment.

nschneid · 2023-04-10T03:01:06Z

With the current implementation, I think all spans are indexed by offsets in the tree's dehydrated string (the concatenation of terminals ignoring spaces, hyphens, and corrections). This means that

gaps and insertions are length-0 spans
a hyphen coordinator will be indexed like an insertion
multiple consecutive gap/insertion terminals will have the same span (are they sequenced for alignment in LTR order?)
a nonterminal constituent starting or ending with a gap/insertion will likely have the same span as a nonterminal without the gap/insertion.

So a sequence of multiple nodes for a span being aligned by Levenshtein is not always due to unaries.

Is this what we want? Maybe it's fine in practice, even if we could in principle be a bit more precise about spans by assigning indices for left-gaps/insertions and right-gaps/insertions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyphenation/spacing in multi-word tokens #74

Hyphenation/spacing in multi-word tokens #74

aryamanarora commented Mar 24, 2023 •

edited

Loading

nschneid commented Mar 24, 2023

nschneid commented Mar 24, 2023 •

edited

Loading

nschneid commented Apr 10, 2023 •

edited

Loading

Hyphenation/spacing in multi-word tokens #74

Hyphenation/spacing in multi-word tokens #74

Comments

aryamanarora commented Mar 24, 2023 • edited Loading

nschneid commented Mar 24, 2023

nschneid commented Mar 24, 2023 • edited Loading

nschneid commented Apr 10, 2023 • edited Loading

aryamanarora commented Mar 24, 2023 •

edited

Loading

nschneid commented Mar 24, 2023 •

edited

Loading

nschneid commented Apr 10, 2023 •

edited

Loading