TICCL-LDcalc: request to remove existing filter on underscore/hyphen bigram corrections

* In short:

It seems to me we have misguidedly imposed a restriction on TICCL-LDcalc to return higher ngram pairs where the variant and Correction Candidate (CC) only differ in a single (?) underscore  (= space) or hyphen. I suppose I at some point expected this restriction to lighten TICCL's overall work load. The result is the later modules cannot converge on the best fitting resolution of the split word due to the contradiction between the unigram solution and those offered by the bi- or possibly trigrams. Ultimately, FoLiA-correct fails to find the right bi- and trigrams to correct.

Example LD-calc output:
```
is_hon_derd~1~1~ir_honderd~1~1~13664231956~2~9~0~1~1~0~0
is_hon_derd~1~1~isa_Honderd~98765433~98765433~2984709275~2~9~1~1~1~0~0
is_hon_derd~1~1~ishonderd~1~1~22081616064~2~9~0~1~1~0~0
```
We do not get the CC: 'is_honderd'.

This results in the bi/trigram correction never getting the most plausible resolution for split words, but still getting hundreds of less plausible Correction Candidates (CCs). This results in suboptimal ranking of the CCs and chaos further on in the pipeline, especially in TICCL-chainclean which on the current very large test on about 2.3 million pages of HTRed text now fails to make progress even after days.

We observe the same to be true for hyphens in ngramcorrections. See section 'Hyphens:' below.

This restriction is possibly implemented as simply as: for the confusion values for underscore or hyphen: do not return word pairs where the CC would be a bi- or trigram, i.e. only unigrams are allowed as CC. (This will probably not fully cover it...).

However implemented, I would now like to see the restriction removed.


* The story, more in full, for both underscores and hyphens:

- Underscores:

TICCL-rank currently correctly returns e.g. the unigram pair:
```
Hon_derd~1~23~Honderd~110023864~121380612~11040808032~1~7~1~1~1~0~76
```
'Grep' on the ranked list:
```
reynaert@violet:/reddata/NATAR/RANK$ cat NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep 'Hon_derd'

Hon_derd#1#honderd#110122179#11040808032#1#0.997116
Te_Hon_derd#1#ten_honderd#98766330#1125720992#2#0.998336
Hon_derd_halve#1#honderd_halven#98765438#1125720992#2#0.917197
```
The bigram, i.e. the split unigram, is correctly resolved. We also get two trigrams containing the bigram.

The CC for the first trigram 'Te_Hon_derd ' is 'nice' in light of the fact that we currently prefer what we now regard as the archaic form with 'ten' in Dutch. However, the more plausible form for these diachronic texts would have 'te', which has higher corpus frequencies (you need to subtract the artifrq '98765432' to get at the actual corpus frequencies):
```
reynaert@violet:/reddata/NATAR/RANK$ grep -i '^Te_Honderd' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean |head -n 5

te_honderd	98767463
te_Honderd	98765777
Te_honderd	98765740
te_honderd_en	98765505
Te_Honderd	98765497

reynaert@violet:/reddata/NATAR/RANK$ grep -i '^Ten_Honderd' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean |head -n 5

ten_honderd	98766330
ten_Honderd	98765600
Ten_honderd	98765523
Ten_Honderd	98765485
ten_honderd_en	98765472
```
For the second trigram ' Hon_derd_halve ' we see the actual bigram containing just 'halve' is here not returned by TICCL-LDcalc:
```
(LMdev) reynaert@violet:RANK$ grep '^Hon_derd_halve' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc

Hon_derd_halve~1~1~Honderd_haive~1~1~13198108815~2~12~0~1~1~0~0
Hon_derd_halve~1~1~Honderd_halven~98765433~98765439~1125720992~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halven~98765438~98765439~1125720992~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halver~98765433~98765433~1722007593~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halv~1~1~22633548775~2~12~0~1~0~0~0
Hon_derd_halve~1~1~honderd_hatve~1~1~14509133807~2~12~0~1~1~0~0
Hon_derd_halve~1~1~honderd_helve~98765433~98765433~13473584596~2~12~1~1~1~0~0
Hon_derd_halve~1~1~honderd_zalve~98765433~98765433~3400807475~2~12~1~1~1~0~0
```
After TICCL-rank this results in:
```
(LMdev) reynaert@violet:RANK$ grep '11040808032' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep '#honderd#'

hon_derd#22#honderd#110122179#11040808032#1#0.997135
Hon_derd#1#honderd#110122179#11040808032#1#0.997116
```
But on higher ngram level and allowing for more character confusion than only an extra space (represented here as underscore)::
```
(LMdev) reynaert@violet:RANK$ grep '_' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep '^Hon_derd'

Hon_derd#1#honderd#110122179#11040808032#1#0.997116
Hon_derd_halve#1#honderd_halven#98765438#1125720992#2#0.917197
```
This results in chaos down the line, TICCL-chain and especially TICCL-chainclean fail to further resolve these contradictive results.

- Hyphens:

We see the same happening with hyphens

Our current corpus frequency list has the following bigrams::
```
reynaert@violet:/reddata/NATAR/RANK$ grep -i 'ge-arresteerd_en' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean           

ge-arresteerd_En	2
ge-arresteerd_en	2
```
versus:
```
reynaert@violet:/reddata/NATAR/RANK$ grep -i 'gearresteerd_en' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean

gearresteerd_en	98765472
gearresteerd_En	98765449
Gearresteerd_En	98765436
```
Here too, TICCL-LDcalc does not return the most plausible CC:
```
reynaert@violet:/reddata/NATAR/RANK$ grep 'ge-arresteerd_en' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc

ge-arresteerd_en~2~4~Gearresteerden~98765447~110000088~46763859681~2~14~1~1~1~0~2
ge-arresteerd_en~2~4~ge-arresteerde~5~5~23207337056~2~14~0~1~0~0~2
ge-arresteerd_en~2~4~gearresteerden~110000073~110000088~46763859681~2~14~1~1~1~0~2
ge-arresteerdens~1~1~ge-arresteerd_en~2~4~4345431517~2~14~0~1~0~0~0
```
We hope this can be remedied shortly!
Thanks!
MRE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TICCL-LDcalc: request to remove existing filter on underscore/hyphen bigram corrections #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TICCL-LDcalc: request to remove existing filter on underscore/hyphen bigram corrections #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions