-
Notifications
You must be signed in to change notification settings - Fork 4
Description
- In short:
It seems to me we have misguidedly imposed a restriction on TICCL-LDcalc to return higher ngram pairs where the variant and Correction Candidate (CC) only differ in a single (?) underscore (= space) or hyphen. I suppose I at some point expected this restriction to lighten TICCL's overall work load. The result is the later modules cannot converge on the best fitting resolution of the split word due to the contradiction between the unigram solution and those offered by the bi- or possibly trigrams. Ultimately, FoLiA-correct fails to find the right bi- and trigrams to correct.
Example LD-calc output:
is_hon_derd~1~1~ir_honderd~1~1~13664231956~2~9~0~1~1~0~0
is_hon_derd~1~1~isa_Honderd~98765433~98765433~2984709275~2~9~1~1~1~0~0
is_hon_derd~1~1~ishonderd~1~1~22081616064~2~9~0~1~1~0~0
We do not get the CC: 'is_honderd'.
This results in the bi/trigram correction never getting the most plausible resolution for split words, but still getting hundreds of less plausible Correction Candidates (CCs). This results in suboptimal ranking of the CCs and chaos further on in the pipeline, especially in TICCL-chainclean which on the current very large test on about 2.3 million pages of HTRed text now fails to make progress even after days.
We observe the same to be true for hyphens in ngramcorrections. See section 'Hyphens:' below.
This restriction is possibly implemented as simply as: for the confusion values for underscore or hyphen: do not return word pairs where the CC would be a bi- or trigram, i.e. only unigrams are allowed as CC. (This will probably not fully cover it...).
However implemented, I would now like to see the restriction removed.
- The story, more in full, for both underscores and hyphens:
- Underscores:
TICCL-rank currently correctly returns e.g. the unigram pair:
Hon_derd~1~23~Honderd~110023864~121380612~11040808032~1~7~1~1~1~0~76
'Grep' on the ranked list:
reynaert@violet:/reddata/NATAR/RANK$ cat NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep 'Hon_derd'
Hon_derd#1#honderd#110122179#11040808032#1#0.997116
Te_Hon_derd#1#ten_honderd#98766330#1125720992#2#0.998336
Hon_derd_halve#1#honderd_halven#98765438#1125720992#2#0.917197
The bigram, i.e. the split unigram, is correctly resolved. We also get two trigrams containing the bigram.
The CC for the first trigram 'Te_Hon_derd ' is 'nice' in light of the fact that we currently prefer what we now regard as the archaic form with 'ten' in Dutch. However, the more plausible form for these diachronic texts would have 'te', which has higher corpus frequencies (you need to subtract the artifrq '98765432' to get at the actual corpus frequencies):
reynaert@violet:/reddata/NATAR/RANK$ grep -i '^Te_Honderd' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean |head -n 5
te_honderd 98767463
te_Honderd 98765777
Te_honderd 98765740
te_honderd_en 98765505
Te_Honderd 98765497
reynaert@violet:/reddata/NATAR/RANK$ grep -i '^Ten_Honderd' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean |head -n 5
ten_honderd 98766330
ten_Honderd 98765600
Ten_honderd 98765523
Ten_Honderd 98765485
ten_honderd_en 98765472
For the second trigram ' Hon_derd_halve ' we see the actual bigram containing just 'halve' is here not returned by TICCL-LDcalc:
(LMdev) reynaert@violet:RANK$ grep '^Hon_derd_halve' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc
Hon_derd_halve~1~1~Honderd_haive~1~1~13198108815~2~12~0~1~1~0~0
Hon_derd_halve~1~1~Honderd_halven~98765433~98765439~1125720992~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halven~98765438~98765439~1125720992~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halver~98765433~98765433~1722007593~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halv~1~1~22633548775~2~12~0~1~0~0~0
Hon_derd_halve~1~1~honderd_hatve~1~1~14509133807~2~12~0~1~1~0~0
Hon_derd_halve~1~1~honderd_helve~98765433~98765433~13473584596~2~12~1~1~1~0~0
Hon_derd_halve~1~1~honderd_zalve~98765433~98765433~3400807475~2~12~1~1~1~0~0
After TICCL-rank this results in:
(LMdev) reynaert@violet:RANK$ grep '11040808032' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep '#honderd#'
hon_derd#22#honderd#110122179#11040808032#1#0.997135
Hon_derd#1#honderd#110122179#11040808032#1#0.997116
But on higher ngram level and allowing for more character confusion than only an extra space (represented here as underscore)::
(LMdev) reynaert@violet:RANK$ grep '_' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep '^Hon_derd'
Hon_derd#1#honderd#110122179#11040808032#1#0.997116
Hon_derd_halve#1#honderd_halven#98765438#1125720992#2#0.917197
This results in chaos down the line, TICCL-chain and especially TICCL-chainclean fail to further resolve these contradictive results.
- Hyphens:
We see the same happening with hyphens
Our current corpus frequency list has the following bigrams::
reynaert@violet:/reddata/NATAR/RANK$ grep -i 'ge-arresteerd_en' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean
ge-arresteerd_En 2
ge-arresteerd_en 2
versus:
reynaert@violet:/reddata/NATAR/RANK$ grep -i 'gearresteerd_en' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean
gearresteerd_en 98765472
gearresteerd_En 98765449
Gearresteerd_En 98765436
Here too, TICCL-LDcalc does not return the most plausible CC:
reynaert@violet:/reddata/NATAR/RANK$ grep 'ge-arresteerd_en' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc
ge-arresteerd_en~2~4~Gearresteerden~98765447~110000088~46763859681~2~14~1~1~1~0~2
ge-arresteerd_en~2~4~ge-arresteerde~5~5~23207337056~2~14~0~1~0~0~2
ge-arresteerd_en~2~4~gearresteerden~110000073~110000088~46763859681~2~14~1~1~1~0~2
ge-arresteerdens~1~1~ge-arresteerd_en~2~4~4345431517~2~14~0~1~0~0~0
We hope this can be remedied shortly!
Thanks!
MRE