Hey,
getRawTextContent seems to assume that word pieces in the bidiTexts array appear in ascending order of their y coordinate but they probably do appear as they are being added by the PDF rendering software. I have a file where a line of text looking on screen like
Some label 12.3 4455
appears in that array as
Some label | 3 | . | 2 | 1 | 5 | 5 | 4 | 4
that is, the character groups are properly following one after the other but the order of individual characters within the groups is reversed. And getRawTextContent() naively returns
Some label 3.21 5544
The y number of each item correctly shows the correct position but the code is not considering that (it only checks the absolute y difference).
So I wonder if this has a known solution before I go and change the code myself?
Cheers for the nice work.