ligature question / annotation preservation when converting to html #861
              
                Unanswered
              
          
                  
                    
                      DookTibs
                    
                  
                
                  asked this question in
                Looking for help
              
            Replies: 1 comment 1 reply
-
| Uff - a long question! 
 As per the text marker annotations: 
 | 
Beta Was this translation helpful? Give feedback.
                  
                    1 reply
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi - I'm new to PyMuPDF and having some trouble that I think is relating to ligatures.
What I'm trying to do:
But certain pdf's are giving me difficulty. For instance, see this tiny screenshot of a sample pdf I'm working with:
In this sentence, the words "polyfluoroalkyl" and "fluoroether" have what I think is called a ligature - the "fl" portion does not appear to be two separate characters (for instance, if I use my mouse to highlight letters in those worse, the "f" and the "l" cannot be highlighted independently).
I found an issue where you discuss something related:
#745
and so I'm explicitly passing in flags=fitz.TEXT_PRESERVE_WHITESPACE+fitz.TEXT_PRESERVE_IMAGES in my call to page.getText, so as to not preserve ligatures. The converted html that PyMuPDF produces for this sentence then looks like this (added some extra whitespace between spans to make it more readable [edit - and attaching as an image as I can't get the html to not render in this comment]):
The "fl" is getting wrapped in a span with a slightly different font. And when viewing this html in either a text editor or a web browser, the "f" and "l" are now separate characters.
So the generated html looks great - basically indistinguishable from the source pdf. But if I wanted to, for example, search for "polyfloroalkyl", it's complicated by that extra span thrown in there. (for this specific example of course I could get around it, but I need to handle many such cases so I'm trying to figure out a general solution). Is there a way to get words like "polyfluoroalkyl" to not be split over multiple spans in the generated text?
(And similarly the word "BACKGROUND" at the start of the sentence is wrapped in two separate spans, to account for the larger "B" at the start of the word. That's going to also cause me potential issues when searching)
What PyMuPDF is doing here makes a lot of sense to me, but I'm wondering if there's a general approach that I'm missing that might be better for my particular goal.
Or as an alternate approach, I briefly experimented with annotating the pdf instead of in the html...so using page.getTextWords(), getting the Rect from the match, and addHighlightAnnot on that. That worked well and I could find words like polyfluoroalkyl, but then when converting to html those annotations were not carried through. Is there a way to preserve pdf annotations when converting to another format?
Beta Was this translation helpful? Give feedback.
All reactions