Question: Inserting unicode any utf-8 without detecting the language with a custom font #690
Replies: 10 comments
-
| The Tesseract PDF rendered code has some useful info (at least for me) on the way they do it: https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp#L35 | 
Beta Was this translation helpful? Give feedback.
-
| In case you're looking for hOCR files for my example program: (You will need to gunzip it, I made my program stop after one page for all my tests) ... chinese hocr file will follow momentarily. | 
Beta Was this translation helpful? Give feedback.
-
| Sorry, I now pushed the latest code. Also a branch that does load the glyphless font, but nothing seems to get added to the PDF. Here is a similar file with hOCR (but really, one character being added to the PDF with such a glyphless font could be enough): (You might want to  I suppose the glyphless font requires more hacks that Tesseract applies to map all characters to 0, as mentioned in the pdf renderer. | 
Beta Was this translation helpful? Give feedback.
-
| I haven't tried inserting text with a glyphless font with PyMuPDF before. | 
Beta Was this translation helpful? Give feedback.
-
| There is the repo https://github.com/jbarlow83/OCRmyPDF, which has some overlaps with your work I believe ... | 
Beta Was this translation helpful? Give feedback.
-
| Just tried it: Also tried insertText with the font: more or the less the same, does not complain about the glyphless font, but extracts spaces with text extraction. | 
Beta Was this translation helpful? Give feedback.
-
| For your information, I am studying the Tesseract C++ code some more, and they seem to perform quite some interesting hacks. Maybe it is not reasonable to assume that these will work with pymupdf. I will get back to you in a few days from now. Thanks. | 
Beta Was this translation helpful? Give feedback.
-
| Interesting to see where this leads to. FYI: MuPDF v1.18.0 (not PyMuPDF yet) contains native support for OCR-based text extraction via Tesseract. | 
Beta Was this translation helpful? Give feedback.
-
| Understood. I am trying to do integrate OCR results into PDFs, not OCR PDF files. I'll keep you posted, I will have some minimal Python code that generates a small PDF (by hand) that I will then manipulate with pymupdf, I think. | 
Beta Was this translation helpful? Give feedback.
-
| As a follow up... I've ported the tesseract pdfrenderer.cpp to Python here: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/pdfrenderer.py And then the other file in that repo (recode.py) uses OCR-result files (hOCR) and an input-pdf with images to create a new searchable pdf. In the  Tesseract does a lot of neat hacks/tricks to get the size to be small. If you're interested I can try to work with you on support something similar with regards to text insertion in pymupdf, but I'm content with the pdfrenderer.py that I wrote -- it works with all unicode and the output pdf is really small. | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am hoping to create PDF files from 'hOCR' (output format of OCR engines) and create a (hidden!) text layer on top of a PDF with images. I already have a working proof of concept of this, although it's in very early stages: https://git.archive.org/merlijn/archive-pdf-tools/-/blob/master/hocr2pdf.py
Changing
render_mode=0torender_mode=3will indeed make the text invisible. But it only supports a very limited set of characters.In any case, it will look something like this with my current code:
I am not using the TextWriter interface since I need to be able to have the text fill the text boxes, with my own morph code.
What I would like to do is use a glyphless font ( this one is extracted from Tesseract: https://wizzup.org/glyphless.ttf ), but I've had trouble loading the font. I believe such a font will save a lot in size of the PDF, since it is a very small font (572 bytes), and since I don't want to actually see the text, and just make it selectable, that should work fine?
I could not figure out how to load the
glyphless.ttffont using MuPDF and render text with it -- any tips?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions