-
Notifications
You must be signed in to change notification settings - Fork 929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Add PaddleOCR as an availabe OCR engine to SubtitleEdit #9204
Comments
A PR is welcome. |
Is there a Windows Command line OCR download? |
Setting up PaddleOCR is done via pip. So to set it up you need to:
Afterwards it can be run from the command line for example via: That is the reason I said that this request isnt so easy because we would need to check for quite a few prerequisites and setup the required tools correctly. From what I've seen the GPU version also requires an nvidia graphics card. The correct paddlepaddle version also depends on the installed nvidia driver and the supported cuda version. So quite a few checks needed. But afterwards you can run it from the windows command line. |
I must be doing something wrong: Gives
This is very easy for all the other OCR engines... |
I'm currently not on my PC so I can't test it myself right now. |
Those AI vision models usually don't work very well on binarized subs, those are mostly used on raw images. Test it on (this image can't be properly binarized): Here is raw and binarized (I think Chinese): Few more examples extracted with SubsMask2Img [the first one is averaged]: These raw images can be extracted with SubsMask2Img [part of InpaintDelogo] or VSF. Btw, maybe some day I'll release currently non-public command line tool doing AI OCR. |
@Purfview I for one would be very interested! Currently I use VideoSubFinder to extract the frames, and then either run them through FineReader, or compile them into PDFs with 50 images per file, and from Google Drive open them in Google Docs, which performs OCR on them without having to pay Google for it. |
@darnn I don't know any public models for Hebrew, you would need to train one yourself. |
@niksedk Your image is indeed not working with paddleocr, im also not getting a correct output. But Tesseract also fails on this image:
But as already mentioned by @Purfview PaddleOCR works better with RGB images. One thing I also noticed is that there needs to be some "empty" space around the text. If the image is completely cropped to just the letters it performs actually worse. Now the result is correct:
Paddleocr supports Chinese, English, French, German, Korean and Japanese at the moment, but you can train it on your own for other languages from what I've read online. The big advantage of PaddleOCR is for example the first image from @Purfview where the lightning is really bad: Here PaddleOCR gets a really good result:
whereas Tesseract gets it completely wrong:
Currently I'm also using VSF and then I put the created RGB images with a script through PaddleOCR and create a .srt-file with the extracted data automatically. |
https://github.com/YaoFANGUK/video-subtitle-extractor |
Latest beta now has a very basic "Paddle OCR" implementation: https://github.com/SubtitleEdit/subtitleedit/releases/download/4.0.10/SubtitleEditBeta.zip Let me know how it works (requires that paddleocr is installed) - no progress/info when model is downloaded. |
Thank you for this suggestion. It works very well in extracting hard subs as srt. I'm amazed. |
I noticed that tesseract would perform really badly on some images and especially bad for example on chinese characters. It was so bad that I looked for alternatives to tesseract.
Then I came across PaddleOCR. From my own testing it performs better than tesseract 5.5 in almost all cases and in other languages like chinese its more like night and day. The OCR results from tesseract for this language where close to unusable whereas PaddleOCR achieves close to perfect results.
So i would really appreciate the addition of this engine to SubtitleEdit. PaddleOCR also has GPU support and in addition to being way better it was also substantially faster.
I think this is not an easy request but it would be quite a big upgrade over tesseract for some languages.
I include two examples, one in english and one in chinese, in the following .zip where tesseract failed completely and PaddleOCR achieved a perfect result with no finetuning. The results using SubtitleEdit and running the engines from the command line are also included.
OCR comparison.zip
The text was updated successfully, but these errors were encountered: