Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add PaddleOCR as an availabe OCR engine to SubtitleEdit #9204

Open
timminator opened this issue Jan 11, 2025 · 12 comments
Open

Comments

@timminator
Copy link

I noticed that tesseract would perform really badly on some images and especially bad for example on chinese characters. It was so bad that I looked for alternatives to tesseract.
Then I came across PaddleOCR. From my own testing it performs better than tesseract 5.5 in almost all cases and in other languages like chinese its more like night and day. The OCR results from tesseract for this language where close to unusable whereas PaddleOCR achieves close to perfect results.
So i would really appreciate the addition of this engine to SubtitleEdit. PaddleOCR also has GPU support and in addition to being way better it was also substantially faster.

I think this is not an easy request but it would be quite a big upgrade over tesseract for some languages.
I include two examples, one in english and one in chinese, in the following .zip where tesseract failed completely and PaddleOCR achieved a perfect result with no finetuning. The results using SubtitleEdit and running the engines from the command line are also included.
OCR comparison.zip

@niksedk
Copy link
Member

niksedk commented Jan 11, 2025

A PR is welcome.
Pretty hard to see what is what for me...

@niksedk
Copy link
Member

niksedk commented Jan 11, 2025

Is there a Windows Command line OCR download?

@timminator
Copy link
Author

Setting up PaddleOCR is done via pip.

So to set it up you need to:

  1. Install python
  2. Install paddlepaddle via:
    python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
    More you can find here and here. A CPU version is also available.
  3. Install setuptools: pip install setuptools
  4. Install paddleocr: pip install paddleocr

Afterwards it can be run from the command line for example via: paddleocr -h.

That is the reason I said that this request isnt so easy because we would need to check for quite a few prerequisites and setup the required tools correctly. From what I've seen the GPU version also requires an nvidia graphics card. The correct paddlepaddle version also depends on the installed nvidia driver and the supported cuda version. So quite a few checks needed. But afterwards you can run it from the windows command line.
But the results with paddleocr are really good so I think it would be worth the time to get it into Subtitle Edit.

@niksedk
Copy link
Member

niksedk commented Jan 12, 2025

I must be doing something wrong:

Image1

Gives

paddleocr --image_dir c:\temp\Image1.png --det false --lang en

[2025/01/12 11:12:00] ppocr INFO: **********c:\temp\Image1.png**********
[2025/01/12 11:12:01] ppocr INFO: ('Mhy mommy al mays sars.', 0.8831787109375)

This is very easy for all the other OCR engines...

@timminator
Copy link
Author

I'm currently not on my PC so I can't test it myself right now.
But I always run it like this:
paddleocr --image_dir c:\temp\Image1.png --use_angle_cls true --lang en --show_log false
Maybe you could try this? I will try it myself later.

@Purfview
Copy link
Contributor

Purfview commented Jan 12, 2025

I must be doing something wrong:

Those AI vision models usually don't work very well on binarized subs, those are mostly used on raw images.

Test it on (this image can't be properly binarized):

alt text

Here is raw and binarized (I think Chinese):

alt text

alt text

Few more examples extracted with SubsMask2Img [the first one is averaged]:

alt text

alt text

These raw images can be extracted with SubsMask2Img [part of InpaintDelogo] or VSF.

Btw, maybe some day I'll release currently non-public command line tool doing AI OCR.

@darnn
Copy link

darnn commented Jan 12, 2025

@Purfview I for one would be very interested! Currently I use VideoSubFinder to extract the frames, and then either run them through FineReader, or compile them into PDFs with 50 images per file, and from Google Drive open them in Google Docs, which performs OCR on them without having to pay Google for it.
Those are the only two things, last I checked, that handle Hebrew, or non-binarized images, with any degree of accuracy.

@Purfview
Copy link
Contributor

Purfview commented Jan 12, 2025

@darnn I don't know any public models for Hebrew, you would need to train one yourself.

@timminator
Copy link
Author

@niksedk Your image is indeed not working with paddleocr, im also not getting a correct output. But Tesseract also fails on this image:

.\tesseract.exe "D:\temp\402347744-973f47fa-74c6-446a-a235-0c5889fe4b56.png" stdout -l eng
My mommy always sald
fhere were no monsters.

But as already mentioned by @Purfview PaddleOCR works better with RGB images. One thing I also noticed is that there needs to be some "empty" space around the text. If the image is completely cropped to just the letters it performs actually worse.
I did a screenshot with some space around it like this:

Screenshot 2025-01-12 154728

Now the result is correct:

paddleocr --image_dir "D:\temp\Screenshot 2025-01-12 154728.png" --use_angle_cls true --lang en --show_log false
[2025/01/12 16:10:28] ppocr INFO: **********D:\temp\Screenshot 2025-01-12 154728.png**********
[2025/01/12 16:10:29] ppocr INFO: [[[92.0, 56.0], [735.0, 60.0], [734.0, 118.0], [91.0, 113.0]], ('My mommy always said', 0.9907816052436829)]
[2025/01/12 16:10:29] ppocr INFO: [[[90.0, 149.0], [732.0, 153.0], [731.0, 198.0], [89.0, 194.0]], ('there were no monsters.', 0.9906995296478271)]

Paddleocr supports Chinese, English, French, German, Korean and Japanese at the moment, but you can train it on your own for other languages from what I've read online.

The big advantage of PaddleOCR is for example the first image from @Purfview where the lightning is really bad:

68747470733a2f2f692e696d6775722e636f6d2f676f786a344b422e706e67

Here PaddleOCR gets a really good result:

paddleocr --image_dir D:\temp\68747470733a2f2f692e696d6775722e636f6d2f676f786a344b422e706e67.png" --use_angle_cls true --show_log false
[2025/01/12 18:31:11] ppocr INFO: *********D:\temp\68747470733a2f2f692e696d6775722e636f6d2f676f786a344b422e706e67.png**********
[2025/01/12 18:31:12] ppocr INFO: [[[113.0, 21.0], [567.0, 19.0], [567.0, 47.0], [113.0, 48.0]], ('et cesser de pretendre chercher', 0.9668408036231995)]
[2025/01/12 18:31:12] ppocr INFO: [[[141.0, 60.0], [539.0, 60.0], [539.0, 87.0], [141.0, 87.0]], ('qui se fait passer pour nous.', 0.9560474753379822)]

whereas Tesseract gets it completely wrong:

.\tesseract.exe "D:\temp\68747470733a2f2f692e696d6775722e636f6d2f676f786a344b422e706e67.png" stdout -l eng
Estimating resolution as 218
Detected 24 diacritics
e chercne
nous

Currently I'm also using VSF and then I put the created RGB images with a script through PaddleOCR and create a .srt-file with the extracted data automatically.

@techguru0
Copy link

https://github.com/YaoFANGUK/video-subtitle-extractor
I use this for hardsub extraction and it works VERY well from what ive done so far
and it uses paddleocr

@niksedk
Copy link
Member

niksedk commented Jan 14, 2025

Latest beta now has a very basic "Paddle OCR" implementation: https://github.com/SubtitleEdit/subtitleedit/releases/download/4.0.10/SubtitleEditBeta.zip

Let me know how it works (requires that paddleocr is installed) - no progress/info when model is downloaded.

@MbuguaDavid
Copy link

https://github.com/YaoFANGUK/video-subtitle-extractor I use this for hardsub extraction and it works VERY well from what ive done so far and it uses paddleocr

Thank you for this suggestion. It works very well in extracting hard subs as srt. I'm amazed.
https://youtu.be/yCk5s5sPP5o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants