[Feature request] Add PaddleOCR as an availabe OCR engine to SubtitleEdit #9204

timminator · 2025-01-11T17:54:57Z

I noticed that tesseract would perform really badly on some images and especially bad for example on chinese characters. It was so bad that I looked for alternatives to tesseract.
Then I came across PaddleOCR. From my own testing it performs better than tesseract 5.5 in almost all cases and in other languages like chinese its more like night and day. The OCR results from tesseract for this language where close to unusable whereas PaddleOCR achieves close to perfect results.
So i would really appreciate the addition of this engine to SubtitleEdit. PaddleOCR also has GPU support and in addition to being way better it was also substantially faster.

I think this is not an easy request but it would be quite a big upgrade over tesseract for some languages.
I include two examples, one in english and one in chinese, in the following .zip where tesseract failed completely and PaddleOCR achieved a perfect result with no finetuning. The results using SubtitleEdit and running the engines from the command line are also included.
OCR comparison.zip

niksedk · 2025-01-11T18:41:30Z

A PR is welcome.
Pretty hard to see what is what for me...

niksedk · 2025-01-11T18:49:40Z

Is there a Windows Command line OCR download?

timminator · 2025-01-11T20:31:54Z

Setting up PaddleOCR is done via pip.

So to set it up you need to:

Install python
Install paddlepaddle via:
python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
More you can find here and here. A CPU version is also available.
Install setuptools: pip install setuptools
Install paddleocr: pip install paddleocr

Afterwards it can be run from the command line for example via: paddleocr -h.

That is the reason I said that this request isnt so easy because we would need to check for quite a few prerequisites and setup the required tools correctly. From what I've seen the GPU version also requires an nvidia graphics card. The correct paddlepaddle version also depends on the installed nvidia driver and the supported cuda version. So quite a few checks needed. But afterwards you can run it from the windows command line.
But the results with paddleocr are really good so I think it would be worth the time to get it into Subtitle Edit.

niksedk · 2025-01-12T10:13:48Z

I must be doing something wrong:

Gives

paddleocr --image_dir c:\temp\Image1.png --det false --lang en

[2025/01/12 11:12:00] ppocr INFO: **********c:\temp\Image1.png**********
[2025/01/12 11:12:01] ppocr INFO: ('Mhy mommy al mays sars.', 0.8831787109375)

This is very easy for all the other OCR engines...

timminator · 2025-01-12T11:27:47Z

I'm currently not on my PC so I can't test it myself right now.
But I always run it like this:
paddleocr --image_dir c:\temp\Image1.png --use_angle_cls true --lang en --show_log false
Maybe you could try this? I will try it myself later.

Purfview · 2025-01-12T15:51:54Z

I must be doing something wrong:

Those AI vision models usually don't work very well on binarized subs, those are mostly used on raw images.

Test it on (this image can't be properly binarized):

Here is raw and binarized (I think Chinese):

Few more examples extracted with SubsMask2Img [the first one is averaged]:

These raw images can be extracted with SubsMask2Img [part of InpaintDelogo] or VSF.

Btw, maybe some day I'll release currently non-public command line tool doing AI OCR.

darnn · 2025-01-12T16:13:26Z

@Purfview I for one would be very interested! Currently I use VideoSubFinder to extract the frames, and then either run them through FineReader, or compile them into PDFs with 50 images per file, and from Google Drive open them in Google Docs, which performs OCR on them without having to pay Google for it.
Those are the only two things, last I checked, that handle Hebrew, or non-binarized images, with any degree of accuracy.

Purfview · 2025-01-12T16:25:22Z

@darnn I don't know any public models for Hebrew, you would need to train one yourself.

timminator · 2025-01-12T17:42:44Z

@niksedk Your image is indeed not working with paddleocr, im also not getting a correct output. But Tesseract also fails on this image:

.\tesseract.exe "D:\temp\402347744-973f47fa-74c6-446a-a235-0c5889fe4b56.png" stdout -l eng
My mommy always sald
fhere were no monsters.

But as already mentioned by @Purfview PaddleOCR works better with RGB images. One thing I also noticed is that there needs to be some "empty" space around the text. If the image is completely cropped to just the letters it performs actually worse.
I did a screenshot with some space around it like this:

Now the result is correct:

paddleocr --image_dir "D:\temp\Screenshot 2025-01-12 154728.png" --use_angle_cls true --lang en --show_log false
[2025/01/12 16:10:28] ppocr INFO: **********D:\temp\Screenshot 2025-01-12 154728.png**********
[2025/01/12 16:10:29] ppocr INFO: [[[92.0, 56.0], [735.0, 60.0], [734.0, 118.0], [91.0, 113.0]], ('My mommy always said', 0.9907816052436829)]
[2025/01/12 16:10:29] ppocr INFO: [[[90.0, 149.0], [732.0, 153.0], [731.0, 198.0], [89.0, 194.0]], ('there were no monsters.', 0.9906995296478271)]

Paddleocr supports Chinese, English, French, German, Korean and Japanese at the moment, but you can train it on your own for other languages from what I've read online.

The big advantage of PaddleOCR is for example the first image from @Purfview where the lightning is really bad:

Here PaddleOCR gets a really good result:

paddleocr --image_dir D:\temp\68747470733a2f2f692e696d6775722e636f6d2f676f786a344b422e706e67.png" --use_angle_cls true --show_log false
[2025/01/12 18:31:11] ppocr INFO: *********D:\temp\68747470733a2f2f692e696d6775722e636f6d2f676f786a344b422e706e67.png**********
[2025/01/12 18:31:12] ppocr INFO: [[[113.0, 21.0], [567.0, 19.0], [567.0, 47.0], [113.0, 48.0]], ('et cesser de pretendre chercher', 0.9668408036231995)]
[2025/01/12 18:31:12] ppocr INFO: [[[141.0, 60.0], [539.0, 60.0], [539.0, 87.0], [141.0, 87.0]], ('qui se fait passer pour nous.', 0.9560474753379822)]

whereas Tesseract gets it completely wrong:

.\tesseract.exe "D:\temp\68747470733a2f2f692e696d6775722e636f6d2f676f786a344b422e706e67.png" stdout -l eng
Estimating resolution as 218
Detected 24 diacritics
e chercne
nous

Currently I'm also using VSF and then I put the created RGB images with a script through PaddleOCR and create a .srt-file with the extracted data automatically.

techguru0 · 2025-01-13T01:32:55Z

https://github.com/YaoFANGUK/video-subtitle-extractor
I use this for hardsub extraction and it works VERY well from what ive done so far
and it uses paddleocr

niksedk · 2025-01-14T18:34:03Z

Latest beta now has a very basic "Paddle OCR" implementation: https://github.com/SubtitleEdit/subtitleedit/releases/download/4.0.10/SubtitleEditBeta.zip

Let me know how it works (requires that paddleocr is installed) - no progress/info when model is downloaded.

MbuguaDavid · 2025-01-15T10:28:14Z

https://github.com/YaoFANGUK/video-subtitle-extractor I use this for hardsub extraction and it works VERY well from what ive done so far and it uses paddleocr

Thank you for this suggestion. It works very well in extracting hard subs as srt. I'm amazed.
https://youtu.be/yCk5s5sPP5o

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Add PaddleOCR as an availabe OCR engine to SubtitleEdit #9204

[Feature request] Add PaddleOCR as an availabe OCR engine to SubtitleEdit #9204

timminator commented Jan 11, 2025

niksedk commented Jan 11, 2025

niksedk commented Jan 11, 2025

timminator commented Jan 11, 2025

niksedk commented Jan 12, 2025 •

edited

Loading

timminator commented Jan 12, 2025

Purfview commented Jan 12, 2025 •

edited

Loading

darnn commented Jan 12, 2025

Purfview commented Jan 12, 2025 •

edited

Loading

timminator commented Jan 12, 2025

techguru0 commented Jan 13, 2025

niksedk commented Jan 14, 2025

MbuguaDavid commented Jan 15, 2025

[Feature request] Add PaddleOCR as an availabe OCR engine to SubtitleEdit #9204

[Feature request] Add PaddleOCR as an availabe OCR engine to SubtitleEdit #9204

Comments

timminator commented Jan 11, 2025

niksedk commented Jan 11, 2025

niksedk commented Jan 11, 2025

timminator commented Jan 11, 2025

niksedk commented Jan 12, 2025 • edited Loading

timminator commented Jan 12, 2025

Purfview commented Jan 12, 2025 • edited Loading

darnn commented Jan 12, 2025

Purfview commented Jan 12, 2025 • edited Loading

timminator commented Jan 12, 2025

techguru0 commented Jan 13, 2025

niksedk commented Jan 14, 2025

MbuguaDavid commented Jan 15, 2025

niksedk commented Jan 12, 2025 •

edited

Loading

Purfview commented Jan 12, 2025 •

edited

Loading

Purfview commented Jan 12, 2025 •

edited

Loading