Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] OCR Video Frame to Text #8474

Open
zeem12 opened this issue Jun 4, 2024 · 2 comments
Open

[Feature Request] OCR Video Frame to Text #8474

zeem12 opened this issue Jun 4, 2024 · 2 comments

Comments

@zeem12
Copy link

zeem12 commented Jun 4, 2024

Yes, I’ve checked out issue #340 and others, but I do think my concept is a bit different from what they've requested before.

Right now, I’m subbing a Japanese TV program. As you might know, they often use large subtitles or colorful captions (the image below is just an example).

1680883136640

To speed up my workflow, I usually take a screenshot, run it through an OCR engine like Google Lens, and paste it back into Subtitle Edit. But I was wondering, could this be done faster and simultaneously?

My idea is to create a feature similar to Whisper, but instead of converting selected audio lines to text, it converts one video frame from the start/middle/end of the selected lines into an image and then converts it to text with Tesseract or another OCR engine.

So it doesn’t have to automatically detect all of the start and end by capturing each frame of the hardcoded subtitles, it just extracts one frame from every selected line (which can be done with ffmpeg like in the selected lines feature of Whisper) and then OCRs it.

image

I think this could be a game-changing feature, especially for those who work with Japanese/Korean TV and variety shows which use a lot of text on their screen.

@epubc
Copy link

epubc commented Jun 4, 2024

Have you tried the VideoSubFinder program? I think it meets your requirements.

@zeem12
Copy link
Author

zeem12 commented Jun 4, 2024

Have you tried the VideoSubFinder program? I think it meets your requirements.

I’ve already tried VideoSubFinder, It’s not quite efficient for my needs. I’m hoping to do everything in one window within Subtitle Edit itself, similar to how we transcribe audio from the selected lines with Whisper in a single click. I’d also like more flexibility to manually crop and time the subtitles.

VideoSubFinder captures all text within a constant “capture window”, and the timing is automatically generated. However, captions in Japanese TV programs don’t usually have a fixed or constant position. I also don’t need VideoSubFinder’s auto timing feature since I prefer to time the subtitles manually. Moreover, I only need to OCR some of the hardcoded captions that I select, not all of them automatically like VideoSubFinder does.

The feature I’m requesting would be beneficial for those working in teams or collaborations. For instance, one person could create the timing for the hardcoded captions, and another could translate them. So they could simultaneously extract text based on the timing that’s already been set. This could really streamline the process and make collaborative work much more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants