Skip to content

integrate srt reading for diarization, splitting and speech recognition#18

Draft
zobadaniel wants to merge 1 commit into
Softcatala:mainfrom
ZalozbaDev:use_srt_as_input_for_annotate_diarization_stt
Draft

integrate srt reading for diarization, splitting and speech recognition#18
zobadaniel wants to merge 1 commit into
Softcatala:mainfrom
ZalozbaDev:use_srt_as_input_for_annotate_diarization_stt

Conversation

@zobadaniel
Copy link
Copy Markdown

This PR adds support for specifying a speaker-annotated .srt file as input to the dubbing process.

The steps of audio chunking, speaker diarization and speech-to-text will not be performed on the audio, rather info from the .srt file will be used.

The proces relies on a .srt file with the following properties:

  • only one line of text per subtitle entry
  • speaker annotated at the beginning of each line

Example:

5
00:00:13,480 --> 00:00:17,920
[SPEAKER_01]: Deswegen ist er der Kapitän der englischen Nationalmannschaft.

6
00:00:18,039 --> 00:00:21,320
[SPEAKER_01]: Er ist als Spieler sehr gereift und dominiert das Spielgeschehen.

The code uses pysrt for reading the subtitle file.

Please let me know what needs to be changed to have this merged.

@jordimas
Copy link
Copy Markdown
Collaborator

Hello,
Could you please explain the use case for this? In what scenario would this workflow be useful for the end user?
Thanks!

@zobadaniel
Copy link
Copy Markdown
Author

For some movies subtitles, esp. with main speakers already marked, are already available. Here is a link to how one of the German broadcasters does it: https://www.ard.de/die-ard/EBU-TT-D-Basic-DE-XML-Format-fuer-die-Distribution-von-Untertiteln-in-den-ARD-Mediatheken-102.pdf

Based on that, you can get a .srt file to the video easily (this tool for instance already does it for you: https://mediathekview.de/). Conversion from this format to the one referred to here is easy, I'd add a python script in a different PR if this functionality is accepted in general.

The main advantage I see over using automated segmentation, diarization and speech recognition is the amount of text. These subtitles are not word-for-word, but somehow condensed. This makes it much easier to fit the audio from the synthesized target language into the foreseen timeslot - and thus less required or less dramatic audio speed-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants