Bug description
When using Coqui-XTTS for dubbing, the final video (video_dub.mp4) is generated
with only ~0.02s of audio, making it effectively silent. The video track is correct.
Root cause
The final ffmpeg mux command re-encodes audio_mix.mp3 to AAC:
ffmpeg -i Video.mp4 -i audio_mix.mp3 -c:v copy -c:a aac -map 0:v -map 1:a -shortest video_dub.mp4
The audio_mix.mp3 file is valid (ffprobe shows correct duration, ~198s), but the
MP3→AAC transcoding produces only 0.02s in the output container. This appears to be
an ffmpeg compatibility issue with the MP3 header generated during the audio mixing step.
Diagnostic evidence
Using ffprobe on the intermediate files:
| File |
Codec |
Duration |
Status |
audio_dub_solo.ogg |
pcm_s32le, 41000 Hz |
198.55s |
✅ OK |
audio_mix.mp3 |
mp3, 44100 Hz |
198.58s |
✅ OK |
audio_Voiceless.wav |
pcm_s16le, 44100 Hz |
190.12s |
✅ OK |
video_dub.mp4 (audio stream) |
aac, 44100 Hz |
0.02s |
❌ Bug |
The audio source files are all correct. The problem occurs specifically during the
MP3→AAC transcode in the mux step.
Note: audio_dub_solo.ogg uses pcm_s32le at 41000 Hz, which is non-standard for
OGG (normally Vorbis/Opus at 44100/48000 Hz). This unusual format may contribute to
the MP3 header issue downstream.
Suggested fix
Convert audio_mix.mp3 to WAV before the final mux. WAV→AAC transcoding works
correctly. The fix is a ~4 line change before the final ffmpeg call:
# Convert MP3 to WAV before muxing (fixes MP3→AAC transcode failure)
import subprocess
subprocess.run([
"ffmpeg", "-y", "-i", "audio_mix.mp3",
"-acodec", "pcm_s16le", "-ar", "44100", "-ac", "2",
"audio_mix.wav"
], capture_output=True)
# Then use audio_mix.wav instead of audio_mix.mp3 in the final mux command
Environment
- Platform: Google Colab (free tier)
- TTS engine: Coqui-XTTS
- SoniTranslate version: latest from Colab notebook
- ffmpeg version: (Colab default)
Steps to reproduce
- Open SoniTranslate Colab notebook
- Select Coqui-XTTS as TTS engine
- Process any video with dubbing
- The output video will have no audio
Bug description
When using Coqui-XTTS for dubbing, the final video (
video_dub.mp4) is generatedwith only ~0.02s of audio, making it effectively silent. The video track is correct.
Root cause
The final ffmpeg mux command re-encodes
audio_mix.mp3to AAC:The
audio_mix.mp3file is valid (ffprobe shows correct duration, ~198s), but theMP3→AAC transcoding produces only 0.02s in the output container. This appears to be
an ffmpeg compatibility issue with the MP3 header generated during the audio mixing step.
Diagnostic evidence
Using ffprobe on the intermediate files:
audio_dub_solo.oggaudio_mix.mp3audio_Voiceless.wavvideo_dub.mp4(audio stream)The audio source files are all correct. The problem occurs specifically during the
MP3→AAC transcode in the mux step.
Note:
audio_dub_solo.oggusespcm_s32leat 41000 Hz, which is non-standard forOGG (normally Vorbis/Opus at 44100/48000 Hz). This unusual format may contribute to
the MP3 header issue downstream.
Suggested fix
Convert
audio_mix.mp3to WAV before the final mux. WAV→AAC transcoding workscorrectly. The fix is a ~4 line change before the final ffmpeg call:
Environment
Steps to reproduce