Crash during alignment of near-30-sec audio sources #429

yoadsn · 2025-01-23T19:59:39Z

Hi - Thanks for the great library and work - You are awesome.

I would like to point out a crash I stumbled upon, with as much detail as I can provide.
Version - 2.18.1
Repro is not platform specific, or model specific, or engine specific - so should be easy to repro.
(Attached a sample audio file)

The repro is quite simple from user-land perspective - start with a simple align call:

import stable_whisper
model = stable_whisper.load_model('tiny')
tx = "anything"
model.align('repro_audio.mp3', tx)

This, in some specific scenario can cause a crash which I will do my best to describe below.

First - Call Stack would go to

_skip_nonspeech (/stable_whisper/non_whisper/alignment.py:904)
align (/stable_whisper/non_whisper/alignment.py:296)
align (/stable_whisper/alignment.py:221)

and return on:

stable-ts/stable_whisper/non_whisper/alignment.py

Line 905 in e7ff3dd

return

Where the code would detect a final (and only!) "non-speech" timing which, if taking into consideration the min-word duration ends after the audio end.
Code would look like this at this point:

        self._seek_sample = round(nonspeech_ends[0] * self.sample_rate)
        if self._seek_sample + (self.options.post.min_word_dur * self.sample_rate) >= curr_total_samples:
            self._seek_sample = curr_total_samples
            return # <---- This will return

        self._time_offset = self._seek_sample / self.sample_rate

Now, for an audio source which is 30s - it is the first iteration within the main loop of Aligner.align.
This loop will "continue" upon the first iteration since the _skip_nonspeech method above returns None.

stable-ts/stable_whisper/non_whisper/alignment.py

Line 298 in e7ff3dd

continue

                audio_segment = self._skip_nonspeech(audio_segment)
                if audio_segment is None:
                    continue # <---- continues here

At this point - the loop re-initializes, but the _seek_sample has been pushed by _skip_nonspeech to the end of the audio source. Thus, the _time_offset is at the end, and no more chunks are available to process. The loop will break.

stable-ts/stable_whisper/non_whisper/alignment.py

Line 292 in e7ff3dd

break

                self._time_offset = self._seek_sample / self.sample_rate
                audio_segment = self.audio_loader.next_chunk(self._seek_sample, self.n_samples)
                if audio_segment is None:
                    break # <--- Breaks since no more chunks

Finally, the pbar update outside the loop will reference last_ts which was never initialized - and will crash the align method.

            self._update_pbar(tqdm_pbar, last_ts, self.failure_count <= self.max_fail) # Will crash

Throws:
UnboundLocalError: local variable 'last_ts' referenced before assignment

So far my analysis of the crash. My speculation is that the logic that pushes the seek to the end might be flawed, since the audio sources where this happens - are indeed short, but do have speech on them. So there should be no reason to skip the first (and only) 30s.
The logic looks for the "first" non speech to see how much to "skip ahead" but when this non speech is the only one and at the end - something does not make sense.
I am not sure, since this part is not documented - but perhaps you will see it clearly.

        self._seek_sample = round(nonspeech_ends[0] * self.sample_rate) # This value + min word duration go over the total.
        if self._seek_sample + (self.options.post.min_word_dur * self.sample_rate) >= curr_total_samples:
            self._seek_sample = curr_total_samples
            return

I am attaching a sample audio file (in Hebrew) that reproduces this (regardless of text to be aligned). I call the align with default settings and with the "tiny" open ai mode, on a cpu.
https://drive.google.com/file/d/11ZcSy0AUSZGZ3qQqbBEoqYVcwO6uY9lZ/view?usp=sharing

The text was updated successfully, but these errors were encountered:

-fixed `nonspeech_skip` causing alignment to skip sections of speech -fixed "'last_ts' referenced before assignment" error for alignment (#429)

jianfch · 2025-01-23T22:14:04Z

Thanks for the detailed report.
This issue should fixed in fc0d0da.

yoadsn · 2025-01-24T07:13:18Z

Thank you @jianfch for the quick fix!
Looks great from reading the code and running local tests.
Also - I really appreciate the extra code docs! Will help in the future with contributing back.

yoadsn · 2025-01-29T10:48:50Z

Dear Friend,
This bug fix not only affects the crash - but also prevents weird "skipping" in long form alignment.
So I am so glad you have created a fix - my one request, can this be released (2.8.3 perhaps).
We would love to have this fix officially deployed on our pipeline.

jianfch · 2025-01-29T15:34:54Z

Good call. Alignment is a major feature so this fix justifies a minor version bump.

jianfch added a commit that referenced this issue Jan 23, 2025

fixed nonspeech_skip

fc0d0da

-fixed `nonspeech_skip` causing alignment to skip sections of speech -fixed "'last_ts' referenced before assignment" error for alignment (#429)

yoadsn closed this as completed Jan 24, 2025

yoadsn reopened this Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash during alignment of near-30-sec audio sources #429

Crash during alignment of near-30-sec audio sources #429

yoadsn commented Jan 23, 2025

jianfch commented Jan 23, 2025

yoadsn commented Jan 24, 2025

yoadsn commented Jan 29, 2025

jianfch commented Jan 29, 2025

Crash during alignment of near-30-sec audio sources #429

Crash during alignment of near-30-sec audio sources #429

Comments

yoadsn commented Jan 23, 2025

jianfch commented Jan 23, 2025

yoadsn commented Jan 24, 2025

yoadsn commented Jan 29, 2025

jianfch commented Jan 29, 2025