Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Illogical "Avoid computing higher temperatures on no_speech" #1903

Merged
merged 4 commits into from
Dec 1, 2024

Conversation

Purfview
Copy link
Contributor

@Purfview Purfview commented Dec 17, 2023

Bugfix for #1279

The bug: It's "silence" when decoding has failed due to compression_ratio_threshold [+no_speech_threshold] in #1279, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to logprob_threshold [+no_speech_threshold].

Like described there:

parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")

And in code there:

if no_speech_threshold is not None:
# no voice activity check
should_skip = result.no_speech_prob > no_speech_threshold
if (
logprob_threshold is not None
and result.avg_logprob > logprob_threshold
):
# don't skip if the logprob is high enough, despite the no_speech_prob
should_skip = False

@Purfview
Copy link
Contributor Author

Related: SYSTRAN/faster-whisper#621

@Purfview
Copy link
Contributor Author

Purfview commented Dec 17, 2023

I think this bug can trigger the hallucination loops because on some hallucination it wouldn't trigger the prompt reset on high temperature , and because higher temperatures are not computed on what is not an actual "silence".

@Purfview
Copy link
Contributor Author

@TheoBoyer @jongwook , would be great if you could have a look.

@TheoBoyer
Copy link
Contributor

This change is consistent with the rest of the code, so I'm not against it.

The original PR indeed skipped processing based on the logprob_threshold, but it was also contingent on logprob_threshold being set. @jongwook modified this. I assume the intention was to make the process independent of whether a threshold is set, but there may be reasons for this change that I'm unaware of.

However, I'm skeptical about involving logprob_threshold in silence discrimination in the first place.
The approach figure in the original paper clearly shows that there shouldn't be any decoding after no_speech.

Approach

PR #1279 was created because no_speech does not depend on token decoding; hence, regardless of the tokens decoded, no_speech_prob will remain unchanged.

In the (too) few experiments I conducted, the model seemed capable of hallucinating high-probability tokens during silences. It would be beneficial if someone could further investigate the relevance of incorporating logprob_threshold in silence discrimination. I'm also interested to know if any related experiments already exist.

@Purfview
Copy link
Contributor Author

Purfview commented Dec 18, 2023

However, I'm skeptical about involving logprob_threshold in silence discrimination in the first place.

no_speech_threshold alone is pretty unreliable, model can generate no_speech_prob close to 1.0 on a perfectly fine speech.

@Purfview
Copy link
Contributor Author

I think this bug can trigger the hallucinations loop because on some hallucination it wouldn't trigger the prompt reset on high temperature , because higher temperatures are not computed on what is not an actual "silence".

My guess was right, Today I encountered one:

DEBUG: Compression ratio threshold is not met with temperature 0.0 (6.677966 > 2.400000)
[04:17.320 --> 04:29.020]  been doing it for a long time. I'm a professional. I'm a professional. I'm a
[04:29.020 --> 04:29.340]  professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:29.340 --> 04:34.560]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:34.560 --> 04:38.360]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:38.360 --> 05:03.750]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm

No hallucination loop with this bugfix:

DEBUG: Compression ratio threshold is not met with temperature 0.0 (6.677966 > 2.400000)
DEBUG: Compression ratio threshold is not met with temperature 0.2 (8.533333 > 2.400000)
DEBUG: Compression ratio threshold is not met with temperature 0.4 (8.884615 > 2.400000)
[04:17.320 --> 04:22.640]  got me feeling natural. Finding a natural-seeming way to fail at any given task.
[04:23.700 --> 04:27.140]  In each of the commercials that I'm in, I'm the one who simply can't go on
[04:27.140 --> 04:33.340]  without the product. It's ridiculous that we don't have the product. Show them.
DEBUG: Reset prompt. prompt_reset_on_temperature threshold is met 0.600000 > 0.500000
DEBUG: Log probability threshold is not met with temperature 0.0 (-1.344815 < -1.000000)
DEBUG: Log probability threshold is not met with temperature 0.2 (-1.150256 < -1.000000)
[04:33.340 --> 04:35.340]  No, you shouldn't.
[04:36.020 --> 04:36.300]  Please.
[04:36.560 --> 04:37.520]  You wanna see?
[04:38.020 --> 04:39.080]  Yeah, I wanna see.
[04:43.260 --> 04:44.120]  She's amazing.
[05:03.870 --> 05:05.110]  I just...
[05:05.110 --> 05:05.650]  I...

Bugfix for openai#1279

It's "silence" when decoding has failed due to `compression_ratio_threshold` too, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to `logprob_threshold`.

Like described there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L421

And in code there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L243-L251
@Purfview
Copy link
Contributor Author

Another example of hallucination fix: #1962

@Purfview
Copy link
Contributor Author

Purfview commented Nov 29, 2024

@jongwook Why this bugfix still not merged?

Maybe it's confusing, read the description of #1279 :

In decode_with_fallback, we compute higher temperatures in the case where compression_ratio is too high or avg_logprob is too low. But as the computation of no_speech_prob doens't depend on sampling, we can avoid computing higher temperatures if we detect in the first one that the no_speech condition is fulfilled

This PR still retains full functionality of what is described in the quote above. And fixes the #1279 bug where it skips computing higher temperatures when the no_speech condition is not fulfilled, it should skip only when it's fulfilled.

That bug can cause the hallucination loops, probably it's responsible for a big portion of all those hallucinations reported.
As I understand a sole reason for fallback is to recover from hallucinations, this bug prevents that.

@jongwook jongwook merged commit 90db0de into openai:main Dec 1, 2024
9 checks passed
@Purfview Purfview deleted the patch-1 branch December 1, 2024 07:11
joelvaneenwyk pushed a commit to joelvaneenwyk/whisper that referenced this pull request Dec 31, 2024
…penai#1903)

* Bugfix: Illogical "Avoid computing higher temperatures on no_speech"

Bugfix for openai#1279

It's "silence" when decoding has failed due to `compression_ratio_threshold` too, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to `logprob_threshold`.

Like described there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L421

And in code there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L243-L251

* Fix if "logprob_threshold=None"

---------

Co-authored-by: Jong Wook Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants