Improve decoding #291

ggerganov · 2022-12-18T12:05:09Z

The goal of this PR is to reach OpenAI decoding parity and potentially go beyond

There are several ideas for improving the decoding strategy that will be explored.
There is some chance that these ideas will improve segment and token timestamp precision, but no guarantees.

Implemented decoding strategies

Average log probability threshold support
Decoded sequences can be discarded based on the average logprob of the tokens. When the avg logprob is below the threshold, it means that the model wasn't very confident in the transcription and we need to apply some fallback strategy to generate a better sequence
Entropy-based threshold support
This is similar to OpenAI's compression ration threshold logic used to determine if a sequence is too repetitive. However, in whisper.cpp instead of using zlib compression, we use a basic entropy metric H = -sum(p*log(p)) of the last 32 tokens in the sequence to determine if the decoding has degraded in endless repetition. Low entropy means more repetition. This approach has to be further tested - probably the entropy threshold might need some adjustments
Temperature support
By default, the decoding starts with T = 0, deterministically sampling the best token each time based on the computed logits. Upon failure, we increase the temperature and we start sampling the tokens from a discrete probability distribution obtained by scaling the logits with 1/T
The Greedy decoding strategy
Uses --best-of number of independent decoders for T > 0. Each decoder keeps a separate decoding sequence. At temperature T > 0.5 we clear any previous context. The rationale is that sometimes the context can confuse the decoder and drive it into a failure case
The BeamSearch decoding strategy
At T = 0 we start with --beam-size independent decoders. Each one generates the top --beam-size sequences from it's current state. From all generated candidate sequences, we pick the top --beam-size based on the logprob sum of their tokens and reassign them to the decoders. Upon failure, we increase the temperature and fallback to the Greedy strategy. The BeamSearch decoder is --beam-size times more computationally heavy than the Greedy decoder
I think it is worth exploring a strategy which initially uses 1 beam at T = 0 and only activates --beam-size decoders upon failure. This would significantly speed-up the processing and I hope it will keep the transcription quality high. Will probably add a flag for that

Development notes

best_of is used only by the Greedy decoder at temperature > 0
beam_size is used by the BeamSearch decoder
Both best_of and beam_size require to maintain a separate KV memory for each decoder stream. Need changes both in whisper.h interface + whisper_context / whisper_model to support that. Introduce whisper_decoder
compression_ratio heuristic might be out-of-scope - I cannot implement zlib.compress from scratch. Maybe use something simpler, like n-gram entropy?

ChatGPT brainstorming
Clear past prompt only for temperature >= 0.5:
patience controls the max number of sequences to obtain from the beam search
BeamSearch algorithm is explained here: https://arxiv.org/pdf/2204.05424.pdf
For each decoded sequence, maintain avg_logprob of the tokens in order to implement logprob_threshold fallback:
https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/transcribe.py#L119-L120
Sequence ranking with and without length_penalty:
https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/decoding.py#L169-L192

Initial step needed for supporting parallel decoders

We actually use entropy, but it is similar

ggerganov · 2023-01-14T21:11:45Z

This should be pretty close to OpenAI's decoding implementation.
There might be a few bugs left, but I think it's pretty much ready to merge.

The few failure cases that I have are now correctly transcribed using either --best-of 5 or --beam-size 5
By default, I am thinking about leaving beam-search off and have a temperature + --best-of 5 fallback.

stevevaius2015 · 2023-01-15T06:44:59Z

Thank you for your immense work and this wonderful project

For real-time processing, we always want a single decoder running at T=0

RndyP · 2023-01-15T15:04:15Z

Getting a potential uninitialized variable tid on line 3124 of Whisper.cpp reported by VC++

RndyP · 2023-01-15T15:07:19Z

First, thanks for all the hard work on this!

I am playing around with 1.1.0 as I write this.

Still have the issue that was closed in #172. The problem is worse now in that the "echos" may eat up CPU like crazy. My test case is to repeat the number "six" multiple times. (Sorry about the 666 humor) I send whisper an audio block of about 4 seconds with the word "six" repeated at least 3 times, and Whisper will now, instead or returning with a large number of sixes, will crunch for up to 1 minute and return various odd strings.

ggerganov · 2023-01-16T17:07:03Z

@RndyP
Thanks for the reports - these are very useful.
The "crazy CPU" problem is due to the new entropy-based threshold for discarding repetitive sequences. I will fine-tune the parameters to try to avoid the case that you observe and also add parameters to enable/disable and control this functionality for the main example.

dennislysenko · 2023-01-16T19:40:54Z

Thanks for the fixes 👍

* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders

…ganov#291)

* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders

…ganov#291)

* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders

…ganov#291)

* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders

…ganov#291)

* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders

…ganov#291)

ggerganov force-pushed the decoding branch from f06b991 to 4c5527b Compare December 23, 2022 20:32

ggerganov mentioned this pull request Dec 29, 2022

Improving timestamps for words #270

Open

FlakM mentioned this pull request Dec 30, 2022

Transcriptions JupiterBroadcasting/jupiterbroadcasting.com#301

Open

ggerganov force-pushed the decoding branch from 4c5527b to 20b64fc Compare January 8, 2023 17:12

ggerganov added 4 commits January 8, 2023 20:36

whisper : prepare infra for new decoding strategies

abe104b

whisper : apply logit filters and compute logprobs

2d8d372

whisper : add whisper_get_logits()

2155953

whisper : separate self and cross attention memory

523e049

Initial step needed for supporting parallel decoders

ggerganov linked an issue Jan 8, 2023 that may be closed by this pull request

Support for beam search decoding #278

Closed

ggerganov added this to the 1.1.0 milestone Jan 8, 2023

ggerganov linked an issue Jan 8, 2023 that may be closed by this pull request

Model gets stuck in some words #172

Closed

ggerganov added 2 commits January 9, 2023 19:02

whisper : move probs_id buffer to whisper_context

1163f26

whisper : refactor kv cache into separate struct

ee58108

ggerganov force-pushed the decoding branch from 20b64fc to ee58108 Compare January 9, 2023 18:43

ggerganov linked an issue Jan 9, 2023 that may be closed by this pull request

whisper_full: failed to generate timestamp token - skipping one second #316

Closed

ggerganov added 14 commits January 11, 2023 19:31

whisper : move self-attention kv cache to whisper_decoder

628843c

whisper : wip decoding parameters + strategies

9551d7f

whisper : wip decoding parameters + strategies (part 2)

3d723d0

whisper : wip decoding parameters + strategies (part 3)

116dd67

whisper : wip decoding parameters + strategies (part 4)

bd6e70b

whisper : fix prompt_past update to not include prompt_init

712bc4b

whisper : temperature + best_of support

34c5110

whisper : support for compression_ration_threshold

c67716f

We actually use entropy, but it is similar

command : fix example to use logits instead of obsolete probs

7ea1b73

whisper : handle empty sequence ranking

c6a8a47

whisper : add WHISPER_DEBUG + diagnostic prints + new main args

c301a79

whisper : minor fixes

5e97f80

whisper : add beam-search support

5548a19

whisper : bug fix when there no previous context

6700cd5

ggerganov marked this pull request as ready for review January 14, 2023 21:08

ggerganov added 3 commits January 15, 2023 10:30

whisper : add comments

d83e475

stream : disable temperature fallback

3fe33d6

For real-time processing, we always want a single decoder running at T=0

whisper.swiftui : update example - fix paths + add empty folders

6a2f4db

ggerganov merged commit 8de452c into master Jan 15, 2023

ggerganov deleted the decoding branch January 15, 2023 09:30

ggerganov added a commit that referenced this pull request Jan 16, 2023

whisper : fix possible uninitialized variables (#291)

8088a97

ggerganov added a commit that referenced this pull request Jan 18, 2023

main : we had accidentally disabled the temperature fallback .. (#291)

f583e2d

rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023

whisper : fix possible uninitialized variables (ggerganov#291)

753c823

rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023

main : we had accidentally disabled the temperature fallback .. (gger…

1507c48

…ganov#291)

anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023

whisper : fix possible uninitialized variables (ggerganov#291)

e8af2cb

anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023

main : we had accidentally disabled the temperature fallback .. (gger…

0b377fc

…ganov#291)

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

whisper : fix possible uninitialized variables (ggerganov#291)

26dc40a

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

main : we had accidentally disabled the temperature fallback .. (gger…

ffbb424

…ganov#291)

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

whisper : fix possible uninitialized variables (ggerganov#291)

6a7c1de

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

main : we had accidentally disabled the temperature fallback .. (gger…

b488dc0

…ganov#291)

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023

whisper : fix possible uninitialized variables (ggerganov#291)

61c4376

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023

main : we had accidentally disabled the temperature fallback .. (gger…

ea86dff

…ganov#291)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve decoding #291

Improve decoding #291

ggerganov commented Dec 18, 2022 •

edited

Loading

ggerganov commented Jan 14, 2023

stevevaius2015 commented Jan 15, 2023

RndyP commented Jan 15, 2023 •

edited

Loading

RndyP commented Jan 15, 2023

ggerganov commented Jan 16, 2023 •

edited

Loading

dennislysenko commented Jan 16, 2023

Improve decoding #291

Improve decoding #291

Conversation

ggerganov commented Dec 18, 2022 • edited Loading

Implemented decoding strategies

Development notes

ggerganov commented Jan 14, 2023

stevevaius2015 commented Jan 15, 2023

RndyP commented Jan 15, 2023 • edited Loading

RndyP commented Jan 15, 2023

ggerganov commented Jan 16, 2023 • edited Loading

dennislysenko commented Jan 16, 2023

ggerganov commented Dec 18, 2022 •

edited

Loading

RndyP commented Jan 15, 2023 •

edited

Loading

ggerganov commented Jan 16, 2023 •

edited

Loading