Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve decoding #291

Merged
merged 23 commits into from
Jan 15, 2023
Merged

Improve decoding #291

merged 23 commits into from
Jan 15, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Dec 18, 2022

ref #278 #133 #172 #255 #270

The goal of this PR is to reach OpenAI decoding parity and potentially go beyond

There are several ideas for improving the decoding strategy that will be explored.
There is some chance that these ideas will improve segment and token timestamp precision, but no guarantees.


Implemented decoding strategies

  • Average log probability threshold support
    Decoded sequences can be discarded based on the average logprob of the tokens. When the avg logprob is below the threshold, it means that the model wasn't very confident in the transcription and we need to apply some fallback strategy to generate a better sequence
  • Entropy-based threshold support
    This is similar to OpenAI's compression ration threshold logic used to determine if a sequence is too repetitive. However, in whisper.cpp instead of using zlib compression, we use a basic entropy metric H = -sum(p*log(p)) of the last 32 tokens in the sequence to determine if the decoding has degraded in endless repetition. Low entropy means more repetition. This approach has to be further tested - probably the entropy threshold might need some adjustments
  • Temperature support
    By default, the decoding starts with T = 0, deterministically sampling the best token each time based on the computed logits. Upon failure, we increase the temperature and we start sampling the tokens from a discrete probability distribution obtained by scaling the logits with 1/T
  • The Greedy decoding strategy
    Uses --best-of number of independent decoders for T > 0. Each decoder keeps a separate decoding sequence. At temperature T > 0.5 we clear any previous context. The rationale is that sometimes the context can confuse the decoder and drive it into a failure case
  • The BeamSearch decoding strategy
    At T = 0 we start with --beam-size independent decoders. Each one generates the top --beam-size sequences from it's current state. From all generated candidate sequences, we pick the top --beam-size based on the logprob sum of their tokens and reassign them to the decoders. Upon failure, we increase the temperature and fallback to the Greedy strategy. The BeamSearch decoder is --beam-size times more computationally heavy than the Greedy decoder
    I think it is worth exploring a strategy which initially uses 1 beam at T = 0 and only activates --beam-size decoders upon failure. This would significantly speed-up the processing and I hope it will keep the transcription quality high. Will probably add a flag for that

Development notes

@ggerganov ggerganov linked an issue Jan 8, 2023 that may be closed by this pull request
@ggerganov ggerganov added this to the 1.1.0 milestone Jan 8, 2023
@ggerganov ggerganov linked an issue Jan 8, 2023 that may be closed by this pull request
@ggerganov ggerganov marked this pull request as ready for review January 14, 2023 21:08
@ggerganov
Copy link
Owner Author

This should be pretty close to OpenAI's decoding implementation.
There might be a few bugs left, but I think it's pretty much ready to merge.

The few failure cases that I have are now correctly transcribed using either --best-of 5 or --beam-size 5
By default, I am thinking about leaving beam-search off and have a temperature + --best-of 5 fallback.

@stevevaius2015
Copy link

Thank you for your immense work and this wonderful project

@ggerganov ggerganov merged commit 8de452c into master Jan 15, 2023
@ggerganov ggerganov deleted the decoding branch January 15, 2023 09:30
@RndyP
Copy link

RndyP commented Jan 15, 2023

Getting a potential uninitialized variable tid on line 3124 of Whisper.cpp reported by VC++

image

@RndyP
Copy link

RndyP commented Jan 15, 2023

First, thanks for all the hard work on this!

I am playing around with 1.1.0 as I write this.

Still have the issue that was closed in #172. The problem is worse now in that the "echos" may eat up CPU like crazy. My test case is to repeat the number "six" multiple times. (Sorry about the 666 humor) I send whisper an audio block of about 4 seconds with the word "six" repeated at least 3 times, and Whisper will now, instead or returning with a large number of sixes, will crunch for up to 1 minute and return various odd strings.

@ggerganov
Copy link
Owner Author

ggerganov commented Jan 16, 2023

@RndyP
Thanks for the reports - these are very useful.
The "crazy CPU" problem is due to the new entropy-based threshold for discarding repetitive sequences. I will fine-tune the parameters to try to avoid the case that you observe and also add parameters to enable/disable and control this functionality for the main example.

@dennislysenko
Copy link

Thanks for the fixes 👍

rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023
* whisper : prepare infra for new decoding strategies

* whisper : apply logit filters and compute logprobs

* whisper : add whisper_get_logits()

* whisper : separate self and cross attention memory

Initial step needed for supporting parallel decoders

* whisper : move probs_id buffer to whisper_context

* whisper : refactor kv cache into separate struct

* whisper : move self-attention kv cache to whisper_decoder

* whisper : wip decoding parameters + strategies

* whisper : wip decoding parameters + strategies (part 2)

* whisper : wip decoding parameters + strategies (part 3)

* whisper : wip decoding parameters + strategies (part 4)

* whisper : fix prompt_past update to not include prompt_init

* whisper : temperature + best_of support

* whisper : support for compression_ration_threshold

We actually use entropy, but it is similar

* command : fix example to use logits instead of obsolete probs

* whisper : handle empty sequence ranking

* whisper : add WHISPER_DEBUG + diagnostic prints + new main args

* whisper : minor fixes

* whisper : add beam-search support

* whisper : bug fix when there no previous context

* whisper : add comments

* stream : disable temperature fallback

For real-time processing, we always want a single decoder running at T=0

* whisper.swiftui : update example - fix paths + add empty folders
rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023
rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
* whisper : prepare infra for new decoding strategies

* whisper : apply logit filters and compute logprobs

* whisper : add whisper_get_logits()

* whisper : separate self and cross attention memory

Initial step needed for supporting parallel decoders

* whisper : move probs_id buffer to whisper_context

* whisper : refactor kv cache into separate struct

* whisper : move self-attention kv cache to whisper_decoder

* whisper : wip decoding parameters + strategies

* whisper : wip decoding parameters + strategies (part 2)

* whisper : wip decoding parameters + strategies (part 3)

* whisper : wip decoding parameters + strategies (part 4)

* whisper : fix prompt_past update to not include prompt_init

* whisper : temperature + best_of support

* whisper : support for compression_ration_threshold

We actually use entropy, but it is similar

* command : fix example to use logits instead of obsolete probs

* whisper : handle empty sequence ranking

* whisper : add WHISPER_DEBUG + diagnostic prints + new main args

* whisper : minor fixes

* whisper : add beam-search support

* whisper : bug fix when there no previous context

* whisper : add comments

* stream : disable temperature fallback

For real-time processing, we always want a single decoder running at T=0

* whisper.swiftui : update example - fix paths + add empty folders
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* whisper : prepare infra for new decoding strategies

* whisper : apply logit filters and compute logprobs

* whisper : add whisper_get_logits()

* whisper : separate self and cross attention memory

Initial step needed for supporting parallel decoders

* whisper : move probs_id buffer to whisper_context

* whisper : refactor kv cache into separate struct

* whisper : move self-attention kv cache to whisper_decoder

* whisper : wip decoding parameters + strategies

* whisper : wip decoding parameters + strategies (part 2)

* whisper : wip decoding parameters + strategies (part 3)

* whisper : wip decoding parameters + strategies (part 4)

* whisper : fix prompt_past update to not include prompt_init

* whisper : temperature + best_of support

* whisper : support for compression_ration_threshold

We actually use entropy, but it is similar

* command : fix example to use logits instead of obsolete probs

* whisper : handle empty sequence ranking

* whisper : add WHISPER_DEBUG + diagnostic prints + new main args

* whisper : minor fixes

* whisper : add beam-search support

* whisper : bug fix when there no previous context

* whisper : add comments

* stream : disable temperature fallback

For real-time processing, we always want a single decoder running at T=0

* whisper.swiftui : update example - fix paths + add empty folders
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* whisper : prepare infra for new decoding strategies

* whisper : apply logit filters and compute logprobs

* whisper : add whisper_get_logits()

* whisper : separate self and cross attention memory

Initial step needed for supporting parallel decoders

* whisper : move probs_id buffer to whisper_context

* whisper : refactor kv cache into separate struct

* whisper : move self-attention kv cache to whisper_decoder

* whisper : wip decoding parameters + strategies

* whisper : wip decoding parameters + strategies (part 2)

* whisper : wip decoding parameters + strategies (part 3)

* whisper : wip decoding parameters + strategies (part 4)

* whisper : fix prompt_past update to not include prompt_init

* whisper : temperature + best_of support

* whisper : support for compression_ration_threshold

We actually use entropy, but it is similar

* command : fix example to use logits instead of obsolete probs

* whisper : handle empty sequence ranking

* whisper : add WHISPER_DEBUG + diagnostic prints + new main args

* whisper : minor fixes

* whisper : add beam-search support

* whisper : bug fix when there no previous context

* whisper : add comments

* stream : disable temperature fallback

For real-time processing, we always want a single decoder running at T=0

* whisper.swiftui : update example - fix paths + add empty folders
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* whisper : prepare infra for new decoding strategies

* whisper : apply logit filters and compute logprobs

* whisper : add whisper_get_logits()

* whisper : separate self and cross attention memory

Initial step needed for supporting parallel decoders

* whisper : move probs_id buffer to whisper_context

* whisper : refactor kv cache into separate struct

* whisper : move self-attention kv cache to whisper_decoder

* whisper : wip decoding parameters + strategies

* whisper : wip decoding parameters + strategies (part 2)

* whisper : wip decoding parameters + strategies (part 3)

* whisper : wip decoding parameters + strategies (part 4)

* whisper : fix prompt_past update to not include prompt_init

* whisper : temperature + best_of support

* whisper : support for compression_ration_threshold

We actually use entropy, but it is similar

* command : fix example to use logits instead of obsolete probs

* whisper : handle empty sequence ranking

* whisper : add WHISPER_DEBUG + diagnostic prints + new main args

* whisper : minor fixes

* whisper : add beam-search support

* whisper : bug fix when there no previous context

* whisper : add comments

* stream : disable temperature fallback

For real-time processing, we always want a single decoder running at T=0

* whisper.swiftui : update example - fix paths + add empty folders
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants