Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer quantisation support #540

Merged
merged 15 commits into from
Apr 30, 2023
Merged

Integer quantisation support #540

merged 15 commits into from
Apr 30, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Feb 26, 2023

  • Add integer quantization support: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
  • Add quantize tool for model quantization
  • Update all WASM examples to support quantized models
  • Sync talk-llama with latest llama.cpp

Usage:

# quantize a model with Q5_0 method
make quantize
./quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0

# run main example as usual, specifying the quantized model file
./main -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav
@regstuff
Copy link

Wow the speed and memory improvements are insane. Especially saw what you did with GPT-J!
Did you do any benchmarks on accuracy? Either for Whisper or GPT?
Would be interesting to see the speed/accuracy situation when using something like medium quantized vs small unquantized, since they both have about the same footprint.

@ggerganov
Copy link
Owner Author

@regstuff
I haven't done accuracy evaluation yet. The accuracy definitely drops, especially for the smaller models.
But I cannot say how much yet.

For example, I observe that GPT-2 117M and 345M completely fail with Q4_0 quantisation, while 345M works with Q4_1 since it is more accurate. The Whisper tiny-q4_0 and base-q4_0 models often fail as well.

Overall, the intuition is that the larger the model is, the more resilient to quantisation it will be. I think..

@lele85
Copy link

lele85 commented Mar 9, 2023

Sorry in advance because i'm pretty much out of my depth here but i'm trying things, so feel free to dismiss me as a noob :)

I played a little bit on the wasm version converting the tiny model to the q4_0 using your tool here ggerganov/ggml#27. The size improvements are fantastic but at least on my M1 Max (8 threads) i don't see dramatic performance increase:

Audio Lenght: 196.9 sec
Tiny(f16): 33.12sec
Tiny(q4_0): 25.7sec

The quality of the transcription is way lower. Is the choice to use 4 bits quantization instead of 8bit driven by something specific? Is an higher resolution in the quantization related in any way to the performance of the algorithm?

@meakbiyik
Copy link
Contributor

As a small note to this PR: my tests on this branch on Neoverse V1 CPUs (with correct flags set in compilation) have shown a dramatic drop of performance in medium model. In bench, classical medium.en:

whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = f16
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   553.80 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 12803.90 ms /     1 runs (12803.90 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 13357.77 ms

medium.en-q4_0:

whisper_init_from_file: loading model from '/models/ggml-medium.en-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   239.97 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 19738.27 ms /     1 runs (19738.27 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 19978.31 ms
@ocordeiro
Copy link

Hi @ggerganov this whisper/4-bit doesn't works with quantized models from ggml/master.
The old conversion ggml/gq (deleted) works.

@ggerganov ggerganov marked this pull request as ready for review April 30, 2023 15:24
@ggerganov ggerganov merged commit 794b162 into master Apr 30, 2023
@ggerganov ggerganov deleted the 4-bit branch April 30, 2023 15:52
@ggerganov ggerganov changed the title 4-bit Integer quantisation Apr 30, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* whisper : add integer quantization support

* examples : add common-ggml + prepare to add "quantize" tool

* whisper : quantization tool ready

* whisper : fix F32 support

* whisper : try to fix shared lib linkage

* wasm : update quantized models to Q5

* bench.wasm : remove "medium" button

* bench.wasm : fix custom model button

* ggml : add Q5_0 and Q5_1 WASM SIMD

* wasm : add quantized models to all WASM examples

* wasm : bump DB version number to 2

* talk-llama : update example to latest llama.cpp

* node : increase test timeout to 10s

* readme : add information for model quantization

* wasm : add links to other examples
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* whisper : add integer quantization support

* examples : add common-ggml + prepare to add "quantize" tool

* whisper : quantization tool ready

* whisper : fix F32 support

* whisper : try to fix shared lib linkage

* wasm : update quantized models to Q5

* bench.wasm : remove "medium" button

* bench.wasm : fix custom model button

* ggml : add Q5_0 and Q5_1 WASM SIMD

* wasm : add quantized models to all WASM examples

* wasm : bump DB version number to 2

* talk-llama : update example to latest llama.cpp

* node : increase test timeout to 10s

* readme : add information for model quantization

* wasm : add links to other examples
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* whisper : add integer quantization support

* examples : add common-ggml + prepare to add "quantize" tool

* whisper : quantization tool ready

* whisper : fix F32 support

* whisper : try to fix shared lib linkage

* wasm : update quantized models to Q5

* bench.wasm : remove "medium" button

* bench.wasm : fix custom model button

* ggml : add Q5_0 and Q5_1 WASM SIMD

* wasm : add quantized models to all WASM examples

* wasm : bump DB version number to 2

* talk-llama : update example to latest llama.cpp

* node : increase test timeout to 10s

* readme : add information for model quantization

* wasm : add links to other examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants