Integer quantisation support #540

ggerganov · 2023-02-26T19:17:12Z

Add integer quantization support: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
Add quantize tool for model quantization
Update all WASM examples to support quantized models
Sync talk-llama with latest llama.cpp

Usage:

# quantize a model with Q5_0 method
make quantize
./quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0

# run main example as usual, specifying the quantized model file
./main -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav

regstuff · 2023-02-28T11:58:57Z

Wow the speed and memory improvements are insane. Especially saw what you did with GPT-J!
Did you do any benchmarks on accuracy? Either for Whisper or GPT?
Would be interesting to see the speed/accuracy situation when using something like medium quantized vs small unquantized, since they both have about the same footprint.

ggerganov · 2023-02-28T18:14:10Z

@regstuff
I haven't done accuracy evaluation yet. The accuracy definitely drops, especially for the smaller models.
But I cannot say how much yet.

For example, I observe that GPT-2 117M and 345M completely fail with Q4_0 quantisation, while 345M works with Q4_1 since it is more accurate. The Whisper tiny-q4_0 and base-q4_0 models often fail as well.

Overall, the intuition is that the larger the model is, the more resilient to quantisation it will be. I think..

lele85 · 2023-03-09T19:09:11Z

Sorry in advance because i'm pretty much out of my depth here but i'm trying things, so feel free to dismiss me as a noob :)

I played a little bit on the wasm version converting the tiny model to the q4_0 using your tool here ggerganov/ggml#27. The size improvements are fantastic but at least on my M1 Max (8 threads) i don't see dramatic performance increase:

Audio Lenght: 196.9 sec
Tiny(f16): 33.12sec
Tiny(q4_0): 25.7sec

The quality of the transcription is way lower. Is the choice to use 4 bits quantization instead of 8bit driven by something specific? Is an higher resolution in the quantization related in any way to the performance of the algorithm?

meakbiyik · 2023-03-19T09:27:12Z

As a small note to this PR: my tests on this branch on Neoverse V1 CPUs (with correct flags set in compilation) have shown a dramatic drop of performance in medium model. In bench, classical medium.en:

whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = f16
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   553.80 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 12803.90 ms /     1 runs (12803.90 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 13357.77 ms

medium.en-q4_0:

whisper_init_from_file: loading model from '/models/ggml-medium.en-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   239.97 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 19738.27 ms /     1 runs (19738.27 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 19978.31 ms

ocordeiro · 2023-04-03T18:59:25Z

Hi @ggerganov this whisper/4-bit doesn't works with quantized models from ggml/master.
The old conversion ggml/gq (deleted) works.

* whisper : add integer quantization support * examples : add common-ggml + prepare to add "quantize" tool * whisper : quantization tool ready * whisper : fix F32 support * whisper : try to fix shared lib linkage * wasm : update quantized models to Q5 * bench.wasm : remove "medium" button * bench.wasm : fix custom model button * ggml : add Q5_0 and Q5_1 WASM SIMD * wasm : add quantized models to all WASM examples * wasm : bump DB version number to 2 * talk-llama : update example to latest llama.cpp * node : increase test timeout to 10s * readme : add information for model quantization * wasm : add links to other examples

ggerganov force-pushed the 4-bit branch from df11316 to 3916503 Compare February 26, 2023 21:08

This was referenced Feb 27, 2023

Recommendations for performance when running whisper.cpp on VPS? #524

Open

Android run demo by small model #518

Open

ggerganov mentioned this pull request Mar 7, 2023

example whisper.wasm is different than example #577

Closed

meakbiyik mentioned this pull request Mar 19, 2023

4-bit Integer quantisation ggerganov/ggml#27

Merged

8 tasks

ggerganov force-pushed the 4-bit branch from b4ebdb6 to 454d97d Compare April 30, 2023 09:57

whisper : add integer quantization support

e9d78e6

ggerganov force-pushed the 4-bit branch from 454d97d to e9d78e6 Compare April 30, 2023 09:57

ggerganov added 11 commits April 30, 2023 13:39

examples : add common-ggml + prepare to add "quantize" tool

d814e95

whisper : quantization tool ready

9d6bcad

whisper : fix F32 support

8b91958

whisper : try to fix shared lib linkage

23f63e0

wasm : update quantized models to Q5

f641252

bench.wasm : remove "medium" button

ec3e882

bench.wasm : fix custom model button

9364cfd

ggml : add Q5_0 and Q5_1 WASM SIMD

524470d

wasm : add quantized models to all WASM examples

a3e96e7

wasm : bump DB version number to 2

7c43711

talk-llama : update example to latest llama.cpp

05137ce

ggerganov marked this pull request as ready for review April 30, 2023 15:24

ggerganov added 3 commits April 30, 2023 18:29

node : increase test timeout to 10s

ced04d0

readme : add information for model quantization

d378f5d

wasm : add links to other examples

7aeb79d

ggerganov merged commit 794b162 into master Apr 30, 2023

ggerganov deleted the 4-bit branch April 30, 2023 15:52

ggerganov changed the title ~~4-bit Integer quantisation~~ Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer quantisation support #540

Integer quantisation support #540

ggerganov commented Feb 26, 2023 •

edited

Loading

regstuff commented Feb 28, 2023

ggerganov commented Feb 28, 2023

lele85 commented Mar 9, 2023

meakbiyik commented Mar 19, 2023

ocordeiro commented Apr 3, 2023

Integer quantisation support #540

Integer quantisation support #540

Conversation

ggerganov commented Feb 26, 2023 • edited Loading

regstuff commented Feb 28, 2023

ggerganov commented Feb 28, 2023

lele85 commented Mar 9, 2023

meakbiyik commented Mar 19, 2023

ocordeiro commented Apr 3, 2023

ggerganov commented Feb 26, 2023 •

edited

Loading