Reduce memory usage during Whisper inference #431

ggerganov · 2023-01-19T19:36:39Z

The idea is to avoid keeping all intermediate tensors of the computation graph by introducing "scratch" buffers to ggml #272 (comment)

I initially thought it would be enough to just keep the last 2 intermediate tensors at each point.
However, it's not the case since we have operations like this:

cur = ggml_add(ctx0,
        ggml_repeat(ctx0,
            model.e_conv_2_b,
            cur),
        cur);

The tensor cur is used to create 2 new intermediate tensors.
So we need to keep more than 2 tensors in the "scratch" buffer.

Initial results

Using scratch buffers during inference we reduce the total memory usage for the base model from 500 MB to just 213 MB. As an extra bonus, the decoder seems to be about %30 faster on M1 Pro without any loss of precision compared to master.

The main drawback is that the scratch buffer selection is currently done manually in whisper.cpp.
It makes the code quite unreadable and very error-prone. I think it can be automated by analysing the nodes in the created compute graphs and assigning them to the correct scratch buffers, but the assignment algorithm is not trivial to implement and it would need some major refactoring in ggml. For now I think it would be better to just clean-up the code a little bit and wait to see if some better idea pops up.

Memory usage change:

Model	Disk	Mem (Old)	Mem (New)
tiny	75 MB	~390 MB	~125 MB
base	142 MB	~500 MB	~210 MB
small	466 MB	~1.0 GB	~600 MB
medium	1.5 GB	~2.6 GB	~1.7 GB
large	2.9 GB	~4.7 GB	~3.3 GB

Development notes:

~~Cannot use ggml_cpy with scratch tensors~~
~~Special-cased constant ggml tensors - need a better fix~~
We now only compute the logits for the last token in whisper_decode()
~~Use different scratch buffers for every other layer?~~

* ggml : add "scratch" buffer support * ggml : support for scratch ring-buffer * ggml : bug fix in ggml_repeat() * ggml : error on scratch buffer overflow * whisper : use scratch buffers during inference (base model only) * whisper : update memory usage for all models * whisper : fix encoder memory usage * whisper : use whisper_context functions instead of macros * whisper : fix FF + remove it from README * ggml : reuse ggml_new_i32 * ggml : refactor the scratch buffer storage * whisper : reorder scratch buffers in the decoder * main : add option to disable temp fallback * Update README.md

ggerganov force-pushed the mem branch 3 times, most recently from 1f7cd04 to 60d0f9d Compare January 25, 2023 20:16

ggerganov marked this pull request as ready for review January 29, 2023 07:33

ggerganov force-pushed the mem branch from b0c2268 to 41f3171 Compare January 29, 2023 09:08

ggerganov added 14 commits February 4, 2023 09:30

ggml : add "scratch" buffer support

60eff46

ggml : support for scratch ring-buffer

0eea547

ggml : bug fix in ggml_repeat()

1821057

ggml : error on scratch buffer overflow

0ba91b5

whisper : use scratch buffers during inference (base model only)

1a1dee4

whisper : update memory usage for all models

6cae05b

whisper : fix encoder memory usage

79148a2

whisper : use whisper_context functions instead of macros

42d7dee

whisper : fix FF + remove it from README

4e0e252

ggml : reuse ggml_new_i32

d922aa4

ggml : refactor the scratch buffer storage

62205ae

whisper : reorder scratch buffers in the decoder

01669ee

main : add option to disable temp fallback

bdf21fa

Update README.md

6ed1344

ggerganov force-pushed the mem branch from 41f3171 to 6ed1344 Compare February 4, 2023 07:35

js : update whisper.js

a26f3a7

ggerganov merged commit f3ee4a9 into master Feb 4, 2023

ggerganov deleted the mem branch February 4, 2023 07:45

ggerganov mentioned this pull request Mar 24, 2023

Reduce memory usage and allocate enough memory for largest context ggerganov/llama.cpp#473

Merged

ggerganov mentioned this pull request May 14, 2023

ggml : various fixes ggerganov/llama.cpp#1450

Merged

FELIXrobust approved these changes Jul 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage during Whisper inference #431

Reduce memory usage during Whisper inference #431

ggerganov commented Jan 19, 2023 •

edited

Loading

Reduce memory usage during Whisper inference #431

Reduce memory usage during Whisper inference #431

Conversation

ggerganov commented Jan 19, 2023 • edited Loading

Initial results

Development notes:

ggerganov commented Jan 19, 2023 •

edited

Loading