Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments with GPU CUDA acceleration...sort of #220

Open
Topping1 opened this issue Dec 6, 2022 · 17 comments
Open

Experiments with GPU CUDA acceleration...sort of #220

Topping1 opened this issue Dec 6, 2022 · 17 comments
Labels
performance CPU and memory usage - results and comparisons

Comments

@Topping1
Copy link
Contributor

Topping1 commented Dec 6, 2022

CUDA toolkit documentation link states that NVBLAS is a drop-in BLAS replacement.
Also states: "The NVBLAS Library is a GPU-accelerated Library that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the charateristics of the call make it speed up on a GPU." One of those Level-3 routines is sgemm (matrix multiplication), that is used extensively by ggml.c.
In theory, IF CORRECTLY CONFIGURED, NVBLAS can intercept the calls to the OpenBLAS function cblas_sgemm and accelerate it using a CUDA compatible graphics card installed in the system.
There is not much information about the specific steps to enable it, but I could piece together this step-by-step:

1-Install CUDA toolkit from the official link link

2-create the file /etc/nvblas.conf with the following contents:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL

/usr/lib/x86_64-linux-gnu/libopenblas.so is the location of libopenblas.so on my system, You have to point it to the correct location (should not be that different).

3-create an environment variable pointing to nvblas.conf
export NVBLAS_CONFIG_FILE=/etc/nvblas.conf

4-create an environment variable pointing to the location of libnvblas.so
export LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.11
here is not clear which .so file is needed. For example on my system I can find the following
/usr/local/cuda/lib64/libnvblas.so
/usr/local/cuda/lib64/libnvblas.so.11
/usr/local/cuda/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6

5-Download source code of whisper.cpp with
git clone https://github.com/ggerganov/whisper.cpp

6-Inside the whisper.cpp folder, execute
cmake -DWHISPER_SUPPORT_OPENBLAS=ON .

7-Inside the whisper.cpp folder, execute
make
you should have now a compiled main executable with BLAS support turned on.

8-now, at least in my case, when I run a test transcription, the program confirms that is using BLAS (BLAS = 1), but NVBLAS does not seem to be intercepting the calls. NVTOP does not show GPU usage and no nvblas.log is created.

If someone can figure out how to make this work, it has the potential to accelerate substantially the transcription speed on x64.

@RYucel
Copy link

RYucel commented Dec 6, 2022 via email

@ggerganov ggerganov added the performance CPU and memory usage - results and comparisons label Dec 6, 2022
@Topping1
Copy link
Contributor Author

Topping1 commented Dec 7, 2022

I have spotted this on the documentation as an alternative to intercepting OpenBLAS calls with the LD_PRELOAD environment variable:
To use the NVBLAS Library, the user application must be relinked against NVBLAS in addition to the original CPU Blas (technically only NVBLAS is needed unless some BLAS routines not supported by NVBLAS are used by the application). To be sure that the linker links against the exposed symbols of NVBLAS and not the ones from the CPU BLAS, the NVBLAS Library needs to be put before the CPU BLAS on the linkage command line.

@ggerganov, can you please advise how to link the shared library libnvblas.so so it is linked before OpenBLAS on the command line? Also, I'm not sure of where to apply this: ggml.c, whisper.cpp or main.cpp? Any help would be appreciated.

@misutoneko
Copy link

For that LD_PRELOAD trick, maybe that directory also needs to be added to LD_LIBRARY_PATH?
You can use LD_DEBUG=all to see what's going on in more detail.
To get a long list of options of what's possible, use LD_DEBUG=help cat

Btw the different library names that you see are mostly symlinked together, so it shouldn't matter much which one you choose.

@ggerganov
Copy link
Owner

ggerganov commented Dec 8, 2022

@Topping1
I got it working and there is a significant performance boost when using libnvblas. Here are initial results:

You got everything correct, except it seems that the cblas_ calls are not intercepted by libnvblas. Instead we have to use the native Fortran BLAS API. Initial demonstration is available on the nvblas branch, so make sure to checkout the branch and rebuild.

Here is a comparison on a machine with GeForce GTX 1660, running the large model on jfk.wav:

  • Without libnvblas:
./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1538.72 ms
whisper_print_timings:      mel time =    53.92 ms
whisper_print_timings:   sample time =     8.39 ms
whisper_print_timings:   encode time = 12416.90 ms / 388.03 ms per layer
whisper_print_timings:   decode time =  1605.45 ms / 50.17 ms per layer
whisper_print_timings:    total time = 15623.76 ms
  • With libnvblas (using the LD_PRELOAD trick):
NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so ./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/etc/nvblas.conf'

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1535.95 ms
whisper_print_timings:      mel time =    52.52 ms
whisper_print_timings:   sample time =     7.75 ms
whisper_print_timings:   encode time =  7362.63 ms / 230.08 ms per layer
whisper_print_timings:   decode time =  1535.42 ms / 47.98 ms per layer
whisper_print_timings:    total time = 10494.65 ms

This shows that the Encoder is almost x2 faster - 12416.90 ms vs 7362.63 ms.

My /etc/nvblas.conf looks like this:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED

So I think with some fine-tuning and porting the appropriate matrix multiplications to libnvblas we can get decent GPU support. Likely, it will not be optimal, but hopefully decent. Thanks for this idea!

@RYucel
Copy link

RYucel commented Dec 9, 2022 via email

@Topping1
Copy link
Contributor Author

Topping1 commented Dec 16, 2022

Any windows executable soon for this?

@RYucel After reading some documentation, it seems that this nvblas trick is only applicable to Linux. I think it can be implemented for Windows but with more changes to the code.

@Topping1
Copy link
Contributor Author

@Topping1 I got it working and there is a significant performance boost when using libnvblas. Here are initial results:

You got everything correct, except it seems that the cblas_ calls are not intercepted by libnvblas. Instead we have to use the native Fortran BLAS API. Initial demonstration is available on the nvblas branch, so make sure to checkout the branch and rebuild.

Here is a comparison on a machine with GeForce GTX 1660, running the large model on jfk.wav:

* Without `libnvblas`:
./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1538.72 ms
whisper_print_timings:      mel time =    53.92 ms
whisper_print_timings:   sample time =     8.39 ms
whisper_print_timings:   encode time = 12416.90 ms / 388.03 ms per layer
whisper_print_timings:   decode time =  1605.45 ms / 50.17 ms per layer
whisper_print_timings:    total time = 15623.76 ms
* With `libnvblas` (using the `LD_PRELOAD` trick):
NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda-11.8/targets/x86_64-linux/lib/libnvblas.so ./bin/main -m ../models/ggml-large.bin ../samples/jfk.wav
whisper_model_load: loading model from '../models/ggml-large.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 5
whisper_model_load: adding 1608 extra tokens
whisper_model_load: mem_required  = 4712.00 MB
whisper_model_load: ggml ctx size = 2950.97 MB
whisper_model_load: memory size   =  304.38 MB
whisper_model_load: model size    = 2950.66 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/etc/nvblas.conf'

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1535.95 ms
whisper_print_timings:      mel time =    52.52 ms
whisper_print_timings:   sample time =     7.75 ms
whisper_print_timings:   encode time =  7362.63 ms / 230.08 ms per layer
whisper_print_timings:   decode time =  1535.42 ms / 47.98 ms per layer
whisper_print_timings:    total time = 10494.65 ms

This shows that the Encoder is almost x2 faster - 12416.90 ms vs 7362.63 ms.

My /etc/nvblas.conf looks like this:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED

So I think with some fine-tuning and porting the appropriate matrix multiplications to libnvblas we can get decent GPU support. Likely, it will not be optimal, but hopefully decent. Thanks for this idea!

@ggerganov thanks very much for your efforts!
I did some additional digging and found an interesting read here. It says:
Use of NVBLAS_AUTOPIN_MEM_ENABLED flag can be essential for good performance , something not obvious from the documentation.

Basically you have to add the line
NVBLAS_AUTOPIN_MEM_ENABLED
to the /etc/nvblas.conf file.

I ran bench on a Ryzen 5 PRO 2400G with a Quadro P1000 and got these results with a nvblas.conf:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED

results are:

whisper_print_timings:     load time =   219.68 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3044.64 ms / 507.44 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3264.37 ms

and with this nvblas.conf:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_TRACE_LOG_ENABLED
NVBLAS_AUTOPIN_MEM_ENABLED

results are:

whisper_print_timings:     load time =   230.46 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2176.08 ms / 362.68 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2406.60 ms

2176.08 ms vs 3044.64 ms encode time...not bad at all.

On a related note, I think CLBlast has a similar way to call the matrix multiplication functions, and in this case the hardware acceleration would be via OpenCL. One downside is that you have to "tune" the installation for your particular GPU to get any decent speedup. I believe there was a mention here #173 . I got as far as optimizing for my GPU but there was errors installing the library. I will try again to see what is the performance difference with nvblas.

@ggerganov ggerganov pinned this issue Dec 16, 2022
@ggerganov
Copy link
Owner

I tried using NVBLAS_AUTOPIN_MEM_ENABLED but it does not make a difference on my video card. Strange.

But in any case, I like this approach because it is very un-intrusive. Thinking what is the best way to integrate it in the build. Using the LD_PRELOAD trick is not very convenient - probably better link directly to libnvblas

@Topping1
Copy link
Contributor Author

I tried using NVBLAS_AUTOPIN_MEM_ENABLED but it does not make a difference on my video card. Strange.

But in any case, I like this approach because it is very un-intrusive. Thinking what is the best way to integrate it in the build. Using the LD_PRELOAD trick is not very convenient - probably better link directly to libnvblas

Strange that you don't get any speedup, but I guess it depends on the specific setup. I agree that linking directly to the libnvblas library is the way to go. The only inconvenience left would be to generate the nvblas.conf file, but I think that even that could be generated programmatically.
One behavior that I have noticed is that enabling nvblas improves the encoding time but degrades the decoding time compared to a build without BLAS support. I think it could be caused by other OpenBLAS calls in the ggml.c code.

@esonec
Copy link

esonec commented Jan 5, 2023

How can I run Whisper on GPU on Windows 11 (CUDA) ?

@Benoit9
Copy link

Benoit9 commented Jan 14, 2023

If you are interested in GPU acceleration with minimal code changes, I suggest you try CLBlast: https://github.com/CNugteren/CLBlast (and a nice presentation by its author: https://cnugteren.github.io/downloads/CLBlastIWOCL18.pdf)

All I had to do to use it in Whisper.cpp was:

Replace the openblas include in ggml.c:

//#include <cblas.h>
#include <clblast_netlib_c.h>

Change the Makefile to link with it:

ifdef WHISPER_OPENBLAS
	CFLAGS  += -DGGML_USE_OPENBLAS 
	LDFLAGS += -lclblast
endif

And on my Ubuntu laptop, install some Intel driver:

apt-get install intel-opencl-icd

I also had to install CLBlast from source with the -DNETLIB=ON cmake flag in order to get the clblast_netlib_c.h functionality.

With Intel Iris XE integrated graphics, I could match the performance of 8-core openblas with just one active core + GPU. With 4 cores, I got about 2x better performance. I am curious about what you would get with a Nvidia GPU?

And this is with the "netlib" bindings of CLBlast. As the author points out (https://github.com/CNugteren/CLBlast/blob/master/doc/bindings.md) it's for people who don't want to touch OpenCL, but comes with severe performance loss... Presumably the batched versions of gemm and proper buffer handling could be much faster.

@misutoneko
Copy link

misutoneko commented Jan 15, 2023

@Benoit9, yep I gave CLBlast a try (very briefly) with my old GTX660 a while back.
I thought it might be a nice alternative for older GPUs, but it didn't work out very well in my case.
The memory requirements were the main problem, as I could only barely run the base model (and even that one was crashy).
So maybe it's better with a newer GPU with more memory.
I also found out that my elderly CPU has AVX support, so the performance problem kinda got solved for me that way...:D

@Compaile
Copy link

Compaile commented Feb 7, 2023

any more updates to this? running it on gpu would be very cool

@joshuachris2001
Copy link

joshuachris2001 commented Feb 17, 2023

would this work on older nv gpus? for instance one that throws errors for whisper torch.
also does the gpu need to have the entire model in memory? (sorry I'm a noob for ML)

@d3xMachina
Copy link

It would be nice to have an implementation that also works on AMD like HERE. It's fast and use only 4GB with the large model on a RX 6800XT. I'm worried this repo won't be maintained thought.

@rklasen
Copy link

rklasen commented Apr 1, 2023

Does this work still in the current master branch? Inference on my CPU is quite slow, about 0.5 tokens/s on an AMD 3900X with the 7B model.

@vricosti
Copy link

vricosti commented Apr 3, 2023

I found in this ticket what I was talking about here: #713
This fragmentation around AI technology is a nonsense... CUDA, OpenCL, CoreML, you can find fork of whisper for each OS/hardware (original python, swift version with CoreML, DirectCompute, pure CPU).
Energy is wasted for nothing...
How can the situation be improved ?

Anyway I have also tested with CLBlast compiled locally and I get an exception inside:

// zT = y * xT
                cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                        ne11, ne01, ne10,
                        1.0f,    y, ne10,
                                 x, ne10,
                        0.0f,    d, ne01);

When I try to transcribe a simple wav file:

vricosti@iMac ~/Dev/Perso/jarvis/whisper.cpp/build/bin/Release
./main -m models/ggml-medium.bin -l fr -t 8 -f /Users/vricosti/Dev/Perso/jarvis/jarvis-open-ai/20230403_180429.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem required = 1725.00 MB (+ 43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
whisper_init_state: kv self size = 42.00 MB
whisper_init_state: kv cross size = 140.62 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

main: processing '/Users/vricosti/Dev/Perso/jarvis/jarvis-open-ai/20230403_180429.wav' (80000 samples, 5.0 sec), 8 threads, 1 processors, lang = fr, task = transcribe, timestamps = 1 ...

CLBlast: OpenCL error: clBuildProgram: -11
libc++abi: terminating with uncaught exception of type std::runtime_error: CLBlast returned with error code -11
[1] 97401 abort ./main -m models/ggml-medium.bin -l fr -t 8 -f

And I am not very lucky with the stream app because sometimes it can recognize some words while my small custom app in python recognise a lot more words, so something is weird.

Anyone knows this: https://github.com/ROCm-Developer-Tools/HIP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance CPU and memory usage - results and comparisons