Metal support #127

ggerganov · 2022-11-07T19:49:10Z

This is quick and dirty implementation of GPU support for Apple hardware using Metal Performance Shaders. It demonstrates how part of the feed forward layer in the encoder can be offloaded to the GPU.

On my MacBook M1 Pro, I don't observe significant performance gain compared to the original implementation. Either I have a problem in my MPS integration, or simply the AMX coprocessor is doing a good enough job and adding Metal does not really help.

In any case, this PR can be a good starting point for anyone interested in adding GPU support to ggml. I think a similar approach can be taken for CUDA.

For now, I don't plan to merge this into master unless the performance gets better.

Seems to be only marginally faster compared to pure AMX

DiegoGiovany · 2022-11-11T23:24:30Z

can't make it on M1 Max:

c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
Undefined symbols for architecture arm64:
"_ggml_mtl_alloc", referenced from:
_ggml_new_tensor_mtl_impl in ggml.o
"_ggml_mtl_init", referenced from:
_ggml_init in ggml.o
"_ggml_mtl_mul_mat_f16", referenced from:
_ggml_compute_forward_mul_mat_f16_f32 in ggml.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [main] Error 1

ggerganov · 2022-11-12T06:33:11Z

@DiegoGiovany
Forgot to update the Makefile - it should work now. make clean + make

latenitefilms · 2022-12-10T03:12:48Z

This may or may not be helpful, but Warren Moore writes:

I don’t have any Apple Silicon devices, nor do I know much about ML or Whisper, so I’m not of much help.

But, the use of managed buffers without the use of explicit synchronization (via a blit encoder) is suspicious; I don’t see how this could work on a discrete GPU as-written.

Also, I’m not sure if the data dependencies allow concurrent execution, but calling -waitUntilCompleted forces the CPU thread to wait for GPU work to finish. There would be less overhead if encoders could be batched into fewer command buffers.

Finally, it leaks all of the Metal resources it creates, since ARC is disabled in the target. Any thread that encodes commands should have an autorelease pool, and resources should be explicitly released if ARC is disabled.

vade · 2022-12-20T21:52:08Z

Hi. Firstly, thanks for this repo. This project is awesome!

Forgive me if im incorrect in understanding the ramifications of this, but one thought after a brief look at this PR - it might make sense to decouple the command buffer commit / wait / read back cycle from each function call like in ggml_mtl_mul_mat_vec_f16

Is it feasible to rather, commit the first set of operations to a MTLBuffer as necessary, and then keep compute on the GPU, and encode all of the multiplies in a single command buffer, dont read back , and do single

    [commandBuffer commit];
     [commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> conformBuffer) :^{
         memcpy(...)
     }];

At the very end of calculation? This would remove any CPU / GPU pipeline stalls, keep compute on the GPU, and also allow for some work to be done on the CPU while waiting for the GPU to complete.

Forgive me if I dont get side effects of this proposed change (im not familiar enough with the internals of how Whisper works).

Thank you!

ggerganov · 2022-12-23T08:12:00Z

@vade
Yes, absolutely. It's definitely better to put as much operations as possible in a single command buffer and only read the data once at the end. The thing is that this will require refactoring of the ggml interface or something clever.

For example, if I have the following operations:

auto c = ggml_mul_mat(ctx, a, b)
auto e = ggml_mul_mat(ctx, c, d);
auto g = ggml_mul_mat(ctx, e, f);

// do something with "g"
...

Ideally I would want this to be a single command buffer with 3 matrix multiplications that starts with a and b as input and returns g without waiting for the intermediate results. To achieve that, we either need some clever logic in ggml to determine when a command buffer starts and ends by analysing the forward compute graph, or we need some explicit ggml interface calls to be called manually by the user whenever CPU/GPU synchronisation has to occur.

The proposal in this PR is a very rough starting point and is for sure far from optimal.
Many things can be improved.

vade · 2022-12-23T16:43:28Z

Thanks @ggerganov - and to be clear, I wasn't trying to point out any flaws, I'm aware this entire endeavor is a work in progress and theres a lot of moving pieces (and bravo on that!).

I was hesitant to mention only because I'm not entirely familiar with the code base or whispers internals as implemented here.

Does it make sense break down some changes that would benefit pipelining to GPU for all supported platforms? My suspicion is that anything metal benefits from would benefit CUDA, etc.

If I may propose a few baby steps to break this potentially large change into manageable changes for all platforms and make integration easier?

Identify which functions are ripe for pipelining, and which groups of layers in the whisper encoder / decoder can benefit from GPU work
Refactor the method signatures of those functions without any changes the how the code is currently working. This would allow for a baseline
Identify locations that require GPU synchronization in the code.
Stub in a GPU submit / blocking / wait for GPU to finish function with a defined method signature that doesnt actually do anything
Use the above code as a rebase for the metal branch, giving us entry points to add the pipelined GPU operations and allow for other platforms to eventually benefit from the proposed changes.

Apologies, im not intending to step in and try to manage your project, just to start a conversation and make a set of actionable proposals that the community can rally around :)

Thank you, and again, this project is really awesome.

My assumptions for changes would

new function creates a GPU context
new function creates a GPU command buffer from the create context/ command queue method above
refactor the method signature of existing functions requiring GPU to take an additional argument (the command buffer)
new function that handles GPU submission and blocking, which takes in the created commend buffer from the new function above.

LMK - I'm happy to help, and potentially even sponsor some of this development.

voidfel · 2023-04-14T10:34:15Z

Hi, Just curious if this still on the roadmap and being actively worked on? Thanks for your hard work.

williamjeong2 · 2023-06-08T16:06:18Z

ggerganov/llama.cpp#1642
llama.cpp has been updated to support metal. I hope that whisper.cpp will also be updated to have the same capability.

ggerganov · 2023-06-08T17:56:08Z

Yes, it will come for sure

voidfel · 2023-06-14T12:31:35Z

It is already optimized for Apple silicon via ARM NEON, Accelerate framework and Core ML. I am using the medium.en model and it is super fast on my M1 Pro 16GB, it is absolutely amazing. Only the first run will be slow. Can Metal make it even faster? That would be unbelievable.

mtl : matrix multiplication support

b5d3521

Seems to be only marginally faster compared to pure AMX

ggerganov force-pushed the metal branch from 2c6fc25 to b5d3521 Compare November 9, 2022 16:29

mtl : update Makefile to support Metal

fa9621e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal support #127

Metal support #127

ggerganov commented Nov 7, 2022 •

edited

Loading

DiegoGiovany commented Nov 11, 2022

ggerganov commented Nov 12, 2022

latenitefilms commented Dec 10, 2022

vade commented Dec 20, 2022

ggerganov commented Dec 23, 2022

vade commented Dec 23, 2022 •

edited

Loading

voidfel commented Apr 14, 2023

williamjeong2 commented Jun 8, 2023

ggerganov commented Jun 8, 2023

voidfel commented Jun 14, 2023 •

edited

Loading

Metal support #127

Are you sure you want to change the base?

Metal support #127

Conversation

ggerganov commented Nov 7, 2022 • edited Loading

DiegoGiovany commented Nov 11, 2022

ggerganov commented Nov 12, 2022

latenitefilms commented Dec 10, 2022

vade commented Dec 20, 2022

ggerganov commented Dec 23, 2022

vade commented Dec 23, 2022 • edited Loading

voidfel commented Apr 14, 2023

williamjeong2 commented Jun 8, 2023

ggerganov commented Jun 8, 2023

voidfel commented Jun 14, 2023 • edited Loading

ggerganov commented Nov 7, 2022 •

edited

Loading

vade commented Dec 23, 2022 •

edited

Loading

voidfel commented Jun 14, 2023 •

edited

Loading