Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal support #127

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Metal support #127

wants to merge 2 commits into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 7, 2022

This is quick and dirty implementation of GPU support for Apple hardware using Metal Performance Shaders. It demonstrates how part of the feed forward layer in the encoder can be offloaded to the GPU.

On my MacBook M1 Pro, I don't observe significant performance gain compared to the original implementation. Either I have a problem in my MPS integration, or simply the AMX coprocessor is doing a good enough job and adding Metal does not really help.

In any case, this PR can be a good starting point for anyone interested in adding GPU support to ggml. I think a similar approach can be taken for CUDA.

For now, I don't plan to merge this into master unless the performance gets better.

Seems to be only marginally faster compared to pure AMX
@DiegoGiovany
Copy link

can't make it on M1 Max:

c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
Undefined symbols for architecture arm64:
"_ggml_mtl_alloc", referenced from:
_ggml_new_tensor_mtl_impl in ggml.o
"_ggml_mtl_init", referenced from:
_ggml_init in ggml.o
"_ggml_mtl_mul_mat_f16", referenced from:
_ggml_compute_forward_mul_mat_f16_f32 in ggml.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [main] Error 1

@ggerganov
Copy link
Owner Author

@DiegoGiovany
Forgot to update the Makefile - it should work now. make clean + make

@latenitefilms
Copy link

This may or may not be helpful, but Warren Moore writes:

I don’t have any Apple Silicon devices, nor do I know much about ML or Whisper, so I’m not of much help.

But, the use of managed buffers without the use of explicit synchronization (via a blit encoder) is suspicious; I don’t see how this could work on a discrete GPU as-written.

Also, I’m not sure if the data dependencies allow concurrent execution, but calling -waitUntilCompleted forces the CPU thread to wait for GPU work to finish. There would be less overhead if encoders could be batched into fewer command buffers.

Finally, it leaks all of the Metal resources it creates, since ARC is disabled in the target. Any thread that encodes commands should have an autorelease pool, and resources should be explicitly released if ARC is disabled.

@vade
Copy link

vade commented Dec 20, 2022

Hi. Firstly, thanks for this repo. This project is awesome!

Forgive me if im incorrect in understanding the ramifications of this, but one thought after a brief look at this PR - it might make sense to decouple the command buffer commit / wait / read back cycle from each function call like in ggml_mtl_mul_mat_vec_f16

Is it feasible to rather, commit the first set of operations to a MTLBuffer as necessary, and then keep compute on the GPU, and encode all of the multiplies in a single command buffer, dont read back , and do single

    [commandBuffer commit];
     [commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> conformBuffer) :^{
         memcpy(...)
     }];

At the very end of calculation? This would remove any CPU / GPU pipeline stalls, keep compute on the GPU, and also allow for some work to be done on the CPU while waiting for the GPU to complete.

Forgive me if I dont get side effects of this proposed change (im not familiar enough with the internals of how Whisper works).

Thank you!

@ggerganov
Copy link
Owner Author

@vade
Yes, absolutely. It's definitely better to put as much operations as possible in a single command buffer and only read the data once at the end. The thing is that this will require refactoring of the ggml interface or something clever.

For example, if I have the following operations:

auto c = ggml_mul_mat(ctx, a, b)
auto e = ggml_mul_mat(ctx, c, d);
auto g = ggml_mul_mat(ctx, e, f);

// do something with "g"
...

Ideally I would want this to be a single command buffer with 3 matrix multiplications that starts with a and b as input and returns g without waiting for the intermediate results. To achieve that, we either need some clever logic in ggml to determine when a command buffer starts and ends by analysing the forward compute graph, or we need some explicit ggml interface calls to be called manually by the user whenever CPU/GPU synchronisation has to occur.

The proposal in this PR is a very rough starting point and is for sure far from optimal.
Many things can be improved.

@vade
Copy link

vade commented Dec 23, 2022

Thanks @ggerganov - and to be clear, I wasn't trying to point out any flaws, I'm aware this entire endeavor is a work in progress and theres a lot of moving pieces (and bravo on that!).

I was hesitant to mention only because I'm not entirely familiar with the code base or whispers internals as implemented here.

Does it make sense break down some changes that would benefit pipelining to GPU for all supported platforms? My suspicion is that anything metal benefits from would benefit CUDA, etc.

If I may propose a few baby steps to break this potentially large change into manageable changes for all platforms and make integration easier?

  • Identify which functions are ripe for pipelining, and which groups of layers in the whisper encoder / decoder can benefit from GPU work
  • Refactor the method signatures of those functions without any changes the how the code is currently working. This would allow for a baseline
  • Identify locations that require GPU synchronization in the code.
  • Stub in a GPU submit / blocking / wait for GPU to finish function with a defined method signature that doesnt actually do anything
  • Use the above code as a rebase for the metal branch, giving us entry points to add the pipelined GPU operations and allow for other platforms to eventually benefit from the proposed changes.

Apologies, im not intending to step in and try to manage your project, just to start a conversation and make a set of actionable proposals that the community can rally around :)

Thank you, and again, this project is really awesome.

My assumptions for changes would

  • new function creates a GPU context
  • new function creates a GPU command buffer from the create context/ command queue method above
  • refactor the method signature of existing functions requiring GPU to take an additional argument (the command buffer)
  • new function that handles GPU submission and blocking, which takes in the created commend buffer from the new function above.

LMK - I'm happy to help, and potentially even sponsor some of this development.

@voidfel
Copy link

voidfel commented Apr 14, 2023

Hi, Just curious if this still on the roadmap and being actively worked on? Thanks for your hard work.

@williamjeong2
Copy link

ggerganov/llama.cpp#1642
llama.cpp has been updated to support metal. I hope that whisper.cpp will also be updated to have the same capability.

@ggerganov
Copy link
Owner Author

Yes, it will come for sure

@voidfel
Copy link

voidfel commented Jun 14, 2023

It is already optimized for Apple silicon via ARM NEON, Accelerate framework and Core ML. I am using the medium.en model and it is super fast on my M1 Pro 16GB, it is absolutely amazing. Only the first run will be slow. Can Metal make it even faster? That would be unbelievable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants