Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diminishing returns with increasing number of threads #200

Open
savchenko opened this issue Nov 29, 2022 · 16 comments
Open

Diminishing returns with increasing number of threads #200

savchenko opened this issue Nov 29, 2022 · 16 comments
Labels
performance CPU and memory usage - results and comparisons

Comments

@savchenko
Copy link

It seems like 7 threads is a sweet-spot after which performance starts decreasing:

cpp

Is this expected?

Latest build from the GitHub Workflows
Windows 21H2
AMD 3700X
@j-f1
Copy link

j-f1 commented Nov 29, 2022

How many CPUs do you have? It might be that once you are using all of your cores you start to lose performance from excess thread switching.

@savchenko
Copy link
Author

@j-f1 , Ryzen 3700X has 8/16 cores and threads respectively.

@RYucel
Copy link

RYucel commented Nov 29, 2022 via email

@savchenko
Copy link
Author

@RYucel , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless.

@RYucel
Copy link

RYucel commented Nov 29, 2022 via email

@ggerganov ggerganov added the performance CPU and memory usage - results and comparisons label Dec 1, 2022
@ggerganov
Copy link
Owner

@savchenko

Yes, I observe the same behaviour on M1 Pro - 7 threads is the sweet spot.
Thanks for pointing out - I actually thought that 8 threads is best.

My explanation is that the computation becomes memory-bound at some point, so you stop gaining performance with more CPU power. It's the memory that limits us.

@jonvaldes
Copy link

jonvaldes commented Dec 4, 2022

I've been running some tests under Superluminal, and I believe I'm seeing some waste when running on multiple threads.

The way ggml works, it spawns new threads every time ggml_graph_compute is invoked, but in some cases in whisper_cpp this gets pretty bad, especially on whisper_decode. For example, here's how one of those invocations looks like in Superluminal:
image

The thread only lives for 2.7ms (which is already worrying, as there's thousands of these threads being spawned), but of that time, only about 1ms is being spent on actual work. The rest is calls to atomic_load, or overhead from creating and destroying the thread.

It looks like trying to make these threads longer-lived and using a lighter synchronization mechanism should bring some nice perf gains here.

@ggerganov
Copy link
Owner

@jonvaldes

Thanks for this analysis!
I guess I will have to make the threads wait on a condition variable instead of joining them when the ggml_graph_compute finishes.

Regarding the atomic_load - once the threads are started, I found that using a busy loop on atomic counter is much more efficient compared to waiting + notify a condition variable. Of course, it is probably more energy wasteful, but since I am more interested in performance it was better. I think I can add a "low-power" mode where instead of busy loops we use the standard mechanism with condition variable. Would make the CPU go less crazy.

@debasish-mihup
Copy link

@savchenko Which model did you use and what was the duration of the audio segment used for testing?

@savchenko
Copy link
Author

@debasish-mihup, medium.en and "long enough to run for many minutes".

@yujinqiu
Copy link

@ggerganov I profile it with FlameGraph, on my linux host.
with thread 8, you can see that ggml_compute_forward_mul_mat only use about 24.71% cpu time, but 72.53%(97.24 - 24.71) cpu time is wasted, I suspect this is the reason why metal don't work as expect, it's not the bottleneck.
whisper02

I'm not familiar with C++, but from the code I guess decrease the thread number can help reduce the busy waiting time.
here is the thread 4 FlameGraph, you can see that now ggml_compute_forward_mul_mat 63.21% is doing actual work, only 32.19% (95.4 - 63.21) cpu time is busy waiting,
thread4

@vitacon
Copy link

vitacon commented Dec 25, 2022

The thread only lives for 2.7ms (which is already worrying, as there's thousands of these threads being spawned), but of that time, only about 1ms is being spent on actual work.

Does the length of input affect the quality of output?
Would not be a more efficient solution to stop creating these micro-threads and basically split the input audio to several segments and let each segment to be munched by a separate thread? (Of course, the results would have to be pasted together from different threads,)

These are times from my CPU (AMD Ryzen 5 3600 with 6 cores / 12 threads) with different number of threads:
1 779815.88
2 441046.56
4 277384.97
6 252671.91
8 236560.52
10 214721.44
11 203417.19
12 208065.34

Two parallel tasks:
6+6 183298.86 ms

I suppose it should be possible to get much closer to the ideal time (779 815 ms / 12 threads = 64 984 ms). It would just require to find the right place where to cut the original audio without splitting any word. Actually, skipping silent parts (audio gate) would also help.

@ggerganov
Copy link
Owner

I tried to eliminate thread creation/joining in #343 but performance did not improve. My hypothesis is that mutex locks are actually very expensive - more expensive than creating and joining threads. But not sure if I am correct ..

I agree that there is a lot of performance to be gained in the Decoder. The ggml_graph_compute is called many many times and there is significant overhead from these calls. But I don't know what is the best way to improve this yet

@Ono-Sendai
Copy link

There's something very wrong with the multithreading support.
I have a Ryzen 5950X (16 cores, 32 hardware threads).
Setting n_threads = 16 gives inference times (2 trials performed):
5.22 s,
4.99 s.
Setting n_threads = 32 gives inference times:
124.09 s
196.79 s

@Ono-Sendai
Copy link

Something like 80% of the total computation time is spent in ggml_graph_compute_thread calling atomic_load.

@heshpdx
Copy link

heshpdx commented Jun 4, 2023

I collected data on two of the many-core server systems I have in my lab, both aarch64. I used a Chinese audio file which is 73 seconds long, and tested with the latest mainline build using a quantized int-5 model:

./main -t 1 -l chinese -d 73300 -bo 5 -m models/ggml-model-whisper-base-q5_1.bin -f Tencent-chinese.wav

The Huawei machine has 48 cores on an SoC, and the Ampere machine has 80 cores on an SoC. Neither has SMT. I ran a few different trials and took the best time for each thread count. The best time on the Huawei was with 13 threads, and for Ampere it was at 20 threads.

image

image

The Ampere machine has large private L2 caches; when we bind the threads so the OS doesn't schedule them all over the place, we retain hot caches (for data and locks) which leads to better CPU usage. Although that is for the region on the right after we have already hit our minimum with 16 threads. Using 80 threads is twice as slow as using 16 threads. Maybe there just isn't enough work to create efficiency past 16 threads? Are there knobs to partition the work at coarser granularity per thread?

for i in `seq 1 80`; do num=$((i-1)); time=`perf stat taskset -c 0-$num ./main -l chinese -d 73300 -bo 5 -m models/ggml-model-whisper-base-q5_1.bin -f Tencent-chinese.wav -t $i |& grep seconds\ time | awk '{print $1}'`; echo "$i,$time" >> stats.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance CPU and memory usage - results and comparisons
10 participants