Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Memory Segment API for aligned vector loads. #132

Open
jatin-bhateja opened this issue Oct 20, 2023 · 8 comments
Open

Use Memory Segment API for aligned vector loads. #132

jatin-bhateja opened this issue Oct 20, 2023 · 8 comments

Comments

@jatin-bhateja
Copy link

jatin-bhateja commented Oct 20, 2023

Hi All,

Most of the vectorized code in SimdOps.java is using fromArray API to load the contents into vector.

With JDK-20+ Vector API added the support for loading and storing vectors from MemorySegments.

Using from/intoMemorySegment APIs one can ensures aligned vector load / store, given that most of the code is using SPECIES_PREFERRED which means the vector size (64 bytes) will match with the cacheline size on X86 AVX-512 targets.

Thus if the first vector load in the vector loop happens from an address which is not a multiple of cacheline / vector size then each successive vector load will span across the cache line, this may have significant performance penalty.

Following PMU events can be used to count the number of SPLIT loads against total number of memory loads.

image

Best Regards,
Jatin

@jbellis
Copy link
Owner

jbellis commented Oct 20, 2023

Thanks, Jatin!

@tjake did you test MemorySegment vectors instead of float[], or am I thinking of something else?

@tjake
Copy link
Collaborator

tjake commented Oct 22, 2023

I did, @jatin-bhateja look at #90 you can run the JMH yourself.

@jatin-bhateja
Copy link
Author

jatin-bhateja commented Oct 23, 2023

Thanks for the link @tjake , I will take a look and get back.

@jatin-bhateja
Copy link
Author

jatin-bhateja commented Nov 2, 2023

Thanks for the link @tjake , I will take a look and get back.

Hi @tjake ,

I ran SimilarityBench JMH micro included with PR #90 with following modifications on Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz Icelake Server.

IPROMPT>diff src/main/java/org/sample/SimilarityBench.java /home/jatinbha/sandboxes/jvector/jvector-native/src/test/java/microbench/SimilarityBench.java 
28d27
< import java.lang.foreign.Arena;
39,40c38,39
<     private static final int SIZE = 1024;
<     private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_512;
---
>     private static final int SIZE = 2;
>     private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_64;
43,44c42,43
<         MemorySegment fm0 = Arena.global().allocate(SIZE * Float.BYTES, 64);
<         MemorySegment fm1 = Arena.global().allocate(SIZE * Float.BYTES, 64);
---
>         MemorySegment fm0 = MemorySegment.ofBuffer(ByteBuffer.allocateDirect(SIZE * Float.BYTES).order(ByteOrder.LITTLE_ENDIAN));
>         MemorySegment fm1 = MemorySegment.ofBuffer(ByteBuffer.allocateDirect(SIZE * Float.BYTES).order(ByteOrder.LITTLE_ENDIAN));
85c84
<         for (int i = 0; i < SIZE * Float.BYTES; i += SPECIES.length() * Float.BYTES) {
---
>         for (int i = 0; i < SIZE; i += SPECIES.length()) {
109d107
< 


Following are the results along with relevant PMU counters.

image

image

Benchmarking was done over unmodified jvector-80-native-vectors branch after some minor build fixes. As can be seen with array based backing storage around 78% of vector loads are split across the cache lines. Split penalty significantly improves with memory segments as we see almost negligible split loads compared to total number of loads. There is around 15% improvement in throughput.

Will spend more time to analyze NativeVectorizationProvider.

Best Regards,
Jatin

@tjake
Copy link
Collaborator

tjake commented Nov 2, 2023

Hey @jatin-bhateja thanks for taking a look!

So looks like the ValueLayout isn't aligned and allocateDirect is? Am I reading it right?

@jatin-bhateja
Copy link
Author

Hey @jatin-bhateja thanks for taking a look!

So looks like the ValueLayout isn't aligned and allocateDirect is? Am I reading it right?

Yes, JDK 21 introduced a new API Arean.allocate to allocate aligned memory segments.

@tjake
Copy link
Collaborator

tjake commented Nov 2, 2023

Hi @jatin-bhateja I was able to reproduce the split-load drop with aligned memory but I don't see a 15% bump. I only see a ~5% improvement over arrays. Any idea why?

Also, since these are vector embedings mostly 1024 is pretty large. When I run with 128 floats I see a 2% loss over arrays. With 1536 (openai embedding size) I see 11% improvement.

@jatin-bhateja
Copy link
Author

jatin-bhateja commented Nov 3, 2023

When I run with 128 floats I see a 2% loss over arrays

Hi @tjake
It all depends on the %age of cycles spent in vector computation loops compared to rest of the application, a speedup over small number of iterations may not impact overall throughput significantly, in addition fromMemorySegment takes more instructions to execute compared to fromArray API. Your observations are in line with this.

I will study your implementation in detail and happy to contribute.

Best Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants