Use Memory Segment API for aligned vector loads. #132

jatin-bhateja · 2023-10-20T06:25:59Z

Hi All,

Most of the vectorized code in SimdOps.java is using fromArray API to load the contents into vector.

With JDK-20+ Vector API added the support for loading and storing vectors from MemorySegments.

Using from/intoMemorySegment APIs one can ensures aligned vector load / store, given that most of the code is using SPECIES_PREFERRED which means the vector size (64 bytes) will match with the cacheline size on X86 AVX-512 targets.

Thus if the first vector load in the vector loop happens from an address which is not a multiple of cacheline / vector size then each successive vector load will span across the cache line, this may have significant performance penalty.

Following PMU events can be used to count the number of SPLIT loads against total number of memory loads.

Best Regards,
Jatin

jbellis · 2023-10-20T13:44:10Z

Thanks, Jatin!

@tjake did you test MemorySegment vectors instead of float[], or am I thinking of something else?

tjake · 2023-10-22T01:45:15Z

I did, @jatin-bhateja look at #90 you can run the JMH yourself.

jatin-bhateja · 2023-10-23T12:56:13Z

Thanks for the link @tjake , I will take a look and get back.

jatin-bhateja · 2023-11-02T08:38:06Z

Thanks for the link @tjake , I will take a look and get back.

Hi @tjake ,

I ran SimilarityBench JMH micro included with PR #90 with following modifications on Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz Icelake Server.

IPROMPT>diff src/main/java/org/sample/SimilarityBench.java /home/jatinbha/sandboxes/jvector/jvector-native/src/test/java/microbench/SimilarityBench.java 
28d27
< import java.lang.foreign.Arena;
39,40c38,39
<     private static final int SIZE = 1024;
<     private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_512;
---
>     private static final int SIZE = 2;
>     private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_64;
43,44c42,43
<         MemorySegment fm0 = Arena.global().allocate(SIZE * Float.BYTES, 64);
<         MemorySegment fm1 = Arena.global().allocate(SIZE * Float.BYTES, 64);
---
>         MemorySegment fm0 = MemorySegment.ofBuffer(ByteBuffer.allocateDirect(SIZE * Float.BYTES).order(ByteOrder.LITTLE_ENDIAN));
>         MemorySegment fm1 = MemorySegment.ofBuffer(ByteBuffer.allocateDirect(SIZE * Float.BYTES).order(ByteOrder.LITTLE_ENDIAN));
85c84
<         for (int i = 0; i < SIZE * Float.BYTES; i += SPECIES.length() * Float.BYTES) {
---
>         for (int i = 0; i < SIZE; i += SPECIES.length()) {
109d107
<

Following are the results along with relevant PMU counters.

Benchmarking was done over unmodified jvector-80-native-vectors branch after some minor build fixes. As can be seen with array based backing storage around 78% of vector loads are split across the cache lines. Split penalty significantly improves with memory segments as we see almost negligible split loads compared to total number of loads. There is around 15% improvement in throughput.

Will spend more time to analyze NativeVectorizationProvider.

Best Regards,
Jatin

tjake · 2023-11-02T13:28:30Z

Hey @jatin-bhateja thanks for taking a look!

So looks like the ValueLayout isn't aligned and allocateDirect is? Am I reading it right?

jatin-bhateja · 2023-11-02T16:27:38Z

Hey @jatin-bhateja thanks for taking a look!

So looks like the ValueLayout isn't aligned and allocateDirect is? Am I reading it right?

Yes, JDK 21 introduced a new API Arean.allocate to allocate aligned memory segments.

tjake · 2023-11-02T23:06:54Z

Hi @jatin-bhateja I was able to reproduce the split-load drop with aligned memory but I don't see a 15% bump. I only see a ~5% improvement over arrays. Any idea why?

Also, since these are vector embedings mostly 1024 is pretty large. When I run with 128 floats I see a 2% loss over arrays. With 1536 (openai embedding size) I see 11% improvement.

jatin-bhateja · 2023-11-03T05:35:30Z

When I run with 128 floats I see a 2% loss over arrays

Hi @tjake
It all depends on the %age of cycles spent in vector computation loops compared to rest of the application, a speedup over small number of iterations may not impact overall throughput significantly, in addition fromMemorySegment takes more instructions to execute compared to fromArray API. Your observations are in line with this.

I will study your implementation in detail and happy to contribute.

Best Regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Memory Segment API for aligned vector loads. #132

Use Memory Segment API for aligned vector loads. #132

jatin-bhateja commented Oct 20, 2023 •

edited

Loading

jbellis commented Oct 20, 2023

tjake commented Oct 22, 2023

jatin-bhateja commented Oct 23, 2023 •

edited

Loading

jatin-bhateja commented Nov 2, 2023 •

edited

Loading

tjake commented Nov 2, 2023

jatin-bhateja commented Nov 2, 2023

tjake commented Nov 2, 2023

jatin-bhateja commented Nov 3, 2023 •

edited

Loading

Use Memory Segment API for aligned vector loads. #132

Use Memory Segment API for aligned vector loads. #132

Comments

jatin-bhateja commented Oct 20, 2023 • edited Loading

jbellis commented Oct 20, 2023

tjake commented Oct 22, 2023

jatin-bhateja commented Oct 23, 2023 • edited Loading

jatin-bhateja commented Nov 2, 2023 • edited Loading

tjake commented Nov 2, 2023

jatin-bhateja commented Nov 2, 2023

tjake commented Nov 2, 2023

jatin-bhateja commented Nov 3, 2023 • edited Loading

jatin-bhateja commented Oct 20, 2023 •

edited

Loading

jatin-bhateja commented Oct 23, 2023 •

edited

Loading

jatin-bhateja commented Nov 2, 2023 •

edited

Loading

jatin-bhateja commented Nov 3, 2023 •

edited

Loading