1

I am working on optimizing a Single Producer Single Consumer (SPSC) queue in C++. The following is the miniman reproducible example for my implementation:

#include <atomic>
#include <cstddef>
#include <iostream>
#include <thread>

struct Queue {
    alignas(64) std::atomic<size_t> write_index; // Align to cache line size
    alignas(64) std::atomic<size_t> read_index;  // Align to cache line size
    int* data;
    size_t capacity;

    Queue(size_t cap) : write_index(0), read_index(0), capacity(cap + 1) {
        data = new int[capacity];
    }

    ~Queue() {
        delete[] data;
    }

    bool push(int val) {
        size_t write_index_local = write_index.load(std::memory_order_relaxed);
        size_t next_write_index = (write_index_local + 1) % capacity;

        if (next_write_index == read_index.load(std::memory_order_acquire)) {
            // Queue is full
            return false;
        }

        data[write_index_local] = val;
        write_index.store(next_write_index, std::memory_order_release);
        return true;
    }

    bool pop(int& val) {

        const size_t write_index_local = write_index.load(std::memory_order_acquire);
        const size_t read_index_local = read_index.load(std::memory_order_relaxed);

        if (read_index_local == write_index_local) {
            // Queue is empty
            return false;
        }


        val = data[read_index_local];
        size_t next_read_index = (read_index_local + 1) % capacity;
        read_index.store(next_read_index, std::memory_order_release);

        return true;
    }
};

void producer(Queue& queue) {
    for (int i = 0; i < 20; ++i) {
        while (!queue.push(i)) {
            // Busy-wait until there is space in the queue
        }
        std::cout << "Pushed: " << i << std::endl;
    }
}

void consumer(Queue& queue) {
    for (int i = 0; i < 20; ++i) {
        int val;
        while (!queue.pop(val)) {
            // Busy-wait until there is an item to pop
        }
        std::cout << "Popped: " << val << std::endl;
    }
}

int main() {
    Queue queue(10); // Create a queue with capacity 10 (internally 11)

    std::thread producer_thread(producer, std::ref(queue));
    std::thread consumer_thread(consumer, std::ref(queue));

    producer_thread.join();
    consumer_thread.join();

    return 0;
}

I'm observing two cache misses in pop function:

When fetching the write_index. When fetching the data at read_index. I want to optimize this to reduce the number of cache misses. Whenever write_index is changed, pop function has to fetch both write_index and the data. Ideally, I would like to fetch both the write_index and the data at the same time, to avoid cache miss while fetching data.

I thought about using prefetching techniques, but I'm not sure about the best approach for this scenario.

Is it possible to fetch both write_index and data at read_index together? Are there any specific data layout strategies or memory order considerations that can help in minimizing cache misses in this context?

4
  • 1
    "I'm observing two cache misses" how are you observing? What is your expected use pattern? data[read_index_local] is going to be in cache if you're repeatedly calling pop, why do you think that's not the case? Have you ran benchmarks?
    – Passer By
    Commented Jul 7 at 10:04
  • "I'm observing two cache misses in pop function" I am wondering if this is actually true. I cannot reproduce this, even by tweaking your program (eg. removing issues in it) or simulating caches with tools like cachegrind. Did you meant "expecting" instead of "observing"? Commented Jul 7 at 11:43
  • In practice, I think your program last for a too short time (<0.1 ms on my machine) so to get any useful performance profiling information. I think one thread can run far before another and the queue can be filled even before the other is actually started (this seems what often happen on my machine). The spin lock hammer the cache line while it is not necessary causing performance issues in some cases. Commented Jul 7 at 11:44
  • Side note: caches have limited size. Eliminating all cache misses is in general impossible. Sure, you can try to minimise them, but you won't ever get rid of them always. Commented Jul 7 at 17:07

0

Browse other questions tagged or ask your own question.