Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange performance problems when going below 6 cores per dual port #1116

Open
Civil opened this issue Mar 14, 2024 · 0 comments
Open

Strange performance problems when going below 6 cores per dual port #1116

Civil opened this issue Mar 14, 2024 · 0 comments

Comments

@Civil
Copy link
Contributor

Civil commented Mar 14, 2024

I'm currently trying to benchmark NICs on a system that have relatively low core count (16 cores, 32 threads) and I encounter a strange scalability problem when I'm trying to allocate less than 6 cores per dual port.

With 6 cores per dual port on 5 NICs I can easily get about 100 Mpps per NIC TX (and 65-85 Mpps RX, depending on a NIC, which is in line with what )

This is performance on ConnectX-6 with 6 cores per dual-port:
CleanShot 2024-03-14 at 23 12 02@2x

vtune hotspots:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 68.307s
    CPU Time: 341.346s
        Effective Time: 341.346s
        Spin Time: 0s
        Overhead Time: 0s
    Total Thread Count: 17
    Paused Time: 0s

Top Hotspots
Function                                                                                                  Module         CPU Time  % of CPU Time(%)
--------------------------------------------------------------------------------------------------------  -------------  --------  ----------------
rte_rdtsc                                                                                                 _t-rex-64      132.300s             38.8%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push  _t-rex-64       38.642s             11.3%
mlx5_tx_burst_empw_inline                                                                                 libmlx5-64.so   19.090s              5.6%
mlx5_tx_cseg_init                                                                                         libmlx5-64.so   15.152s              4.4%
CNodeGenerator::handle_stl_node                                                                           _t-rex-64       11.840s              3.5%
[Others]                                                                                                  N/A            124.322s             36.4%
Effective Physical Core Utilization: 31.6% (5.049 out of 16)
 | The metric value is low, which may signal a poor physical CPU cores
 | utilization caused by:
 |     - load imbalance
 |     - threading runtime overhead
 |     - contended synchronization
 |     - thread/process underutilization
 |     - incorrect affinity that utilizes logical cores instead of physical
 |       cores
 | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
 | or run the Locks and Waits analysis to identify parallel bottlenecks for
 | other parallel runtimes.
 |
    Effective Logical Core Utilization: 15.8% (5.061 out of 32)
     | The metric value is low, which may signal a poor logical CPU cores
     | utilization. Consider improving physical core utilization as the first
     | step and then look at opportunities to utilize logical cores, which in
     | some cases can improve processor throughput and overall performance of
     | multi-threaded applications.
     |
Collection and Platform Info
    Application Command Line: ./_t-rex-64 "-i" "-c" "6" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
    Operating System: 6.5.0-0.deb12.4-amd64 12.5
    Computer Name: spr-testbench
    Result Size: 13.2 MB
    Collection start time: 22:11:36 14/03/2024 UTC
    Collection stop time: 22:12:46 14/03/2024 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
        Frequency: 3.096 GHz
        Logical CPU Count: 32
        LLC size: 47.2 MB
        Cache Allocation Technology
            Level 2 capability: available
            Level 3 capability: available

Some information from vtune performance-snapshot:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 45.157s
    IPC: 2.652
    SP GFLOPS: 0.000
    DP GFLOPS: 0.562
    Average CPU Frequency: 4.918 GHz
Logical Core Utilization: 16.4% (5.255 out of 32)
    Physical Core Utilization: 32.8% (5.241 out of 16)
Microarchitecture Usage: 43.4% of Pipeline Slots
    Retiring: 43.4% of Pipeline Slots
        Light Operations: 38.7% of Pipeline Slots
        Heavy Operations: 4.7% of Pipeline Slots
    Front-End Bound: 2.2% of Pipeline Slots
        Front-End Latency: 0.6% of Pipeline Slots
        Front-End Bandwidth: 1.6% of Pipeline Slots
    Bad Speculation: 1.2% of Pipeline Slots
        Branch Mispredict: 0.8% of Pipeline Slots
        Machine Clears: 0.4% of Pipeline Slots
    Back-End Bound: 53.2% of Pipeline Slots
        Memory Bound: 14.2% of Pipeline Slots
            L1 Bound: 2.6% of Clockticks
            L2 Bound: 0.0% of Clockticks
            L3 Bound: 1.0% of Clockticks
                L3 Latency: 0.2% of Clockticks
            DRAM Bound: 0.0% of Clockticks
                Memory Bandwidth: 0.1% of Clockticks
                Memory Latency: 9.2% of Clockticks
                    Local DRAM: 0.0% of Clockticks
                    Remote DRAM: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
        Core Bound: 39.0% of Pipeline Slots
Memory Bound: 14.2% of Pipeline Slots
    Cache Bound: 3.7% of Clockticks
    DRAM Bound: 0.0% of Clockticks
        DRAM Bandwidth Bound: 0.0% of Elapsed Time
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        HP FLOPs: 0.0% of uOps
            Packed: 0.0%
                128-bit: 0.0%
                256-bit: 0.0%
                512-bit: 0.0%
            Scalar: 0.0%
        SP FLOPs: 0.0% of uOps
            Packed: 11.4% from SP FP
                128-bit: 11.0% from SP FP
                256-bit: 0.4% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 88.6% from SP FP
        DP FLOPs: 0.8% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 100.0% from DP FP
        AMX BF16 FLOPs: 0.0% of uOps
        x87 FLOPs: 0.0% of uOps
        Non-FP: 99.2% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.031
    FP Arith/Mem Wr Instr. Ratio: 0.069

PCIe Bandwidth: 13.345 GB/s
PCI Device Class    PCIe Bandwidth, GB/s
------------------  --------------------
Network controller                13.345
Bridge                             0.000
[Unknown]                          0.000

And here is for 5 cores per dualport:
CleanShot 2024-03-14 at 23 17 12@2x

Here it is clearly visible drop in TX from 100 Mpps per port to 25 Mpps (1/4 of original for only one core difference) and performance is WAY less stable (it can briefly go up to 40 Mpps but drops)

From vtune hotspots:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 82.541s
    CPU Time: 381.038s
        Effective Time: 381.038s
        Spin Time: 0s
        Overhead Time: 0s
    Total Thread Count: 16
    Paused Time: 0s

Top Hotspots
Function                                                                                                  Module         CPU Time  % of CPU Time(%)
--------------------------------------------------------------------------------------------------------  -------------  --------  ----------------
rte_rdtsc                                                                                                 _t-rex-64      295.096s             77.4%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push  _t-rex-64       16.046s              4.2%
rte_delay_us_block                                                                                        _t-rex-64        6.892s              1.8%
rte_pause                                                                                                 _t-rex-64        5.934s              1.6%
check_cqe                                                                                                 libmlx5-64.so    4.816s              1.3%
[Others]                                                                                                  N/A             52.254s             13.7%
Effective Physical Core Utilization: 29.1% (4.657 out of 16)
 | The metric value is low, which may signal a poor physical CPU cores
 | utilization caused by:
 |     - load imbalance
 |     - threading runtime overhead
 |     - contended synchronization
 |     - thread/process underutilization
 |     - incorrect affinity that utilizes logical cores instead of physical
 |       cores
 | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
 | or run the Locks and Waits analysis to identify parallel bottlenecks for
 | other parallel runtimes.
 |
    Effective Logical Core Utilization: 14.6% (4.668 out of 32)
     | The metric value is low, which may signal a poor logical CPU cores
     | utilization. Consider improving physical core utilization as the first
     | step and then look at opportunities to utilize logical cores, which in
     | some cases can improve processor throughput and overall performance of
     | multi-threaded applications.
     |
Collection and Platform Info
    Application Command Line: ./_t-rex-64 "-i" "-c" "5" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
    Operating System: 6.5.0-0.deb12.4-amd64 12.5
    Computer Name: spr-testbench
    Result Size: 13.7 MB
    Collection start time: 22:16:15 14/03/2024 UTC
    Collection stop time: 22:17:39 14/03/2024 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
        Frequency: 3.096 GHz
        Logical CPU Count: 32
        LLC size: 47.2 MB
        Cache Allocation Technology
            Level 2 capability: available
            Level 3 capability: available

Composition is completely different and most of time is spent on getting time.

and vtune performance snapshot:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 59.648s
    IPC: 0.894
    SP GFLOPS: 0.000
    DP GFLOPS: 0.163
    Average CPU Frequency: 4.921 GHz
Logical Core Utilization: 13.4% (4.277 out of 32)
    Physical Core Utilization: 26.6% (4.263 out of 16)
Microarchitecture Usage: 19.9% of Pipeline Slots
    Retiring: 19.9% of Pipeline Slots
        Light Operations: 13.0% of Pipeline Slots
        Heavy Operations: 6.8% of Pipeline Slots
    Front-End Bound: 2.2% of Pipeline Slots
        Front-End Latency: 0.6% of Pipeline Slots
        Front-End Bandwidth: 1.6% of Pipeline Slots
    Bad Speculation: 0.5% of Pipeline Slots
        Branch Mispredict: 0.4% of Pipeline Slots
        Machine Clears: 0.0% of Pipeline Slots
    Back-End Bound: 77.4% of Pipeline Slots
        Memory Bound: 5.9% of Pipeline Slots
            L1 Bound: 1.2% of Clockticks
            L2 Bound: 0.0% of Clockticks
            L3 Bound: 1.1% of Clockticks
                L3 Latency: 1.6% of Clockticks
            DRAM Bound: 0.0% of Clockticks
                Memory Bandwidth: 0.1% of Clockticks
                Memory Latency: 3.5% of Clockticks
                    Local DRAM: 0.0% of Clockticks
                    Remote DRAM: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
        Core Bound: 71.5% of Pipeline Slots
Memory Bound: 5.9% of Pipeline Slots
    Cache Bound: 2.3% of Clockticks
    DRAM Bound: 0.0% of Clockticks
        DRAM Bandwidth Bound: 0.0% of Elapsed Time
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        HP FLOPs: 0.0% of uOps
            Packed: 0.0%
                128-bit: 0.0%
                256-bit: 0.0%
                512-bit: 0.0%
            Scalar: 0.0%
        SP FLOPs: 0.0% of uOps
            Packed: 24.7% from SP FP
                128-bit: 24.7% from SP FP
                256-bit: 0.0% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 75.3% from SP FP
        DP FLOPs: 0.7% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 100.0% from DP FP
        AMX BF16 FLOPs: 0.0% of uOps
        x87 FLOPs: 0.0% of uOps
        Non-FP: 99.3% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.031
    FP Arith/Mem Wr Instr. Ratio: 0.072

PCIe Bandwidth: 4.280 GB/s
PCI Device Class    PCIe Bandwidth, GB/s
------------------  --------------------
Network controller                 4.280
Bridge                             0.000
[Unknown]                          0.000

That looks like COMPLETELY different workload which should be way less stressful for core (IPC <1 compared to 2.6 before) and that looks like some problem with scheduling logic within trex or some problem within DPDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant