You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently trying to benchmark NICs on a system that have relatively low core count (16 cores, 32 threads) and I encounter a strange scalability problem when I'm trying to allocate less than 6 cores per dual port.
With 6 cores per dual port on 5 NICs I can easily get about 100 Mpps per NIC TX (and 65-85 Mpps RX, depending on a NIC, which is in line with what )
This is performance on ConnectX-6 with 6 cores per dual-port:
vtune hotspots:
vtune: Executing actions 75 % Generating a report Elapsed Time: 68.307s
CPU Time: 341.346s
Effective Time: 341.346s
Spin Time: 0s
Overhead Time: 0s
Total Thread Count: 17
Paused Time: 0s
Top Hotspots
Function Module CPU Time % of CPU Time(%)
-------------------------------------------------------------------------------------------------------- ------------- -------- ----------------
rte_rdtsc _t-rex-64 132.300s 38.8%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push _t-rex-64 38.642s 11.3%
mlx5_tx_burst_empw_inline libmlx5-64.so 19.090s 5.6%
mlx5_tx_cseg_init libmlx5-64.so 15.152s 4.4%
CNodeGenerator::handle_stl_node _t-rex-64 11.840s 3.5%
[Others] N/A 124.322s 36.4%
Effective Physical Core Utilization: 31.6% (5.049 out of 16)
| The metric value is low, which may signal a poor physical CPU cores
| utilization caused by:
| - load imbalance
| - threading runtime overhead
| - contended synchronization
| - thread/process underutilization
| - incorrect affinity that utilizes logical cores instead of physical
| cores
| Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
| or run the Locks and Waits analysis to identify parallel bottlenecks for
| other parallel runtimes.
|
Effective Logical Core Utilization: 15.8% (5.061 out of 32)
| The metric value is low, which may signal a poor logical CPU cores
| utilization. Consider improving physical core utilization as the first
| step and then look at opportunities to utilize logical cores, which in
| some cases can improve processor throughput and overall performance of
| multi-threaded applications.
|
Collection and Platform Info
Application Command Line: ./_t-rex-64 "-i" "-c" "6" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
Operating System: 6.5.0-0.deb12.4-amd64 12.5
Computer Name: spr-testbench
Result Size: 13.2 MB
Collection start time: 22:11:36 14/03/2024 UTC
Collection stop time: 22:12:46 14/03/2024 UTC
Collector Type: Event-based counting driver,User-mode sampling and tracing
CPU
Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
Frequency: 3.096 GHz
Logical CPU Count: 32
LLC size: 47.2 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available
Some information from vtune performance-snapshot:
vtune: Executing actions 75 % Generating a report Elapsed Time: 45.157s
IPC: 2.652
SP GFLOPS: 0.000
DP GFLOPS: 0.562
Average CPU Frequency: 4.918 GHz
Logical Core Utilization: 16.4% (5.255 out of 32)
Physical Core Utilization: 32.8% (5.241 out of 16)
Microarchitecture Usage: 43.4% of Pipeline Slots
Retiring: 43.4% of Pipeline Slots
Light Operations: 38.7% of Pipeline Slots
Heavy Operations: 4.7% of Pipeline Slots
Front-End Bound: 2.2% of Pipeline Slots
Front-End Latency: 0.6% of Pipeline Slots
Front-End Bandwidth: 1.6% of Pipeline Slots
Bad Speculation: 1.2% of Pipeline Slots
Branch Mispredict: 0.8% of Pipeline Slots
Machine Clears: 0.4% of Pipeline Slots
Back-End Bound: 53.2% of Pipeline Slots
Memory Bound: 14.2% of Pipeline Slots
L1 Bound: 2.6% of Clockticks
L2 Bound: 0.0% of Clockticks
L3 Bound: 1.0% of Clockticks
L3 Latency: 0.2% of Clockticks
DRAM Bound: 0.0% of Clockticks
Memory Bandwidth: 0.1% of Clockticks
Memory Latency: 9.2% of Clockticks
Local DRAM: 0.0% of Clockticks
Remote DRAM: 0.0% of Clockticks
Remote Cache: 0.0% of Clockticks
Core Bound: 39.0% of Pipeline Slots
Memory Bound: 14.2% of Pipeline Slots
Cache Bound: 3.7% of Clockticks
DRAM Bound: 0.0% of Clockticks
DRAM Bandwidth Bound: 0.0% of Elapsed Time
NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
Instruction Mix
HP FLOPs: 0.0% of uOps
Packed: 0.0%
128-bit: 0.0%
256-bit: 0.0%
512-bit: 0.0%
Scalar: 0.0%
SP FLOPs: 0.0% of uOps
Packed: 11.4% from SP FP
128-bit: 11.0% from SP FP
256-bit: 0.4% from SP FP
512-bit: 0.0% from SP FP
Scalar: 88.6% from SP FP
DP FLOPs: 0.8% of uOps
Packed: 0.0% from DP FP
128-bit: 0.0% from DP FP
256-bit: 0.0% from DP FP
512-bit: 0.0% from DP FP
Scalar: 100.0% from DP FP
AMX BF16 FLOPs: 0.0% of uOps
x87 FLOPs: 0.0% of uOps
Non-FP: 99.2% of uOps
FP Arith/Mem Rd Instr. Ratio: 0.031
FP Arith/Mem Wr Instr. Ratio: 0.069
PCIe Bandwidth: 13.345 GB/s
PCI Device Class PCIe Bandwidth, GB/s
------------------ --------------------
Network controller 13.345
Bridge 0.000
[Unknown] 0.000
And here is for 5 cores per dualport:
Here it is clearly visible drop in TX from 100 Mpps per port to 25 Mpps (1/4 of original for only one core difference) and performance is WAY less stable (it can briefly go up to 40 Mpps but drops)
From vtune hotspots:
vtune: Executing actions 75 % Generating a report Elapsed Time: 82.541s
CPU Time: 381.038s
Effective Time: 381.038s
Spin Time: 0s
Overhead Time: 0s
Total Thread Count: 16
Paused Time: 0s
Top Hotspots
Function Module CPU Time % of CPU Time(%)
-------------------------------------------------------------------------------------------------------- ------------- -------- ----------------
rte_rdtsc _t-rex-64 295.096s 77.4%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push _t-rex-64 16.046s 4.2%
rte_delay_us_block _t-rex-64 6.892s 1.8%
rte_pause _t-rex-64 5.934s 1.6%
check_cqe libmlx5-64.so 4.816s 1.3%
[Others] N/A 52.254s 13.7%
Effective Physical Core Utilization: 29.1% (4.657 out of 16)
| The metric value is low, which may signal a poor physical CPU cores
| utilization caused by:
| - load imbalance
| - threading runtime overhead
| - contended synchronization
| - thread/process underutilization
| - incorrect affinity that utilizes logical cores instead of physical
| cores
| Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
| or run the Locks and Waits analysis to identify parallel bottlenecks for
| other parallel runtimes.
|
Effective Logical Core Utilization: 14.6% (4.668 out of 32)
| The metric value is low, which may signal a poor logical CPU cores
| utilization. Consider improving physical core utilization as the first
| step and then look at opportunities to utilize logical cores, which in
| some cases can improve processor throughput and overall performance of
| multi-threaded applications.
|
Collection and Platform Info
Application Command Line: ./_t-rex-64 "-i" "-c" "5" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
Operating System: 6.5.0-0.deb12.4-amd64 12.5
Computer Name: spr-testbench
Result Size: 13.7 MB
Collection start time: 22:16:15 14/03/2024 UTC
Collection stop time: 22:17:39 14/03/2024 UTC
Collector Type: Event-based counting driver,User-mode sampling and tracing
CPU
Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
Frequency: 3.096 GHz
Logical CPU Count: 32
LLC size: 47.2 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available
Composition is completely different and most of time is spent on getting time.
and vtune performance snapshot:
vtune: Executing actions 75 % Generating a report Elapsed Time: 59.648s
IPC: 0.894
SP GFLOPS: 0.000
DP GFLOPS: 0.163
Average CPU Frequency: 4.921 GHz
Logical Core Utilization: 13.4% (4.277 out of 32)
Physical Core Utilization: 26.6% (4.263 out of 16)
Microarchitecture Usage: 19.9% of Pipeline Slots
Retiring: 19.9% of Pipeline Slots
Light Operations: 13.0% of Pipeline Slots
Heavy Operations: 6.8% of Pipeline Slots
Front-End Bound: 2.2% of Pipeline Slots
Front-End Latency: 0.6% of Pipeline Slots
Front-End Bandwidth: 1.6% of Pipeline Slots
Bad Speculation: 0.5% of Pipeline Slots
Branch Mispredict: 0.4% of Pipeline Slots
Machine Clears: 0.0% of Pipeline Slots
Back-End Bound: 77.4% of Pipeline Slots
Memory Bound: 5.9% of Pipeline Slots
L1 Bound: 1.2% of Clockticks
L2 Bound: 0.0% of Clockticks
L3 Bound: 1.1% of Clockticks
L3 Latency: 1.6% of Clockticks
DRAM Bound: 0.0% of Clockticks
Memory Bandwidth: 0.1% of Clockticks
Memory Latency: 3.5% of Clockticks
Local DRAM: 0.0% of Clockticks
Remote DRAM: 0.0% of Clockticks
Remote Cache: 0.0% of Clockticks
Core Bound: 71.5% of Pipeline Slots
Memory Bound: 5.9% of Pipeline Slots
Cache Bound: 2.3% of Clockticks
DRAM Bound: 0.0% of Clockticks
DRAM Bandwidth Bound: 0.0% of Elapsed Time
NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
Instruction Mix
HP FLOPs: 0.0% of uOps
Packed: 0.0%
128-bit: 0.0%
256-bit: 0.0%
512-bit: 0.0%
Scalar: 0.0%
SP FLOPs: 0.0% of uOps
Packed: 24.7% from SP FP
128-bit: 24.7% from SP FP
256-bit: 0.0% from SP FP
512-bit: 0.0% from SP FP
Scalar: 75.3% from SP FP
DP FLOPs: 0.7% of uOps
Packed: 0.0% from DP FP
128-bit: 0.0% from DP FP
256-bit: 0.0% from DP FP
512-bit: 0.0% from DP FP
Scalar: 100.0% from DP FP
AMX BF16 FLOPs: 0.0% of uOps
x87 FLOPs: 0.0% of uOps
Non-FP: 99.3% of uOps
FP Arith/Mem Rd Instr. Ratio: 0.031
FP Arith/Mem Wr Instr. Ratio: 0.072
PCIe Bandwidth: 4.280 GB/s
PCI Device Class PCIe Bandwidth, GB/s
------------------ --------------------
Network controller 4.280
Bridge 0.000
[Unknown] 0.000
That looks like COMPLETELY different workload which should be way less stressful for core (IPC <1 compared to 2.6 before) and that looks like some problem with scheduling logic within trex or some problem within DPDK.
The text was updated successfully, but these errors were encountered:
I'm currently trying to benchmark NICs on a system that have relatively low core count (16 cores, 32 threads) and I encounter a strange scalability problem when I'm trying to allocate less than 6 cores per dual port.
With 6 cores per dual port on 5 NICs I can easily get about 100 Mpps per NIC TX (and 65-85 Mpps RX, depending on a NIC, which is in line with what )
This is performance on ConnectX-6 with 6 cores per dual-port:
![CleanShot 2024-03-14 at 23 12 02@2x](https://cdn.statically.io/img/private-user-images.githubusercontent.com/844380/313009147-77d0a6b9-b8dd-455e-9b64-ad905c8071be.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE0MDM2NDQsIm5iZiI6MTcyMTQwMzM0NCwicGF0aCI6Ii84NDQzODAvMzEzMDA5MTQ3LTc3ZDBhNmI5LWI4ZGQtNDU1ZS05YjY0LWFkOTA1YzgwNzFiZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzE5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxOVQxNTM1NDRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02MDA1Y2ZhZDgyMzY1YjA2ZmIxM2EzNDg1ZmY4NTY3NTVmZjVlZjBhMDk3NWYyYjM1Njk1ZTQ3N2YzNzMyZWMyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.osTqjW9HzSoOk3a54bbXBV63n-ll8OqpxnZ7t3tPaOE)
vtune hotspots:
Some information from vtune performance-snapshot:
And here is for 5 cores per dualport:
![CleanShot 2024-03-14 at 23 17 12@2x](https://cdn.statically.io/img/private-user-images.githubusercontent.com/844380/313010203-c53cee5d-eefd-493a-9adb-5f74e316db36.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE0MDM2NDQsIm5iZiI6MTcyMTQwMzM0NCwicGF0aCI6Ii84NDQzODAvMzEzMDEwMjAzLWM1M2NlZTVkLWVlZmQtNDkzYS05YWRiLTVmNzRlMzE2ZGIzNi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzE5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxOVQxNTM1NDRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0zMDBiOWRmM2QzM2M1YjBhODYwNTNkMjBjOTYzN2ZlNGVmMmQwZWY0MDVhNmMxNjYzZTM0YWQyNTZhZmM3ZDQ3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.oA2l-gLtuDhZqVgU77j_zIcvkaeNoYyKD7h1v4jKY94)
Here it is clearly visible drop in TX from 100 Mpps per port to 25 Mpps (1/4 of original for only one core difference) and performance is WAY less stable (it can briefly go up to 40 Mpps but drops)
From vtune hotspots:
Composition is completely different and most of time is spent on getting time.
and vtune performance snapshot:
That looks like COMPLETELY different workload which should be way less stressful for core (IPC <1 compared to 2.6 before) and that looks like some problem with scheduling logic within trex or some problem within DPDK.
The text was updated successfully, but these errors were encountered: