-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T-rex 3.04 with Mellanox ConnectX-6 and dummy port shows errors instead of stats while sending traffic #1101
Comments
Funfact, I tried v2.87 and there the stats look correct. So I wanted to try to find out which version introduced the change but starting with v2.88 I always run into this error:
Seems to be DPDK Update related, so the hunt between v2.87 and v3.04 is just a bit too much :/ |
v3.03 looks good, might be the same issue as seen in #1091 |
I've had the same problem as an author. With ConnectX-6, it works in a different way than above. However, it works fine with ConnectX-5 and ConnectX-4 (I hope that will help to narrow it down). I've tried git bisect'ing between v3.03 and v3.04, and indeed b0bc6c7 is the commit that broke it. |
@Civil thanks for bisect. the commit you mentioned is the DPDK upgrade, the but might be related to the new DPDK and that was the main objective of the new version |
@hhaim if that helps, I've tried to adapt DPDK 23.11 in my fork: https://github.com/Civil/trex-core/tree/dpdk2311 But as I'm not that familiar with your build system and assumptions I probably did that a bit suboptimal. It builds like that, it runs, however problem with ConnectX-6 persists: Stats are broken, but now in a bit different way than before. |
Overall, while porting, I had few small complications:
I've also splitted my work into two commits - one to import dpdk and second one to port the patch (I did a diff with vanilla 23.03) I kinda suspect that it is still a DPDK-related problem, but I'm not familiar enough with the code to build a minimal repro to report it to DPDK mail list though and it will take a bit of time to familiarize myself with the code base before I'll be able to do something like that... |
@Civil looking into the output of the TUI above, it seems there is an issue with the Tx counters too (port 0,1). Are you using CX-6 2x100Gbps with PCIe4 motherboard? we have CX-5de cards with PCIe3 motherboard in our lab, and you can't reach 100gbps total bandwidth due to PCI shortage. I would reduce the rate and look into the detail counters (in the Console show counters command to understand the reason for the error). We don't have CX-6 in our lab so we can't test it, usually upgrading the DPDK with NVIDIA driver is painful due to its dependency with kernel and OFED |
@hhaim port 0 and 1 is a single cx-6 2x100, port 2 and 3 is cx5 2x100, cdat (pcie gen4 version) to show that it works with older generation just fine.
Yes, both cards links at x16 Gen4.
I have CX-4, 5 and 6 on my hands, so I can run any kind of tests on any of those cards, if you have something in mind.
If that matter, that machine of mine runs debian 12. I've tried OFED 5.7-1.0.2.0 and OFED 24.01-0.3.3.1, and haven't noticed any differences at all. As about kernel - I've tried 6.1 (stock) and 6.5 (backports) And as I've said both trex 3.04 (dpdk 23.03) and trex after my attempt to port dpdk 23.11 shows same behavior. And same as original reporter said - trex 3.03 (dpdk 22.03) works fine. I've so far tried to revert some of the changes around stats/xstats structure in If that matters, I've tried few firmware version on connectx6, currently I'm at 22.40.1000, but I've also tried 22.31.1014 |
@Civil it seems that the trex driver counter code for CX-6 needs to be modified try to understand using the trex mapping between the raw counter and trex counters is in this function
this function map |
honestly I don't see anything wrong with raw counters there, but I'll have a look at the code later, maybe I'll find a problem there. What makes me wonder is why |
@Civil it seems that the TUI and counters work well in low 1kpps. BTW the TUI works with trex-driver-counters this is the function that convert the (stats -x |
@hhaim pps rate doesn't change the behavior, even if I do just 1 pps it will be counted as ierrors in trex per-port stats table and in tui with dpdk 23.03+. But here are the outputs for 10mpps test: For dpdk-23.11:
v3.04 dpdk-23.03:
v3.03 (just
For some reason it shows me the error I've showed you above. I haven't got time to dig into the function code yet, but at least I have a clue that something is wrong with how it is converted inside trex. |
So about
So that is expected when network card family can give more counters, than particular card supports so there could be more names than actual stats, index should still match. I have a question - how performance critical is |
So I've rewrote extended_stats to be way simplier and after that it works well on both ConnectX-5 and 6. That is what I did: Civil@a22596a And if you are interested, I can port it to current master and send a PR for that. It might require a bit of extra work as I just made sure that my use case works well, and I probably haven't got enough historical context about some of the decisions, but it can be a good start for you. |
nice. looked into your commit, it is strange that some counters do not work in diffs (e.g. ibytes/obytes/errors) for example:
was that due to a kernel driver change? If you can create a PR I could run it in our regression to verify it on our hardware |
That was because I've just removed the code (kept only names of the counters) and rewrote it from scratch and haven't understood the value of getting diff. I can return it, but it only make sense for moments when driver counter wraps (and that I assumed happens only if it is actually not uint64 but something else underneath, and I haven't seen any comments about that inside DPDK code), so in the PR I can keep the diffs back in place, they shouldn't hurt as I think main benefit is getting a single reliable way to get the counter, without relying on their exact position in the array (and that is what has changed in DPDK) |
@Civil understood, it is better to keep diff for all the counters, it does not relate to the wrap (wrap is another issue). |
Ok, understood; thanks for the clarification. I've did a PR that keeps the diff: P.S. I've also verified that same problem happened on BlueField-2 without the PR and PR fixes that as well (which is expected I guess, because it is essentially ConnectX-6 with extras). And in case you need more tests or more information I have in my homelab both CX6 and BF2 now. |
@norg I think after my PR was accepted, trex (from master or if you manually apply the patch from PR) would work for you as well. At least I can now get everything working even with mix of different ConnectX cards. |
Awesome, sounds great. Thanks! |
While I have some T-rex setups running with Intel XL710 to achieve 40G with one port I struggle with a 100G setup based on Mellanox ConnectX-6 cards.
So far OFED installation worked, T-Rex setup as well and I can even see the traffic on the DUT but the report from T-rex shows errors instead of the rate, which I was used to from the setup with Intel NICS and DPDK.
T-Rex output:
Runcommand:
I use
-p
to have the full flow received by the single port of the DUT (it's the Suricata IDS so I just need passive traffic forwarding towards it).The
/etc/trex_cfg.yaml
looks like this:While
82:00.1
is the port of the Mellanox NIC on the machine doing the traffic replay and the first destination mac being the DUT. But even if I remove the port info section it still shows those errors.This type of setup with the
-p
flag and also thedummy
in the config works with the Intel setup where I see the traffic rate and all those stats (of course just foropackets
andobytes
whileipackets
is 0).Is this just an output issue? I would like to see the stats and especially traffic rate being sent so I can verify on the DUT side if the traffic rate is received.
Thanks
The text was updated successfully, but these errors were encountered: