Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection issue for a multiple zone cluster with calico 3.28.0 #8860

Open
lzhecheng opened this issue May 28, 2024 · 3 comments
Open

Connection issue for a multiple zone cluster with calico 3.28.0 #8860

lzhecheng opened this issue May 28, 2024 · 3 comments

Comments

@lzhecheng
Copy link

lzhecheng commented May 28, 2024

Expected Behavior

A Node can reach a service whose endpoint is on another Node (different zone) immediately.

Current Behavior

A Node cannot reach a service whose endpoint is on another Node (different zone) immediately. The first packet is dropped and the second one works.

Possible Solution

Use calico 3.27.3

Steps to Reproduce (for bugs)

  1. Create a multiple zone cluster
  2. Create a service with endpoints on nodes of different zones
  3. wget from a Node to the service
  4. If the endpoint is not on the same node, the first packet is lost

Context

Details here: kubernetes-sigs/cloud-provider-azure#6293

Your Environment

  • Calico version
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
  • Operating System and version:
  • Link to your project (optional):
@matthewdupre
Copy link
Member

Sounds like a regression, we'll have a look

@sfudeus
Copy link

sfudeus commented Jun 25, 2024

I might have a similar issue and was just about to open a bugreport. For me, this is related to VXLAN checksum offloading.
For vxlan-tunneled traffic, the first SYN is lost and has to be resent.
Please advise if I should add my data here or create a dedicated issue.

@lzhecheng Can you try with disabling checksum offloading (featureDetectOverride: ChecksumOffloadBroken=true in FelixConfiguration) to see if it makes a difference for you?

@sfudeus
Copy link

sfudeus commented Jun 26, 2024

I'm adding the basics of what I am observing here:
With VXLAN checksum offloading enabled, I do observe the following:
For any traffic, which is

  • directed against a NodePort or an external/loadBalancerIP
  • forwarded to a pod on a different subnet(i.e. requiring vxlan, likely happening always when using vxlanMode: Always instead of CrossSubnet)
    the first SYN packet is lost.

I could not observe packet loss when directing the traffic against the podIP itself, only via K8s iptables rules for NodePort/LoadBalancer, likely because of NAT?

I observed the packet loss to happen only on the destination node, between the physical interface and the pod interface, i.e. I could still see the first SYN packet VXLAN encapsulated on the physical interface, but only the second SYN popped up on the pod interface (cali*).

My test client was only running in the hostNetwork, I didn't test from a pod (yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment