Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico potentially losing track of state intermittently? #8942

Open
henryzhao95 opened this issue Jun 26, 2024 · 1 comment
Open

Calico potentially losing track of state intermittently? #8942

henryzhao95 opened this issue Jun 26, 2024 · 1 comment

Comments

@henryzhao95
Copy link

Expected Behavior

We have Argo CD running in numerous Kubernetes clusters. This includes:

  • argocd-redis-ha-server StatefulSet pod with redis container listening on 6379
  • argocd-redis-ha-server StatefulSet pod with sentinel container listening on 26379
  • argocd-redis-ha-haproxy ReplicaSet pods with redis container listening on ports 6379 and 9101, fronted by a Kubernetes service

We have Calico NetworkPolicies in place to allow the ingress to these ports, for example:

  ingress:
    - action: Allow
      destination:
        ports:
          - 26379
          - 6379
      protocol: TCP
      source:
        namespaceSelector: name == 'argocd'
        selector: >-
          app.kubernetes.io/name in {'argocd-redis-ha',
          'argocd-redis-ha-haproxy', 'argocd-server', 'argocd-repo-server',
          'argocd-application-controller'}
  order: 150
  selector: app.kubernetes.io/name in {'argocd-redis-ha', 'argocd-redis-ha-haproxy'}
  types:
    - Ingress

And so we expect Argo to work, with nothing being denied. (We have a log & deny all rule at the end too.)

Current Behavior

From time to time (like once a month for a cluster), randomly, on rare occasions not coinciding with new calico-node or Argo pods, we will see a burst of 3 of blocked Argo flows spaced roughly 100 seconds apart e.g. 1 at 4:57:39 pm, 1 at 4:59:19 pm, 1 at 5:01:00 pm.

These blocked flows report the inverse of the flow we'd normally expect.
e.g. Blocked: argocd-redis-ha-server:26379 --> argocd-redis-ha-haproxy:40962
Expected flow: argocd-redis-ha-haproxy:40962 --> argocd-redis-ha-server:26379

e.g. Blocked: argocd-redis-ha-server:6379 --> argocd-redis-ha-proxy:51418
Expected flow: argocd-redis-ha-proxy:51418 --> argocd-redis-ha-server:6379

I don't see anything in the Calico pod logs out of the ordinary. My understanding of networking is weak, but it feels like Calico which should be stateful, is potentially losing track of the state of the network flows? Is that possible? Or are there any other theories?

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

  • Calico version v3.27.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): EKS with Kubelet
    v1.28.8-eks-ae9a62a
  • Operating System and version: Amazon Linux 2, 5.10.217-205.860.amzn2.x86_64
  • Link to your project (optional):
@fasaxc
Copy link
Member

fasaxc commented Jun 27, 2024

Yes, Calico is a stateful firewall, we track connections in the kernel's connection tracking "conntrack" table. You can see conntrack entries with conntrack -L to list all or conntrack -E to watch for changes.

The fact that the denied packets are in the reverse direction suggests that there was a previous connection that was being tracked but it was cleaned up. This could be for a few reasons:

  • The connection was closed and these are retransmitted FIN packets at the end of the connection.
  • The connection was silent for a very long time and the connection tracking entry timed out. The timeout is controlled with several sysctl settings:
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300

The net.netfilter.nf_conntrack_tcp_timeout_established timeout is the one for connections that were fully established. It is typically very long (days) but connections that are silent for a long time do hit it.

  • The connection tracking entry was deliberately removed. Calico does this when a local pod is torn down to prevent a later pod with same IP from re-using connection tracking entries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants