Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico nodes 3.27.3 and 3.28.0 crash on IPv6 only cluster #8941

Open
telmich opened this issue Jun 25, 2024 · 3 comments
Open

calico nodes 3.27.3 and 3.28.0 crash on IPv6 only cluster #8941

telmich opened this issue Jun 25, 2024 · 3 comments

Comments

@telmich
Copy link

telmich commented Jun 25, 2024

Expected Behavior

calico-node runs

Current Behavior

calico-node crashes

Possible Solution

Unclear

Steps to Reproduce (for bugs)

Run

VERSION=v3.28.0
helm repo add projectcalico https://docs.projectcalico.org/charts
helm repo update
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace

Result:

[23:57] sun:~% kubectl get pods -n calico-system
NAME                            READY   STATUS             RESTARTS            AGE
calico-node-2rdj6               1/1     Running            3                   213d
calico-node-g9vfn               0/1     CrashLoopBackOff   5 (<invalid> ago)   4m35s
calico-node-gxt2v               1/1     Running            7                   213d
calico-node-khpx8               1/1     Running            5 (90d ago)         164d
calico-node-lgjt7               1/1     Running            0                   12m
calico-node-ms4wr               1/1     Running            0                   16m
calico-node-rn5pl               1/1     Running            0                   20m
calico-node-vhnfg               1/1     Running            6                   164d
calico-typha-579b449cb9-7wlhc   1/1     Running            0                   4m35s
calico-typha-579b449cb9-fflxc   1/1     Running            0                   4m35s
calico-typha-579b449cb9-gljhc   1/1     Running            0                   4m35s
[23:57] sun:~% 
[23:53] sun:~% kubectl -n calico-system logs  calico-node-g9vfn         
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
2024-06-25 21:58:37.304 [INFO][4] startup/startup.go 431: Early log level set to info
2024-06-25 21:58:37.304 [INFO][4] startup/utils.go 126: Using NODENAME environment for node name server70
2024-06-25 21:58:37.304 [INFO][4] startup/utils.go 138: Determined node name: server70
2024-06-25 21:58:37.304 [INFO][4] startup/startup.go 95: Starting node server70 with version v3.28.0
2024-06-25 21:58:37.305 [INFO][4] startup/startup.go 436: Checking datastore connection
2024-06-25 21:58:37.318 [INFO][4] startup/startup.go 460: Datastore connection verified
2024-06-25 21:58:37.318 [INFO][4] startup/startup.go 105: Datastore is ready
2024-06-25 21:58:37.327 [WARNING][4] startup/winutils.go 150: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-25 21:58:37.338 [INFO][4] startup/startup.go 489: Initialize BGP data
2024-06-25 21:58:37.338 [WARNING][4] startup/autodetection_methods.go 99: Unable to auto-detect an IPv4 address: no valid IPv4 addresses found on the host interfaces
2024-06-25 21:58:37.338 [WARNING][4] startup/startup.go 511: Couldn't autodetect an IPv4 address. If auto-detecting, choose a different autodetection method. Otherwise provide an explicit address.
2024-06-25 21:58:37.338 [INFO][4] startup/startup.go 395: Clearing out-of-date IPv4 address from this node IP=""
2024-06-25 21:58:37.339 [INFO][4] startup/startup.go 399: Clearing out-of-date IPv6 address from this node IP=""
2024-06-25 21:58:37.350 [WARNING][4] startup/utils.go 48: Terminating
Calico node failed to start

Context

This is an 1.30.1 k8s cluster that was running 3.26.4 before the upgrade:

[23:50] sun:~% helm -n tigera ls
NAME  	NAMESPACE	REVISION	UPDATED                               	STATUS  	CHART                  	APP VERSION
calico	tigera   	1       	2023-11-25 18:33:44.68511143 +0100 CET	deployed	tigera-operator-v3.26.4	v3.26.4    

The node on which calico-node crashes has the following IP+ routing information:

[00:07] server70.place6:~# ip a sh dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 3c:4a:92:79:bc:b6 brd ff:ff:ff:ff:ff:ff
    inet6 2a0a:e5c0:3:0:3e4a:92ff:fe79:bcb6/64 scope global dynamic mngtmpaddr proto kernel_ra 
       valid_lft 86400sec preferred_lft 14400sec
    inet6 fe80::3e4a:92ff:fe79:bcb6/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever
[00:07] server70.place6:~# ip -6 r
2a0a:e5c0:3::/64 dev eth0 proto kernel metric 256 expires 86399sec pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:8304 dev calia65cbcf37b3 metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:8309 dev cali9602dfc7711 metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:8316 dev cali8175fd46771 metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:8321 dev calib6cb42879fe metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:8328 dev caliacd966f9edc metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:832b dev caliab9ad2efa7b metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:8332 dev cali434564fd4ad metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:833a dev caliab11d221847 metric 1024 pref medium
2a0a:e5c0:12:1:49ff:6d53:4609:833c dev cali66576bc8dc4 metric 1024 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev caliacd966f9edc proto kernel metric 256 pref medium
fe80::/64 dev caliab11d221847 proto kernel metric 256 pref medium
fe80::/64 dev calib6cb42879fe proto kernel metric 256 pref medium
fe80::/64 dev cali9602dfc7711 proto kernel metric 256 pref medium
fe80::/64 dev calia65cbcf37b3 proto kernel metric 256 pref medium
fe80::/64 dev cali434564fd4ad proto kernel metric 256 pref medium
fe80::/64 dev cali66576bc8dc4 proto kernel metric 256 pref medium
fe80::/64 dev cali8175fd46771 proto kernel metric 256 pref medium
fe80::/64 dev caliab9ad2efa7b proto kernel metric 256 pref medium
default via fe80::20d:b9ff:fe4a:fbe8 dev eth0 proto ra metric 1024 expires 599sec hoplimit 64 pref medium
default via fe80::20d:b9ff:fe4a:f9ac dev eth0 proto ra metric 1024 expires 595sec hoplimit 64 pref high
[00:07] server70.place6:~# 

Your Environment

  • Calico version 3.28.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.30.1
  • Operating System and version: alpine linux

logs from 3.26.4

for reference, from a running pod on another node, attached as calico-node-lgjt7-log.txt:
calico-node-lgjt7-log.txt

@telmich
Copy link
Author

telmich commented Jun 25, 2024

For reference the tigera installation:

% kubectl -n tigera get installations.operator.tigera.io -o yaml 
apiVersion: v1
items:
- apiVersion: operator.tigera.io/v1
  kind: Installation
  metadata:
    annotations:
      meta.helm.sh/release-name: calico
      meta.helm.sh/release-namespace: tigera
    creationTimestamp: "2023-11-25T17:36:04Z"
    finalizers:
    - tigera.io/operator-cleanup
    - operator.tigera.io/installation-controller
    generation: 5
    labels:
      app.kubernetes.io/managed-by: Helm
    name: default
    resourceVersion: "365039407"
    uid: e131aee1-1715-4658-a1ad-667068394c13
  spec:
    calicoNetwork:
      bgp: Enabled
      hostPorts: Enabled
      ipPools:
      - allowedUses:
        - Workload
        - Tunnel
        blockSize: 122
        cidr: 2a0a:e5c0:12:1::/64
        disableBGPExport: false
        encapsulation: None
        name: default-ipv6-ippool
        natOutgoing: Disabled
        nodeSelector: all()
      linuxDataplane: Iptables
      linuxPolicySetupTimeoutSeconds: 0
      multiInterfaceMode: None
      nodeAddressAutodetectionV4:
        firstFound: true
      nodeAddressAutodetectionV6:
        firstFound: true
      windowsDataplane: Disabled
    cni:
      ipam:
        type: Calico
      type: Calico
    controlPlaneReplicas: 2
    flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
    imagePullSecrets: []
    kubeletVolumePluginPath: /var/lib/kubelet
    kubernetesProvider: ""
    logging:
      cni:
        logFileMaxAgeDays: 30
        logFileMaxCount: 10
        logFileMaxSize: 100Mi
        logSeverity: Info
    nodeUpdateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
    nonPrivileged: Disabled
    variant: Calico
  status:
    conditions:
    - lastTransitionTime: "2024-06-25T22:20:42Z"
      message: 'Pod calico-system/calico-node-9pj2s has crash looping container: calico-node'
      observedGeneration: 5
      reason: PodFailure
      status: "True"
      type: Degraded
    - lastTransitionTime: "2024-06-25T22:20:42Z"
      message: ""
      observedGeneration: 5
      reason: Unknown
      status: "False"
      type: Ready
    - lastTransitionTime: "2024-06-25T22:20:42Z"
      message: ""
      observedGeneration: 5
      reason: Unknown
      status: "False"
      type: Progressing
kind: List
metadata:
  resourceVersion: ""
@telmich telmich changed the title calico nodes with 3.28.0 crashes on IPv6 only cluster Jun 26, 2024
@telmich
Copy link
Author

telmich commented Jun 26, 2024

Interestingly, same result with 3.27.3:

VERSION=v3.27.3
helm repo add projectcalico https://docs.projectcalico.org/charts
helm repo update
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace

% kubectl get pods -n calico-system   
NAME                                       READY   STATUS             RESTARTS            AGE
calico-kube-controllers-5d7d79486c-kfnc2   1/1     Running            0                   2m1s
calico-node-2bpv2                          0/1     CrashLoopBackOff   3 (34s ago)         2m2s
calico-node-65cgv                          0/1     CrashLoopBackOff   3 (<invalid> ago)   2m2s
calico-node-gxt2v                          1/1     Running            7                   213d
calico-node-khpx8                          1/1     Running            5 (91d ago)         164d
calico-node-lgjt7                          1/1     Running            0                   9h
calico-node-ms4wr                          1/1     Running            0                   9h
calico-node-vhnfg                          1/1     Running            6                   164d
calico-node-w9rsn                          0/1     CrashLoopBackOff   3 (<invalid> ago)   2m2s
calico-typha-5665d494b5-7sgrw              1/1     Running            0                   2m3s
calico-typha-5665d494b5-hhfnm              1/1     Running            0                   2m3s
calico-typha-5665d494b5-k5fvz              1/1     Running            0                   2m3s
csi-node-driver-29f8f                      2/2     Running            0                   2m2s
csi-node-driver-2br86                      2/2     Running            0                   75s
csi-node-driver-2q6rd                      2/2     Running            0                   69s
csi-node-driver-42gmf                      2/2     Running            0                   83s
csi-node-driver-ftqq2                      2/2     Running            0                   64s
csi-node-driver-n6k9b                      2/2     Running            0                   53s
csi-node-driver-rbzqz                      2/2     Running            0                   59s
csi-node-driver-tkqf7                      2/2     Running            0                   95s

% kubectl -n calico-system logs calico-node-2bpv2
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
2024-06-26 07:28:35.394 [INFO][4] startup/startup.go 445: Early log level set to info
2024-06-26 07:28:35.394 [INFO][4] startup/utils.go 126: Using NODENAME environment for node name server69
2024-06-26 07:28:35.394 [INFO][4] startup/utils.go 138: Determined node name: server69
2024-06-26 07:28:35.394 [INFO][4] startup/startup.go 95: Starting node server69 with version v3.27.3
2024-06-26 07:28:35.395 [INFO][4] startup/startup.go 450: Checking datastore connection
2024-06-26 07:28:35.410 [INFO][4] startup/startup.go 474: Datastore connection verified
2024-06-26 07:28:35.410 [INFO][4] startup/startup.go 105: Datastore is ready
2024-06-26 07:28:35.418 [WARNING][4] startup/winutils.go 144: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-26 07:28:35.430 [INFO][4] startup/startup.go 503: Initialize BGP data
2024-06-26 07:28:35.430 [WARNING][4] startup/autodetection_methods.go 99: Unable to auto-detect an IPv4 address: no valid IPv4 addresses found on the host interfaces
2024-06-26 07:28:35.430 [WARNING][4] startup/startup.go 525: Couldn't autodetect an IPv4 address. If auto-detecting, choose a different autodetection method. Otherwise provide an explicit address.
2024-06-26 07:28:35.431 [INFO][4] startup/startup.go 409: Clearing out-of-date IPv4 address from this node IP=""
2024-06-26 07:28:35.431 [INFO][4] startup/startup.go 413: Clearing out-of-date IPv6 address from this node IP=""
2024-06-26 07:28:35.446 [WARNING][4] startup/utils.go 48: Terminating
Calico node failed to start

@caseydavenport
Copy link
Member

2024-06-25 21:58:37.338 [WARNING][4] startup/autodetection_methods.go 99: Unable to auto-detect an IPv4 address: no valid IPv4 addresses found on the host interfaces

Seems like Calico is configured to detect an IPv4 address on the host, which it is failing to do because this is an IPv6 only cluster.

Might want to adjust your Installation like this:

      nodeAddressAutodetectionV4: {}

To disable IPv4 auto detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants