Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to attach bpf program to pod interface #8897

Open
xander-sh opened this issue Jun 10, 2024 · 9 comments
Open

Failed to attach bpf program to pod interface #8897

xander-sh opened this issue Jun 10, 2024 · 9 comments
Labels
area/bpf eBPF Dataplane issues kind/support

Comments

@xander-sh
Copy link

xander-sh commented Jun 10, 2024

Context

After upgrading calico to 3.27.2 version in clusters with PostgreSQL managed by Patroni and depoyed by zalando operator, we started to see errors in the pods with DB.

2024-06-01T07:40:03+03:00  2024-06-01 04:40:03 UTC [35341]: [1-1] 665aa623.8a0d 0 [unknown] [unknown] [unknown] 10.222.59.107 LOG:  connection received: host=10.222.59.107 port=35708
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35341]: [2-1] 665aa623.8a0d 0 airflow adm [unknown] 10.222.59.107 LOG:  connection authenticated: identity="adm" method=md5 (/home/postgres/pgdata/pgroot/data/pg_hba.conf:12)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35341]: [3-1] 665aa623.8a0d 0 airflow adm [unknown] 10.222.59.107 LOG:  connection authorized: user=adm database=airflow SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35341]: [4-1] 665aa623.8a0d 0 airflow adm [unknown] 10.222.59.107 LOG:  disconnection: session time: 0:00:00.008 user=adm database=airflow host=10.222.59.107 port=35708
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [1-1] 665aa623.8a0e 0 [unknown] [unknown] [unknown] 10.222.20.175 LOG:  connection received: host=10.222.20.175 port=42608
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [2-1] 665aa623.8a0e 0 airflow adm [unknown] 10.222.20.175 LOG:  connection authenticated: identity="adm" method=md5 (/home/postgres/pgdata/pgroot/data/pg_hba.conf:12)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [3-1] 665aa623.8a0e 0 airflow adm [unknown] 10.222.20.175 LOG:  connection authorized: user=adm database=airflow SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [4-1] 665aa623.8a0e 0 airflow adm [unknown] 10.222.20.175 LOG:  disconnection: session time: 0:00:00.012 user=adm database=airflow host=10.222.20.175 port=42608
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35344]: [1-1] 665aa623.8a10 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35343]: [1-1] 665aa623.8a0f 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35345]: [1-1] 665aa623.8a11 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35343]: [2-1] 665aa623.8a0f 0 airflow postgres [unknown] [local] LOG:  connection authorized: user=postgres database=airflow application_name=pgq ticker
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35344]: [2-1] 665aa623.8a10 0 postgres postgres [unknown] [local] LOG:  connection authorized: user=postgres database=postgres application_name=pgq ticker
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35345]: [2-1] 665aa623.8a11 0 postgres postgres [unknown] [local] LOG:  connection authorized: user=postgres database=postgres application_name=pgq ticker
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35344]: [3-1] 665aa623.8a10 0 postgres postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.002 user=postgres database=postgres host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35343]: [3-1] 665aa623.8a0f 0 airflow postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.003 user=postgres database=airflow host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35345]: [3-1] 665aa623.8a11 0 postgres postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.002 user=postgres database=postgres host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.583 28 LOG {ticks: 0, maint: 0, retry: 0}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03,952 INFO: no action. I am (newfmm-db-airflow-1), the leader with the lock
2024-06-01T07:40:18+03:00 2024-06-01 04:40:13,846 INFO: Lock owner: newfmm-db-airflow-1; I am newfmm-db-airflow-1
2024-06-01T07:40:18+03:00 2024-06-01 04:40:18,853 ERROR: Request to server https://10.222.64.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.222.64.1', port=443): Read timed out. (read timeout=4.999803479760885)",)
2024-06-01T07:40:20+03:00 2024-06-01 04:40:20,501 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fef011b63c8>, 'Connection to 10.222.64.1 timed out. (connect timeout=1.115929365158081)')': /api/v1/namespaces/airflow/endpoints/newfmm-db-airflow
2024-06-01T07:40:21+03:00 2024-06-01 04:40:21,619 ERROR: Request to server https://10.222.64.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.222.64.1', port=443): Max retries exceeded with url: /api/v1/namespaces/airflow/endpoints/newfmm-db-airflow (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7feef9d858d0>, 'Connection to 10.222.64.1 timed out. (connect timeout=1.115929365158081)'))",)
2024-06-01T07:40:23+03:00 2024-06-01 04:40:23,572 ERROR: Request to server https://10.222.64.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.222.64.1', port=443): Max retries exceeded with url: /api/v1/namespaces/airflow/endpoints/newfmm-db-airflow (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7feef9d855f8>, 'Connection to 10.222.64.1 timed out. (connect timeout=1)'))",)
2024-06-01T07:40:23+03:00 2024-06-01 04:40:23,572 INFO: Demoting self (immediate-nolock)
2024-06-01T07:40:23+03:00 2024-06-01 04:40:23,572 ERROR: failed to update leader lock

At the same time wee see errors in calico node. There are those errors:

2024-06-01T07:40:03+03:00  2024-06-01 04:40:03.948 [INFO][69] felix/calc_graph.go 507: Local endpoint updated id=WorkloadEndpoint(node=stage1-6495d69d65xsx98l-tkf2b, orchestrator=k8s, workload=airflow/newfmm-db-airflow-1, name=eth0)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.949 [INFO][69] felix/int_dataplane.go 1954: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali58c64a0c8be" profile_ids:"kns.airflow" profile_ids:"ksa.airflow.postgres-pod" ipv4_nets:"10.222.59.119/32" > 
2024-06-01T07:40:03+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/bpf_ep_mgr.go 1659: Failed to add policy to workload, removing from iptables allow list error=attaching program to wep: error attaching tc program invalid argument id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-0" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-6" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-9" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-c" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-2" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-8" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-a" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="cali58c64a0c8be" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" 
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-d" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="cali58c64a0c8be"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.976 [INFO][69] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} ipVersion=0x4 status="up"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.976 [INFO][69] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} status="up"
2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="cali58c64a0c8be" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" 
2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}

If we check logs before those errors we can see that BPF programs were updated for that inteface every 10 seconds for a long period of time.

2024-06-01T07:39:33+03:00  2024-06-01 04:39:33.953 [INFO][69] felix/int_dataplane.go 1954: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali58c64a0c8be" profile_ids:"kns.airflow" profile_ids:"ksa.airflow.postgres-pod" ipv4_nets:"10.222.59.119/32" > 
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.953 [INFO][69] felix/calc_graph.go 507: Local endpoint updated id=WorkloadEndpoint(node=stage1-6495d69d65xsx98l-tkf2b, orchestrator=k8s, workload=airflow/newfmm-db-airflow-1, name=eth0)
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/bpf_ep_mgr.go 2539: /proc/sys/net/ipv4/conf/cali58c64a0c8be/accept_local set to 1
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-0" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-6" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-9" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-c" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/bpf_ep_mgr.go 1824: Finished applying BPF programs for workload ifaceName="cali58c64a0c8be" timeTaken=12.953379ms
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/bpf_ep_mgr.go 1644: Updated workload interface. iface="cali58c64a0c8be"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-2" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-8" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-a" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-d" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="cali58c64a0c8be"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.971 [INFO][69] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} ipVersion=0x4 status="up"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.971 [INFO][69] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} status="up"
2024-06-01T07:39:43+03:00 2024-06-01 04:39:43.950 [INFO][69] felix/calc_graph.go 507: Local endpoint updated id=WorkloadEndpoint(node=stage1-6495d69d65xsx98l-tkf2b, orchestrator=k8s, workload=airflow/newfmm-db-airflow-1, name=eth0)
2024-06-01T07:39:43+03:00 2024-06-01 04:39:43.950 [INFO][69] felix/int_dataplane.go 1954: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali58c64a0c8be" profile_ids:"kns.airflow" profile_ids:"ksa.airflow.postgres-pod" ipv4_nets:"10.222.59.119/32" > 
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.003 [INFO][69] felix/attach.go 174: Program attached to TC. attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.035 [INFO][69] felix/attach.go 174: Program attached to TC. attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.039 [INFO][69] felix/bpf_ep_mgr.go 1905: Attached programs to the WEP hook=egress iface="cali58c64a0c8be" type="workload"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.040 [INFO][69] felix/bpf_ep_mgr.go 1905: Attached programs to the WEP hook=ingress iface="cali58c64a0c8be" type="workload"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.040 [INFO][69] felix/bpf_ep_mgr.go 1824: Finished applying BPF programs for workload ifaceName="cali58c64a0c8be" timeTaken=89.975517ms
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/bpf_ep_mgr.go 2539: /proc/sys/net/ipv4/conf/cali58c64a0c8be/accept_local set to 1
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/bpf_ep_mgr.go 1644: Updated workload interface. iface="cali58c64a0c8be"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-0" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-2" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-6" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-8" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-9" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-a" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-c" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-d" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.042 [INFO][69] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.044 [INFO][69] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="cali58c64a0c8be"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.044 [INFO][69] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.044 [INFO][69] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.049 [INFO][69] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} ipVersion=0x4 status="up"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.049 [INFO][69] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} status="up"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.049 [INFO][69] felix/summary.go 100: Summarising 17 dataplane reconciliation loops over 1m0s: avg=20ms longest=99ms (resync-filter-v4,update-workload-iface)

The inteface cali58c64a0c8be lacks some bpf programs:
calid2e5dc9 - a healthy interface
cali58c64a0 - a faulty interface

root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep calid2e5d
7766942: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
7766943: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131461: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131483: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep cali58c64a0
6479986: sched_cls  name p4i_cali58c64a0  tag 0dbf0d10b02282f0
6479987: sched_cls  name p4e_cali58c64a0  tag b1d839878b133ee8
root@stage1-6495d69d65xsx98l-tkf2b:/# ls  /sys/fs/bpf/tc/cali58c64a0c8be_*
/sys/fs/bpf/tc/cali58c64a0c8be_egr:
cali_jump2

/sys/fs/bpf/tc/cali58c64a0c8be_igr:
cali_jump2

Using calico-bpf we see the following information:

─$ kubectl -n kube-system exec calico-node-lggqf -- calico-node -bpf ifstate dump | grep cali58c64a0c8be
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
 2656 : {flags: workload,ready, XDPPolicy: -1, IngressPolicy: 24, EgressPolicy: 25, IngressFilter: -1, EgressFilter: -1, name: cali58c64a0c8be}

 ╰─$ kubectl -n kube-system exec calico-node-lggqf -- calico-node -bpf counters dump --iface=cali58c64a0c8be
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
+----------+--------------------------------+-----------+---------+-----+
| CATEGORY |              TYPE              |  INGRESS  | EGRESS  | XDP |
+----------+--------------------------------+-----------+---------+-----+
| Accepted | by another program             |         0 |       0 | N/A |
|          | by failsafe                    |         0 |       0 | N/A |
|          | by policy                      |   1046273 | 1139052 | N/A |
| Dropped  | by policy                      |         0 |       0 | N/A |
|          | failed decapsulation           |         0 |       0 | N/A |
|          | failed encapsulation           |         0 |       0 | N/A |
|          | incorrect checksum             |         0 |       0 | N/A |
|          | malformed IP packets           |         0 |       0 | N/A |
|          | packets with unknown route     |         0 |       0 | N/A |
|          | packets with unknown source    |         0 |       0 | N/A |
|          | packets with unsupported IP    |         0 |       0 | N/A |
|          | options                        |           |         |     |
|          | too short packets              |         0 |       0 | N/A |
| Total    | packets                        | 166970130 | 2528111 | N/A |
+----------+--------------------------------+-----------+---------+-----+
###dumped cali58c64a0c8be counters.

╰─$ kubectl -n kube-system exec calico-node-lggqf -- calico-node -bpf policy dump cali58c64a0c8be all
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
IfaceName: cali58c64a0c8be
Hook: tc egress
Error:
Policy Info:
// Start of rule action:"allow" rule_id:"cKE9PSPlsXvdnn0_"
// count = 0
IfaceName: cali58c64a0c8be
Hook: tc ingress
Error:
Policy Info:
// Start of rule action:"allow" rule_id:"QSpjq8vswFLjKjfu"
// count = 53409
2024-06-10 06:55:18.476 [ERROR][11151] confd/policy_debug.go 79: Failed to dump policy info. error=stat /var/run/calico/bpf/policy/cali58c64a0c8be_xdp_v4.json: no such file or directory

Expected Behavior

Those errors should not happen and the connection should stay stable.

Current Behavior

After the events described in the context pod is not able to establish connect to anything anymore.

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

  • Calico version 3.27.2 on ebpf dataplane.
    Felix configuration
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  annotations:
  name: default
  resourceVersion: "283159629"
  uid: 222bf9d5-6012-4cb4-9d49-2b3575416a14
spec:
  bpfEnabled: true
  bpfExternalServiceMode: Tunnel
  bpfLogLevel: ""
  logSeverityScreen: Debug
  prometheusMetricsEnabled: true
  reportingInterval: 0s
  vxlanEnabled: true
  vxlanPort: 4799
  vxlanVNI: 4096
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes v1.22.15
  • Operating System and version: Ubuntu 20.04.4 LTS, kernel: 5.15.0-67-generic, cri: containerd://1.5.5
  • Link to your project (optional):
@tomastigera tomastigera added the area/bpf eBPF Dataplane issues label Jun 18, 2024
@tomastigera
Copy link
Contributor

2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="cali58c64a0c8be" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" 
2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}

Never seen that libbpf error. Might be that there is no qdisc set up at the device? You could you tc to show what qdiscs are attached.

root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep calid2e5d
7766942: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
7766943: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131461: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131483: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep cali58c64a0
6479986: sched_cls  name p4i_cali58c64a0  tag 0dbf0d10b02282f0
6479987: sched_cls  name p4e_cali58c64a0  tag b1d839878b133ee8

It is weird to see policy programs attached 2x. Also the [ie] vs [EI] is strange, might be some upgrade issue 🤔

What version did you upgrade from?

@tomastigera
Copy link
Contributor

You can use bpftool net to see which programs are directly connected to the device.Those ⬆️ policy programs are called eventually in a chain of programs.

@tomastigera
Copy link
Contributor

given the lowercase e in p4e_cali58c64a0 is seems like you went from 3.25ish? Seems like some cleanup did not happen.

The error message comes from https://github.com/torvalds/linux/blob/92e5605a199efbaee59fb19e15d6cc2103a04ec2/net/sched/cls_api.c#L2316

Are the failing pods those that eisted before upgrade or does it happen to new pods as well?

@xander-sh
Copy link
Author

We upgrade from calico 3.25.0 (apply new manifest without delete old version)

Are the failing pods those that eisted before upgrade or does it happen to new pods as well?

It's happen also with new pods (statefulset recreate with the same pod name)

You can use bpftool net to see which programs are directly connected to the device

I restart statefulset pod (network connection in pod is normal)

NAMESPACE   NAME                READY    STATUS      RESTARTS     AGE     IP              NODE
airflow     newfmm-db-airflow-0  2/2     Running     0            6m10s   10.222.47.32    stage1-6495d69d65xsx98l-7s8td

❯ k exec -it newfmm-db-airflow-0 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "postgres" out of: postgres, prometheus-postgres-exporter
# curl -k https://kubernetes.default
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
  },
  "code": 403
}

root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32
10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0
    cache
root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2
calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[195661] id 195661
calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[195660] id 195660

After restarting postgresql, we will watch when it crashes again and collect information from bpftool, tc

@xander-sh
Copy link
Author

After 4 days container in pod transitioning to unhealthy state.

❯ k -n airflow get pods -o wide newfmm-db-airflow-0
NAME                  READY   STATUS    RESTARTS   AGE    IP             NODE                            NOMINATED NODE   READINESS GATES
newfmm-db-airflow-0   1/2     Running   0          4d5h   10.222.47.32   stage1-6495d69d65xsx98l-7s8td   <none>           <none>

Some diagnostic

root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32
10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0
    cache

root@stage1-6495d69d65xsx98l-7s8td:/# bpftool prog | grep calid2b
# empty :( 
root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2
calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[400338] id 400338
calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[400339] id 400339

[root@stage1-6495d69d65xsx98l-7s8td /]# tc -s qdisc show dev calieca2ba1ed43
qdisc noqueue 0: root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc clsact ffff: parent ffff:fff1
 Sent 191201052976 bytes 376857785 pkt (dropped 53362, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
#count of dropped packets increase every second

Logs from calico-node in debug and dump more logs in attach
0.log

2024-06-23T18:22:34.654135139+03:00 stdout F 2024-06-23 15:22:34.654 [DEBUG][58967] felix/bpf_ep_mgr.go 1430: Ignoring interface that doesn't match the host data/l3 interface regex iface="calieca2ba1ed43"
2024-06-23T18:22:34.658959833+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/attach.go 353: Already have a clsact qdisc on this interface iface="calieca2ba1ed43"
2024-06-23T18:22:34.65897197+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 1896: readiness: 0 iface="calieca2ba1ed43"
2024-06-23T18:22:34.659075818+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 2742: ensureProgramAttached 4 hook=ingress iface="calieca2ba1ed43" type="workload"
2024-06-23T18:22:34.659079655+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/map.go 130: needs generic object file from_wep_fib_no_log_co-re.o AttachType=hook.AttachType{Hook:0, Family:4, Type:"workload", LogLevel:"off", FIB:true, ToHostDrop:false, DSR:false}
2024-06-23T18:22:34.659081934+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 1896: readiness: 0 iface="calieca2ba1ed43"
2024-06-23T18:22:34.659166532+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 133: generic object file already loaded from_wep_fib_no_log_co-re.o layout=hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}
2024-06-23T18:22:34.659169858+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/bpf_ep_mgr.go 2742: ensureProgramAttached 4 hook=egress iface="calieca2ba1ed43" type="workload"
2024-06-23T18:22:34.659171724+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 66: UpdateMapEntry(28, [62 0 0 0], [99 0 0 0])
2024-06-23T18:22:34.65917393+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 130: needs generic object file to_wep_no_log_co-re.o AttachType=hook.AttachType{Hook:1, Family:4, Type:"workload", LogLevel:"off", FIB:true, ToHostDrop:false, DSR:false}
2024-06-23T18:22:34.65921784+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0x1c mapInfo=&maps.MapInfo{Type:3, KeySize:4, ValueSize:4, MaxEntries:10000}
2024-06-23T18:22:34.65933278+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 133: generic object file already loaded to_wep_no_log_co-re.o layout=hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}
2024-06-23T18:22:34.659336321+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 66: UpdateMapEntry(28, [63 0 0 0], [99 0 0 0])
2024-06-23T18:22:34.659417834+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0x1c mapInfo=&maps.MapInfo{Type:3, KeySize:4, ValueSize:4, MaxEntries:10000}
2024-06-23T18:22:34.664465985+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 291: Found old calico program prog=tc.attachedProg{pref:"32768", handle:"0x1"}
2024-06-23T18:22:34.664703578+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 291: Found old calico program prog=tc.attachedProg{pref:"32768", handle:"0x1"}
2024-06-23T18:22:34.664710538+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_jump2 k 4 v 4
2024-06-23T18:22:34.66471307+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_progs2 k 4 v 4
2024-06-23T18:22:34.664794065+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_arp2 k 8 v 12
2024-06-23T18:22:34.664797995+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_ct3 k 16 v 88
2024-06-23T18:22:34.664800341+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_fe3 k 16 v 20
2024-06-23T18:22:34.66492409+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_be k 8 v 8
2024-06-23T18:22:34.664934394+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_aff k 16 v 16
2024-06-23T18:22:34.664936359+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_jump2 k 4 v 4
2024-06-23T18:22:34.664938181+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_state4 k 4 v 464
2024-06-23T18:22:34.664979423+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_progs2 k 4 v 4
2024-06-23T18:22:34.664989684+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 452: ConfigureProgram HookLayout=hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}
2024-06-23T18:22:34.665120085+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_arp2 k 8 v 12
2024-06-23T18:22:34.665124979+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_ct3 k 16 v 88
2024-06-23T18:22:34.665140656+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_fe3 k 16 v 20
2024-06-23T18:22:34.665142986+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_be k 8 v 8
2024-06-23T18:22:34.665278386+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_aff k 16 v 16
2024-06-23T18:22:34.665282781+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_state4 k 4 v 464
2024-06-23T18:22:34.665401168+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 452: ConfigureProgram HookLayout=hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}
2024-06-23T18:22:34.666112544+03:00 stdout F libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-23T18:22:34.666293187+03:00 stdout F 2024-06-23 15:22:34.666 [WARNING][58967] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:62, Iface:"calieca2ba1ed43", LogLevel:"off"}, LogFilte
r:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0xb}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-23T18:22:34.667112806+03:00 stdout F libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-23T18:22:34.66730246+03:00 stdout F 2024-06-23 15:22:34.667 [WARNING][58967] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:63, Iface:"calieca2ba1ed43", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0xb}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-23T18:22:34.667438952+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 125: Set bpfMap="cali_iface" k={ifIndex: 3984} v={flags: workload,ready, XDPPolicy: -1, IngressPolicy: 63, EgressPolicy: 62, IngressFilter: -1, EgressFilter: -1, name: calieca2ba1ed43}
2024-06-23T18:22:34.667445162+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 142: Set: Key in dataplane already, ignoring. bpfMap="cali_iface"
2024-06-23T18:22:34.667515079+03:00 stdout F 2024-06-23 15:22:34.667 [WARNING][58967] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="calieca2ba1ed43" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-0" endpoint_id:"eth0"
2024-06-23T18:22:34.667528735+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/caching_map.go 153: Applying deletions to DP map. name="cali_iface"
2024-06-23T18:22:34.667533857+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/caching_map.go 129: Applying updates to DP map. name="cali_iface"
2024-06-23T18:22:34.667626354+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/bpf_ep_mgr.go 1327: CompleteDeferredWork done.
2024-06-23T18:22:34.667631888+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/endpoint_mgr.go 445: Reporting endpoint status. dirtyEndpoints=set.Set{}
2024-06-23T18:22:34.667792509+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 276: Asked to resync with the dataplane on next update. family="inet"
2024-06-23T18:22:34.667796401+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 215: Asked to resync with the dataplane on next update.
2024-06-23T18:22:34.667798229+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/wireguard.go 656: Wireguard is not enabled, skipping sync ipVersion=0x4
2024-06-23T18:22:34.668229581+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 342: Resyncing ipsets with dataplane. family="inet"
2024-06-23T18:22:34.668234483+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state. ipsetFamily="inet"
2024-06-23T18:22:34.668237291+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=0 inDataplaneNotDesired=0 ipsetFamily="inet" totalNumInDP=0
2024-06-23T18:22:34.668239276+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 243: Doing full resync of BPF IP sets map bpf family="inet"
2024-06-23T18:22:34.668246501+03:00 stdout F 2024-06-23 15:22:34.668 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0xe mapInfo=&maps.MapInfo{Type:11, KeySize:20, ValueSize:4, MaxEntries:1048576}
2024-06-23T18:22:34.669262736+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/ipsets.go 406: List of ipsets: [] family="inet"
2024-06-23T18:22:34.669269369+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state.
2024-06-23T18:22:34.669271576+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=1 inDataplaneNotDesired=0 totalNumInDP=0
2024-06-23T18:22:34.669343478+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state.
2024-06-23T18:22:34.669347348+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=12 inDataplaneNotDesired=0 totalNumInDP=0
@tomastigera
Copy link
Contributor

root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32
10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0
cache

root@stage1-6495d69d65xsx98l-7s8td:/# bpftool prog | grep calid2b

empty :(

calid2b why grepping for this unrelated device?

root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2
calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[400338] id 400338
calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[400339] id 400339

This looks like programs are attached. I will have a look at the logs.

@xander-sh
Copy link
Author

why grepping for this unrelated device?

Sorry, this is my mistake.
Diagnostic info from other statefulset
pod newfmm-db-airflow-1 with ip 10.222.59.94 and calico interface name cali58c64a0c8be

❯ k -n airflow get pods -o wide newfmm-db-airflow-1
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
newfmm-db-airflow-1   1/2     Running   0          5d22h   10.222.59.94   stage1-6495d69d65xsx98l-tkf2b   <none>           <none>
root@stage1-6495d69d65xsx98l-tkf2b:/# ip ro get 10.222.59.94
10.222.59.94 dev cali58c64a0c8be src 10.220.1.3 uid 0
    cache
d65xsx98l-tkf2b:/# bpftool net| grep cali58c64a0c8be
cali58c64a0c8be(2817) clsact/ingress cali_tc_preambl:[9162962] id 9162962
cali58c64a0c8be(2817) clsact/egress cali_tc_preambl:[9162963] id 9162963
root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep cali58
6479986: sched_cls  name p4i_cali58c64a0  tag 0dbf0d10b02282f0
6479987: sched_cls  name p4e_cali58c64a0  tag b1d839878b133ee8
@legolego621
Copy link

legolego621 commented Jul 9, 2024

The problem also exists on the controller and node version v3.27.3.
Our problem appeared on clusters version 1.22 and 1.27

@ChandonPierre
Copy link

ChandonPierre commented Jul 12, 2024

We're seeing the same problem on k8s 1.29 + kernel 6.5 + calico 3.72.2

If this is at all helpful:

 bpftool net list | grep cali1c22a1cbbe2
cali1c22a1cbbe2(26) clsact/ingress cali_tc_preambl:[132768] id 132768
cali1c22a1cbbe2(26) clsact/egress cali_tc_preambl:[132769] id 132769

tc filter show dev cali1c22a1cbbe2 >> nothing

 bpftool prog show | grep -E "132768|132769" -A 4
132768: sched_cls  name cali_tc_preambl  tag 4e084811a5e4fbbf  gpl
        loaded_at 2024-07-12T11:40:22+0000  uid 0
        xlated 2144B  jited 1410B  memlock 4096B  map_ids 247,66441,739,738,66442
        btf_id 33214
132769: sched_cls  name cali_tc_preambl  tag 4e084811a5e4fbbf  gpl
        loaded_at 2024-07-12T11:40:22+0000  uid 0
        xlated 2144B  jited 1410B  memlock 4096B  map_ids 247,66443,739,738,66444
        btf_id 33215

bpftool map show | grep -E "33214|33215" -B 2
66441: array  name tc_pream.rodata  flags 0x480
        key 4B  value 220B  max_entries 1  memlock 4096B
        btf_id 33214
--
66443: array  name tc_pream.rodata  flags 0x480
        key 4B  value 220B  max_entries 1  memlock 4096B
        btf_id 33215

bpftool map dump id 66441
[{
        "value": {
            ".rodata": [{
                    "__globals": {
                        "host_ip": 2311630346,
                        "tunnel_mtu": 1500,
                        "vxlan_port": 4789,
                        "intf_ip": 16907945,
                        "ext_to_svc_mark": 0,
                        "psnat_start": 20000,
                        "psnat_len": 29999,
                        "host_tunnel_ip": 2311630346,
                        "flags": 16,
                        "wg_port": 0,
                        "__pad": 0,
                        "natin_idx": 0,
                        "natout_idx": 0,
                        "iface_name": [99,97,108,105,49,99,50,50,97,49,99,98,98,101,50,0
                        ],
                        "log_filter_jmp": 4294967295,
                        "jumps": [39,37,40,41,42,4294967295,43,44,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295
                        ],
                        "__scratch": []
                    }
                }
            ]
        }
    }
]

bpftool map dump id 66443
[{
        "value": {
            ".rodata": [{
                    "__globals": {
                        "host_ip": 2311630346,
                        "tunnel_mtu": 1500,
                        "vxlan_port": 4789,
                        "intf_ip": 16907945,
                        "ext_to_svc_mark": 0,
                        "psnat_start": 20000,
                        "psnat_len": 29999,
                        "host_tunnel_ip": 2311630346,
                        "flags": 16,
                        "wg_port": 0,
                        "__pad": 0,
                        "natin_idx": 0,
                        "natout_idx": 0,
                        "iface_name": [99,97,108,105,49,99,50,50,97,49,99,98,98,101,50,0
                        ],
                        "log_filter_jmp": 4294967295,
                        "jumps": [45,36,46,47,48,4294967295,49,50,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295
                        ],
                        "__scratch": []
                    }
                }
            ]
        }
    }
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bpf eBPF Dataplane issues kind/support
5 participants