TL;DR
Problem: communication issues between pods on different nodes
Identification: describe on the non-ready calico pod in the kube-system namespace shows the error "calico/node is not ready: BIRD is not ready: BGP not established"
Solution:
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=www.google.com
The Full Story
Few days ago, our test kubernetes started having communication issues:
Some of the pods where unable to communicate with other pods. Checking deeper into this I've found that the problem was communication between pods that reside on different nodes.
The kubernetes diagram displayed here demonstrate the problem. Green arrows represent success communication, an red arrow represent a failed communication.
Communication between all the nodes is working fine, as well as communication between the pods within each node. However, the communication between pods in the node-c to pods in other nodes fails. This is a weird status, as the communication between node-b and node-c is fine.
As this is our test environment, which was working fine for several months, I assumed someone had messed it up, so i tried rebooting the nodes, reinstalling the kubernetes, and even manually override the DNS resolving, but the problem remained.
Finally, I realized this is a kubernetes CNI issue.
In this bare metal kubernetes, we are using Calico for CNI.
I've checked the calico pods status:
kubectl get pods -n kube-system
An I've found that the calico pod on the node-C is not in a Ready state:
NAME READY STATUS RESTARTS AGE IP NODE calico-node-f9dgm 1/1 Running 0 45h 10.195.5.136 node-a calico-node-g69z9 1/1 Running 0 45h 10.195.5.135 node-b calico-node-xb92h 0/1 Running 0 45h 10.195.5.133 node-c
I've run kubectl describe for the non-ready pod, and found the readiness probe errors:
Warning Unhealthy 32m kubelet Readiness probe failed: 2021-05-16 08:40:25.359 [INFO][199] confd/health.go 180: Number of node(s) with BGP peering established = 0 calico/node is not ready: BIRD is not ready: BGP not established with 10.195.5.135,10.195.5.136
And then I've found the following bug:
calico/node is not ready: BIRD is not ready: BGP not established
Which means that calico had selected the wrong IP address for the node. I have used the recommended solution to force the calico selection of a IP address with external network connectivity:
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=www.google.com
Other IP address selection methods are available here.
No comments:
Post a Comment