Monday, May 17, 2021

Solving Nodes Communication Problem with Calico


TL;DR 

Problem: communication issues between pods on different nodes

Identification: describe on the non-ready calico pod in the kube-system namespace shows the error "calico/node is not ready: BIRD is not ready: BGP not established"

Solution:

kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=www.google.com



The Full Story


Few days ago, our test kubernetes started having communication issues:

Some of the pods where unable to communicate with other pods. Checking deeper into this I've found that the problem was communication between pods that reside on different nodes.






The kubernetes diagram displayed here demonstrate the problem. Green arrows represent success communication, an red arrow represent a failed communication. 

Communication between all the nodes is working fine, as well as communication between the pods within each node. However, the communication between pods in the node-c to pods in other nodes fails. This is a weird status, as the communication between node-b and node-c is fine.

As this is our test environment, which was working fine for several months, I assumed someone had messed it up, so i tried rebooting the nodes, reinstalling the kubernetes, and even manually override the DNS resolving, but the problem remained.

Finally, I realized this is a kubernetes CNI issue. 

In this bare metal kubernetes, we are using Calico for CNI.

I've checked the calico pods status:



kubectl get pods -n kube-system



An I've found that the calico pod on the node-C is not in a Ready state:



NAME                                       READY   STATUS    RESTARTS   AGE   IP                NODE
calico-node-f9dgm                          1/1     Running   0          45h   10.195.5.136      node-a   
calico-node-g69z9                          1/1     Running   0          45h   10.195.5.135      node-b   
calico-node-xb92h                          0/1     Running   0          45h   10.195.5.133      node-c   



I've run kubectl describe for the non-ready pod, and found the readiness probe errors:



Warning  Unhealthy       32m  kubelet  Readiness probe failed: 2021-05-16 08:40:25.359 [INFO][199] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.195.5.135,10.195.5.136


And then I've found the following bug:
calico/node is not ready: BIRD is not ready: BGP not established


Which means that calico had selected the wrong IP address for the node. I have used the recommended solution to force the calico selection of a IP address with external network connectivity:


kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=www.google.com


Other IP address selection methods are available here.



No comments:

Post a Comment