On Kubernetes Pod failures and Restart policy

A job in Kubernetes is responsible for creating and managing Pods and perform tasks until its successful termination where as normal Pods restarts continuously regardless of the exit code. If a job fails before the successful termination, the job controller will create a new pod and depending on the nature of the system there’s a chance that our system might end up with duplicated pods.

Let’s consider a one-shot job that fails to successfully terminate. If we check the status of of our pod, we would notice that our pod has been restarted multiple times and Kubernetes will end up in CrashLoopBackOff status.

kubectl get pod -a -l job-name=demojob
NAME                              READY          STATUS                       RESTARTS      AGE

demojob-3ddk0              0/1                 CrashLoopBackOff   4                      3m

If we change the restartPolicy from OnFailure to Never, we could avoid this kind of CrashLoops. For example,

kubectl get pod -a -l job-name=demojob


NAME                           READY        STATUS         RESTARTS      AGE 

demojob-0wm49         0/1               Error             0                       1m 

demojob-6h9s2           0/1               Error             0                       39s

demojob-hkzw0         1/1                Running       0                        6s 

demojob-k5swz         0/1               Error              0                       28s 

demojob-m1rdw       0/1               Error             0                        19s

demojob-x157b        0/1               Error               0                      57s

What we are seeing here is a set of duplicate pods that are in Error state. With restart policy set to Never we inform kubelet to not to restart pods on failure. But the job object notices a job in Error state and crates a new pod for that in this situation and we will be ended up with multiple duplicate pods. Since it is uncommon to have a pod failure on start up, we could configure the restart policy to OnFailure and avoid duplicate pods.