Job
Job
Creating a Simple Job That Finishes in Finite Time
A Job is a supervisor in Kubernetes that can be used to manage Pods that are supposed to run a determined task and then terminate gracefully. A Job creates the specified number of Pods and ensures that they successfully complete their workloads or tasks. When a Job is created, it creates and tracks the Pods that were specified in its configuration. When a specified number of Pods complete successfully, the Job is considered complete. If a Pod fails because of underlying node failures, the Job will create a new Pod to replace it. This also means that the application or code running on the Pod should be capable of gracefully handling a case where a new Pod comes up during the runtime of the process.
The Pods created by a Job aren’t deleted following completion of the job. The Pods run to completion and stay in the cluster with a Completed status.
Creating a Simple Job That Finishes in Finite Time
In this exercise, we will create our first Job, which will run a container that simply waits for 10 seconds and then finishes.
To successfully complete this exercise, perform the following steps:
Create a file called runonce.yaml with the following content:
apiVersion: batch/v1
kind: Job
metadata:
name: countdown
spec:
template:
metadata:
name: countdown
spec:
containers:
- name: counter
image: centos:7
command:
- "bin/bash"
- "-c"
- "for i in 9 8 7 6 5 4 3 2 1 ; do echo $i ; done"
restartPolicy: Never
Run the following command to create the Deployment using the kubectl apply command:
kubectl apply -f countdown.yaml
You should see the following response:
job.batch/countdown created
Run the following command to check the status of the Job:
kubectl get jobs
NAME COMPLETIONS DURATION AGE countdown 1/1 24s 59s
We can see that the Job requires one completion. Run the following command to check the Pod running the Job:
kubectl get pods
NAME READY STATUS RESTARTS AGE countdown-jfwzk 0/1 Completed 0 75s
We can see that the Job has created a Pod named countdown-jfwzk
to run the countdown task specified in the Job
template.
We can see that the Pod has a Completed
status.
Next, run the following command to check the logs for the Pod created by the Job:
kubectl logs -f countdown-jfwzk
9 8 7 6 5 4 3 2 1
We can see that the Pod printed the countdown.
kubectl delete job countdown
job.batch "countdown" deleted
Jobs Completions And Parallelism
Multiple Single Job
For example, we may have a queue of messages that needs processing. We must spawn consumer jobs that pull messages from the queue until it’s empty. To implement this pattern in Kubernetes Jobs, we set the .spec.completions parameter to a number (must be a non-zero, positive number). The Job starts spawning pods up till the completions number. The Job regards itself as complete when all the pods terminate with a successful exit code. Let’s have an example. Modify our definition file to look as follows:
apiVersion: batch/v1
kind: Job
metadata:
name: job-2
spec:
completions: 5
template:
metadata:
name: job-2
spec:
containers:
- name: job-2
image: r.deso.tech/library/alpine
command: ["/bin/sh","-c"]
args: ["echo 'consuming a message'; sleep 5"]
restartPolicy: OnFailure
kubectl apply -f job-2.yaml
job.batch/job-2 created
This definition is very similar to the one we used before with some differences:
- We specify the completions parameter to be 5.
- We change the command that the container inside the pod used to include a five-second delay, which ensures that we can see the pods created and terminated.
Issue the following command:
kubectl apply -f job-2.yaml && kubectl get pods --watch
This command applies the new definition file to the cluster and immediately starts displaying the pods and their
status. The --watch
flag saves us from having to type the command over and over as it automatically displays any
changes in the pod statuses:
job.batch/job-2 created
NAME READY STATUS RESTARTS AGE job-2-nvd7q 0/1 ContainerCreating 0 1s job-2-nvd7q 0/1 ContainerCreating 0 2s job-2-nvd7q 1/1 Running 0 5s job-2-nvd7q 0/1 Completed 0 10s job-2-x4mw6 0/1 Pending 0 0s job-2-x4mw6 0/1 Pending 0 0s job-2-x4mw6 0/1 ContainerCreating 0 0s job-2-x4mw6 0/1 ContainerCreating 0 1s job-2-x4mw6 1/1 Running 0 3s job-2-x4mw6 0/1 Completed 0 8s job-2-shg6x 0/1 Pending 0 0s job-2-shg6x 0/1 Pending 0 0s job-2-shg6x 0/1 ContainerCreating 0 0s job-2-shg6x 0/1 ContainerCreating 0 2s job-2-shg6x 1/1 Running 0 4s job-2-shg6x 0/1 Completed 0 9s job-2-cvtl6 0/1 Pending 0 0s job-2-cvtl6 0/1 Pending 0 0s job-2-cvtl6 0/1 ContainerCreating 0 0s job-2-cvtl6 0/1 ContainerCreating 0 2s job-2-cvtl6 1/1 Running 0 4s job-2-cvtl6 0/1 Completed 0 9s job-2-w9rjm 0/1 Pending 0 0s job-2-w9rjm 0/1 Pending 0 0s job-2-w9rjm 0/1 ContainerCreating 0 0s job-2-w9rjm 0/1 ContainerCreating 0 2s job-2-w9rjm 1/1 Running 0 4s job-2-w9rjm 0/1 Completed 0 9s
As you can see from the above output, the Job created the first pod. When the pod terminated without failure, the Job spawned the next one all long till the last of the ten pods were created and terminated with no failure.
kubectl delete job job-2
job.batch "job-2" deleted
Multiple Parallel Jobs (Work Queue)
Another pattern may involve the need to run multiple jobs, but instead of running them one after another, we need to run several of them in parallel. Parallel processing decreases the overall execution time. It has its application in many domains, like data science and AI.
We will use a simple command of linux for shuffling text: shuf
.
It shuffles its input by outputting a random permutation of its input lines.
Each output permutation is equally likely.
Modify the definition file to look as follows:
apiVersion: batch/v1
kind: Job
metadata:
name: job-3
spec:
parallelism: 5
template:
metadata:
name: job-3
spec:
containers:
- name: job-3
image: r.deso.tech/library/alpine
command: ["/bin/sh","-c"]
args: ["echo 'consuming a message'; sleep $(shuf -i 5-10 -n 1)"]
restartPolicy: OnFailure
Here we didn’t set the .spec.completions parameter. Instead, we specified the parallelism one. The completions parameter in our case defaults to parallelism (5). The Job now has the following behavior: five pods will get launched at the same time; all of them are execution the same Job. When one of the pods terminates successfully, this means that the whole Job is done. No more pods get spawned, and the Job eventually terminates. Let’s apply this definition:
kubectl apply -f job-3.yaml && kubectl get pods --watch
job.batch/job-3 created NAME READY STATUS RESTARTS AGE job-3-9l2j7 0/1 ContainerCreating 0 0s job-3-c95k7 0/1 ContainerCreating 0 0s job-3-hkrpd 0/1 ContainerCreating 0 0s job-3-qn24t 0/1 ContainerCreating 0 0s job-3-vqzxw 0/1 ContainerCreating 0 0s job-3-9l2j7 0/1 ContainerCreating 0 3s job-3-qn24t 0/1 ContainerCreating 0 3s job-3-hkrpd 0/1 ContainerCreating 0 4s job-3-vqzxw 0/1 ContainerCreating 0 4s job-3-c95k7 0/1 ContainerCreating 0 4s job-3-9l2j7 1/1 Running 0 5s job-3-qn24t 1/1 Running 0 6s job-3-vqzxw 1/1 Running 0 6s job-3-hkrpd 1/1 Running 0 6s job-3-c95k7 1/1 Running 0 7s job-3-qn24t 0/1 Completed 0 11s job-3-c95k7 0/1 Completed 0 13s job-3-hkrpd 0/1 Completed 0 14s job-3-9l2j7 0/1 Completed 0 14s job-3-vqzxw 0/1 Completed 0 15s . . .
In this scenario, Kubernetes Job is spawning five pods at the same time. It is the responsibility of the pods to know whether or not their peers have finished. In our example, we assume that we are consuming messages from a message queue (like RabbitMQ). When there are no more messages to consume, the job receives a notification that it should exit. Once the first pod exits successfully:
- No more pods are spawned.
- Existing pods finish their work and exit as well.
In the above example, we changed the command that the pod executes to make it sleep for a random number of seconds (from five to ten) before it terminates. This way, we are roughly simulating how multiple pods can work together on an external data source like a message queue or an API.
kubectl get pod
NAME READY STATUS RESTARTS AGE job-3-9l2j7 0/1 Completed 0 3m27s job-3-c95k7 0/1 Completed 0 3m27s job-3-hkrpd 0/1 Completed 0 3m27s job-3-qn24t 0/1 Completed 0 3m27s job-3-vqzxw 0/1 Completed 0 3m27s
Kubernetes Job Failure and Concurrency Considerations
A process running through a Kubernetes Job is different than a daemon. A daemon represents some service with a web interface (for example, an API). Traditionally, a daemon is programmed to be stopped or restarted without issues. On the other hand, a process that is designed to run once may run into errors when it exits prematurely. Kubernetes Jobs restarts the container inside the pod if it fails. It may also restart the whole pod or reschedule it to another node for multiple reasons. For example:
- The pod was consuming more resources than the node has.
- The node crashed.
- The node got rebooted or upgraded.
Even if you set .spec.parallelism = 1
, .spec.completions = 1
and .spec.template.spec.restartPolicy =
"Never"
, there is no guarantee that the Kubernetes Job runs the process more than once.
The bottom line is: the process that runs through a Job must be able to handle premature exit (lock files, cached data, etc.). It must also be capable of surviving while multiple instances of it are running.
The Pod Failure Limit
If you set the restartPolicy = "OnFailure"
and your Pod had a problem that makes it always fail (a configuration
error, an unreachable database, API…etc.) does this mean that the Kubernetes Job keep on restarting the Pod
indefinitely? Fortunately, no. A Kubernetes Job will retry running a failed pod every time interval. This time interval
starts at ten seconds then it doubles, i.e. 10,20,40,80…etc. As soon as six minutes have passed, the Job will no
longer restart the failing Pod.
You can override this behavior by setting the spec.backoffLimit
to the number of times the Kubernetes Job should
restart the failing Pod.
Notice that setting restartPolicy = "OnFailure"
terminates the container when the backoff limit is reached.
This way, it may be more challenging when you need to trace back what caused the Job to fail. In such a case, you may
want to set the restartPolicy = "Never"
and start debugging the problem.
Limiting The Kubernetes Job Execution Time
Sometimes, you are more interested in running your Job for a specific amount of time regardless of whether or not the process completes successfully. Think of an AI application that needs to consume data from Google Analytics. You’re using a cloud instance, and the provider charges you for the amount of CPU and network resources you are utilizing. You are using a Kubernetes Job for data consumption, and you don’t want it to run for more than one hour.
Kubernetes Jobs offer the spec.activeDeadlineSeconds
parameter.
Setting this parameter to a number will terminate the Job immediately once this number of seconds is reached.
Notice that this setting overrides spec.backoffLimit
, which means that if the pod fails and the Job reaches its
deadline limit, it will not restart the failing pod. It will stop immediately.
In the following example, we are creating a Job that has both a backoff limit and a deadline:
apiVersion: batch/v1
kind: Job
metadata:
name: job-4
spec:
backoffLimit: 5
activeDeadlineSeconds: 20
template:
metadata:
name: job-4
spec:
containers:
- name: job-4-container
image: r.deso.tech/library/alpine
command: ["/bin/sh", "-c"]
args: ["echo 'Consuming data'; sleep 1; exit 1"]
restartPolicy: OnFailure
kubectl apply -f job-4.yaml && kubectl get pods --watch
job.batch/job-4 created NAME READY STATUS RESTARTS AGE job-4-ztrgj 0/1 ContainerCreating 0 1s job-4-ztrgj 0/1 ContainerCreating 0 3s job-4-ztrgj 1/1 Running 0 5s job-4-ztrgj 0/1 Error 0 6s job-4-ztrgj 0/1 Error 1 8s job-4-ztrgj 0/1 CrashLoopBackOff 1 9s job-4-ztrgj 0/1 Terminating 1 20s job-4-ztrgj 0/1 Terminating 1 24s job-4-ztrgj 0/1 Terminating 1 27s
As you can see from the above output, the Kubernetes Job started the pod, but since the pod shortly fails, it restarts
it. It keeps on restarting the failing pod. But, once the activeDeadlineSeconds
is reached (twenty seconds), the job
exits immediately although it restarted the pod twice only and hasn’t gotten to the backoffLimit of five.