Job

Creating a Simple Job That Finishes in Finite Time

A Job is a supervisor in Kubernetes that can be used to manage Pods that are supposed to run a determined task and then terminate gracefully. A Job creates the specified number of Pods and ensures that they successfully complete their workloads or tasks. When a Job is created, it creates and tracks the Pods that were specified in its configuration. When a specified number of Pods complete successfully, the Job is considered complete. If a Pod fails because of underlying node failures, the Job will create a new Pod to replace it. This also means that the application or code running on the Pod should be capable of gracefully handling a case where a new Pod comes up during the runtime of the process.

The Pods created by a Job aren’t deleted following completion of the job. The Pods run to completion and stay in the cluster with a Completed status.

Creating a Simple Job That Finishes in Finite Time

In this exercise, we will create our first Job, which will run a container that simply waits for 10 seconds and then finishes.

To successfully complete this exercise, perform the following steps:

Create a file called runonce.yaml with the following content:

apiVersion: batch/v1
kind: Job
metadata:
  name: countdown
spec:
  template:
    metadata:
      name: countdown
    spec:
      containers:
      - name: counter
        image: centos:7
        command:
         - "bin/bash"
         - "-c"
         - "for i in 9 8 7 6 5 4 3 2 1 ; do echo $i ; done"
      restartPolicy: Never

Run the following command to create the Deployment using the kubectl apply command:

kubectl apply -f countdown.yaml

You should see the following response:

job.batch/countdown created

Run the following command to check the status of the Job:

kubectl get jobs

NAME        COMPLETIONS   DURATION   AGE
countdown   1/1           24s        59s

We can see that the Job requires one completion. Run the following command to check the Pod running the Job:

kubectl get pods

NAME              READY   STATUS      RESTARTS   AGE
countdown-jfwzk   0/1     Completed   0          75s

We can see that the Job has created a Pod named countdown-jfwzk to run the countdown task specified in the Job template. We can see that the Pod has a Completed status.

Next, run the following command to check the logs for the Pod created by the Job:

kubectl logs -f countdown-jfwzk

We can see that the Pod printed the countdown.

kubectl delete job countdown

job.batch "countdown" deleted

Jobs Completions And Parallelism

Multiple Single Job

For example, we may have a queue of messages that needs processing. We must spawn consumer jobs that pull messages from the queue until it’s empty. To implement this pattern in Kubernetes Jobs, we set the .spec.completions parameter to a number (must be a non-zero, positive number). The Job starts spawning pods up till the completions number. The Job regards itself as complete when all the pods terminate with a successful exit code. Let’s have an example. Modify our definition file to look as follows:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-2
spec:
  completions: 5
  template:
    metadata:
      name: job-2
    spec:
      containers:
      - name: job-2
        image: r.deso.tech/library/alpine
        command: ["/bin/sh","-c"]
        args: ["echo 'consuming a message'; sleep 5"]
      restartPolicy: OnFailure

kubectl apply -f job-2.yaml

job.batch/job-2 created

This definition is very similar to the one we used before with some differences:

We specify the completions parameter to be 5.
We change the command that the container inside the pod used to include a five-second delay, which ensures that we can see the pods created and terminated.

Issue the following command:

kubectl apply -f job-2.yaml && kubectl get pods --watch

This command applies the new definition file to the cluster and immediately starts displaying the pods and their status. The --watch flag saves us from having to type the command over and over as it automatically displays any changes in the pod statuses:

job.batch/job-2 created

NAME          READY   STATUS              RESTARTS   AGE
job-2-nvd7q   0/1     ContainerCreating   0          1s
job-2-nvd7q   0/1     ContainerCreating   0          2s
job-2-nvd7q   1/1     Running             0          5s
job-2-nvd7q   0/1     Completed           0          10s
job-2-x4mw6   0/1     Pending             0          0s
job-2-x4mw6   0/1     Pending             0          0s
job-2-x4mw6   0/1     ContainerCreating   0          0s
job-2-x4mw6   0/1     ContainerCreating   0          1s
job-2-x4mw6   1/1     Running             0          3s
job-2-x4mw6   0/1     Completed           0          8s
job-2-shg6x   0/1     Pending             0          0s
job-2-shg6x   0/1     Pending             0          0s
job-2-shg6x   0/1     ContainerCreating   0          0s
job-2-shg6x   0/1     ContainerCreating   0          2s
job-2-shg6x   1/1     Running             0          4s
job-2-shg6x   0/1     Completed           0          9s
job-2-cvtl6   0/1     Pending             0          0s
job-2-cvtl6   0/1     Pending             0          0s
job-2-cvtl6   0/1     ContainerCreating   0          0s
job-2-cvtl6   0/1     ContainerCreating   0          2s
job-2-cvtl6   1/1     Running             0          4s
job-2-cvtl6   0/1     Completed           0          9s
job-2-w9rjm   0/1     Pending             0          0s
job-2-w9rjm   0/1     Pending             0          0s
job-2-w9rjm   0/1     ContainerCreating   0          0s
job-2-w9rjm   0/1     ContainerCreating   0          2s
job-2-w9rjm   1/1     Running             0          4s
job-2-w9rjm   0/1     Completed           0          9s

As you can see from the above output, the Job created the first pod. When the pod terminated without failure, the Job spawned the next one all long till the last of the ten pods were created and terminated with no failure.

kubectl delete job job-2

job.batch "job-2" deleted

Multiple Parallel Jobs (Work Queue)

Another pattern may involve the need to run multiple jobs, but instead of running them one after another, we need to run several of them in parallel. Parallel processing decreases the overall execution time. It has its application in many domains, like data science and AI.

We will use a simple command of linux for shuffling text: shuf. It shuffles its input by outputting a random permutation of its input lines. Each output permutation is equally likely.

Modify the definition file to look as follows:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-3
spec:
  parallelism: 5
  template:
    metadata:
      name: job-3
    spec:
      containers:
      - name: job-3
        image: r.deso.tech/library/alpine
        command: ["/bin/sh","-c"]
        args: ["echo 'consuming a message'; sleep $(shuf -i 5-10 -n 1)"]
      restartPolicy: OnFailure

Here we didn’t set the .spec.completions parameter. Instead, we specified the parallelism one. The completions parameter in our case defaults to parallelism (5). The Job now has the following behavior: five pods will get launched at the same time; all of them are execution the same Job. When one of the pods terminates successfully, this means that the whole Job is done. No more pods get spawned, and the Job eventually terminates. Let’s apply this definition:

kubectl apply -f job-3.yaml && kubectl get pods --watch

job.batch/job-3 created
NAME          READY   STATUS              RESTARTS   AGE
job-3-9l2j7   0/1     ContainerCreating   0          0s
job-3-c95k7   0/1     ContainerCreating   0          0s
job-3-hkrpd   0/1     ContainerCreating   0          0s
job-3-qn24t   0/1     ContainerCreating   0          0s
job-3-vqzxw   0/1     ContainerCreating   0          0s
job-3-9l2j7   0/1     ContainerCreating   0          3s
job-3-qn24t   0/1     ContainerCreating   0          3s
job-3-hkrpd   0/1     ContainerCreating   0          4s
job-3-vqzxw   0/1     ContainerCreating   0          4s
job-3-c95k7   0/1     ContainerCreating   0          4s
job-3-9l2j7   1/1     Running             0          5s
job-3-qn24t   1/1     Running             0          6s
job-3-vqzxw   1/1     Running             0          6s
job-3-hkrpd   1/1     Running             0          6s
job-3-c95k7   1/1     Running             0          7s
job-3-qn24t   0/1     Completed           0          11s
job-3-c95k7   0/1     Completed           0          13s
job-3-hkrpd   0/1     Completed           0          14s
job-3-9l2j7   0/1     Completed           0          14s
job-3-vqzxw   0/1     Completed           0          15s
. . .

In this scenario, Kubernetes Job is spawning five pods at the same time. It is the responsibility of the pods to know whether or not their peers have finished. In our example, we assume that we are consuming messages from a message queue (like RabbitMQ). When there are no more messages to consume, the job receives a notification that it should exit. Once the first pod exits successfully:

No more pods are spawned.
Existing pods finish their work and exit as well.

In the above example, we changed the command that the pod executes to make it sleep for a random number of seconds (from five to ten) before it terminates. This way, we are roughly simulating how multiple pods can work together on an external data source like a message queue or an API.

kubectl get pod

NAME          READY   STATUS      RESTARTS   AGE
job-3-9l2j7   0/1     Completed   0          3m27s
job-3-c95k7   0/1     Completed   0          3m27s
job-3-hkrpd   0/1     Completed   0          3m27s
job-3-qn24t   0/1     Completed   0          3m27s
job-3-vqzxw   0/1     Completed   0          3m27s

Kubernetes Job Failure and Concurrency Considerations

A process running through a Kubernetes Job is different than a daemon. A daemon represents some service with a web interface (for example, an API). Traditionally, a daemon is programmed to be stopped or restarted without issues. On the other hand, a process that is designed to run once may run into errors when it exits prematurely. Kubernetes Jobs restarts the container inside the pod if it fails. It may also restart the whole pod or reschedule it to another node for multiple reasons. For example:

The pod was consuming more resources than the node has.
The node crashed.
The node got rebooted or upgraded.

Even if you set .spec.parallelism = 1, .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", there is no guarantee that the Kubernetes Job runs the process more than once.

The bottom line is: the process that runs through a Job must be able to handle premature exit (lock files, cached data, etc.). It must also be capable of surviving while multiple instances of it are running.

The Pod Failure Limit

If you set the restartPolicy = "OnFailure" and your Pod had a problem that makes it always fail (a configuration error, an unreachable database, API…etc.) does this mean that the Kubernetes Job keep on restarting the Pod indefinitely? Fortunately, no. A Kubernetes Job will retry running a failed pod every time interval. This time interval starts at ten seconds then it doubles, i.e. 10,20,40,80…etc. As soon as six minutes have passed, the Job will no longer restart the failing Pod.

You can override this behavior by setting the spec.backoffLimit to the number of times the Kubernetes Job should restart the failing Pod.

Notice that setting restartPolicy = "OnFailure" terminates the container when the backoff limit is reached. This way, it may be more challenging when you need to trace back what caused the Job to fail. In such a case, you may want to set the restartPolicy = "Never" and start debugging the problem.

Limiting The Kubernetes Job Execution Time

Sometimes, you are more interested in running your Job for a specific amount of time regardless of whether or not the process completes successfully. Think of an AI application that needs to consume data from Google Analytics. You’re using a cloud instance, and the provider charges you for the amount of CPU and network resources you are utilizing. You are using a Kubernetes Job for data consumption, and you don’t want it to run for more than one hour.

Kubernetes Jobs offer the spec.activeDeadlineSeconds parameter. Setting this parameter to a number will terminate the Job immediately once this number of seconds is reached.

Notice that this setting overrides spec.backoffLimit, which means that if the pod fails and the Job reaches its deadline limit, it will not restart the failing pod. It will stop immediately.

In the following example, we are creating a Job that has both a backoff limit and a deadline:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-4
spec:
  backoffLimit: 5
  activeDeadlineSeconds: 20
  template:
    metadata:
      name: job-4
    spec:
      containers:
      - name: job-4-container
        image: r.deso.tech/library/alpine
        command: ["/bin/sh", "-c"]
        args: ["echo 'Consuming data'; sleep 1; exit 1"]
      restartPolicy: OnFailure

kubectl apply -f job-4.yaml && kubectl get pods --watch

job.batch/job-4 created
NAME          READY   STATUS              RESTARTS   AGE
job-4-ztrgj   0/1     ContainerCreating   0          1s
job-4-ztrgj   0/1     ContainerCreating   0          3s
job-4-ztrgj   1/1     Running             0          5s
job-4-ztrgj   0/1     Error               0          6s
job-4-ztrgj   0/1     Error               1          8s
job-4-ztrgj   0/1     CrashLoopBackOff    1          9s
job-4-ztrgj   0/1     Terminating         1          20s
job-4-ztrgj   0/1     Terminating         1          24s
job-4-ztrgj   0/1     Terminating         1          27s

As you can see from the above output, the Kubernetes Job started the pod, but since the pod shortly fails, it restarts it. It keeps on restarting the failing pod. But, once the activeDeadlineSeconds is reached (twenty seconds), the job exits immediately although it restarted the pod twice only and hasn’t gotten to the backoffLimit of five.