Why this matters
Most ML systems need scheduled or one-off batch inference: nightly scoring, weekly backfills, or temporary reprocessing after a model update. Kubernetes Jobs and CronJobs give you reliable execution, retries, scaling, resource limits, and automatic cleanup—all essential for robust MLOps.
- Run large predictions as parallel shards.
- Schedule off-peak jobs to save cost.
- Retry failures safely without re-running everything.
- Track status, logs, and outcomes for auditability.
Who this is for
- MLOps engineers deploying batch inference on Kubernetes.
- Data scientists moving prototypes to production in clusters.
- Platform engineers standardizing ML batch patterns.
Prerequisites
- Comfort with basic Kubernetes objects (Pods, Deployments, Namespaces).
- Can build and push a container image with your inference code.
- Basic YAML editing and using kubectl.
Concept explained simply
A Job runs a finite task until it completes. Think: “Do this work and finish.” A CronJob runs a Job on a schedule. Think: “Do this same task every night at 02:00.”
Mental model
- Job = a checklist with N boxes (completions). parallelism = how many boxes you check at once.
- CronJob = a calendar that creates a new Job at specific times.
- Retries/backoff = what happens if a box fails; Kubernetes tries again up to backoffLimit.
- Cleanup (ttlSecondsAfterFinished) = auto-remove completed Jobs after a delay.
Key Kubernetes objects for batch ML
- Job: one-off batch run; supports parallelism, completions, retries.
- CronJob: scheduled Jobs; supports concurrencyPolicy and history limits.
- Pod template: command to run inference, resources, volumes, secrets.
Worked examples
Example 1 — One-off Job for a single dataset
apiVersion: batch/v1
kind: Job
metadata:
name: infer-once
spec:
backoffLimit: 3
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: infer
image: your-batch-infer:latest
args: ["--input", "/data/input.parquet", "--output", "/data/preds.parquet"]
resources:
requests: { cpu: "1", memory: "2Gi" }
limits: { cpu: "2", memory: "4Gi" }
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: ml-bucket-pvc
Use this when you have a single file to score and just need it to finish once.
Example 2 — Nightly CronJob with no overlap
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-batch-infer
spec:
schedule: "0 2 * * *" # run every day at 02:00
concurrencyPolicy: Forbid # do not start a new run if the previous is still running
startingDeadlineSeconds: 600 # if missed, try within 10 minutes
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 2
ttlSecondsAfterFinished: 86400
template:
spec:
restartPolicy: Never
containers:
- name: infer
image: your-batch-infer:stable
env:
- name: MODEL_VERSION
value: "v3"
args: ["--date", "{{yesterday}}", "--model", "$(MODEL_VERSION)"]
resources:
requests: { cpu: "500m", memory: "1Gi" }
limits: { cpu: "1", memory: "2Gi" }
Use this for predictable, scheduled scoring. Forbid avoids overlapping runs.
Example 3 — Sharded Job with Indexed completions
apiVersion: batch/v1
kind: Job
metadata:
name: infer-sharded
spec:
completions: 100
parallelism: 10
completionMode: Indexed
backoffLimit: 2
ttlSecondsAfterFinished: 7200
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: your-batch-infer:latest
env:
- name: SHARD_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
args: ["--shard-index", "$(SHARD_INDEX)", "--num-shards", "100"]
resources:
requests: { cpu: "1", memory: "2Gi" }
limits: { cpu: "2", memory: "4Gi" }
Each Pod processes one shard, indexed 0..99, enabling deterministic re-runs for failed shards.
Patterns and parameters cheat-sheet
- parallelism: number of Pods running at once.
- completions: total successful Pods needed before Job is complete.
- completionMode: Indexed enables stable shard indices; NonIndexed treats Pods as interchangeable.
- backoffLimit: maximum retries for failed Pods before Job fails.
- activeDeadlineSeconds: overall time limit for the Job.
- ttlSecondsAfterFinished: auto-delete completed Jobs/Pods after N seconds.
- CronJob concurrencyPolicy: Allow, Forbid, Replace (choose to avoid overlap).
- startingDeadlineSeconds: catch up quickly after a missed schedule.
- History limits: successfulJobsHistoryLimit and failedJobsHistoryLimit for retention.
- restartPolicy: Never (common) or OnFailure for retrying inside same Pod.
- Resources: set requests/limits to match model memory/CPU needs.
- Data access: use volumes, object-store gateways, or service endpoints.
- Secrets/config: mount credentials and model versions safely via Secrets/ConfigMaps.
Monitoring and debugging
- kubectl get jobs, describe jobs to see status, conditions, failed/succeeded counts.
- kubectl logs for Pod output; add structured logs (JSON) to simplify parsing.
- Check events in describe output for scheduling or image pull issues.
- For CronJobs: inspect lastScheduleTime and lastSuccessfulTime in status.
- Emit metrics (duration, processed rows, errors) from your container logs.
Exercises (do these hands-on)
Note: The quick test is available to everyone. Only logged-in users have their exercise and test progress saved.
-
Exercise 1 — Sharded Job for 100 splits
- Create a Job with completions=100, parallelism=10, completionMode=Indexed.
- Expose the shard index to the container as SHARD_INDEX via the Pod annotation batch.kubernetes.io/job-completion-index.
- Command should look like: --shard-index $(SHARD_INDEX) --num-shards 100.
- Set backoffLimit=2, restartPolicy=Never, and ttlSecondsAfterFinished=3600.
-
Exercise 2 — Nightly CronJob without overlap
- Create a CronJob scheduled at 02:00 daily.
- Use concurrencyPolicy=Forbid and startingDeadlineSeconds=600.
- Keep only 3 successful and 2 failed Job histories.
- Set Job backoffLimit=1 and ttlSecondsAfterFinished=86400.
- [ ] I validated my YAML with kubectl apply --dry-run=client -f file.yaml
- [ ] I confirmed the Job/CronJob status fields behave as expected in a test cluster
- [ ] I saw retries happen when I forced a failure
Common mistakes and self-check
- Overlapping CronJobs: Fix with concurrencyPolicy=Forbid or Replace.
- Infinite retries: Set backoffLimit and/or activeDeadlineSeconds.
- Memory OOM kills: Increase memory limits or reduce parallelism; check Pod OOMKilled status.
- Data collisions: In sharded jobs, ensure each shard writes to unique output paths.
- Leaving clutter: Use ttlSecondsAfterFinished and history limits.
Self-check prompts
- Does a failed shard re-run only that shard (Indexed), not all work?
- Can your CronJob survive a missed schedule within the deadline?
- Are logs sufficient to audit inputs, model version, and outputs?
Practical projects
- Build a sharded batch inference pipeline that processes 1M rows split into 200 shards with Indexed Jobs and writes shard outputs to unique paths.
- Create a nightly CronJob that scores new data, computes summary metrics, and posts a compact report to a storage location.
- Add auto-cleanup (TTL) and a simple retention policy for successful and failed runs, and document recovery steps for failed shards.
Learning path
- Now: Jobs and CronJobs for batch inference (this page).
- Next: Resource tuning and autoscaling for batch Pods; advanced scheduling (affinity, taints/tolerations).
- Then: Orchestrating multi-step batch flows (Pipelines) and integrating model registries and feature stores.
Next steps
- Implement the exercises in a sandbox namespace.
- Add structured logging and success/failure metrics to your container.
- Review alerting for missed or failed CronJobs.
Mini challenge
Scenario: Backfill last 14 days safely
Create a CronJob that runs every hour but only processes one missing day at a time using a parameter (e.g., --date). Avoid overlap, retry twice on failure, and auto-clean completed Jobs after 12 hours. Hint: Use Forbid, startingDeadlineSeconds, and an idempotent container that checks which dates are pending.
Ready to test yourself?
Scroll to the Quick Test below. Not logged in? Your answers are not saved, but you can still practice for free.