Knowledge Center Contents Previous Next Index |
Job Requeue and Job Rerun
Contents
- About Job Requeue
- Automatic Job Requeue
- Job-level automatic requeue
- Reverse Requeue
- Exclusive Job Requeue
- User-Specified Job Requeue
- Automatic Job Rerun
About Job Requeue
A networked computing environment is vulnerable to any failure or temporary conditions in network services or processor resources. For example, you might get NFS stale handle errors, disk full errors, process table full errors, or network connectivity problems. Your application can also be subject to external conditions such as a software license problems, or an occasional failure due to a bug in your application.
Such errors are temporary and probably happen at one time but not another, or on one host but not another. You might be upset to learn all your jobs exited due to temporary errors and you did not know about it until 12 hours later.
LSF provides a way to automatically recover from temporary errors. You can configure certain exit values such that in case a job exits with one of the values, the job is automatically requeued as if it had not yet been dispatched. This job is then be retried later. It is also possible for you to configure your queue such that a requeued job is not scheduled to hosts on which the job had previously failed to run.
Automatic Job Requeue
You can configure a queue to automatically requeue a job if it exits with a specified exit value.
- The job is requeued to the head of the queue from which it was dispatched, unless the LSB_REQUEUE_TO_BOTTOM parameter in
lsf.conf
is set.- When a job is requeued, LSF does not save the output from the failed run.
- When a job is requeued, LSF does not notify the user by sending mail.
- A job terminated by a signal is not requeued.
The reserved keyword
all
specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~
) to exclude specified exit codes from the list.For example:
REQUEUE_EXIT_VALUES=all ~1 ~2 EXCLUDE(9)Jobs exited with all exit codes except 1 and 2 are requeued. Jobs with exit code 9 are requeued requeued so that the failed job is not rerun on the same host (exclusive job requeue).
Configure automatic job requeue
- To configure automatic job requeue, set REQUEUE_EXIT_VALUES in the queue definition (
lsb.queues
) or in an application profile (lsb.applications
) and specify the exit codes that cause the job to be requeued.Application-level exit values override queue-level values. Job-level exit values (
bsub -Q
) override application-level and queue-level values.Begin Queue ... REQUEUE_EXIT_VALUES = 99 100 ... End QueueThis configuration enables jobs that exit with 99 or 100 to be requeued.
Control how many times a job can be requeued
By default, if a job fails and its exit value falls into REQUEUE_EXIT_VALUES, LSF requeues the job automatically. Jobs that fail repeatedly are requeued five times by default.
- To limit the number of times a failed job is requeued, set MAX_JOB_REQUEUE cluster wide (
lsb.params
), in the queue definition (lsb.queues
), or in an application profile (lsb.applications
).Specify an integer greater than zero (0).
MAX_JOB_REQUEUE in
lsb.applications
overrideslsb.queues
, andlsb.queues
overrideslsb.params
configuration. Specifying a job-level exit value usingbsub -Q
overrides all MAX_JOB_REQUEUE settings.When MAX_JOB_REQUEUE is set, if a job fails and its exit value falls into REQUEUE_EXIT_VALUES, the number of times the job has been requeued is increased by 1 and the job is requeued. When the requeue limit is reached, the job is suspended with PSUSP status. If a job fails and its exit value is not specified in REQUEUE_EXIT_VALUES, the job is not requeued.
Viewing the requeue retry limit
- Run
bjobs -l
to display the job exit code and reason if the job requeue limit is exceeded.- Run
bhist -l
to display the exit code and reason for finished jobs if the job requeue limit is exceeded.How job requeue retry limit is recovered
The job requeue limit is recovered when LSF is restarted and reconfigured. LSF replays the job requeue limit from the JOB_STATUS event and its pending reason in
lsb.events
.Job-level automatic requeue
Use
bsub -Q
to submit a job that is automatically requeued if it exits with the specified exit values. Use spaces to separate multiple exit codes. The reserved keywordall
specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~
) to exclude specified exit codes from the list.Job-level requeue exit values override application-level and queue-level configuration of the parameter REQUEUE_EXIT_VALUES, if defined.
Jobs running with the specified exit code share the same application and queue with other jobs.
For example:
bsub -Q "all ~1 ~2 EXCLUDE(9)" myjobJobs exited with all exit codes except 1 and 2 are requeued. Jobs with exit code 9 are requeued requeued so that the failed job is not rerun on the same host (exclusive job requeue).
Define an exit code as EXCLUDE(
exit_code
) to enable exclusive job requeue. Exclusive job requeue does not work for parallel jobs.If
mbatchd
is restarted, it does not remember the previous hosts from which the job exited with an exclusive requeue exit code. In this situation, it is possible for a job to be dispatched to hosts on which the job has previously exited with an exclusive exit code.Use
bmod -Q
to modify or cancel job-level requeue exit values.bmod -Q
does not affect running jobs. For rerunnable and requeue jobs,bmod -Q
affects the next run.MultiCluster jobs
Job forwarding model
For jobs sent to a remote cluster, arguments of
bsub -Q
take effect on remote clusters.Lease model
The arguments of
bsub -Q
apply to jobs running on remote leased hosts as if they are running on local hosts.Reverse Requeue
By default, if you use automatic job requeue, jobs are requeued to the head of a queue. You can have jobs requeued to the bottom of a queue instead. The job priority does not change.
Configure reverse requeue
You must already use automatic job requeue (REQUEUE_EXIT_VALUES in
lsb.queues
).To configure reverse requeue:
- Set LSB_REQUEUE_TO_BOTTOM in
lsf.conf
to 1.- Reconfigure the cluster:
a
lsadmin reconfig
b
badmin mbdrestart
Exclusive Job Requeue
You can configure automatic job requeue so that a failed job is not rerun on the same host.
Limitations
- If
mbatchd
is restarted, this feature might not work properly, since LSF forgets which hosts have been excluded. If a job ran on a host and exited with an exclusive exit code beforembatchd
was restarted, the job could be dispatched to the same host again aftermbatchd
is restarted.- Exclusive job requeue does not work for MultiCluster jobs or parallel jobs
- A job terminated by a signal is not requeued
Configure exclusive job requeue
- Set REQUEUE_EXIT_VALUES in the queue definition (
lsb.queues
) and define the exit code using parentheses and the keywordEXCLUDE
:EXCLUDE(
exit_code
...)
"[all] [~
exit_code
has the following form:number
...] | [number
...]"The reserved keyword
all
specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~
) to exclude specified exit codes from the list.Jobs are requeued to the head of the queue. The output from the failed run is not saved, and the user is not notified by LSF.
When a job exits with any of the specified exit codes, it is requeued, but it is not dispatched to the same host again.
Begin Queue ... REQUEUE_EXIT_VALUES=30 EXCLUDE(20) HOSTS=hostA hostB hostC ... End QueueA job in this queue can be dispatched to
hostA
,hostB
orhostC
.If a job running on
hostA
exits with value 30 and is requeued, it can be dispatched tohostA
,hostB
, orhostC
. However, if a job running onhostA
exits with value 20 and is requeued, it can only be dispatched tohostB
orhostC
.If the job runs on
hostB
and exits with a value of 20 again, it can only be dispatched onhostC
. Finally, if the job runs onhostC
and exits with a value of 20, it cannot be dispatched to any of the hosts, so it is pending forever.User-Specified Job Requeue
You can use
brequeue
to kill a job and requeue it. When the job is requeued, it is assigned the PEND status and the job's new position in the queue is after other jobs of the same priority.Requeue a job
- To requeue one job, use
brequeue
.
- You can only use
brequeue
on running (RUN), user-suspended (USUSP), or system-suspended (SSUSP) jobs.- Users can only requeue their own jobs. Only root and LSF administrator can requeue jobs submitted by other users.
- You cannot use
brequeue
on interactive batch jobsbrequeue 109
LSF kills the job with job ID 109, and requeues it in the PEND state. If job 109 has a priority of 4, it is placed after all the other jobs with the same priority.
brequeue -u User5 45 67 90
LSF kills and requeues 3 jobs belonging to
User5
. The jobs have the job IDs 45, 67, and 90.Automatic Job Rerun
Job requeue vs. job rerun
Automatic job requeue occurs when a job finishes and has a specified exit code (usually indicating some type of failure).
Automatic job rerun occurs when the execution host becomes unavailable while a job is running. It does not occur if the job itself fails.
About job rerun
When a job is rerun or restarted, it is first returned to the queue from which it was dispatched with the same options as the original job. The priority of the job is set sufficiently high to ensure the job gets dispatched before other jobs in the queue. The job uses the same job ID number. It is executed when a suitable host is available, and an email message is sent to the job owner informing the user of the restart.
Automatic job rerun can be enabled at the job level, by the user, or at the queue level, by the LSF administrator. If automatic job rerun is enabled, the following conditions cause LSF to rerun the job:
- The execution host becomes unavailable while a job is running
- The system fails while a job is running
When LSF reruns a job, it returns the job to the submission queue, with the same job ID. LSF dispatches the job as if it was a new submission, even if the job has been checkpointed.
Once job is rerun, LSF schedules resizable jobs based on their initial allocation request.
Execution host fails
If the execution host fails, LSF dispatches the job to another host. You receive a mail message informing you of the host failure and the requeuing of the job.
LSF system fails
If the LSF system fails, LSF requeues the job when the system restarts.
Configure queue-level job rerun
- To enable automatic job rerun at the queue level, set RERUNNABLE in
lsb.queues
toyes
.Submit a rerunnable job
- To enable automatic job rerun at the job level, use
bsub -r
.Interactive batch jobs (
bsub -I
) cannot be rerunnable.Submit a job as not rerunnable
- To disable automatic job rerun at the job level, use
bsub -rn
.Disable post-execution for rerunnable jobs
Running of post-execution commands upon restart of a rerunnable job may not always be desirable; for example, if the post-exec removes certain files, or does other cleanup that should only happen if the job finishes successfully.
- Use LSB_DISABLE_RERUN_POST_EXEC=Y in
lsf.conf
to prevent the post-exec from running when a job is rerun.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |