Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Job Requeue and Job Rerun

Contents

About Job Requeue

A networked computing environment is vulnerable to any failure or temporary conditions in network services or processor resources. For example, you might get NFS stale handle errors, disk full errors, process table full errors, or network connectivity problems. Your application can also be subject to external conditions such as a software license problems, or an occasional failure due to a bug in your application.

Such errors are temporary and probably happen at one time but not another, or on one host but not another. You might be upset to learn all your jobs exited due to temporary errors and you did not know about it until 12 hours later.

LSF provides a way to automatically recover from temporary errors. You can configure certain exit values such that in case a job exits with one of the values, the job is automatically requeued as if it had not yet been dispatched. This job is then be retried later. It is also possible for you to configure your queue such that a requeued job is not scheduled to hosts on which the job had previously failed to run.

Automatic Job Requeue

You can configure a queue to automatically requeue a job if it exits with a specified exit value.

The reserved keyword all specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~) to exclude specified exit codes from the list.

For example:

REQUEUE_EXIT_VALUES=all ~1 ~2 EXCLUDE(9) 

Jobs exited with all exit codes except 1 and 2 are requeued. Jobs with exit code 9 are requeued requeued so that the failed job is not rerun on the same host (exclusive job requeue).

Configure automatic job requeue

  1. To configure automatic job requeue, set REQUEUE_EXIT_VALUES in the queue definition (lsb.queues) or in an application profile (lsb.applications) and specify the exit codes that cause the job to be requeued.
  2. Application-level exit values override queue-level values. Job-level exit values (bsub -Q) override application-level and queue-level values.

    Begin Queue
    ...
    REQUEUE_EXIT_VALUES = 99 100
    ...
    End Queue 
     

    This configuration enables jobs that exit with 99 or 100 to be requeued.

Control how many times a job can be requeued

By default, if a job fails and its exit value falls into REQUEUE_EXIT_VALUES, LSF requeues the job automatically. Jobs that fail repeatedly are requeued five times by default.

  1. To limit the number of times a failed job is requeued, set MAX_JOB_REQUEUE cluster wide (lsb.params), in the queue definition (lsb.queues), or in an application profile (lsb.applications).
  2. Specify an integer greater than zero (0).

    MAX_JOB_REQUEUE in lsb.applications overrides lsb.queues, and lsb.queues overrides lsb.params configuration. Specifying a job-level exit value using bsub -Q overrides all MAX_JOB_REQUEUE settings.

When MAX_JOB_REQUEUE is set, if a job fails and its exit value falls into REQUEUE_EXIT_VALUES, the number of times the job has been requeued is increased by 1 and the job is requeued. When the requeue limit is reached, the job is suspended with PSUSP status. If a job fails and its exit value is not specified in REQUEUE_EXIT_VALUES, the job is not requeued.

Viewing the requeue retry limit
  1. Run bjobs -l to display the job exit code and reason if the job requeue limit is exceeded.
  2. Run bhist -l to display the exit code and reason for finished jobs if the job requeue limit is exceeded.
How job requeue retry limit is recovered

The job requeue limit is recovered when LSF is restarted and reconfigured. LSF replays the job requeue limit from the JOB_STATUS event and its pending reason in lsb.events.

Job-level automatic requeue

Use bsub -Q to submit a job that is automatically requeued if it exits with the specified exit values. Use spaces to separate multiple exit codes. The reserved keyword all specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~) to exclude specified exit codes from the list.

Job-level requeue exit values override application-level and queue-level configuration of the parameter REQUEUE_EXIT_VALUES, if defined.

Jobs running with the specified exit code share the same application and queue with other jobs.

For example:

bsub -Q "all ~1 ~2 EXCLUDE(9)" myjob 

Jobs exited with all exit codes except 1 and 2 are requeued. Jobs with exit code 9 are requeued requeued so that the failed job is not rerun on the same host (exclusive job requeue).

Define an exit code as EXCLUDE(exit_code) to enable exclusive job requeue. Exclusive job requeue does not work for parallel jobs.

If mbatchd is restarted, it does not remember the previous hosts from which the job exited with an exclusive requeue exit code. In this situation, it is possible for a job to be dispatched to hosts on which the job has previously exited with an exclusive exit code.

Use bmod -Q to modify or cancel job-level requeue exit values. bmod -Q does not affect running jobs. For rerunnable and requeue jobs, bmod -Q affects the next run.

MultiCluster jobs

Job forwarding model

For jobs sent to a remote cluster, arguments of bsub -Q take effect on remote clusters.

Lease model

The arguments of bsub -Q apply to jobs running on remote leased hosts as if they are running on local hosts.

Reverse Requeue

By default, if you use automatic job requeue, jobs are requeued to the head of a queue. You can have jobs requeued to the bottom of a queue instead. The job priority does not change.

Configure reverse requeue

You must already use automatic job requeue (REQUEUE_EXIT_VALUES in lsb.queues).

To configure reverse requeue:

  1. Set LSB_REQUEUE_TO_BOTTOM in lsf.conf to 1.
  2. Reconfigure the cluster:
  3. a lsadmin reconfig

    b badmin mbdrestart

Exclusive Job Requeue

You can configure automatic job requeue so that a failed job is not rerun on the same host.

Limitations

Configure exclusive job requeue

  1. Set REQUEUE_EXIT_VALUES in the queue definition (lsb.queues) and define the exit code using parentheses and the keyword EXCLUDE:
  2. EXCLUDE(exit_code...) 
     

    exit_code has the following form:

    "[all] [~number ...] | [number ...]"

    The reserved keyword all specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~) to exclude specified exit codes from the list.

    Jobs are requeued to the head of the queue. The output from the failed run is not saved, and the user is not notified by LSF.

When a job exits with any of the specified exit codes, it is requeued, but it is not dispatched to the same host again.

Begin Queue
...
REQUEUE_EXIT_VALUES=30 EXCLUDE(20)
HOSTS=hostA hostB hostC
...
End Queue 

A job in this queue can be dispatched to hostA, hostB or hostC.

If a job running on hostA exits with value 30 and is requeued, it can be dispatched to hostA, hostB, or hostC. However, if a job running on hostA exits with value 20 and is requeued, it can only be dispatched to hostB or hostC.

If the job runs on hostB and exits with a value of 20 again, it can only be dispatched on hostC. Finally, if the job runs on hostC and exits with a value of 20, it cannot be dispatched to any of the hosts, so it is pending forever.

User-Specified Job Requeue

You can use brequeue to kill a job and requeue it. When the job is requeued, it is assigned the PEND status and the job's new position in the queue is after other jobs of the same priority.

Requeue a job

  1. To requeue one job, use brequeue.

LSF kills the job with job ID 109, and requeues it in the PEND state. If job 109 has a priority of 4, it is placed after all the other jobs with the same priority.

brequeue -u User5 45 67 90 

LSF kills and requeues 3 jobs belonging to User5. The jobs have the job IDs 45, 67, and 90.

Automatic Job Rerun

Job requeue vs. job rerun

Automatic job requeue occurs when a job finishes and has a specified exit code (usually indicating some type of failure).

Automatic job rerun occurs when the execution host becomes unavailable while a job is running. It does not occur if the job itself fails.

About job rerun

When a job is rerun or restarted, it is first returned to the queue from which it was dispatched with the same options as the original job. The priority of the job is set sufficiently high to ensure the job gets dispatched before other jobs in the queue. The job uses the same job ID number. It is executed when a suitable host is available, and an email message is sent to the job owner informing the user of the restart.

Automatic job rerun can be enabled at the job level, by the user, or at the queue level, by the LSF administrator. If automatic job rerun is enabled, the following conditions cause LSF to rerun the job:

When LSF reruns a job, it returns the job to the submission queue, with the same job ID. LSF dispatches the job as if it was a new submission, even if the job has been checkpointed.

Once job is rerun, LSF schedules resizable jobs based on their initial allocation request.

Execution host fails

If the execution host fails, LSF dispatches the job to another host. You receive a mail message informing you of the host failure and the requeuing of the job.

LSF system fails

If the LSF system fails, LSF requeues the job when the system restarts.

Configure queue-level job rerun

  1. To enable automatic job rerun at the queue level, set RERUNNABLE in lsb.queues to yes.

Submit a rerunnable job

  1. To enable automatic job rerun at the job level, use bsub -r.
  2. Interactive batch jobs (bsub -I) cannot be rerunnable.

Submit a job as not rerunnable

  1. To disable automatic job rerun at the job level, use bsub -rn.

Disable post-execution for rerunnable jobs

Running of post-execution commands upon restart of a rerunnable job may not always be desirable; for example, if the post-exec removes certain files, or does other cleanup that should only happen if the job finishes successfully.

  1. Use LSB_DISABLE_RERUN_POST_EXEC=Y in lsf.conf to prevent the post-exec from running when a job is rerun.

Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index