Job chunking is done after a suitable host is found for the job. MultiCluster jobs can be chunked, but they are forwarded to the remote execution cluster one at a time, and chunked in the execution cluster. Therefore, the CHUNK_JOB_SIZE parameter in the submission queue is ignored by MultiCluster jobs that are forwarded to a remote cluster for execution.
If MultiCluster jobs are chunked, and one job in the chunk starts to run, both clusters display the WAIT status for the remaining jobs. However, the execution cluster sees these jobs in the PEND state, while the submission cluster sees these jobs in the RUN state. This affects scheduling calculations for fairshare and limits on both clusters.
If fairshare scheduling is enabled, resource usage information is a factor used in the calculation of dynamic user priority. MultiCluster jobs count towards a user’s fairshare priority in the execution cluster, and do not affect fairshare calculations in the submission cluster.
There is no requirement that both clusters use fairshare or have the same fairshare policies. However, if you submit a job and specify a local user group for fairshare purposes (bsub -G), your job cannot run remotely unless you also belong to a user group of the same name in the execution cluster.
For more information on fairshare, see Administering Platform LSF.
A parallel job can be forwarded to another cluster, but the job cannot start unless the execution cluster has enough hosts and resources to run the entire job. A parallel job cannot span clusters.
Resizable jobs across MultiCluster clusters is not supported. This implies the following behavior:
Only bresize release is supported in the job forwarding model from the execution cluster:
The submission cluster does log all events related to bresize release in submission cluster lsb.events file
The submission cluster logs JOB_RESIZE events into lsb.acct file after the allocation is changed.
Users should be able to view allocation changes from submission cluster through bjobs, bhist and bacct, busers, bqueues etc.
If job requeue is enabled, LSF requeues jobs that finish with exit codes that indicate job failure.
For more information on job requeue, see Administering Platform LSF.
brequeue in the submission cluster causes the job to be requeued in the send-jobs queue.
brequeue in the execution cluster causes the job to be requeued in the receive-jobs queue.
If job requeue (REQUEUE_EXIT_VALUES in lsb.queues) is enabled in the receive-jobs queue, and the job’s exit code matches, the execution cluster requeues the job (it does not return to the submission cluster). Exclusive job requeue works properly.
If the execution cluster does not requeue the job, the job returns to the send-jobs cluster, and gets a second chance to be requeued. If job requeue is enabled in the send-jobs queue, and the job’s exit code matches, the submission cluster requeues the job.
Exclusive job requeue values configured in the send-jobs queue always cause the job to be requeued, but for MultiCluster jobs the exclusive feature does not work; these jobs could be dispatched to the same remote execution host as before.
The pre-execution command retry limit (MAX_PREEXEC_RETRY, LOCAL_MAX_PREEXEC_RETRY, and REMOTE_MAX_PREEXEC_RETRY), job requeue limit (MAX_JOB_REQUEUE), and job preemption retry limit (MAX_JOB_PREEMPT) configured in lsb.params, lsb.queues, and lsb.applications on the execution cluster are applied.
If the forwarded job requeue limit exceeds the limit on the execution cluster, the job exits and returns to the submission cluster and remains pending for rescheduling.
If job rerun is enabled, LSF automatically restarts running jobs that are interrupted due to failure of the execution host.
If queue-level job rerun (RERUNNABLE in lsb.queues) is enabled in both send-jobs and receive-jobs queues, only the receive-jobs queue reruns the job.
For more information on job rerun, see Administering Platform LSF.
If job rerun is enabled in the receive-jobs queue, the execution cluster reruns the job. While the job is pending in the execution cluster, the job status is returned to the submission cluster.
If the receive-jobs queue does not enable job rerun, the job returns to the submission cluster and gets a second chance to be rerun. If job rerun is enabled at the user level, or is enabled in the send-jobs queue, the submission cluster reruns the job.