Checkpoint a MultiCluster job

Checkpointing of a MultiCluster job is only supported when the send-jobs queue is configured to forward jobs to a single remote receive-jobs queue, without ever using local hosts:

The checkpointable MultiCluster jobs resume on the same host.

For more information on checkpointing, see Administering Platform LSF.

Configuration

Checkpointing MultiCluster jobs

To enable checkpointing of MultiCluster jobs, define a checkpoint directory:
  1. in both the send-jobs and receive-jobs queues (CHKPNT in lsb.queues)
  2. or in an application profile (CHKPNT_DIR, CHKPNT_PERIOD, CHKPNT_INITPERIOD, CHKPNT_METHOD in lsb.applications) of both submission cluster and execution cluster.

LSF uses the directory specified in the execution cluster and ignores the directory specified in the submission cluster.

LSF writes the checkpoint file in a subdirectory named with the submission cluster name and submission cluster job ID. This allows LSF to checkpoint multiple jobs to the same checkpoint directory. For example, the submission cluster is ClusterA, the submission job ID is 789, and the send-jobs queue enables checkpointing. The job is forwarded to clusterB, the execution job ID is 123, and the receive-jobs queue specifies a checkpoint directory called XYZ_dir. LSF will save the checkpoint file in:

XYZ_dir/clusterA/789/
Tip:

You cannot use bsub -k to make a MultiCluster job checkpointable.

Checkpoint a job

To checkpoint and stop a MultiCluster job, run bmig in the execution cluster and specify the local job ID.
Tip:

You cannot run bmig from the submission cluster. You cannot use bmig -m to specify a host.

Force a checkpointed job

Use brun to force any pending job to be dispatched immediately to a specific host, regardless of user limits and fairshare priorities. This is the only way to resume a checkpointed job on a different host. By default, these jobs attempt to restart from the last checkpoint.
Tip:

Use brun -b if you want to make checkpointable jobs start over from the beginning (for example, this might be necessary if the new host does not have access to the old checkpoint directory).

Example

In this example, users in a remote cluster submit work to a data center using a send-jobs queue that is configured to forward jobs to only one receive-jobs queue. You are the administrator of the data center and you need to shut down a host for maintenance. The host is busy running checkpointable MultiCluster jobs.

  1. Before you perform maintenance on a host in the execution cluster, take these steps:
    1. Run badmin hclose to close the host and prevent additional jobs from starting on the host.
    2. Run bmig and specify the execution cluster job IDs of the checkpointable MultiCluster jobs running on the host. For example, if jobs from a remote cluster use job IDs 123 and 456 in the local cluster, type the following command to checkpoint and stop the jobs:
      bmig 123 456 

      You cannot use bmig -m to specify a host.

    3. Allow the checkpoint process to complete. The jobs are requeued to the submission cluster. From there, they will be forwarded to the same receive-jobs queue again, and scheduled on the same host. However, if the host is closed, they will not start.
    4. Shut down LSF daemons on the host.
  2. After you perform maintenance on a host, take these steps:
    1. Start LSF daemons on the host.
    2. Use badmin hopen to open the host. The MultiCluster jobs resume automatically.