Kernel-level checkpoint and restart is enabled by default. LSF users make a job checkpointable by either submitting a job using bsub -k and specifying a checkpoint directory or by submitting a job to a queue that defines a checkpoint directory for the CHKPNT parameter.
To enable user-level checkpoint and restart, you must link your application object files to the LSF checkpoint libraries provided in LSF_LIBDIR. You do not have to change any code within your application. For instructions on how to link application files, see the Platform LSF Programmer’s Guide.
For application-level checkpoint and restart, once the LSF_SERVERDIR contains one or more checkpoint and restart executables, users can specify the external checkpoint executable associated with each checkpointable job they submit. At restart, LSF invokes the corresponding external restart executable.
The directory/name combinations must be unique within the cluster. For example, you can write two different checkpoint executables with the name echkpnt.fluent and save them as LSF_SERVERDIR/echkpnt.fluent and my_execs/echkpnt.fluent. To run checkpoint and restart executables from a directory other than LSF_SERVERDIR, you must configure the parameter LSB_ECHKPNT_METHOD_DIR in lsf.conf.
An echkpnt.application must return a value of 0 when checkpointing succeeds and a non-zero value when checkpointing fails.