Knowledge Center Contents Previous Next Index |
Configuring Job Controls
After a job is started, it can be killed, suspended, or resumed by the system, an LSF user, or LSF administrator. LSF job control actions cause the status of a job to change. This chapter describes how to configure job control actions to override or augment the default job control actions.
Contents
- Default Job Control Actions
- Configuring Job Control Actions
- Customizing Cross-Platform Signal Conversion
Default Job Control Actions
After a job is started, it can be killed, suspended, or resumed by the system, an LSF user, or LSF administrator. LSF job control actions cause the status of a job to change. LSF supports the following default actions for job controls:
- SUSPEND
- RESUME
- TERMINATE
On successful completion of the job control action, the LSF job control commands cause the status of a job to change.
The environment variable LS_EXEC_T is set to the value JOB_CONTROLS for a job when a job control action is initiated.
See Killing Jobs for more information about job controls and the LSF commands that perform them.
SUSPEND action
Change a running job from RUN state to one of the following states:
- USUSP or PSUSP in response to
bstop
- SSUSP state when the LSF system suspends the job
The default action is to send the following signals to the job:
- SIGTSTP for parallel or interactive jobs. SIGTSTP is caught by the master process and passed to all the slave processes running on other hosts.
- SIGSTOP for sequential jobs. SIGSTOP cannot be caught by user programs. The SIGSTOP signal can be configured with the LSB_SIGSTOP parameter in
lsf.conf
.LSF invokes the SUSPEND action when:
- The user or LSF administrator issues a
bstop
orbkill
command to the job- Load conditions on the execution host satisfy
any
of:
- The suspend conditions of the queue, as specified by the STOP_COND parameter in
lsb.queues
- The scheduling thresholds of the queue or the execution host
- The run window of the queue closes
- The job is preempted by a higher priority job
RESUME action
Change a suspended job from SSUSP, USUSP, or PSUSP state to the RUN state. The default action is to send the signal SIGCONT.
LSF invokes the RESUME action when:
- The user or LSF administrator issues a
bresume
command to the job- Load conditions on the execution host satisfy
all
of:
- The resume conditions of the queue, as specified by the RESUME_COND parameter in
lsb.queues
- The scheduling thresholds of the queue and the execution host
- A closed run window of the queue opens again
- A preempted job finishes
TERMINATE action
Terminate a job. This usually causes the job change to EXIT status. The default action is to send SIGINT first, then send SIGTERM 10 seconds after SIGINT, then send SIGKILL 10 seconds after SIGTERM. The delay between signals allows user programs to catch the signals and clean up before the job terminates.
To override the 10 second interval, use the parameter JOB_TERMINATE_INTERVAL in the
lsb.params
file. See thePlatform LSF Configuration Reference
for information about thelsb.params
file.LSF invokes the TERMINATE action when:
- The user or LSF administrator issues a
bkill
orbrequeue
command to the job- The TERMINATE_WHEN parameter in the queue definition (
lsb.queues
) causes a SUSPEND action to be redirected to TERMINATE- The job reaches its CPULIMIT, MEMLIMIT, RUNLIMIT or PROCESSLIMIT
If the execution of an action is in progress, no further actions are initiated unless it is the TERMINATE action. A TERMINATE action is issued for all job states except PEND.
Windows job control actions
On Windows, actions equivalent to the UNIX signals have been implemented to do the default job control actions. Job control messages replace the SIGINT and SIGTERM signals, but only customized applications will be able to process them. Termination is implemented by the
TerminateProcess()
system call.See
Platform LSF Programmer's Guide
for more information about LSF signal handling on Windows.Configuring Job Control Actions
Several situations may require overriding or augmenting the default actions for job control. For example:
- Notifying users when their jobs are suspended, resumed, or terminated
- An application holds resources (for example, licenses) that are not freed by suspending the job. The administrator can set up an action to be performed that causes the license to be released before the job is suspended and re-acquired when the job is resumed.
- The administrator wants the job checkpointed before being:
- Suspended when a run window closes
- Killed when the RUNLIMIT is reached
- A distributed parallel application must receive a catchable signal when the job is suspended, resumed or terminated to propagate the signal to remote processes.
To override the default actions for the SUSPEND, RESUME, and TERMINATE job controls, specify the JOB_CONTROLS parameter in the queue definition in
lsb.queues
.See the
Platform LSF Configuration Reference
for information about thelsb.queues
file.JOB_CONTROLS parameter (lsb.queues)
The JOB_CONTROLS parameter has the following format:
Begin Queue ... JOB_CONTROLS = SUSPEND[signal
| CHKPNT |command
] \ RESUME[signal
|command
] \ TERMINATE[signal
| CHKPNT |command
] ... End QueueWhen LSF needs to suspend, resume, or terminate a job, it invokes one of the following actions as specified by SUSPEND, RESUME, and TERMINATE.
signal
A UNIX signal name (for example, SIGTSTP or SIGTERM). The specified signal is sent to the job.
The same set of signals is not supported on all UNIX systems. To display a list of the symbolic names of the signals (without the SIG prefix) supported on your system, use the
kill -l
command.CHKPNT
Checkpoint the job. Only valid for SUSPEND and TERMINATE actions.
- If the SUSPEND action is CHKPNT, the job is checkpointed and then stopped by sending the SIGSTOP signal to the job automatically.
- If the TERMINATE action is CHKPNT, then the job is checkpointed and killed automatically.
command
A
/bin/sh
command line.
- Do not quote the command line inside an action definition.
- Do not specify a signal followed by an action that triggers the same signal (for example, do not specify
JOB_CONTROLS=TERMINATE[bkill]
orJOB_CONTROLS=TERMINATE[brequeue]
). This will cause a deadlock between the signal and the action.Using a command as a job control action
- The command line for the action is run with
/bin/sh -c
so you can use shell features in the command.- The command is run as the user of the job.
- All environment variables set for the job are also set for the command action. The following additional environment variables are set:
- LSB_JOBPGIDS - a list of current process group IDs of the job
- LSB_JOBPIDS - a list of current process IDs of the job
- For the SUSPEND action command, the environment variables LSB_SUSP_REASONS and LSB_SUSP_SUBREASONS are also set. Use them together in your custom job control to determine the exact load threshold that caused a job to be suspended.
- LSB_SUSP_REASONS - an integer representing a bitmap of suspending reasons as defined in
lsbatch.h
. The suspending reason can allow the command to take different actions based on the reason for suspending the job.- LSB_SUSP_SUBREASONS - an integer representing the load index that caused the job to be suspended. When the suspending reason SUSP_LOAD_REASON (suspended by load) is set in LSB_SUSP_REASONS, LSB_SUSP_SUBREASONS is set to one of the load index values defined in
lsf.h
.- The standard input, output, and error of the command are redirected to the NULL device, so you cannot tell directly whether the command runs correctly. The default null device on UNIX is
/dev/null
.- You should make sure the command line is correct. If you want to see the output from the command line for testing purposes, redirect the output to a file inside the command line.
TERMINATE job actions
Use caution when configuring TERMINATE job actions that do more than just kill a job. For example, resource usage limits that terminate jobs change the job state to SSUSP while LSF waits for the job to end. If the job is not killed by the TERMINATE action, it remains suspended indefinitely.
TERMINATE_WHEN parameter (lsb.queues)
In certain situations you may want to terminate the job instead of calling the default SUSPEND action. For example, you may want to kill jobs if the run window of the queue is closed. Use the TERMINATE_WHEN parameter to configure the queue to invoke the TERMINATE action instead of SUSPEND.
See the
Platform LSF Configuration Reference
for information about thelsb.queues
file and the TERMINATE_WHEN parameter.Syntax
TERMINATE_WHEN = [LOAD] [PREEMPT] [WINDOW]Example
The following defines a night queue that will kill jobs if the run window closes.
Begin Queue NAME = night RUN_WINDOW = 20:00-08:00 TERMINATE_WHEN = WINDOW JOB_CONTROLS = TERMINATE[ kill -KILL $LSB_JOBPIDS; echo "job $LSB_JOBID killed by queue run window" | mail $USER ] End QueueLSB_SIGSTOP parameter (lsf.conf)
Use LSB_SIGSTOP to configure the SIGSTOP signal sent by the default SUSPEND action.
If LSB_SIGSTOP is set to anything other than SIGSTOP, the SIGTSTP signal that is normally sent by the SUSPEND action is not sent. For example, if LSB_SIGSTOP=SIGKILL, the three default signals sent by the TERMINATE action (SIGINT, SIGTERM, and SIGKILL) are sent 10 seconds apart.
See the
Platform LSF Configuration Reference
for information about thelsf.conf
file.Avoiding signal and action deadlock
Do not configure a job control to contain the signal or command that is the same as the action associated with that job control. This will cause a deadlock between the signal and the action.
For example, the
bkill
command uses the TERMINATE action, so a deadlock results when the TERMINATE action itself contains thebkill
command.Any of the following job control specifications will cause a deadlock:
JOB_CONTROLS=TERMINATE[bkill]
JOB_CONTROLS=TERMINATE[brequeue]
JOB_CONTROLS=RESUME[bresume]
JOB_CONTROLS=SUSPEND[bstop]
Customizing Cross-Platform Signal Conversion
LSF supports signal conversion between UNIX and Windows for remote interactive execution through RES.
On Windows, the CTRL+C and CTRL+BREAK key combinations are treated as signals for console applications (these signals are also called console control actions).
LSF supports these two Windows console signals for remote interactive execution. LSF regenerates these signals for user tasks on the execution host.
Default signal conversion
In a mixed Windows/UNIX environment, LSF has the following default conversion between the Windows console signals and the UNIX signals:
For example, if you issue the
lsrun
orbsub -I
commands from a Windows console but the task is running on an UNIX host, pressing the CTRL+C keys will generate a UNIXSIGINT
signal to your task on the UNIX host. The opposite is also true.Custom signal conversion
For
lsrun
(but notbsub -I
), LSF allows you to define your own signal conversion using the following environment variables:
- LSF_NT2UNIX_CTRLC
- LSF_NT2UNIX_CTRLB
For example:
- LSF_NT2UNIX_CTRLC=SIGXXXX
- LSF_NT2UNIX_CTRLB=SIGYYYY
Here, SIGXXXX/SIGYYYY are UNIX signal names such as SIGQUIT, SIGTINT, etc. The conversions will then be: CTRL+C=SIGXXXX and CTRL+BREAK=SIGYYYY.
If both LSF_NT2UNIX_CTRLC and LSF_NT2UNIX_CTRLB are set to the same value (LSF_NT2UNIX_CTRLC=SIGXXXX and LSF_NT2UNIX_CTRLB=SIGXXXX), CTRL+C will be generated on the Windows execution host.
For
bsub -I
, there is no conversion other than the default conversion.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |