LSF collects job information and reports the final status of a job. Traditionally jobs finishing normally report a status of 0, which usually means the job has finished normally. Any non-zero status means that the job has exited abnormally.
Most of the time, the abnormal job exit is related either to the job itself or to the system it ran on and not because of an LSF error. This document explains some of the information LSF provides about the abnormal job termination.
If an LSF server host fails, jobs running on that host are lost. No other jobs are affected. At initial job submission, you must submit a job with specific options for them to be automatically rerun from the beginning or restarted from a checkpoint on another host if they are lost because of a host failure.
If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back up and takes over as master, it reads the lsb.events file to get the state of all batch jobs. Jobs that were running when the systems went down are assumed to have exited, and email is sent to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become available.
A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:
The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.
The job is not able to be dispatched before it reaches its termination deadline, and thus is aborted by LSF.
The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
The job exits with a non-zero exit status.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Administering Platform LSF for more information.
The most common cause of abnormal LSF job termination is due to application system exit values. If your application had an explicit exit value less than 128, bjobs and bhist display the actual exit code of the application; for example,
Exited with exit code 3. You would have to refer to the application code for the meaning of exit code 3.
It is possible for a job to explicitly exit with an exit code greater than 128, which can be confused with the corresponding UNIX signal. Make sure that applications you write do not use exit codes greater than128.
When you send a signal that terminates the job, LSF reports either the signal or the signal_value+128. If the return status is greater than 128, and the job was terminated with a signal, then return_status-128=signal. For example, return status 133 means that the job was terminated with signal 5 (SIGTRAP on most systems, 133-128=5). A job with exit status 130 was terminated with signal 2 (SIGINT on most systems, 130-128 = 2).
Some operating systems define exit codes as 0-255. As a result, negative exit values or values > 255 may have a wrap-around effect on that range. The most common example of this is a program that exits -1 will be seen with "exit code 255" in LSF.
How or why the job may have been signaled, or exited with a certain exit code, can be application and/or system specific. The application or system logs might be able to give a better description of the problem.
In most cases, bjobs and bhist show the application exit value (128 + signal). In some cases, bjobs and bhist show the actual signal value.
If LSF sends catchable signals to the job, it displays the exit value. For example, if you run bkill jobID to kill the job, LSF passes SIGINT, which causes the job to exit with exit code 130 (SIGINT is 2 on most systems, 128+2 = 130).
The following example shows a job that exited with exit code 139, which means that the job was terminated with signal 11 (SIGSEGV on most systems, 139-128=11). This means that the application had a core dump.
bjobs -l 2012Job <2012>, User , Project , Status , Queue , CommandFri Dec 27 22:47:28: Submitted from host , CWD <$HOME>;Fri Dec 27 22:47:37: Started on , Execution Home , Execution CWD ; Fri Dec 27 22:48:02: Exited with exit code 139. The CPU time used is 0.2 seconds.SCHEDULING PARAMETERS:r15s r1m r15m ut pg io ls it tmp swp memloadSched - - - - - - - - - - -loadStop - - - - - - - - - - -cpuspeed bandwidthloadSched - -loadStop - -
When LSF takes action on a job, it may send multiple signals. In the case of job termination, LSF will send, SIGINT, SIGTERM and SIGKILL in succession until the job has terminated. As a result, the job may exit with any of those corresponding exit values at the system level. Other actions may send "warning" signals to applications (SIGUSR2) etc. For specific signal sequences, refer to the LSF documentation for that feature.
Run bhist to see the actions that LSF takes on a job:
bhist -l 1798Job <1798>, User <user1>, Command <sleep 10000>Tue Feb 25 16:35:31: Submitted from host <hostA>, to Queue <normal>, CWD <$HOME/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>;Tue Feb 25 16:35:51: Dispatched to <hostA>;Tue Feb 25 16:35:51: Starting (Pid 12955);Tue Feb 25 16:35:53: Running with execution home </home/user1>, Execution CWD </home/user1/Testing/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>,Execution Pid <12955>;Tue Feb 25 16:38:20: Signal <KILL> requested by user or administrator <user1>;Tue Feb 25 16:38:22: Exited with exit code 130. The CPU time used is 0.1 seconds;Summary of time in seconds spent in various states by Tue Feb 25 16:38:22PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL20 0 151 0 0 0 171
Here we see that LSF itself sent the signal to terminate the job, and the job exits 130 (130-128 = 2 = SIGINT).
Use bacct -l to view job exit information logged to lsb.acct:
bacct -l 7265Accounting information about jobs that are:- submitted by all users.- accounted on all projects.- completed normally or exited- executed on all hosts.- submitted to all queues.- accounted on all service classes.------------------------------------------------------------------------------Job <7265>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <normal>,Command <srun sleep 100000>Thu Sep 16 15:22:09: Submitted from host <hostA>, CWD <$HOME>;Thu Sep 16 15:22:20: Dispatched to 4 Hosts/Processors <4*hostA>;Thu Sep 16 15:23:21: Completed <exit>; TERM_RUNLIMIT: job killed after reachingLSF run time limit.Accounting information about this job:Share group charged </lsfadmin>CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP0.04 11 72 exit 0.0006 0K 0K------------------------------------------------------------------------------SUMMARY: ( time unit: second )Total number of done jobs: 0 Total number of exited jobs: 1Total CPU time consumed: 0.0 Average CPU time consumed: 0.0Maximum CPU time of a job: 0.0 Minimum CPU time of a job: 0.0Total wait time in queues: 11.0Average wait time in queue: 11.0Maximum wait time in queue: 11.0 Minimum wait time in queue: 11.0Average turnaround time: 72 (seconds/job)Maximum turnaround time: 72 Minimum turnaround time: 72Average hog factor of a job: 0.00 ( cpu time / turnaround time )Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00
When LSF detects that a job is terminated, bacct -l displays one of the following termination reasons:
If a queue-level JOB_CONTROL is configured, LSF cannot determine the result of the action. The termination reason only reflects what the termination reason could be in LSF.
LSF cannot be guaranteed to catch any external signals sent directly to the job.
In MultiCluster, a brequeue request sent from the submission cluster is translated to TERM_OWNER or TERM_ADMIN in the remote execution cluster. The termination reason in the email notification sent from the execution cluster as well as that in the lsb.acct is set to TERM_OWNER or TERM_ADMIN.
LSF also provides additional information in the POST_EXEC of the job. Use this information to detect conditions where LSF has terminated the job and take the appropriate action.
The job exit information in the POST_EXEC is defined in 2 parts:
Queue-level POST_EXEC commands should be written by the cluster administrator to perform whatever task is necessary for specific exit situations.
System level enforced limits like CPU and Memory (listed above), cannot be shown in the LSB_JOBEXIT_INFO since it is the operating system performing the action and not LSF. Set appropriate parameters in the queue or at job submission to allow LSF to enforce the limits, which makes this information available to LSF.
For the RMS integrations with LSF (HP AlphaServer SC and Linux QsNet), LSF jobs running through RMS will return rms_run() return code as the job exit code. RMS documents certain exit codes and corresponding job exit reasons.
See the rms_run() man page for more information.