Understanding Platform LSF job exit information

Contents

  • Why did my job exit?

  • How LSF translates events into exit codes

  • Application and system exit values

  • LSF job termination reason logging

  • Job termination by LSF exit information

  • LSF RMS integration exit values

Why did my job exit?

LSF collects job information and reports the final status of a job. Traditionally jobs finishing normally report a status of 0, which usually means the job has finished normally. Any non-zero status means that the job has exited abnormally.

Most of the time, the abnormal job exit is related either to the job itself or to the system it ran on and not because of an LSF error. This document explains some of the information LSF provides about the abnormal job termination.

How LSF translates events into exit codes

The following table summarizes LSF exit behavior for some common error conditions.

Error codition

LSF exit code

Operating system

System exit code equivalent

Meaning

Command not found

127

all

1 or 127

Command shell returns 1 if command not found. If the command cannot be found inside a job script, LSF return exit code 127.

Directory not available for output

0

all

1

LSF sends the output back to user through email if directory not available for output (bsub -o).

LSF internal error

-127, 127

all

N/A

RES returns -127 or 127 for all internal problems.

Out of memory

N/A

all

N/A

Exit code depends on the error handling of the application itself.

LSF job states

0

all

N/A

Exit code 0 is returned for all job states


Host failure

If an LSF server host fails, jobs running on that host are lost. No other jobs are affected. At initial job submission, you must submit a job with specific options for them to be automatically rerun from the beginning or restarted from a checkpoint on another host if they are lost because of a host failure.

  • If a job is submitted with bsub -r or to a queue with RERUNNABLE set, it reruns automatically on host failure.

  • If a job is submitted with bsub -k or to a checkpointable queue or application profile, it can be restarted if the host fails and the checkpoint succeeds.

If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back up and takes over as master, it reads the lsb.events file to get the state of all batch jobs. Jobs that were running when the systems went down are assumed to have exited, and email is sent to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become available.

Exited jobs

A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:

  • The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.

  • The job is not able to be dispatched before it reaches its termination deadline, and thus is aborted by LSF.

  • The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.

The job exits with a non-zero exit status.

You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Administering Platform LSF for more information.

Application and system exit values

LSF monitors a job while running and returns the exit code returned from the job itself. LSF collects this exit code via wait3() system call on UNIX platforms. The exit code is a result of the system exit values. Use bhist or bjobs to see the exit code for your job.

Application exit values

The most common cause of abnormal LSF job termination is due to application system exit values. If your application had an explicit exit value less than 128, bjobs and bhist display the actual exit code of the application; for example,

Exited with exit code 3
. You would have to refer to the application code for the meaning of exit code 3.

It is possible for a job to explicitly exit with an exit code greater than 128, which can be confused with the corresponding UNIX signal. Make sure that applications you write do not use exit codes greater than128.

System signal exit values

When you send a signal that terminates the job, LSF reports either the signal or the signal_value+128. If the return status is greater than 128, and the job was terminated with a signal, then return_status-128=signal. For example, return status 133 means that the job was terminated with signal 5 (SIGTRAP on most systems, 133-128=5). A job with exit status 130 was terminated with signal 2 (SIGINT on most systems, 130-128 = 2).

Some operating systems define exit codes as 0-255. As a result, negative exit values or values > 255 may have a wrap-around effect on that range. The most common example of this is a program that exits -1 will be seen with "exit code 255" in LSF.

How or why the job may have been signaled, or exited with a certain exit code, can be application and/or system specific. The application or system logs might be able to give a better description of the problem.

Note:

Termination signals are operating system dependent, so signal 5 may not be SIGTRAP and 11 may not be SIGSEGV on all UNIX and Linux systems. You need to pay attention to the execution host type in order to correct translate the exit value if the job has been signaled.

bhist and bjobs output

In most cases, bjobs and bhist show the application exit value (128 + signal). In some cases, bjobs and bhist show the actual signal value.

If LSF sends catchable signals to the job, it displays the exit value. For example, if you run bkill jobID to kill the job, LSF passes SIGINT, which causes the job to exit with exit code 130 (SIGINT is 2 on most systems, 128+2 = 130).

If LSF sends uncatchable signals to the job, then the entire process group for the job exits with the corresponding signal. For example, if you run bkill -s SEGV jobID to kill the job, bjobs and bhist show
Exited by signal 7

Example

The following example shows a job that exited with exit code 139, which means that the job was terminated with signal 11 (SIGSEGV on most systems, 139-128=11). This means that the application had a core dump.

bjobs -l 2012 
Job <2012>, User , Project , Status , Queue , Command 
Fri Dec 27 22:47:28: Submitted from host , CWD <$HOME>; 
Fri Dec 27 22:47:37: Started on , Execution Home , Execution CWD ; Fri Dec 27 22:48:02: Exited with exit code 139. The CPU time used is 0.2 seconds.  
SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem 
loadSched   -     -     -     -       -     -    -     -     -      -      - 
loadStop    -     -     -     -       -     -    -     -     -      -      -
                 cpuspeed    bandwidth
 loadSched          -            -
 loadStop           -            -

LSF job termination reason logging

When LSF takes action on a job, it may send multiple signals. In the case of job termination, LSF will send, SIGINT, SIGTERM and SIGKILL in succession until the job has terminated. As a result, the job may exit with any of those corresponding exit values at the system level. Other actions may send "warning" signals to applications (SIGUSR2) etc. For specific signal sequences, refer to the LSF documentation for that feature.

Run bhist to see the actions that LSF takes on a job:

bhist -l 1798 
Job <1798>, User <user1>, Command <sleep 10000> 
Tue Feb 25 16:35:31: Submitted from host <hostA>, to Queue <normal>, CWD <$H
                     OME/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>; 
Tue Feb 25 16:35:51: Dispatched to <hostA>; 
Tue Feb 25 16:35:51: Starting (Pid 12955); 
Tue Feb 25 16:35:53: Running with execution home </home/user1>, Execution CWD <
                     /home/user1/Testing/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>,
                     Execution Pid <12955>; 
Tue Feb 25 16:38:20: Signal <KILL> requested by user or administrator <user1>; 
Tue Feb 25 16:38:22: Exited with exit code 130. The CPU time used is 0.1 seconds; 
Summary of time in seconds spent in various states by  Tue Feb 25 16:38:22  
PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL 
 20       0        151      0        0        0        171

Here we see that LSF itself sent the signal to terminate the job, and the job exits 130 (130-128 = 2 = SIGINT).

When a job finishes, LSF reports the last job termination action it took against the job and logs it into lsb.acct.

If a running job exits because of node failure, LSF sets the correct exit information in lsb.acct, lsb.events, and the job output file.

View logged job exit information (bacct -l)

  1. Use bacct -l to view job exit information logged to lsb.acct:

    bacct -l 7265
    Accounting information about jobs that are: 
      - submitted by all users.
      - accounted on all projects.
      - completed normally or exited
      - executed on all hosts.
      - submitted to all queues.
      - accounted on all service classes.
    ------------------------------------------------------------------------------
    Job <7265>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <normal>, 
                         Command <srun sleep 100000>
    Thu Sep 16 15:22:09: Submitted from host <hostA>, CWD <$HOME>;
    Thu Sep 16 15:22:20: Dispatched to 4 Hosts/Processors <4*hostA>;
    Thu Sep 16 15:23:21: Completed <exit>; TERM_RUNLIMIT: job killed after reaching
                         LSF run time limit.
    Accounting information about this job:
         Share group charged </lsfadmin>
         CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
          0.04       11             72     exit         0.0006     0K      0K
    ------------------------------------------------------------------------------
    SUMMARY:      ( time unit: second ) 
     Total number of done jobs:       0      Total number of exited jobs:     1
     Total CPU time consumed:       0.0      Average CPU time consumed:     0.0
     Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0
     Total wait time in queues:    11.0
     Average wait time in queue:   11.0
     Maximum wait time in queue:   11.0      Minimum wait time in queue:   11.0
     Average turnaround time:        72 (seconds/job)
     Maximum turnaround time:        72      Minimum turnaround time:        72
     Average hog factor of a job:  0.00 ( cpu time / turnaround time )
     Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00

Termination reasons displayed by bacct

When LSF detects that a job is terminated, bacct -l displays one of the following termination reasons:


Keyword displayed by bacct

Termination reason

Integer value logged to JOB_FINISH in lsb.acct

TERM_ADMIN

Job killed by root or LSF administrator

15

TERM_BUCKET_KILL

Job killed with bkill -b

23

TERM_CHKPNT

Job killed after checkpointing

13

TERM_CPULIMIT

Job killed after reaching LSF CPU usage limit

12

TERM_CWD_NOTEXIST

Current working directory is not accessible or does not exist on the execution host

25

TERM_DEADLINE

Job killed after deadline expires

6

TERM_EXTERNAL_SIGNAL

Job killed by a signal external to LSF

17

TERM_FORCE_ADMIN

Job killed by root or LSF administrator without time for cleanup

9

TERM_FORCE_OWNER

Job killed by owner without time for cleanup

8

TERM_LOAD

Job killed after load exceeds threshold

3

TERM_MEMLIMIT

Job killed after reaching LSF memory usage limit

16

TERM_OTHER

Member of a chunk job in WAIT state killed and requeued after being switched to another queue.

4

TERM_OWNER

Job killed by owner

14

TERM_PREEMPT

Job killed after preemption

1

TERM_PROCESSLIMIT

Job killed after reaching LSF process limit

7

TERM_REQUEUE_ADMIN

Job killed and requeued by root or LSF administrator

11

TERM_REQUEUE_OWNER

Job killed and requeued by owner

10

TERM_RMS

Job exited from an RMS system error

18

TERM_RUNLIMIT

Job killed after reaching LSF run time limit

5

TERM_SLURM

Job terminated abnormally in SLURM (node failure)

22

TERM_SWAP

Job killed after reaching LSF swap usage limit

20

TERM_THREADLIMIT

Job killed after reaching LSF thread limit

21

TERM_UNKNOWN

LSF cannot determine a termination reason—0 is logged but TERM_UNKNOWN is not displayed

0

TERM_WINDOW

Job killed after queue run window closed

2

TERM_ZOMBIE

Job exited while LSF is not available

19


Tip:

The integer values logged to the JOB_FINISH event inlsb.acct and termination reason keywords are mapped in lsbatch.h.

Restrictions

  • If a queue-level JOB_CONTROL is configured, LSF cannot determine the result of the action. The termination reason only reflects what the termination reason could be in LSF.

  • LSF cannot be guaranteed to catch any external signals sent directly to the job.

  • In MultiCluster, a brequeue request sent from the submission cluster is translated to TERM_OWNER or TERM_ADMIN in the remote execution cluster. The termination reason in the email notification sent from the execution cluster as well as that in the lsb.acct is set to TERM_OWNER or TERM_ADMIN.

Example output of bacct and bhist


Example termination cause

Termination reason in bacct –l

Example bhist output

bkill -s KILL

bkill job_ID

Completed <exit>; TERM_OWNER or TERM_ADMIN

Thu Mar 13 17:32:05: Signal <KILL> requested by user or administrator <user2>;

Thu Mar 13 17:32:06: Exited by signal 2. The CPU time used is 0.1 seconds;

bkill –r

Completed <exit>; TERM_FORCE_ADMIN or TERM_FORCE_OWNER when sbatchd is not reachable.

Otherwise, TERM_USER or

TERM_ADMIN

Thu Mar 13 17:32:05: Signal <KILL> requested by user or administrator <user2>;

Thu Mar 13 17:32:06: Exited by signal 2. The CPU time used is 0.1 seconds;

TERMINATE_WHEN

Completed <exit>; TERM_LOAD/

TERM_WINDOWS/

TERM_PREEMPT

Thu Mar 13 17:33:16: Signal <KILL> requested by user or administrator <user2>;

Thu Mar 13 17:33:18: Exited by signal 2. The CPU time used is 0.1 seconds;

Memory limit reached

Completed <exit>; TERM_MEMLIMIT

Thu Mar 13 19:31:13: Exited by signal 2. The CPU time used is 0.1 seconds;

Run limit reached

Completed <exit>; TERM_RUNLIMIT

Thu Mar 13 20:18:32: Exited by signal 2. The CPU time used is 0.1 seconds.

CPU limit

Completed <exit>; TERM_CPULIMIT

Thu Mar 13 18:47:13: Exited by signal 24. The CPU time used is 62.0 seconds;

Swap limit

Completed <exit>; TERM_SWAPLIMIT

Thu Mar 13 18:47:13: Exited by signal 24. The CPU time used is 62.0 seconds;

Regular job exits when host crashes

Rusage 0,

Completed <exit>;

TERM_ZOMBIE

Thu Jun 12 15:49:02: Unknown; unable to reach the execution host;

Thu Jun 12 16:10:32: Running;

Thu Jun 12 16:10:38: Exited with exit code 143. The CPU time used is 0.0 seconds;

brequeue –r

For each requeue,

Completed <exit>;

TERM_REQUEUE_ADMIN or TERM_REQUEUE_OWNER

Thu Mar 13 17:46:39: Signal <REQUEUE_PEND> requested by user or administrator <user2>;

Thu Mar 13 17:46:56: Exited by signal 2. The CPU time used is 0.1 seconds;

bchkpnt -k

On the first run:

Completed <exit>;

TERM_CHKPNT

Wed Apr 16 16:00:48: Checkpoint succeeded (actpid 931249);

Wed Apr 16 16:01:03: Exited with exit code 137. The CPU time used is 0.0 seconds;

Kill –9 <RES> and job

Completed <exit>; TERM_EXTERNAL_SIGNAL

Thu Mar 13 17:30:43: Exited by signal 15. The CPU time used is 0.1 seconds;

Others

Completed <exit>;

Thu Mar 13 17:30:43: Exited with 3; The CPU time used is 0.1 seconds;


Job termination by LSF exit information

LSF also provides additional information in the POST_EXEC of the job. Use this information to detect conditions where LSF has terminated the job and take the appropriate action.

The job exit information in the POST_EXEC is defined in 2 parts:

  • LSB_JOBEXIT_STAT—the raw wait3() output (converted using the wait macros /usr/include/sys/wait.h)

  • LSB_JOBEXIT_INFO—defined only if the job exit was due to a defined LSF reason.

Queue-level POST_EXEC commands should be written by the cluster administrator to perform whatever task is necessary for specific exit situations.

Note:

System level enforced limits like CPU and Memory (listed above), cannot be shown in the LSB_JOBEXIT_INFO since it is the operating system performing the action and not LSF. Set appropriate parameters in the queue or at job submission to allow LSF to enforce the limits, which makes this information available to LSF.

Common LSB_JOBEXIT_STAT and LSB_JOBEXIT_INFO values

The following is a table of common scenarios covered and not covered by the LSB_JOBEXIT_INFO

Example termination cause

LSB_JOBEXIT_STAT

LSB_JOBEXIT_INFO

Example bhist output

Job killed with the SIGINT bkill -s INT 520

33280

SIGNAL 2 INT

Fri Feb 14 16:48:00: Exited with exit code 130. The CPU time used is 0.2 seconds;

Job killed with SIGTERM bkill -s TERM 521

36608

SIGNAL 15 TERM

Fri Feb 14 16:49:50: Exited with exit code 143. The CPU time used is 0.2 seconds;

Job killed with SIGKILL bkill -s KILL 522

33280

SIGNAL -14 SIG_TERM_USER

Fri Feb 14 16:51:03: Exited with exit code 130. The CPU time used is 0.2 seconds;

Automatic migration when MIG is defined at queue level

33280

SIGNAL -1 SIG_CHKPNT

Fri Feb 14 17:32:17: Job has been requeued; Fri Feb 14 17:32:17: Pending: Migrating job is waiting for rescheduling;

bsub –I "hostname;exit 130"

33280

Undefined

Fri Feb 14 14:41:51: Exited with exit code 130. The CPU time used is 0.2 seconds;

Killing the job with bkill command bkill 210

33280

SIGNAL -14 SIG_TERM_USER

Fri Feb 14 14:45:51: Exited with exit code 130. The CPU time used is 0.2 seconds;

Job being brequeued. brequeue -r Job <211> is being requeued

33280

SIGNAL -23 SIG_KILL_REQUEUE

Fri Feb 14 14:48:15: Signal <REQUEUE_PEND> requested by user or administrator <iayaz>; Fri Feb 14 14:48:18: Exited with exit code 130. The CPU time used is 0.2 second

Job being migrated bmig -m togni Job <213> is being migrated

33280

SIGNAL -1 SIG_CHKPNT

Fri Feb 14 15:04:42: Migration requested by user or administrator <iayaz>; Specified Hosts <togni>; Fri Feb 14 15:04:44: Job is being requeued; Fri Feb 14 15:05:01: Job has been requeued; Fri Feb 14 15:05:01: Pending: Migrating job is waiting for rescheduling;

Job killed due REQUEUE_EXIT_VALUE bsub "sleep 100;exit 34"

8704

Undefined

Fri Feb 14 15:10:21: Pending: Requeued job is waiting for rescheduling;(exit code 34)>;

Job killed by LSF when CPULIMIT enforced by LSF

158

SIGNAL -24 SIG_TERM_CPULIMIT

Wed Feb 19 14:18:13: Exited by signal 30. The CPU time used is 89.4 seconds.

Job killed because queue level CPULIMIT is reached.

40448

Undefined

Fri Feb 14 15:30:01: Exited with exit code 158. The CPU time used is 61.2 seconds;

Job killed because queue level RUNLIMIT is reached.

37120

Undefined

Fri Feb 14 15:37:44: Exited with exit code 145. The CPU time used is 0.2 seconds;

Job killed due to the check pointing. bchkpnt -k 838 Job <838> is being checkpointed

9

SIGNAL -1 SIG_CHKPNT

Fri Feb 14 17:59:12: Checkpoint succeeded (actpid 25298); Fri Feb 14 17:59:12: Exited by signal 9. The CPU time used is 0.1 seconds;

Job killed when reaches the MEMLIMIT bsub -M 5 "/home/iayaz/script/memwrite -m 10 -r 2"

2

SIGNAL -25 SIG_TERM_MEMLIMIT

Fri Feb 21 10:50:50: Exited by signal 2. The CPU time used is 0.1 seconds;

Job killed when termination time approaches bsub -t 21:11:10 sleep 500;date

37120

Undefined

Exited with exit code 145. The CPU time used is 0.2 seconds;

Job killed when TERMINATE_WHEN = LOAD

33280

SIGNAL -15 SIG_TERM_LOAD

Exited with exit code 130. The CPU time used is 7.2 seconds.

Job killed when TERMINATE_WHEN = PREEMPT

33280

SIGNAL -16 SIG_TERM_PREEMPT

Exited with exit code 130. The CPU time used is 0.3 seconds;


LSF RMS integration exit values

For the RMS integrations with LSF (HP AlphaServer SC and Linux QsNet), LSF jobs running through RMS will return rms_run() return code as the job exit code. RMS documents certain exit codes and corresponding job exit reasons.

See the rms_run() man page for more information.

Upon successful completion, rms_run() returns the global OR of the exit status values of the processes in the parallel program. If one of the processes is killed, rms_run() returns a status value of 128 plus the signal number. It can also return the following codes:

Return Code

RMS Meaning

0

A process exited with the code 127 (GLOBAL EXIT), which indicates success, causing all of the processes to exit.

123

A process exited with the code 123 (GLOBAL ERROR) causing all of the processes to exit.

124

The node the job executing on has been removed from the system.

125

One or more processes were still running when the exit timeout expired.

126

The resource is inadequate for the request.