Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Error and Event Logging

Contents

System Directories and Log Files

LSF uses directories for temporary work files, log files and transaction files and spooling.

LSF keeps track of all jobs in the system by maintaining a transaction log in the work subtree. The LSF log files are found in the directory LSB_SHAREDIR/cluster_name/logdir.

The following files maintain the state of the LSF system:

lsb.events

LSF uses the lsb.events file to keep track of the state of all jobs. Each job is a transaction from job submission to job completion. LSF system keeps track of everything associated with the job in the lsb.events file.

lsb.events.n

The events file is automatically trimmed and old job events are stored in lsb.event.n files. When mbatchd starts, it refers only to the lsb.events file, not the lsb.events.n files. The bhist command can refer to these files.

Job script files in the info directory

When a user issues a bsub command from a shell prompt, LSF collects all of the commands issued on the bsub line and spools the data to mbatchd, which saves the bsub command script in the info directory (or in one of its subdirectories if MAX_INFO_DIRS is defined in lsb.params) for use at dispatch time or if the job is rerun. The info directory is managed by LSF and should not be modified by anyone.

Log directory permissions and ownership

Ensure that the permissions on the LSF_LOGDIR directory to be writable by root. The LSF administrator must own LSF_LOGDIR.

Log levels and descriptions

Number
Level
Description
0
LOG_EMERG
Log only those messages in which the system is unusable.
1
LOG_ALERT
Log only those messages for which action must be taken immediately.
2
LOG_CRIT
Log only those messages that are critical.
3
LOG_ERR
Log only those messages that indicate error conditions.
4
LOG_WARNING
Log only those messages that are warnings or more serious messages. This is the default level of debug information.
5
LOG_NOTICE
Log those messages that indicate normal but significant conditions or warnings and more serious messages.
6
LOG_INFO
Log all informational messages and more serious messages.
7
LOG_DEBUG
Log all debug-level messages.
8
LOG_TRACE
Log all available messages.

Support for UNICOS accounting

In Cray UNICOS environments, LSF writes to the Network Queuing System (NQS) accounting data file, nqacct, on the execution host. This lets you track LSF jobs and other jobs together, through NQS.

Support for IRIX Comprehensive System Accounting (CSA)

The IRIX 6.5.9 Comprehensive System Accounting facility (CSA) writes an accounting record for each process in the pacct file, which is usually located in the /var/adm/acct/day directory. IRIX system administrators then use the csabuild command to organize and present the records on a job by job basis.

The LSF_ENABLE_CSA parameter in lsf.conf enables LSF to write job events to the pacct file for processing through CSA. For LSF job accounting, records are written to pacct at the start and end of each LSF job.

See the Platform LSF Configuration Reference for more information about the LSF_ENABLE_CSA parameter.

See the IRIX 6.5.9 resource administration documentation for information about CSA.

Managing Error Logs

Error logs maintain important information about LSF operations. When you see any abnormal behavior in LSF, you should first check the appropriate error logs to find out the cause of the problem.

LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts.

Daemon error logs

LSF log files are reopened each time a message is logged, so if you rename or remove a daemon log file, the daemons will automatically create a new log file.

The LSF daemons log messages when they detect problems or unusual situations.

The daemons can be configured to put these messages into files.

The error log file names for the LSF system daemons are:

LSF daemons log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. Message logging for LSF daemons (except LIM) is controlled by the parameter LSF_LOG_MASK in lsf.conf. Possible values for this parameter can be any log priority symbol that is defined in /usr/include/sys/syslog.h. The default value for LSF_LOG_MASK is LOG_WARNING.

important:  
LSF_LOG_MASK in lsf.conf no longer specifies LIM logging level in LSF Version 7. For LIM, you must use EGO_LOG_MASK in ego.conf to control message logging for LIM. The default value for EGO_LOG_MASK is LOG_WARNING.

Set the log files owner

Prerequisites: You must be the cluster administrator. The performance monitoring (perfmon) metrics must be enabled or you must set LC_PERFM to debug.

You can set the log files owner for the LSF daemons (not including the mbschd). The default owner is the LSF Administrator.

restriction:  
Applies to UNIX hosts only.
restriction:  
This change only takes effect for daemons that are running as root.
  1. Edit lsf.conf and add the parameter LSF_LOGFILE_OWNER.
  2. Specify a user account name to set the owner of the log files.
  3. Shut down the LSF daemon or daemons you want to set the log file owner for.
  4. Run lsfshutdown on the host.

  5. Delete or move any existing log files.
  6. important:  
    If you do not clear out the existing log files, the file ownership does not change.
  7. Restart the LSF daemons you shut down.
  8. Run lsfstartup on the host.

View the number of file descriptors remaining

Prerequisites: The performance monitoring (perfmon) metrics must be enabled or you must set LC_PERFM to debug.

The mbatchd daemon can log a large number of files in a short period when you submit a large number of jobs to LSF. You can view the remaining file descriptors at any time.

restriction:  
Applies to UNIX hosts only.
  1. Run badmin perfmon view.
  2. The free, used, and total amount of file descriptors display.

    On AIX5, 64-bit hosts, if the file descriptor limit has never been changed, the maximum value displays: 9223372036854775797.

Error logging

If the optional LSF_LOGDIR parameter is defined in lsf.conf, error messages from LSF servers are logged to files in this directory.

If LSF_LOGDIR is defined, but the daemons cannot write to files there, the error log files are created in /tmp.

If LSF_LOGDIR is not defined, errors are logged to the system error logs (syslog) using the LOG_DAEMON facility. syslog messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file /etc/syslog.conf, and read the man pages for syslog(3) and syslogd(1).

If the error log is managed by syslog, it is probably already being automatically cleared.

If LSF daemons cannot find lsf.conf when they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go to syslog. If you cannot find any error messages in the log files, they are likely in the syslog.

System Event Log

The LSF daemons keep an event log in the lsb.events file. The mbatchd daemon uses this information to recover from server failures, host reboots, and mbatchd restarts. The lsb.events file is also used by the bhist command to display detailed information about the execution history of batch jobs, and by the badmin command to display the operational history of hosts, queues, and daemons.

By default, mbatchd automatically backs up and rewrites the lsb.events file after every 1000 batch job completions. This value is controlled by the MAX_JOB_NUM parameter in the lsb.params file. The old lsb.events file is moved to lsb.events.1, and each old lsb.events.n file is moved to lsb.events.n+1. LSF never deletes these files. If disk storage is a concern, the LSF administrator should arrange to archive or remove old lsb.events.n files periodically.

caution:  
Do not remove or modify the current lsb.events file. Removing or modifying the lsb.events file could cause batch jobs to be lost.

Duplicate Logging of Event Logs

To recover from server failures, host reboots, or mbatchd restarts, LSF uses information stored in lsb.events. To improve the reliability of LSF, you can configure LSF to maintain copies of these logs, to use as a backup.

If the host that contains the primary copy of the logs fails, LSF will continue to operate using the duplicate logs. When the host recovers, LSF uses the duplicate logs to update the primary copies.

How duplicate logging works

By default, the event log is located in LSB_SHAREDIR. Typically, LSB_SHAREDIR resides on a reliable file server that also contains other critical applications necessary for running jobs, so if that host becomes unavailable, the subsequent failure of LSF is a secondary issue. LSB_SHAREDIR must be accessible from all potential LSF master hosts.

When you configure duplicate logging, the duplicates are kept on the file server, and the primary event logs are stored on the first master host. In other words, LSB_LOCALDIR is used to store the primary copy of the batch state information, and the contents of LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR, which resides on a central file server. This has the following effects:

Failure of file server

If the file server containing LSB_SHAREDIR goes down, LSF continues to process jobs. Client commands such as bhist, which directly read LSB_SHAREDIR will not work.

When the file server recovers, the current log files are replicated to LSB_SHAREDIR.

Failure of first master host

If the first master host fails, the primary copies of the files (in LSB_LOCALDIR) become unavailable. Then, a new master host is selected. The new master host uses the duplicate files (in LSB_SHAREDIR) to restore its state and to log future events. There is no duplication by the second or any subsequent LSF master hosts.

When the first master host becomes available after a failure, it will update the primary copies of the files (in LSB_LOCALDIR) from the duplicates (in) and continue operations as before.

If the first master host does not recover, LSF will continue to use the files in LSB_SHAREDIR, but there is no more duplication of the log files.

Simultaneous failure of both hosts

If the master host containing LSB_LOCALDIR and the file server containing LSB_SHAREDIR both fail simultaneously, LSF will be unavailable.

Network partioning

We assume that Network partitioning does not cause a cluster to split into two independent clusters, each simultaneously running mbatchd.

This may happen given certain network topologies and failure modes. For example, connectivity is lost between the first master, M1, and both the file server and the secondary master, M2. Both M1 and M2 will run mbatchd service with M1 logging events to LSB_LOCALDIR and M2 logging to LSB_SHAREDIR. When connectivity is restored, the changes made by M2 to LSB_SHAREDIR will be lost when M1 updates LSB_SHAREDIR from its copy in LSB_LOCALDIR.

The archived event files are only available on LSB_LOCALDIR, so in the case of network partitioning, commands such as bhist cannot access these files. As a precaution, you should periodically copy the archived files from LSB_LOCALDIR to LSB_SHAREDIR.

Setting an event update interval

If NFS traffic is too high and you want to reduce network traffic, use EVENT_UPDATE_INTERVAL in lsb.params to specify how often to back up the data and synchronize the LSB_SHAREDIR and LSB_LOCALDIR directories.

The directories are always synchronized when data is logged to the files, or when mbatchd is started on the first LSF master host.

Automatic archiving and duplicate logging

Event logs

Archived event logs, lsb.events.n, are not replicated to LSB_SHAREDIR. If LSF starts a new event log while the file server containing LSB_SHAREDIR is down, you might notice a gap in the historical data in LSB_SHAREDIR.

Configure duplicate logging

To enable duplicate logging, set LSB_LOCALDIR in lsf.conf to a directory on the first master host (the first host configured in lsf.cluster.cluster_name) that will be used to store the primary copies of lsb.events. This directory should only exist on the first master host.

  1. Edit lsf.conf and set LSB_LOCALDIR to a local directory that exists only on the first master host.
  2. Use the commands lsadmin reconfig and badmin mbdrestart to make the changes take effect.

LSF Job Termination Reason Logging

When a job finishes, LSF reports the last job termination action it took against the job and logs it into lsb.acct.

If a running job exits because of node failure, LSF sets the correct exit information in lsb.acct, lsb.events, and the job output file. Jobs terminated by a signal from LSF, the operating system, or an application have the signal logged as the LSF exit code. Exit codes are not the same as the termination actions.

View logged job exit information (bacct -l)

  1. Use bacct -l to view job exit information logged to lsb.acct:
  2. bacct -l 7265 
     
    Accounting information about jobs that are:  
      - submitted by all users. 
      - accounted on all projects. 
      - completed normally or exited 
      - executed on all hosts. 
      - submitted to all queues. 
      - accounted on all service classes. 
    ------------------------------------------------------------------------------ 
     
    Job <7265>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <normal>,  
                         Command <srun sleep 100000> 
    Thu Sep 16 15:22:09: Submitted from host <hostA>, CWD <$HOME>; 
    Thu Sep 16 15:22:20: Dispatched to 4 Hosts/Processors <4*hostA>; 
    Thu Sep 16 15:22:20: slurm_id=21793;ncpus=4;slurm_alloc=n[13-14]; 
    Thu Sep 16 15:23:21: Completed <exit>; TERM_RUNLIMIT: job killed after reaching 
                         LSF run time limit. 
     
    Accounting information about this job: 
         Share group charged </lsfadmin> 
         CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP 
          0.04       11             72     exit         0.0006     0K      0K 
    ------------------------------------------------------------------------------ 
     
    SUMMARY:      ( time unit: second )  
     Total number of done jobs:       0      Total number of exited jobs:     1 
     Total CPU time consumed:       0.0      Average CPU time consumed:     0.0 
     Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0 
     Total wait time in queues:    11.0 
     Average wait time in queue:   11.0 
     Maximum wait time in queue:   11.0      Minimum wait time in queue:   11.0 
     Average turnaround time:        72 (seconds/job) 
     Maximum turnaround time:        72      Minimum turnaround time:        72 
     Average hog factor of a job:  0.00 ( cpu time / turnaround time ) 
     Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00 
    

Termination reasons displayed by bacct

When LSF detects that a job is terminated, bacct -l displays one of the following termination reasons:

Keyword displayed by bacct
Termination reason
Integer value logged to JOB_FINISH in lsb.acct
TERM_ADMIN
Job killed by root or LSF administrator
15
TERM_BUCKET_KILL
Job killed with bkill -b
23
TERM_CHKPNT
Job killed after checkpointing
13
TERM_CPULIMIT
Job killed after reaching LSF CPU usage limit
12
TERM_CWD_NOTEXIST
Current working directory is not accessible or does not exist on the execution host
25
TERM_DEADLINE
Job killed after deadline expires
6
TERM_EXTERNAL_SIGNAL
Job killed by a signal external to LSF
17
TERM_FORCE_ADMIN
Job killed by root or LSF administrator without time for cleanup
9
TERM_FORCE_OWNER
Job killed by owner without time for cleanup
8
TERM_LOAD
Job killed after load exceeds threshold
3
TERM_MEMLIMIT
Job killed after reaching LSF memory usage limit
16
TERM_OTHER
Member of a chunk job in WAIT state killed and requeued after being switched to another queue.
4
TERM_OWNER
Job killed by owner
14
TERM_PREEMPT
Job killed after preemption
1
TERM_PROCESSLIMIT
Job killed after reaching LSF process limit
7
TERM_REQUEUE_ADMIN
Job killed and requeued by root or LSF administrator
11
TERM_REQUEUE_OWNER
Job killed and requeued by owner
10
TERM_RMS
Job exited from an RMS system error
18
TERM_RUNLIMIT
Job killed after reaching LSF run time limit
5
TERM_SLURM
Job terminated abnormally in SLURM (node failure)
22
TERM_SWAP
Job killed after reaching LSF swap usage limit
20
TERM_THREADLIMIT
Job killed after reaching LSF thread limit
21
TERM_UNKNOWN
LSF cannot determine a termination reason-0 is logged but TERM_UNKNOWN is not displayed
0
TERM_WINDOW
Job killed after queue run window closed
2
TERM_ZOMBIE
Job exited while LSF is not available
19

tip:  
The integer values logged to the JOB_FINISH event inlsb.acct and termination reason keywords are mapped in lsbatch.h.
Restrictions

Example output of bacct and bhist

Example termination cause
Termination reason in bacct -l
Example bhist output
bkill -s KILL
bkill job_ID
Completed <exit>; TERM_OWNER or TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL> requested by user or administrator <user2>;
Thu Mar 13 17:32:06: Exited by signal 2. The CPU time used is 0.1 seconds;
bkill -r
Completed <exit>; TERM_FORCE_ADMIN or TERM_FORCE_OWNER when sbatchd is not reachable.
Otherwise, TERM_USER or
TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL> requested by user or administrator <user2>;
Thu Mar 13 17:32:06: Exited by signal 2. The CPU time used is 0.1 seconds;
TERMINATE_WHEN
Completed <exit>; TERM_LOAD/
TERM_WINDOWS/
TERM_PREEMPT
Thu Mar 13 17:33:16: Signal <KILL> requested by user or administrator <user2>;
Thu Mar 13 17:33:18: Exited by signal 2. The CPU time used is 0.1 seconds;
Memory limit reached
Completed <exit>; TERM_MEMLIMIT
Thu Mar 13 19:31:13: Exited by signal 2. The CPU time used is 0.1 seconds;
Run limit reached
Completed <exit>; TERM_RUNLIMIT
Thu Mar 13 20:18:32: Exited by signal 2. The CPU time used is 0.1 seconds.
CPU limit
Completed <exit>; TERM_CPULIMIT
Thu Mar 13 18:47:13: Exited by signal 24. The CPU time used is 62.0 seconds;
Swap limit
Completed <exit>; TERM_SWAPLIMIT
Thu Mar 13 18:47:13: Exited by signal 24. The CPU time used is 62.0 seconds;
Regular job exits when host crashes
Rusage 0,
Completed <exit>;
TERM_ZOMBIE
Thu Jun 12 15:49:02: Unknown; unable to reach the execution host;
Thu Jun 12 16:10:32: Running;
Thu Jun 12 16:10:38: Exited with exit code 143. The CPU time used is 0.0 seconds;
brequeue -r
For each requeue,
Completed <exit>;
TERM_REQUEUE_ADMIN or TERM_REQUEUE_OWNER
Thu Mar 13 17:46:39: Signal <REQUEUE_PEND> requested by user or administrator <user2>;
Thu Mar 13 17:46:56: Exited by signal 2. The CPU time used is 0.1 seconds;
bchkpnt -k
On the first run:
Completed <exit>;
TERM_CHKPNT
Wed Apr 16 16:00:48: Checkpoint succeeded (actpid 931249);
Wed Apr 16 16:01:03: Exited with exit code 137. The CPU time used is 0.0 seconds;
Kill -9 <RES> and job
Completed <exit>; TERM_EXTERNAL_SIGNAL
Thu Mar 13 17:30:43: Exited by signal 15. The CPU time used is 0.1 seconds;
Others
Completed <exit>;
Thu Mar 13 17:30:43: Exited with 3; The CPU time used is 0.1 seconds;

Understanding LSF job exit codes

Exit codes are generated by LSF when jobs end due to signals received instead of exiting normally. LSF collects exit codes via the wait3() system call on UNIX platforms. The LSF exit code is a result of the system exit values. Exit codes less than 128 relate to application exit values, while exit codes greater than 128 relate to system signal exit values (LSF adds 128 to system values). Use bhist to see the exit code for your job.

How or why the job may have been signaled, or exited with a certain exit code, can be application and/or system specific. The application or system logs might be able to give a better description of the problem.

tip:  
Termination signals are operating system dependent, so signal 5 may not be SIGTRAP and 11 may not be SIGSEGV on all UNIX and Linux systems. You need to pay attention to the execution host type in order to correct translate the exit value if the job has been signaled.

Application exit values

The most common cause of abnormal LSF job termination is due to application system exit values. If your application had an explicit exit value less than 128, bjobs and bhist display the actual exit code of the application; for example, Exited with exit code 3. You would have to refer to the application code for the meaning of exit code 3.

It is possible for a job to explicitly exit with an exit code greater than 128, which can be confused with the corresponding system signal. Make sure that applications you write do not use exit codes greater than128.

System signal exit values

Jobs terminated with a system signal are returned by LSF as exit codes greater than 128 such that exit_code-128=signal_value. For example, exit code 133 means that the job was terminated with signal 5 (SIGTRAP on most systems, 133-128=5). A job with exit code 130 was terminated with signal 2 (SIGINT on most systems, 130-128 = 2).

Some operating systems define exit values as 0-255. As a result, negative exit values or values > 255 may have a wrap-around effect on that range. The most common example of this is a program that exits -1 will be seen with "exit code 255" in LSF.

bhist and bjobs output

In most cases, bjobs and bhist show the application exit value (128 + signal). In some cases, bjobs and bhist show the actual signal value.

If LSF sends catchable signals to the job, it displays the exit value. For example, if you run bkill jobID to kill the job, LSF passes SIGINT, which causes the job to exit with exit code 130 (SIGINT is 2 on most systems, 128+2 = 130).

If LSF sends uncatchable signals to the job, then the entire process group for the job exits with the corresponding signal. For example, if you run bkill -s SEGV jobID to kill the job, bjobs and bhist show:

Exited by signal 7 

Example

The following example shows a job that exited with exit code 139, which means that the job was terminated with signal 11 (SIGSEGV on most systems, 139-128=11). This means that the application had a core dump.

bjobs -l 2012

Job <2012>, User , Project , Status , Queue , Command 
Fri Dec 27 22:47:28: Submitted from host , CWD <$HOME>;
Fri Dec 27 22:47:37: Started on , Execution Home , Execution CWD ;
Fri Dec 27 22:48:02: Exited with exit code 139. The CPU time used is 0.2 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      - 
 		                cpuspeed    bandwidth 
 loadSched          -            - 
 loadStop           -            - 

Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index