LSF batch saves a lot of valuable information about the system and jobs. Such information is logged by mbatchd in the files lsb.events and lsb.acct under the directory $LSB_SHAREDIR/your_cluster/logdir, where LSB_SHAREDIR is defined in the lsf.conf file and your_cluster is the name of your Platform LSF cluster.
mbatchd logs such information for several purposes.
Some of the events serve as the backup of mbatchd’s memory. In case mbatchd crashes, all critical information from the event file can then be used by the newly started mbatchd to restore the current state of LSF batch.
The events can be used to produce historical information about the LSF batch system and user jobs.
Such information can be used to produce accounting or statistic reports.
LSBLIB provides a function to read information from these files into a well-defined data structure:
struct eventRec *lsb_geteventrec(log_fp, lineNum)FILE *log_fp; File handle for either an event logfile or job log fileint *lineNum; Line number of the next eventrecord
The parameter log_fp is returned by a successful fopen() call. The content in lineNum is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 before lsb_geteventrec() is called for the first time.
lsb_geteventrec() returns the following data structure:
struct eventRec {char version[MAX_VERSION_LEN]; Version number of the mbatchdint type; Type of the eventtime_t eventTime; Event time stampunion eventLog eventLog; Event data};
The event type is used to determine the structure of the data in eventLog. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.
lsb_geteventrec() returns NULL and sets lsberrno to LSBE_EOF when there are no more records in the event file.
* Available only if the Platform JobScheduler component is enabled.
Each event type corresponds to a different data structure in the union:
union eventLog {struct jobNewLog jobNewLog; EVENT_JOB_NEWstruct jobStartLog jobStartLog; EVENT_JOB_STARTstruct jobStatusLog jobStatusLog; EVENT_JOB_STATUSstruct jobSwitchLog jobSwitchLog; EVENT_JOB_SWITCHstruct jobMoveLog jobMoveLog; EVENT_JOB_MOVEstruct queueCtrlLog queueCtrlLog; EVENT_QUEUE_CTRLstruct hostCtrlLog hostCtrlLog; EVENT_HOST_CTRLstruct mbdStartLog mbdStartLog; EVENT_MBD_STARTstruct mbdDieLog mbdDieLog; EVENT_MBD_DIEstruct unfulfillLog unfulfillLog; EVENT_MBD_UNFULFILLstruct jobFinishLog jobFinishLog; EVENT_JOB_FINISHstruct loadIndexLog loadIndexLog; EVENT_LOAD_INDEXstruct migLog migLog; EVENT_MIGstruct calendarLog calendarLog; Shared by all calendar eventsstruct jobForceRequestLog jobForceRequestLogEVENT_JOB_FORCEstruct jobForwardLog jobForwardLog; EVENT_JOB_FORWARDstruct jobAcceptLog jobAcceptLog; EVENT_JOB_ACCEPTstruct statusAckLog statusAckLog; EVENT_STATUS_ACKstruct signalLog signalLog; EVENT_JOB_SIGNALstruct jobExecuteLog jobExecuteLog; EVENT_JOB_EXECUTEstruct jobRequeueLog jobRequeueLog; EVENT_JOB_REQUEUEstruct sigactLog sigactLog; EVENT_JOB_SIGACTstruct jobStartAcceptLog jobStartAcceptLogEVENT_JOB_START_ACCEPTstruct jobMsgLog jobMsgLOg; EVENT_JOB_MSGstruct jobMsgAckLog jobMsgAckLog; EVENT_JOB_MSG_ACKstruct chkpntLog chkpntLog; EVENT_CHKPNTstruct jobOccupyReqLog jobOccupyReqLog;EVENT_JOB_OCCUPY_REQstruct jobVacatedLog jobVacatedLog; EVENT_JOB_VACATEDstruct jobCleanLog jobCleanLog; EVENT_JOB_CLEANstruct jobExceptionLog jobExceptionLog;EVENT_JOB_EXCEPTIONstruct jgrpNewLog jgrpNewLog; EVENT_JGRP_ADDstruct jgrpCtrlLog jgrpCtrlLog; EVENT_JGRP_CTRstruct logSwitchLog logSwitchLog; EVENT_LOG_SWITCHstruct jobModLog jobModLog; EVENT_JOB_MODIFYstruct jgrpStatusLog jgrpStatusLog; EVENT_JGRP_STATUSstruct jobAttrSetLog jobAttrSetLog; EVENT_JOB_ATTR_SETstruct jobExternalMsgLog jobExternalMsgLog;EVENT_JOB_EXT_MSGstruct jobChunkLog jobChunkLog; EVENT_JOB_CHUNKstruct sbdUnreportedStatusLog sbdUnreportedStatusLog;EVENT_SBD_UNREPORTED_STATUSstruct rsvFinishLog rsvFinishLog;struct hgCtrlLog hgCtrlLog;struct cpuProfileLog cpuProfileLog;struct dataLoggingLog dataLoggingLog;struct jobRunRusageLog jobRunRusageLog;struct eventEOSLog eventEOSLog;struct slaLog slaLog;struct perfmonLog perfmonLog;struct taskFinishLog taskFinishLog;struct jobResizeNotifyStartLog jobResizeNotifyStartLog;struct jobResizeNotifyAcceptLog jobResizeNotifyAcceptLog;struct jobResizeNotifyDoneLog jobResizeNotifyDoneLog;struct jobResizeReleaseLog jobResizeReleaseLog;struct jobResizeCancelLog jobResizeCancelLog;struct jobResizeLog jobResizeLog;};
The detailed data structures in the above union are defined in lsbatch.h and described in lsb_geteventrec(3).
Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the lsb.events file is in /local/lsf/work/cluster1/logdir.
/******************************************************
* The program takes a job name as the argument and returns
* the information of the job with this given name
******************************************************/
#include <stdio.h>#include <string.h>#include <time.h>#include <lsf/lsbatch.h>int main(int argc, char **argv){char *eventFile = "/local/lsf/mnt/work/cluster1/logdir/lsb.events";/*location of lsb.events*/FILE *fp;/* file handler for lsb.events */struct eventRec *record;/* pointer to the return struct of lsb_geteventrec() */int lineNum = 0;/* line number of next event */char *jobName = argv[1];/* specified job name */int i;struct jobNewLog *newJob;/* new job event record */struct jobStartLog *startJob;/* start job event record */struct jobStatusLog *statusJob;/* job status change event record *//* check if the input is in the right format: "./geteventrec JOBNAME" */if (argc != 2) {printf("Usage: %s job name\n", argv[0]);exit(-1);}/* initialize LSBLIB and get the configuration environment */if (lsb_init(argv[0]) < 0) {lsb_perror("lsb_init");exit(-1);}/* open the file for read */fp = fopen(eventFile, "r");if (fp == NULL) {perror(eventFile);exit(-1);}/* get events and print out the information of the eventrecords with the given job name in different format */for (;;) {record = lsb_geteventrec(fp, &lineNum);if (record == NULL) {if (lsberrno == LSBE_EOF)exit(0);lsb_perror("lsb_geteventrec");exit(-1);}/* find the record with the given job name */if (record->eventLog.jobNewLog.jobName==NULL)continue;if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0)continue;elseswitch (record->type) {case EVENT_JOB_NEW:newJob = &(record->eventLog.jobNewLog);printf("%sJob <%d> submitted by <%s> from <%s> to <%s> queue\n", ctime(&record-> eventTime), newJob->jobId, newJob-> userName, newJob->fromHost, newJob-> queue);continue;case EVENT_JOB_START:startJob = &(record->eventLog.jobStartLog);printf("%sJob <%d> started on ", ctime(&record-> eventTime), newJob->jobId);for (i=0; i<startJob->numExHosts; i++)printf("<%s> ", startJob->execHosts[i]);printf("\n");continue;case EVENT_JOB_STATUS:statusJob = &(record->eventLog.jobStatusLog);printf("%sJob <%d> status changed to: ", ctime(&record->eventTime), statusJob-> jobId);switch(statusJob->jStatus) {case JOB_STAT_PEND:printf("pending\n");continue;case JOB_STAT_RUN:printf("running\n");continue;case JOB_STAT_SSUSP:case JOB_STAT_USUSP:case JOB_STAT_PSUSP:printf("suspended\n");continue;case JOB_STAT_UNKWN:printf("unknown (sbatchd unreachable)\n");continue;case JOB_STAT_EXIT:printf("exited\n");continue;case JOB_STAT_DONE:printf("done\n");continue;default:printf("\nError: unknown job status %d\n", statusJob->jStatus);continue;}default:/* Only display a few selected event types */continue;}}exit(0);}
In the above program, events that are of no interest are skipped. The job status codes are defined in lsbatch.h. The lsb.acct file stores job accounting information, which allows lsb.acct to be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH) in lsb.acct, processing is simpler than in the above example.