Platform LSF batch system

LSF batch is a layered distributed load sharing batch system built on top of Platform LSF base. The services provided by LSF batch are extensions to the Platform LSF base services. Application programmers can access batch services through the LSF batch Library (LSBLIB). The diagram below shows the components of LSF batch and their relationship:

LSF batch accepts user jobs and holds them in queues until suitable hosts are available. LSF batch runs user jobs on LSF batch execution hosts, those hosts that a site deems suitable for running batch jobs.

LSBLIB consists of LSF API, the direct user interface to the rest of the LSF batch system. Platform LSF APIs provide easy access to the services of Platform LSF servers. The API routines hide the interaction details between the application and Platform LSF servers in a way that is platform independent.

LSF batch services are provided by two daemons, one mbatchd (master batch daemon) running in each Platform LSF cluster, and one sbatchd (slave batch daemon) running on each batch server host.

Application and Platform LSF batch interactions

LSF batch operation relies on the services provided by Platform LSF base. LSF batch contacts the master LIM to get load and resource information about every batch server host. The diagram below shows the typical operation of LSF batch:

LSF batch executes jobs by sending user requests from the submission host to the master host. The master host puts the job in a queue and dispatches the job to an execution host. The job is run and the results are emailed to the user.

Unlike LSF base, the submission host does not directly interact with the execution host.

  1. bsub or lsb_submit()submits a job to LSF for execution.

  2. To access LSF base services, the submitted job proceeds through the Platform LSF batch library (LSBLIB) that contains LSF base library information.

  3. The LIM communicates the job’s information to the cluster’s master LIM. Periodically, the LIM on individual machines gathers its 12 built-in load indices and forwards this information to the master LIM.

  4. The master LIM determines the best host to run the job and sends this information back to the submission host’s LIM.

  5. Information about the chosen execution host is passed through the LSF batch library.

  6. Information about the host to execute the job is passed back to bsub or lsb_submit().

  7. To enter the batch system, bsub or lsb_submit()sends the job to LSBLIB.

  8. Using LSBLIB services, the job is sent to the mbatchd running on the cluster’s master host.

  9. The mbatchd puts the job in an appropriate queue and waits for the appropriate time to dispatch the job. User jobs are held in batch queues by mbatchd, which checks the load information on all candidate hosts periodically.

  10. The mbatchd dispatches the job when an execution host with the necessary resources becomes available where it is received by the host’s sbatchd. When more than one host is available, the best host is chosen.

  11. Once a job is sent to an sbatchd, that sbatchd controls the execution of the job and reports the job’s status to mbatchd. The sbatchd creates a child sbatchd to handle job execution.

  12. The child sbatchd sends the job to the RES.

  13. The RES creates the execution environment to run the job.

  14. The job is run in the execution environment.

  15. The results of the job are sent to the email system.

  16. The email system sends the job’s results to the user.

The mbatchd always runs on the host where the master LIM runs. The sbatchd on the master host automatically starts the mbatchd. If the master LIM moves to a different host, the current mbatchd will automatically resign and a new mbatchd will be automatically started on the new master host.

The log files store important system and job information so that a newly started mbatchd can restore the status of the previous mbatchd. The log files also provide historic information about jobs, queues, hosts, and LSF batch servers.