Running and monitoring Session Scheduler jobs

Create a Session Scheduler session and run tasks

  1. Create task definition file.

    For example:

    cat my.tasks
    sleep 10
    hostname
    uname
    ls
  2. Use bsub with the ssched application profile to submit a Session Scheduler job with the task definition.
    bsub -app ssched bsub_options ssched [task_options] [-tasks task_definition_file] [command [arguments]]
    For example:
    bsub -app ssched ssched -tasks my.tasks

When all tasks finish, Session Scheduler exits, all temporary files are deleted, the session job is cleaned from the system, and Session Scheduler output is captured and included in the standard LSF job e-mail.

You can also submit a Session Scheduler job without a task definition file to specify a single task.

Note:

The submission directory path can contain up to 4094 characters.

See the ssched command reference for detailed information about all task options.

Submit a Session Scheduler job as a parallel LSF job

Use the -n option of bsub to submit a Session Scheduler job as a parallel LSF job.
bsub -app ssched -n num_hosts ssched [task_options] [-tasks task_definition_file] [command [arguments]]
For example:
bsub -app ssched -n 2 ssched -tasks my.tasks

Submit task array jobs

Use the -J option to submit a task array via the command line, and no task definition file is needed:
-J task_name[index_list]

The index list must be enclosed in square brackets. The index list is a comma-separated list whose elements have the syntax start[-end[:step]] where start, end and step are positive integers. If the step is omitted, a step of one (1) is assumed. The task array index starts at one (1).

All tasks in the array share the same option parameters. Each element of the array is distinguished by its array index.

See the ssched command reference for detailed information about all task options.

Submit tasks with automatic task requeue

Use the -Q option to specify requeue exit values for the tasks:
-Q "exit_code ..."

-Q enables automatic task requeue and sets the LSB_EXIT_REQUEUE environment variable. Use spaces to separate multiple exit codes. LSF does not save the output from the failed task, and does not notify the user that the task failed.

If a job is killed by a signal, the exit value is 128+signal_value. Use the sum of 128 and the signal value as the exit code in the parameter. For example, if you want a task to rerun if it is killed with a signal 9 (SIGKILL), the exit value is 128+9=137.

The SSCHED_REQUEUE_LIMIT setting limits the number of times a task can be requeued.

See the ssched command reference for detailed information about all task options.

Monitor Session Scheduler jobs

  1. Run bjobs -ss to get summary information for Session Scheduler jobs and tasks.
    JOBID OWNER JOB_NAME NTASKS PEND DONE  RUN EXIT
    1   lsfadmin job1   10     4    4    2    0
    2   lsfadmin job2   10    10    0    0    0
    3   lsfadmin job3   10    10    0    0    0

    Information displays about your session scheduler job, including Job ID, the owner, the job name, the number of total tasks, and the number of tasks in any of the following states: pend, run, done, exit.

  2. Use bjobs -l -ss or bread to track the progress of the Session Scheduler job.

Kill a Session Scheduler session

Use bkill to kill the Session Scheduler session. All temporary files are deleted, and the session job is cleaned from the system.

Check your job submission

Use the -C option to sanity-check all parameters and the task definition file.

ssched exits after the check is complete. An exit code of 0 indicates no errors were found. A non-zero exit code indicates errors. You can run ssched -C outside of LSF.

See the ssched command reference for detailed information about all task options.

Example output of ssched -C:

ssched -C -tasks my.tasks
Error in tasks file line 1: -XXX 123 sleep 0
Unsupported option: -XXX
Error in tasks file line 2: -o my.out
A command must be specified

Only the ssched parameters are checked, not the ssched task command itself. The task command must exist and be executable. ssched -C cannot detect whether the task command exists or is executable. To check a task definitions file, remember to specify the -tasks option.

Enable recoverable Session Scheduler sessions

By default, Session Scheduler sessions are unrecoverable. In the event of a system crash, the session job must be resubmitted and all tasks are resubmitted and rerun.

However, the Session Scheduler supports application-level checkpoint/restart using Platform LSF's existing facilities. If the user specifies a checkpoint directory when submitting the session job, the job can be restarted using brestart. After a restart, only those tasks that have not yet completed are resubmitted and run.

To enable recoverable sessions, when submitting the session job:
  1. Provide a writable directory on a shared file system.
  2. Specify the ssched checkpoint method with the bsub -k option.

You do not need to call bchkpnt. The Session Scheduler automatically checkpoints itself after each task completes.

For example:
bsub -app ssched -k "/share/scratch method=ssched" -n 8 ssched -tasks simpton.tasks
Job <123> is submitted to default queue <normal>.
...
brestart /share/scratch 123