Resizable jobs

Enabling resizable jobs allows Session Scheduler to run jobs with minimum and maximum slots requested and dynamically use the number of slots available at any given time.

Session Scheduler automatically releases idle resources for all resizable jobs. The typical use case is a "long tail" scenario, where all short running tasks complete within one session except a few long running tasks. Those long running tasks occupy a small number of hosts, leaving most of the originally allocated resources idle. Session Scheduler automatically detects those idle resources, shuts down the execution agents running on those hosts and releases resources back to LSF.

When additional resources are added to Session Scheduler, it recognizes those resources and makes use of them to run tasks.

Resizable Job

A job whose job slot allocation can grow and shrink during its run time. The allocation change request may be triggered automatically or by the bresize command.

When users run the bresize release command to forcibly release resources from Session Scheduler, it recognizes those released resources and shuts down the execution agents running on those hosts.

Autoresizable job

A resizable job with a minimum and maximum slot request. LSF automatically schedules and allocates additional resources to satisfy job maximum request as the job runs.

Configuring resizable jobs

Session Scheduler jobs are resizable when the parameter RESIZABLE_JOBS=Y in the ssched application profile.

Session Scheduler jobs are autoresizable when the parameter RESIZABLE_JOBS=AUTO in the ssched application profile, or when RESIZABLE_JOBS=Y in the ssched application profile and jobs are submitted with the bjobs option -ar.

Session Scheduler jobs are not resizable when the parameter RESIZABLE_JOBS in the ssched application profile is commented out.

When the bresize release command is run on Session Scheduler jobs, the parameter DJOB_RESIZE_GRACE_PERIOD=seconds in the ssched application profile configures a time interval for Session Scheduler to react and take necessary actions such as shutting down execution agents.

Examples

Adding new resources

For an autoresizable job, new resources are added in when they become available. Consider the example where RESIZABLE_JOBS=AUTO in the ssched application profile:

bsub -app ssched -n 1,3 ssched -tasks ./my.tasks

The autoresizable job is requesting 3 hosts and only 1 is available. The job starts running on 1 host and pends on 2 more. When the additional hosts are free, the job allocation changes to 3 hosts automatically. Selected output from bhist:

Dispatched to <host1>;
Starting (Pid 24721);
"Tasks:PEND=3 RUN=0 DONE=0 EXIT=0"
"Tasks:PEND=2 RUN=1 DONE=0 EXIT=0"
"Tasks:PEND=2 RUN=1 DONE=0 EXIT=0"
Additional allocation on 1 Hosts/Processors <host2>
Resize notification acceptedExternal Message "Tasks:PEND=2 RUN=1 DONE=0 EXIT=0"
Additional allocation on 1 Hosts/Processors <host03>Resize notification accepted
External Message "Tasks:PEND=1 RUN=2 DONE=0 EXIT=0"
External Message "Tasks:PEND=0 RUN=3 DONE=0 EXIT=0"
Releasing resources

Idle resources are released for both resizable and autoresizable jobs. For example, RESIZABLE_JOBS=Y in the ssched application profile and an autoresizable job is submitted:

bsub -ar -app ssched -n 1,3 ssched -tasks ./my.longtail

The autoresizable job is requesting 1 to 3 hosts and 3 are available so the job starts running on 3 right away. This longtail job has many short tasks and only a few long ones, and is soon finished running on 2 of the 3 hosts. Session Scheduler sees the 2 idle hosts and releases them from the allocation. Selected output from bhist:

Submitted from host <host11>, to Queue <normal>, CWD <$HOME>, 3 Processors Requested
Dispatched to 3 Hosts/Processors <host01> <host02> <host03>;
"Tasks:PEND=3 RUN=0 DONE=0 EXIT=0"
"Tasks:PEND=0 RUN=3 DONE=0 EXIT=0"
After two tasks are done, Session Scheduler releases two hosts:
Release allocation on 2 Hosts/Processors <host02> <host03> by user or administrator <user01>, Cancel pending allocation request;
Resize notification accepted;
"Tasks:PEND=0 RUN=1 DONE=2 EXIT=0"
Releasing resources using bresize

Resources can be released on demand for both resizable and autoresizable jobs. For example, DJOB_RESIZE_GRACE_PERIOD=30 in RESIZABLE_JOBS=Y in the ssched application profile and a resizable job is submitted:

bsub -app ssched -n 1,3 ssched -tasks ./my.longtail

Job <5> is submitted to the default queue <normal>

The resizable job starts running on host02, host03, and host04. The following command releases host02 from the job:

bresize release "host02" 5

Session Scheduler shuts down the execution agent on host02. The job continues to run on host03 and host04.

Selected output from bhist:

Submitted from host <delpe07.lsf.platform.com>, to Queue <normal>, CWD <$HOME>, 3 Processors Requested;
Dispatched to 3 Hosts/Processors <host02> <host03> <host04>;
"Tasks:PEND=3 RUN=0 DONE=0 EXIT=0"
"Tasks:PEND=0 RUN=3 DONE=0 EXIT=0"
Release allocation on 1 Hosts/Processors <host02> by user or administrator <user01>
Resize notification accepted
"Tasks:PEND=1 RUN=2 DONE=0 EXIT=0"