Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Reserving Resources

Contents

About Resource Reservation

When a job is dispatched, the system assumes that the resources that the job consumes will be reflected in the load information. However, many jobs do not consume the resources they require when they first start. Instead, they will typically use the resources over a period of time.

For example, a job requiring 100 MB of swap is dispatched to a host having 150 MB of available swap. The job starts off initially allocating 5 MB and gradually increases the amount consumed to 100 MB over a period of 30 minutes. During this period, another job requiring more than 50 MB of swap should not be started on the same host to avoid over-committing the resource.

Resources can be reserved to prevent overcommitment by LSF. Resource reservation requirements can be specified as part of the resource requirements when submitting a job, or can be configured into the queue level resource requirements.

Pending job resize allocation requests are not supported in slot reservation policies. Newly added or removed resources are reflected in the pending job predicted start time calculation.

Resource reservation limits

Maximum and minimum values for consumable resource requirements can be set for individual queues, so jobs will only be accepted if they have resource requirements within a specified range. This can be useful when queues are configured to run jobs with specific memory requirements, for example. Jobs requesting more memory than the maximum limit for the queue will not be accepted, and will not take memory resources away from the smaller memory jobs the queue is designed to run.

Resource reservation limits are set at the queue level by the parameter RESRSV_LIMIT in lsb.queues.

How resource reservation works

When deciding whether to schedule a job on a host, LSF considers the reserved resources of jobs that have previously started on that host. For each load index, the amount reserved by all jobs on that host is summed up and subtracted (or added if the index is increasing) from the current value of the resources as reported by the LIM to get amount available for scheduling new jobs:

available amount = current value - reserved amount for all jobs 

For example:

bsub -R "rusage[tmp=30:duration=30:decay=1]" myjob 

will reserve 30 MB of temp space for the job. As the job runs, the amount reserved will decrease at approximately 1 MB/minute such that the reserved amount is 0 after 30 minutes.

Queue-level and job-level resource reservation

The queue level resource requirement parameter RES_REQ may also specify the resource reservation. If a queue reserves certain amount of a resource (and the parameter RESRSV_LIMIT is not being used), you cannot reserve a greater amount of that resource at the job level.

For example, if the output of bqueues -l command contains:

RES_REQ: rusage[mem=40:swp=80:tmp=100] 

the following submission will be rejected since the requested amount of certain resources exceeds queue's specification:

bsub -R "rusage[mem=50:swp=100]" myjob 

When both RES_REQ and RESRSV_LIMIT are set in lsb.queues for a consumable resource, the queue-level RES_REQ no longer acts as a hard limit for the merged RES_REQ rusage values from the job and application levels. In this case only the limits set by RESRSV_LIMIT must be satisfied, and the queue-level RES_REQ acts as a default value.

Using Resource Reservation

Queue-level resource reservation

At the queue level, resource reservation allows you to specify the amount of resources to reserve for jobs in the queue. It also serves as the upper limits of resource reservation if a user also specifies it when submitting a job.

Queue-level resource reservation and pending reasons

The use of RES_REQ affects the pending reasons as displayed by bjobs. If RES_REQ is specified in the queue and the loadSched thresholds are not specified, then the pending reasons for each individual load index will not be displayed.

Configuring resource reservation at the queue level

Queue-level resource reservations and resource reservation limits can be configured as parameters in lsb.queues. The resource reservation requirement can be configured at the queue level as part of the queue level resource requirements. Use the resource usage (rusage) section of the resource requirement string to specify the amount of resources a job should reserve after it is started.

Examples
Begin Queue
.
RES_REQ = select[type==any] rusage[swp=100:mem=40:duration=60]
RESRSV_LIMIT = [mem=30,100]
.
End Queue 

This will allow a job to be scheduled on any host that the queue is configured to use and will reserve 100 MB of swap and 40 MB of memory for a duration of 60 minutes. The requested memory reservation of 40 MB falls inside the allowed limits set by RESRSV_LIMIT of 30 MB to 100 MB.

Begin Queue
.
RES_REQ = select[type==any] rusage[mem=20||mem=10:swp=20]
.
End Queue 

This will allow a job to be scheduled on any host that the queue is configured to use. The job will attempt to reserve 20 MB of memory, or 10 MB of memory and 20 MB of swap if the 20 MB of memory is unavailable. In this case no limits are defined by RESRSV_LIMIT.

Job-level resource reservation

  1. To specify resource reservation at the job level, use bsub -R and include the resource usage section in the resource requirement string.

Configure per-resource reservation

  1. To enable greater flexibility for reserving numeric resources are reserved by jobs, configure the ReservationUsage section in lsb.resources to reserve resources like license tokens per resource as PER_JOB, PER_SLOT, or PER_HOST:
  2. Begin ReservationUsage
    RESOURCE             METHOD
    licenseX             PER_JOB
    licenseY             PER_HOST
    licenseZ             PER_SLOT
    End ReservationUsage 
     

    Only user-defined numeric resources can be reserved. Builtin resources like mem, cpu, swp, etc. cannot be configured in the ReservationUsage section.

    The cluster-wide RESOURCE_RESERVE_PER_SLOT parameter in lsb.params is obsolete. Configuration in lsb.resources overrides RESOURCE_RESERVE_PER_SLOT if it also exists for the same resource.

    RESOURCE_RESERVE_PER_SLOT parameter still controls resources not configured in lsb.resources. Resources not reserved in lsb.resources are reserved per job.

    PER_HOST reservation means that for the parallel job, LSF reserves one instance of a for each host. For example, some application licenses are charged only once no matter how many applications are running provided those applications are running on the same host under the same user.

Assumptions and limitations

Memory Reservation for Pending Jobs

About memory reservation for pending jobs

By default, the rusage string reserves resources for running jobs. Because resources are not reserved for pending jobs, some memory-intensive jobs could be pending indefinitely because smaller jobs take the resources immediately before the larger jobs can start running. The more memory a job requires, the worse the problem is.

Memory reservation for pending jobs solves this problem by reserving memory as it becomes available, until the total required memory specified on the rusage string is accumulated and the job can start. Use memory reservation for pending jobs if memory-intensive jobs often compete for memory with smaller jobs in your cluster.

Configure memory reservation for pending jobs

RESOURCE_RESERVE parameter
  1. Use the RESOURCE_RESERVE parameter in lsb.queues to reserve host memory for pending jobs.
  2. The amount of memory reserved is based on the currently available memory when the job is pending. Reserved memory expires at the end of the time period represented by the number of dispatch cycles specified by the value of MAX_RESERVE_TIME set on the RESOURCE_RESERVE parameter.

Configure lsb.modules
  1. To enable memory reservation for sequential jobs, add the LSF scheduler plugin module name for resource reservation (schmod_reserve) to the lsb.modules file:
  2. Begin PluginModule
    SCH_PLUGIN                 RB_PLUGIN              SCH_DISABLE_PHASES 
    schmod_default                ()                          () 
    schmod_reserve                ()                          () 
    schmod_preemption             ()                          () 
    End PluginModule 
    
Configure lsb.queues
  1. Set the RESOURCE_RESERVE parameter in a queue defined in lsb.queues.
  2. If both RESOURCE_RESERVE and SLOT_RESERVE are defined in the same queue, job slot reservation and memory reservation are both enabled and an error is displayed when the cluster is reconfigured. SLOT_RESERVE is ignored.

Example queues

The following queue enables memory reservation for pending jobs:

Begin Queue
QUEUE_NAME = reservation
DESCRIPTION = For resource reservation
PRIORITY=40
RESOURCE_RESERVE = MAX_RESERVE_TIME[20]
End Queue 

Use memory reservation for pending jobs

  1. Use the rusage string in the -R option to bsub or the RES_REQ parameter in lsb.queues to specify the amount of memory required for the job. Submit the job to a queue with RESOURCE_RESERVE configured.
  2. See Examples for examples of jobs that use memory reservation.

    note:  
    Compound resource requirements do not support use of the || operator within the component rusage simple resource requirements, multiple -R options, or the cu section.

How memory reservation for pending jobs works

Amount of memory reserved

The amount of memory reserved is based on the currently available memory when the job is pending. For example, if LIM reports that a host has 300 MB of memory available, the job submitted by the following command:

bsub -R "rusage[mem=400]" -q reservation my_job 

will be pending and reserve the 300 MB of available memory. As other jobs finish, the memory that becomes available is added to the reserved memory until 400 MB accumulates, and the job starts.

No memory is reserved if no job slots are available for the job because the job could not run anyway, so reserving memory would waste the resource.

Only memory is accumulated while the job is pending; other resources specified on the rusage string are only reserved when the job is running. Duration and decay have no effect on memory reservation while the job is pending.

How long memory is reserved (MAX_RESERVE_TIME)

Reserved memory expires at the end of the time period represented by the number of dispatch cycles specified by the value of MAX_RESERVE_TIME set on the RESOURCE_RESERVE parameter. If a job has not accumulated enough memory to start by the time MAX_RESERVE_TIME expires, it releases all its reserved memory so that other pending jobs can run. After the reservation time expires, the job cannot reserve slots or memory for one scheduling session, so other jobs have a chance to be dispatched. After one scheduling session, the job can reserve available resources again for another period specified by MAX_RESERVE_TIME.

Examples

lsb.queues

The following queues are defined in lsb.queues:

Begin Queue
QUEUE_NAME = reservation
DESCRIPTION = For resource reservation
PRIORITY=40
RESOURCE_RESERVE = MAX_RESERVE_TIME[20]
End Queue 
Assumptions

Assume one host in the cluster with 10 CPUs and 1 GB of free memory currently available.

Sequential jobs

Each of the following sequential jobs requires 400 MB of memory and runs for 300 minutes.

Job 1:

bsub -W 300 -R "rusage[mem=400]" -q reservation myjob1 

The job starts running, using 400M of memory and one job slot.

Job 2:

Submitting a second job with same requirements yields the same result.

Job 3:

Submitting a third job with same requirements reserves one job slot, and reserves all free memory, if the amount of free memory is between 20 MB and 200 MB (some free memory may be used by the operating system or other software.)

Time-based Slot Reservation

Existing LSF slot reservation works in simple environments, where the host-based MXJ limit is the only constraint to job slot request. In complex environments, where more than one constraint exists (for example job topology or generic slot limit):

Current slot reservation by start time (RESERVE_BY_STARTTIME) resolves several reservation issues in multiple candidate host groups, but it cannot help on other cases:

Time-based slot reservation versus greedy slot reservation

With time-based reservation, a set of pending jobs get future allocation and an estimated start time so that the system can reserve a place for each job. Reservations use the estimated start time, which is based on future allocations.

Time-based resource reservation provides a more accurate predicted start time for pending jobs because LSF considers job scheduling constraints and requirements, including job topology and resource limits, for example.

restriction:  
Time-based reservation does not work with job chunking.
Start time and future allocation

The estimated start time for a future allocation is the earliest start time when all considered job constraints are satisfied in the future. There may be a small delay of a few minutes between the job finish time on which the estimate was based and the actual start time of the allocated job.

For compound resource requirement strings, the predicted start time is based on the simple resource requirement term (contained in the compound resource requirement) with the latest predicted start time.

If a job cannot be placed in a future allocation, the scheduler uses greedy slot reservation to reserve slots. Existing LSF slot reservation is a simple greedy algorithm:

Reservation decisions made by greedy slot reservation do not have an accurate estimated start time or information about future allocation. The calculated job start time used for backfill scheduling is uncertain, so bjobs displays:

Job will start no sooner than indicated time stamp 
Time-based reservation and greedy reservation compared

Start time prediction
Time-based reservation
Greedy reservation
Backfill scheduling if free slots are available
Yes
Yes
Correct with no job topology
Yes
Yes
Correct for job topology requests
Yes
No
Correct based on resource allocation limits
Yes (guaranteed if only two limits are defined)
No
Correct for memory requests
Yes
No
When no slots are free for reservation
Yes
No
Future allocation and reservation based on earliest start time
Yes
No
bjobs displays best estimate
Yes
No
bjobs displays predicted future allocation
Yes
No
Absolute predicted start time for all jobs
No
No
Advance reservation considered
No
No

Greedy reservation example

A cluster has four hosts: A, B, C and D, with 4 CPUs each. Four jobs are running in the cluster: Job1, Job2, Job3 and Job4. According to calculated job estimated start time, the job finish times (FT) have this order: FT(Job2) < FT(Job1) < FT(Job4) < FT(Job3).

Now, a user submits a high priority job. It pends because it requests -n 6 -R "span[ptile=2]". This resource requirement means this pending job needs three hosts with two CPUs on each host. The default greedy slot reservation calculates job start time as the job finish time of Job4 because after Job4 finishes, three hosts with a minimum of two slots are available.

Greedy reservation indicates that the pending job starts no sooner than when Job 2 finishes.

In contrast, time-based reservation can determine that the pending job starts in 2 hours. It is a much more accurate reservation.

Configuring time-based slot reservation

Greedy slot reservation is the default slot reservation mechanism and time-based slot reservation is disabled.

LSB_TIME_RESERVE_NUMJOBS (lsf.conf)
  1. Use LSB_TIME_RESERVE_NUMJOBS=maximum_reservation_jobs in lsf.conf to enable time-based slot reservation. The value must be a positive integer.
  2. LSB_TIME_RESERVE_NUMJOBS controls maximum number of jobs using time-based slot reservation. For example, if LSB_TIME_RESERVE_NUMJOBS=4, only the top 4 jobs will get their future allocation information.

  3. Use LSB_TIME_RESERVE_NUMJOBS=1 to allow only the highest priority job to get accurate start time prediction.
  4. Smaller values are better than larger values because after the first pending job starts, the estimated start time of remaining jobs may be changed. For example, you could configure LSB_TIME_RESERVE_NUMJOBS based on the number of exclusive host partitions or host groups.

Some scheduling examples

  1. Job5 requests -n 6 -R "span[ptile=2]", which will require three hosts with 2 CPUs on each host. As in the greedy slot reservation example, four jobs are running in the cluster: Job1, Job2, Job3 and Job4. Two CPUs are available now, 1 on host A, and 1 on host D:


  2. Job2 finishes, freeing 2 more CPUs for future allocation, 1 on host A, and 1 on host C:


  3. Job4 finishes, freeing 4 more CPUs for future allocation, 2 on host A, and 2 on host C:


  4. Job1 finishes, freeing 2 more CPUs for future allocation, 1 on host C, and 1 host D:


  5. Job5 can now be placed with 2 CPUs on host A, 2 CPUs on host C, and 2 CPUs on host D. The estimated start time is shown as the finish time of Job1:


Assumptions and limitations

Slot limit enforcement

The following slot limits are enforced:

Memory request

To request memory resources, configure RESOURCE_RESERVE in lsb.queues.

When RESOURCE_RESERVE is used, LSF will consider memory and slot requests during time-based reservation calculation. LSF will not reserve slot or memory if any other resources are not satisfied.

If SLOT_RESERVE is configured, time-based reservation will not make a slot reservation if any other type of resource is not satisfied, including memory requests.

When SLOT_RESERVE is used, if job cannot run because of non-slot resources, including memory, time-based reservation will not reserve slots. For example, if job cannot run because it cannot get required license, job will be pending without any reservation

Host partition and queue-level scheduling

If host partitions are configured, LSF first schedules jobs on the host partitions and then goes through each queue to schedule jobs. The same job may be scheduled several times, one for each host partition and last one at queue-level. Available candidate hosts may be different for each time.

Because of this difference, the same job may get different estimated start times, future allocation, and reservation in different host partitions and queue-level scheduling. With time-based reservation configured, LSF always keeps the same reservation and future allocation with the earliest estimated start time.

bjobs displays future allocation information
Predicted start time may be postponed for some jobs

If a pending job cannot be placed in a future resource allocation, the scheduler can skip it in the start time reservation calculation and fall back to use greedy slot reservation. There are two possible reasons:

Either way, the scheduler continues calculating predicted start time for the remaining jobs without considering the skipped job.

Later, once the resource request of skipped job can be satisfied and placed in a future allocation, the scheduler reevaluates the predicted start time for the rest of jobs, which may potentially postpone their start times.

To minimize the overhead in recalculating the predicted start times to include previously skipped jobs, you should configure a small value for LSB_TIME_RESERVE_NUMJOBS in lsf.conf.

Reservation scenarios

Scenario 1

Even though no running jobs finish and no host status in cluster are changed, a job's future allocation may still change from time to time.

Why this happens

Each scheduling cycle, the scheduler recalculates a job's reservation information, estimated start time and opportunity for future allocation. The job candidate host list may be reordered according to current load. This reordered candidate host list will be used for the entire scheduling cycle, also including job future allocation calculation. So different order of candidate hosts may lead to different result of job future allocation. However, the job estimated start time should be the same.

For example, there are two hosts in cluster, hostA and hostB. 4 CPUs per host. Job 1 is running and occupying 2 CPUs on hostA and 2 CPUs on hostB. Job 2 requests 6 CPUs. If the order of hosts is hostA and hostB, then the future allocation of job 2 will be 4 CPUs on hostA 2 CPUs on hostB. If the order of hosts changes in the next scheduling cycle changes to hostB and hostA, then the future allocation of job 2 will be 4 CPUs on hostB 2 CPUs on hostA.

Scenario 2:

If you set JOB_ACCEPT_INTERVAL to non-zero value, after job is dispatched, within JOB_ACCEPT_INTERVAL period, pending job estimated start time and future allocation may momentarily fluctuate.

Why this happens

The scheduler does a time-based reservation calculation each cycle. If JOB_ACCEPT_INTERVAL is set to non-zero value. once a new job has been dispatched to a host, this host will not accept new job within JOB_ACCEPT_INTERVAL interval. Because the host will not be considered for the entire scheduling cycle, no time-based reservation calculation is done, which may result in slight change in job estimated start time and future allocation information. After JOB_ACCEPT_INTERVAL has passed, host will become available for time-based reservation calculation again, and the pending job estimated start time and future allocation will be accurate again.

Examples

Example 1

Three hosts, 4 CPUs each: qat24, qat25, and qat26. Job 11895 uses 4 slots on qat24 (10 hours). Job 11896 uses 4 slots on qat25 (12 hours), and job 11897 uses 2 slots on qat26 (9 hours).

Job 11898 is submitted and requests -n 6 -R "span[ptile=2]".

bjobs -l 11898
Job <11898>, User <user2>, Project <default>, Status <PEND>, Queue <challenge>,
                     Job Priority <50>, Command <sleep 100000000>
..
RUNLIMIT
 840.0 min of hostA
Fri Apr 22 15:18:56: Reserved <2> job slots on host(s) <2*qat26>;
Sat Apr 23 03:28:46: Estimated Job Start Time;
                     alloc=2*qat25 2*qat24 2*qat26.lsf.platform.com  
Example 2

Two RMS hosts, sierraA and sierraB, 8 CPUs per host. Job 3873 uses 4*sierra0 and will last for 10 hours. Job 3874 uses 4*sierra1 and will run for 12 hours. Job 3875 uses 2*sierra2 and 2*sierra3, and will run for 13 hours.

Job 3876 is submitted and requests -n 6 -ext "RMS[nodes=3]".

bjobs -l 3876
Job <3876>, User <user2>, Project <default>, Status <PEND>, Queue <rms>, Extsch
                     ed <RMS[nodes=3]>, Command <sleep 1000000>
Fri Apr 22 15:35:28: Submitted from host <sierraa>, CWD <$HOME>, 6 Processors R
                     equested;
RUNLIMIT 
 840.0 min of sierraa
Fri Apr 22 15:35:46: Reserved <4> job slots on host(s) <4*sierrab>;
Sat Apr 23 01:34:12: Estimated job start time;
                                     rms_alloc=2*sierra[0,2-3]
... 
Example 3

Rerun example 1, but this time, use greedy slot reservation instead of time-based reservation:

bjobs -l 3876
Job <12103>, User <user2>, Project <default>, Status <PEND>, Queue <challenge>,
                     Job Priority <50>, Command <sleep 1000000>
Fri Apr 22 16:17:59: Submitted from host <qat26>, CWD <$HOME>, 6 Processors Req
                     uested, Requested Resources <span[ptile=2]>;

RUNLIMIT
 720.0 min of qat26
Fri Apr 22 16:18:09: Reserved <2> job slots on host(s) <2*qat26.lsf.platform.co
                     m>;
Sat Apr 23 01:39:13: Job will start no sooner than indicated time stamp; 

Viewing Resource Reservation Information

View host-level resource information (bhosts)

  1. Use bhosts -l to show the amount of resources reserved on each host. In the following example, 143 MB of memory is reserved on hostA, and no memory is currently available on the host.
  2. bhosts -l hostA
    HOST  hostA
    STATUS     CPUF   JL/U   MAX   NJOBS    RUN   SSUSP   USUSP   RSV  DISPATCH_WINDOW
    ok       20.00       -     4       2      1       0       0     1       -
    
     CURRENT LOAD USED FOR SCHEDULING:
                r15s   r1m  r15m     ut     pg       io    ls     it    tmp      swp   
     mem
     Total       1.5   1.2   2.0   91%     2.5        7    49      0   911M     915M   
      0M
     Reserved    0.0   0.0   0.0    0%     0.0        0     0      0     0M       0M   
    143M 
    
  3. Use bhosts -s to view information about shared resources.

View queue-level resource information (bqueues)

  1. Use bqueues -l to see the resource usage configured at the queue level.
  2. bqueues -l reservation
    QUEUE: reservation
      -- For resource reservation
    
    PARAMETERS/STATISTICS
    PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV 
    40      0 Open:Active       -    -    -    -     4     0     0     0     0    4 
    
    SCHEDULING PARAMETERS
               r15s   r1m   r15m   ut    pg    io   ls    it    tmp    swp    mem 
    loadSched  -       -      -     -     -    -    -     -      -      -      -     
    loadStop   -       -      -     -     -    -    -     -      -      -      -    
    
                 cpuspeed    bandwidth 
    loadSched          -            - 
    loadStop           -            - 
    
    SCHEDULING POLICIES:  RESOURCE_RESERVE
    
    USERS:   all users
    HOSTS:   all
    
    Maximum resource reservation time: 600 seconds 
    

View reserved memory for pending jobs (bjobs)

If the job memory requirements cannot be satisfied, bjobs -l shows the pending reason. bjobs -l shows both reserved slots and reserved memory.

  1. For example, the following job reserves 60 MB of memory on hostA:
  2. bsub -m hostA -n 2 -q reservation -R"rusage[mem=60]" sleep 8888
    Job <3> is submitted to queue <reservation>. 
     

    bjobs -l shows the reserved memory:

    bjobs -lp Job <3>, User <user1>, Project <default>, Status <PEND>, Queue <reservation> , Command <sleep 8888> Tue Jan 22 17:01:05: Submitted from host <user1>, CWD </home/user1/>, 2 Processors Requested, Requested Resources <rusage[mem=60]>, Specified Hosts <hostA>; Tue Jan 22 17:01:15: Reserved <1> job slot on host <hostA>; Tue Jan 22 17:01:15: Reserved <60> megabyte memory on host <60M*hostA>; PENDING REASONS: Not enough job slot(s): hostA; SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -

View per-resource reservation (bresources)

  1. Use bresources to display per-resource reservation configurations from lsb.resources:
  2. The following example displays all resource reservation configurations:

    bresources -s
    Begin ReservationUsage
    RESOURCE             METHOD
    licenseX             PER_JOB
    licenseY             PER_HOST
    licenseZ             PER_SLOT
    End ReservationUsage 
     

    The following example displays only licenseZ configuration:

    bresources -s licenseZ RESOURCE METHOD licenseZ PER_SLOT

Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index