Knowledge Center Contents Previous Next Index |
Working with Queues
Contents
- Queue States
- Viewing Queue Information
- Control Queues
- Add and Remove Queues
- Manage Queues
- Handling Job Exceptions in Queues
Queue States
Queue states, displayed by
bqueues
, describe the ability of a queue to accept and start batch jobs using a combination of the following states:
- Open: queues accept new jobs
- Closed: queues do not accept new jobs
- Active: queues start jobs on available hosts
- Inactive: queues hold all jobs
Queue state can be changed by an LSF administrator or
root
.Queues can also be activated and inactivated by run windows and dispatch windows (configured in
lsb.queues
, displayed bybqueues -l
).
bqueues -l
displaysInact_Adm
when explicitly inactivated by an Administrator (badmin qinact
), andInact_Win
when inactivated by a run or dispatch window.Viewing Queue Information
The
bqueues
command displays information about queues. Thebqueues
-l
option also gives current statistics about the jobs in a particular queue, such as the total number of jobs in the queue, the number of jobs running, suspended, and so on.
To view the... Run... Available queuesbqueues
Queue statusbqueues
Detailed queue informationbqueues -l
State change history of a queuebadmin qhist
Queue administratorsbqueues -l
for queue
In addition to the procedures listed here, see the
bqueues(1)
man page for more details.View available queues and queue status
- Run
bqueues
. You can view the current status of a particular queue or all queues. Thebqueues
command also displays available queues in the cluster.bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP interactive 400 Open:Active - - - - 2 0 2 0 priority 43 Open:Active - - - - 16 4 11 1 night 40 Open:Inactive - - - - 4 4 0 0 short 35 Open:Active - - - - 6 1 5 0 license 33 Open:Active - - - - 0 0 0 0 normal 30 Open:Active - - - - 0 0 0 0 idle 20 Open:Active - - - - 6 3 1 2A dash (-) in any entry means that the column does not apply to the row. In this example no queues have per-queue, per-user, per-processor, or per host job limits configured, so the
MAX
,JL/U
,JL/P
, andJL/H
entries are shown as a dash.Job slots required by parallel jobs
important:
A parallel job withN
components requiresN
job slots.View detailed queue information
- To see the complete status and configuration for each queue, run
bqueues -l
.Specify queue names to select specific queues. The following example displays details for the queue
normal
.bqueues -l normal
QUEUE: normal --For normal low priority jobs, running only if hosts are lightly loaded. This is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P NJOBS PEND RUN SSUSP USUSP 40 20 Open:Active 100 50 11 1 1 0 0 0 Migration threshold is 30 min. CPULIMIT RUNLIMIT 20 min of IBM350 342800 min of IBM350 FILELIMIT DATALIMIT STACKLIMIT CORELIMIT MEMLIMIT PROCLIMIT 20000 K 20000 K 2048 K 20000 K 5000 K 3 SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - 0.7 1.0 0.2 4.0 50 - - - - - loadStop - 1.5 2.5 - 8.0 240 - - - - - cpuspeed bandwidth loadSched - - loadStop - - SCHEDULING POLICIES: FAIRSHARE PREEMPTIVE PREEMPTABLE EXCLUSIVE USER_SHARES: [groupA, 70] [groupB, 15] [default, 1] DEFAULT HOST SPECIFICATION : IBM350 RUN_WINDOWS: 2:40-23:00 23:30-1:30 DISPATCH_WINDOWS: 1:00-23:50 USERS: groupA/ groupB/ user5 HOSTS: hostA, hostD, hostB ADMINISTRATORS: user7 PRE_EXEC: /tmp/apex_pre.x > /tmp/preexec.log 2>&1 POST_EXEC: /tmp/apex_post.x > /tmp/postexec.log 2>&1 REQUEUE_EXIT_VALUES: 45View the state change history of a queue
- Run
badmin qhist
to display the times when queues are opened, closed, activated, and inactivated.badmin qhist
Wed Mar 31 09:03:14: Queue <normal> closed by user or administrator <root>. Wed Mar 31 09:03:29: Queue <normal> opened by user or administrator <root>.View queue administrators
- Run
bqueues -l
for the queue.View exception status for queues (bqueues)
- Use
bqueues
to display the configured threshold for job exceptions and the current number of jobs in the queue in each exception state.For example, queue
normal
configures JOB_IDLE threshold of 0.10, JOB_OVERRUN threshold of 5 minutes, and JOB_UNDERRUN threshold of 2 minutes. The followingbqueues
command shows no overrun jobs, one job that finished in less than 2 minutes (underrun) and one job that triggered an idle exception (less than idle factor of 0.10):bqueues -l normal
QUEUE: normal -- For normal low priority jobs, running only if hosts are lightly loaded. This is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 30 20 Open:Active - - - - 0 0 0 0 0 0 STACKLIMIT MEMLIMIT 2048 K 5000 K SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -JOB EXCEPTION PARAMETERS OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime) Threshold 5 2 0.10 Jobs 0 1 1
USERS: all users HOSTS: all allremote CHUNK_JOB_SIZE: 3Control Queues
Queues are controlled by an LSF Administrator or root issuing a command or through configured dispatch and run windows.
Close a queue
- Run
badmin qclose
:badmin qclose normal
Queue <normal> is closedWhen a user tries to submit a job to a closed queue the following message is displayed:
bsub -q normal ...
normal: Queue has been closedOpen a queue
- Run
badmin qopen
:badmin qopen normal
Queue <normal> is openedInactivate a queue
- Run
badmin qinact
:badmin qinact normal
Queue <normal> is inactivatedActivate a queue
- Run
badmin qact
:badmin qact normal
Queue <normal> is activatedLog a comment when controlling a queue
- Use the
-C
option ofbadmin
queue commandsqclose
,qopen
,qact
, andqinact
to log an administrator comment inlsb.events
.badmin qclose -C "change configuration" normal
The comment text
change configuration
is recorded inlsb.events
.A new event record is recorded for each queue event. For example:
badmin qclose -C "add user" normal
followed by
badmin qclose -C "add user user1" normal
will generate records in
"QUEUE_CTRL" "7.0 1050082373 1 "normal" 32185 "lsfadmin" "add user" "QUEUE_CTRL" "7.0 1050082380 1 "normal" 32185 "lsfadmin" "add user user1"lsb.events
:- Use
badmin hist
orbadmin qhist
to display administrator comments for closing and opening hosts.badmin qhist
Fri Apr 4 10:50:36: Queue <normal> closed by administrator <lsfadmin> change configuration.
bqueues -l
also displays the comment text:bqueues -l normal
QUEUE: normal -- For normal low priority jobs, running only if hosts are lightly loaded. Th is is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 30 20 Closed:Active - - - - 0 0 0 0 0 0 Interval for a host to accept two jobs is 0 seconds THREADLIMIT 7 SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - JOB EXCEPTION PARAMETERS OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime) Threshold - 2 - Jobs - 0 - USERS: all users HOSTS: all RES_REQ: select[type==any] ADMIN ACTION COMMENT: "change configuration"Configure Dispatch Windows
A dispatch window specifies one or more time periods during which batch jobs are dispatched to run on hosts. Jobs are not dispatched outside of configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, queues are always Active; you must explicitly configure dispatch windows in the queue to specify a time when the queue is Inactive.
To configure a dispatch window:
- Edit
lsb.queues
- Create a DISPATCH_WINDOW keyword for the queue and specify one or more time windows.
Begin Queue QUEUE_NAME = queue1 PRIORITY = 45 DISPATCH_WINDOW = 4:30-12:00 End Queue- Reconfigure the cluster:
- Run
lsadmin reconfig
.- Run
badmin reconfig
.- Run
bqueues -l
to display the dispatch windows.Configure Run Windows
A run window specifies one or more time periods during which jobs dispatched from a queue are allowed to run. When a run window closes, running jobs are suspended, and pending jobs remain pending. The suspended jobs are resumed when the window opens again. By default, queues are always Active and jobs can run until completion. You must explicitly configure run windows in the queue to specify a time when the queue is Inactive.
To configure a run window:
- Edit
lsb.queues
.- Create a RUN_WINDOW keyword for the queue and specify one or more time windows.
Begin Queue QUEUE_NAME = queue1 PRIORITY = 45 RUN_WINDOW = 4:30-12:00 End Queue- Reconfigure the cluster:
- Run
lsadmin reconfig
.- Run
badmin reconfig
.- Run
bqueues -l
to display the run windows.Add and Remove Queues
Add a queue
- Log in as the LSF administrator on any host in the cluster.
- Edit
lsb.queues
to add the new queue definition.You can copy another queue definition from this file as a starting point; remember to change the
QUEUE_NAME
of the copied queue.- Save the changes to
lsb.queues
.- Run
badmin reconfig
to reconfigurembatchd
.Adding a queue does not affect pending or running jobs.
Remove a queue
important:
Before removing a queue, make sure there are no jobs in that queue.If there are jobs in the queue, move pending and running jobs to another queue, then remove the queue. If you remove a queue that has jobs in it, the jobs are temporarily moved to a queue named
lost_and_found
. Jobs in thelost_and_found
queue remain pending until the user or the LSF administrator uses thebswitch
command to switch the jobs into an existing queue. Jobs in other queues are not affected.
- Log in as the LSF administrator on any host in the cluster.
- Close the queue to prevent any new jobs from being submitted.
badmin qclose night
Queue <night> is closed- Move all pending and running jobs into another queue.
Below, the
bswitch -q night
argument chooses jobs from thenight
queue, and the job ID number0
specifies that all jobs should be switched:bjobs -u all -q night
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBM IT_TIME 5308 user5 RUN night hostA hostD job5 Nov 2 1 18:16 5310 user5 PEND night hostA hostC job10 Nov 2 1 18:17bswitch -q night idle 0
Job <5308> is switched to queue <idle> Job <5310> is switched to queue <idle>- Edit
lsb.queues
and remove or comment out the definition for the queue being removed.- Save the changes to
lsb.queues
.- Run
badmin reconfig
to reconfigurembatchd
.Manage Queues
Restrict host use by queues
You may want a host to be used only to run jobs submitted to specific queues. For example, if you just added a host for a specific department such as engineering, you may only want jobs submitted to the queues
engineering1
andengineering2
to be able to run on the host.
- Log on as root or the LSF administrator on any host in the cluster.
- Edit
lsb.queues
, and add the host to theHOSTS
parameter of specific queues.Begin Queue QUEUE_NAME = queue1 ... HOSTS=mynewhost hostA hostB ... End Queue- Save the changes to
lsb.queues
.- Use
badmin ckconfig
to check the new queue definition. If any errors are reported, fix the problem and check the configuration again.- Run
badmin reconfig
to reconfigurembatchd
.- If you add a host to a queue, the new host will not be recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must use the command
badmin mbdrestart
.Add queue administrators
Queue administrators are optionally configured after installation. They have limited privileges; they can perform administrative operations (open, close, activate, inactivate) on the specified queue, or on jobs running in the specified queue. Queue administrators cannot modify configuration files, or operate on LSF daemons or on queues they are not configured to administer.
To switch a job from one queue to another, you must have administrator privileges for both queues.
- In the
lsb.queues
file, between Begin Queue and End Queue for the appropriate queue, specify the ADMINISTRATORS parameter, followed by the list of administrators for that queue. Separate the administrator names with a space. You can specify user names and group names.Begin Queue ADMINISTRATORS = User1 GroupA End QueueHandling Job Exceptions in Queues
You can configure queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.
Job exceptions LSF can detect
If you configure job exception handling in your queues, LSF detects the following job exceptions:
- Job underrun - jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally
- Job overrun - job runs too long (run time is longer than expected). By default, LSF checks for overrun jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in
lsb.params
to change how frequently LSF checks for job overrun.- Idle job - running job consumes less CPU time than expected (in terms of CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in
lsb.params
to change how frequently LSF checks for idle jobs.Configuring job exception handling (lsb.queues)
You can configure your queues to detect job exceptions. Use the following parameters:
JOB_IDLE
Specify a threshold for idle jobs. The value should be a number between 0.0 and 1.0 representing CPU time/runtime. If the job idle factor is less than the specified threshold, LSF invokes
eadmin
to trigger the action for a job idle exception.JOB_OVERRUN
Specify a threshold for job overrun. If a job runs longer than the specified run time, LSF invokes
eadmin
to trigger the action for a job overrun exception.JOB_UNDERRUN
Specify a threshold for job underrun. If a job exits before the specified number of minutes, LSF invokes
eadmin
to trigger the action for a job underrun exception.Example
The following queue defines thresholds for all types job exceptions:
Begin Queue ... JOB_UNDERRUN = 2 JOB_OVERRUN = 5 JOB_IDLE = 0.10 ... End QueueFor this queue:
- A job underrun exception is triggered for jobs running less than 2 minutes
- A job overrun exception is triggered for jobs running longer than 5 minutes
- A job idle exception is triggered for jobs with an idle factor (CPU time/runtime) less than 0.10
Configuring thresholds for job exception handling
By default, LSF checks for job exceptions every 1 minute. Use EADMIN_TRIGGER_DURATION in
lsb.params
to change how frequently LSF checks for overrun, underrun, and idle jobs.Tuning
tip:
Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |