Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Using Platform LSF HPC for Linux/QsNet


RMS Version 2.8.1 and 2.8.2

Contents

[ Top ]


About Platform LSF HPC for Linux/QsNet

Contents

What Platform LSF HPC for Linux/QsNet does

The Platform LSF HPC for Linux/QsNet combines the strengths of Platform LSF HPC, Quadrics Resource Management System (RMS), and Quadrics QsNet data network to provide a comprehensive Distributed Resource Management (DRM) solution on Linux.

LSF HPC acts primarily as the workload scheduler, providing policy and topology-based scheduling and fault tolerance. RMS acts as a parallel execution subsystem for CPU allocation and node selection.

Assumptions and limitations

Compatibility with earlier releases

In this version of Platform LSF HPC for Linux for Linux/QsNet:

[ Top ]


Configuring Platform LSF HPC for Linux/QsNet

Contents

Automatic configuration at installation

lsb.hosts

For the default host, lsfinstall enables "!" in the MXJ column of the HOSTS section of lsb.hosts. For example:

Begin Host
HOST_NAME MXJ   r1m     pg    ls    tmp  DISPATCH_WINDOW  # Keywords
#hostA     () 3.5/4.5   15/   12/15  0      ()            # Example
default    !    ()      ()    ()     ()     ()            
End Host

lsb.modules

During installation, lsfinstall adds the schmod_rms external scheduler plugin module name to the PluginModule section of lsb.modules to enable the RMS scheduler plugin module:

Begin PluginModule
SCH_PLUGIN              RB_PLUGIN           SCH_DISABLE_PHASES 
schmod_default               ()                      () 
schmod_fcfs                  ()                      () 
schmod_fairshare             ()                      () 
schmod_limit                 ()                      () 
schmod_preemption            ()                      () 
...
schmod_rms                   ()                      () 
End PluginModule


The schmod_rms plugin name must be configured after the standard LSF plugin names in the PluginModule list.

See the Platform LSF Configuration Reference for more information about lsb.modules.

lsb.queues

During installation, LSF HPC defines a queue named rms in LSB_CONFDIR/lsb.queues for RMS jobs running in LSF HPC.

Begin Queue
QUEUE_NAME   = rms
PJOB_LIMIT   = 1
PRIORITY     = 30
NICE         = 20
STACKLIMIT   = 5256
DEFAULT_EXTSCHED = RMS[RMS_SNODE]  # LSF will using this scheduling policy if
                                   # -extsched is not defined.
# MANDATORY_EXTSCHED = RMS[RMS_SNODE] # LSF enforces this scheduling policy
RES_REQ = select[rms==1]
DESCRIPTION  = Run RMS jobs only on hosts that have resource 'rms' defined
End Queue


To make the rms queue the default queue, set DEFAULT_QUEUE=rms in lsb.params.

Use the bqueues -l command to view the queue configuration details. Before using LSF HPC, see the Platform LSF Configuration Reference to understand queue configuration parameters in lsb.queues.

lsf.conf

During installation, lsfinstall sets the following parameters in lsf.conf:

lsf.shared

During installation, the Boolean resource rms is defined in lsf.shared:

Begin Resource
RESOURCENAME     TYPE      INTERVAL  INCREASING   DESCRIPTION
...
rms              Boolean   ()        ()           (RMS)
...
End Resource


You should add the rms resource name under the RESOURCES column of the Host section of lsf.cluster.cluster_name. Hosts without the rms resource specified are not considered for scheduling RMS jobs.

lsf.cluster.cluster_name

For each RMS host, hostsetup adds the rms Boolean resource to the HOST section of lsf.cluster.cluster_name.

Setting dedicated LSF partitions (recommended)

You should use the RMS rcontrol command to prevent prun jobs from running directly on partitions dedicated to LSF.

Example

# rcontrol set partition=parallel configuration=day type=batch

See the RMS rcontrol command in the RMS Reference Manual for complete syntax and usage.

Customizing job control actions (optional)

By default, LSF carries out job control actions by sending the appropriate signal to suspend, terminate, or resume a job. If your jobs need special job control actions, use the RMS rcontrol command in the rms queue configuration for RMS jobs to change the default job controls.

JOB_CONTROLS parameter in lsb.queues

Use the JOB_CONTROLS parameter in lsb.queues to configure suspend, terminate, or resume job controls for the queue:

JOB_CONTROLS = SUSPEND[command] | 
               RESUME[command] | 
               TERMINATE[command]

where command is an rcontrol command of the form:

rcontrol [suspend | kill | resume] batchid=cluster_name@$LSB_JOBID

Example TERMINATE job control action

Begin Queue
QUEUE_NAME=rms
...
JOB_CONTROLS = TERMINATE[rcontrol kill batchid=cluster1@$LSB_JOBID]
...
End Queue

For more information

Configuration notes

Resource to determine RMS-enabled hosts

The hostsetup script configures lsf.cluster.cluster_name to assign the Boolean resource rms to all LSF hosts that run on an RMS partition. The rms resource is defined in the default lsf.shared template file at installation.

Maximum job slot limit (MXJ in lsb.hosts)

By default, the maximum job slot limit is set to the number of CPUs that LIM reports. This is specified by MXJ=! in the Host section of LSB_CONFDIR/lsb.hosts:

Begin Host
HOST_NAME     MXJ     r1m       pg    ls     tmp  DISPATCH_WINDOW  # Keywords
...
default       !       ()        ()    ()     ()   ()               # Example
...
End Host


Do not change this default.

Per-processor job slot limit (PJOB_LIMIT in lsb.queues)

By default, the per-processor job slot limit is 1 (PJOB_LIMIT=1 in the rms queue in lsb.queues).


Do not change this default.

Maximum number of sbatchd connections (lsb.params)

If LSF operates on a large system (for example, a system with more than 32 hosts), you may need to configure the parameter MAX_SBD_CONNS in lsb.params. MAX_SBD_CONNS controls the maximum number of files mbatchd can have open and connected to sbatchd.

In a very busy cluster with many jobs being dispatched, running, finishing at the same time, you may see it takes a very long time for mbatchd to update the status change of a job, and to dispatch new jobs. If your cluster shows this behavior, you should set MAX_SBD_CONNS = (number_of_hosts) * 2 or 300, which ever is less. Setting MAX_SBD_CONNS too high may slow down the speed of mbatchd dispatching new jobs.

For more information

Configuring default and mandatory topology scheduling options

Use the DEFAULT_EXTSCHED and MANDATORY_EXTSCHED queue paramters in lsb.queues to configure default and mandatory topology scheduling options.

DEFAULT_EXTSCHED=RMS[[allocation_type][;topology][;flags]]

Specifies default topology scheduling options for the queue.

-extsched options on the bsub command are merged with DEFAULT_EXTSCHED options, and -extsched options override any conflicting queue-level options set by DEFAULT_EXTSCHED.

For example, if DEFAULT_EXTSCHED=RMS[RMS_SNODE;rails=2] and a job is submitted with -extsched "RMS[base=hostA;rails=1;ptile=2]", LSF uses the following resulting options for scheduling:

RMS_SNODE;rails=1;base=hostA;ptile=2

DEFAULT_EXTSCHED can be used in combination with MANDATORY_EXTSCHED in the same queue. For example:

LSF uses the resulting options for scheduling:

RMS_SNODE;rails=2;base=hostA;ptile=4

If topology options (nodes, ptile, or base) or rail flags (rails or railmask) are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -extsched option of bsub. For example, if DEFAULT_EXTSCHED=RMS[nodes=2], and you do not want to specify any node option at all, use -extsched "RMS[RMS_SNODE;nodes=]".

See bsub command for more information.

Obsolete -extsched syntax


Syntax in the form DEFAULT_EXTSCHED=extsched_options is obsolete (for example, DEFAULT_EXTSCHED=RMS_SNODE). You should use the syntax of the form DEFAULT_EXTSCHED=RMS[extsched_options] (for example, DEFAULT_EXTSCHED=RMS[RMS_SNODE]).

The queue named rms defined during installation specifies DEFAULT_EXTSCHED=RMS[RMS_SNODE]. LSF uses the RMS_SNODE scheduling policy if no allocation type is specified at job submission.

MANDATORY_EXTSCHED=RMS[[allocation_type][;topology][;flags]]

Specifies mandatory topology scheduling options for the queue.

-extsched options on the bsub command are merged with MANDATORY_EXTSCHED options, and MANDATORY_EXTSCHED options override any conflicting job-level options set by -extsched.

Overrides -extsched options on the bsub command.

For example, if MANDATORY_EXTSCHED=RMS[RMS_SNODE;rails=2] and a job is submitted with -extsched "RMS[base=hostA;rails=1;ptile=2]", LSF uses the following resulting options for scheduling:

RMS_SNODE;rails=2;base=hostA;ptile=2

MANDATORY_EXTSCHED can be used in combination with DEFAULT_EXTSCHED in the same queue. For example:

LSF uses the following resulting options for scheduling:

RMS_SNODE;rails=2;base=hostA;ptile=4

See bsub command for more information.

If you want to prevent users from setting the topology options (nodes, ptile, or base) or flags (rails or railmask) in the -extsched option of bsub, use the keyword with no value. For example, if the job is submitted with -extsched "RMS[RMS_SNODE;nodes=4]", use MANDATORY_EXTSCHED=RMS[nodes=] to override both of these settings.

Obsolete -extsched syntax


Syntax in the form MANDATORY_EXTSCHED=extsched_options is obsolete (for example, MANDATORY_EXTSCHED=RMS_SNODE). You should use the syntax of the form MANDATORY_EXTSCHED=RMS[extsched_options] (for example, MANDATORY_EXTSCHED=RMS[RMS_SNODE]).

[ Top ]


Operating Platform LSF HPC for Linux/QsNet

Contents

RMS hosts and RMS jobs

An RMS host has the rms Boolean resource in the RESOURCES column of the host section in lsf.cluster.cluster_name.

An RMS job has appropriate external scheduler options at the command line (bsub -extsched) or queue level (DEFAULT_EXTSCHED or MANDATORY_EXTSCHED in rms queue in lsb.queues).

RMS jobs only run on RMS hosts, and non-RMS jobs only run on non-RMS hosts.

Platform LSF HPC RMS topology support plugin

The Platform LSF HPC RMS external scheduler plugin runs on each LSF HPC host within an RMS partition. The RMS plugin is started by mbschd and handles all communication between the LSF HPC scheduler and RMS. It translates LSF HPC concepts (hosts and job slots) into RMS concepts (nodes, number of CPUs, allocation options, topology).

LSF HPC topology adapter for RMS (RLA)

The Platform LSF HPC topology adapter for RMS (RLA) is located on each LSF HPC host within an RMS partition. RLA is started by sbatchd and is the interface for the LSF RMS plugin and the RMS system.

To schedule a job, the RMS external scheduler plugin calls RLA to:

LSF scheduling policies and RMS topology support

-extsched options Normal jobs Preemptive jobs Backfill jobs Job slot reservation
RMS_SLOAD or RMS_SNODE
Yes
Yes
Yes
Yes
RMS_SLOAD, RMS_SNODE with nodes/ptile/base specification
Yes
Yes
Yes
Yes
RMS_MCONT
Yes
Yes
Yes
Yes
RMS_MCONT with nodes/ptile/base specification
Yes
Yes
Yes
Yes

Supported RMS prun allocation options

RMS option Description LSF equivalent
-B
Base node index
-extsched "RMS[base=base_node_name]"
-C
Number of CPUs per node
-C is obsolete in RMS prun
-extsched "RMS[ptile=cpus_per_node]"
-I
Allocate immediately or fail
LSF overrides; uses immediately to allocate
-N
Number of nodes to use
-extsched "RMS[nodes=nodes]"
-n
Number of CPUs to use
-n
-R
immediate same as -I
LSF overrides
rails
Passed with -extsched
railmask
Passed with -extsched
-i/ -o/ -e
Input/output/error redirection
bsub -i/ -o/ -e
-m
Block or cyclic job distribution
Must be passed directly (not via LSF) by the user on prun command line.

LSF host preference and RMS allocation options

Host preference partition (for example, bsub -m hostA) is only supported for RMS_SLOAD allocation. LSF host preference is not taken into account for RMS_SNODE and RMS_MCONT allocation. All hosts in the preference list must be within the same host group or RMS partition.

RMS_SNODE

LSF sorts nodes according to RMS topology (numbering of nodes and domains), which takes precedence over LSF sorting order. LSF host preferences (for example, bsub -m hostA) are not taken into account.

RMS_SLOAD

LSF sorts nodes based on host preference and load information, which take precedence over RMS topology (numbering of nodes and domains).

The allocation starts from the first host specified in the list of LSF hosts and continues until the allocation specification is satisfied.


Use RMS_SLOAD on smaller clusters, where the job placement decision should be influenced by host load, or where you want to keep a specific host preference.

RMS_MCONT

LSF sorts nodes based on RMS topological order (numbering of nodes and domains); LSF host preferences are not taken into account.

RMS rail allocation options

Rails are the layers of a Quadrics switch network. In multirail systems, you can specify the following rail allocation options:

Specify either rails or railmask, but not both.

Using rail options in LSF

To use the RMS rail options with LSF jobs, use the -extsched option of bsub:

[rails=number | railmask=bitmask]

For example, The following job uses 2 rails.

bsub -n 4 -q rms -extsched "RMS[RMS_MCONT; rails=2]"

LSF checks the validity of rail options at job submission against the LSB_RMS_MAXNUMRAILS parameter if it is set in lsf.conf, which specifies a maximum value for the rails option. The default is 32. If incorrect rail option values pass this check, the job pends forever.

For more information

About the rails and railmask options, see the RMS prun command in the RMS Reference Manual

[ Top ]


Submitting and Monitoring Jobs

Contents

bsub command

To submit a job, use the bsub command.

Syntax

bsub -ext[sched] "RMS[[allocation_type][;topology][;flags]]" 
job_name

Specify topology scheduling policy options for RMS jobs either in the -extsched option, or with DEFAULT_EXTSCHED or MANDATORY_EXTSCHED in the rms queue definition in lsb.queues.


You can abbreviate the -extsched option to -ext.

The options set by -extsched can be combined with the queue-level MANDATORY_EXTSCHED or DEFAULT_EXTSCHED parameters. If -extsched and MANDATORY_EXTSCHED set the same option, the MANDATORY_EXTSCHED setting is used. If -extsched and DEFAULT_EXTSCHED set the same options, the -extsched setting is used.

Obsolete -extsched syntax


The -extsched syntax in the form -extsched "extsched_options" is obsolete (for example, -extsched "RMS_SNODE"). You should use the syntax of the form -extsched "RMS[extsched_options]" (for example, -extsched "RMS[RMS_SNODE]").

Shell builtin commands and metasymbols


Shell-specific command syntax is not supported when submitting jobs to run in RMS.

For example, bsub "cmd1|cmd2" is not valid because of the pipe (|).

RMS allocation limitation

If you use RMS_MCONT or RMS_SNODE allocation options, the ptile option in the span section of the resource requirement string (bsub -R "span[ptile=n]") is not supported.

You should use extsched "RMS[ptile=n]" to define the locality of jobs instead of -R "span[ptile=n]".

RMS allocation and topology scheduling options

The topology options nodes and ptile, and the rails flag are limited by the values of the corresponding parameters in lsf.conf:

If topology options (nodes, ptile, or base) or flags (rails or railmask) are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -extsched option of bsub. For example, if DEFAULT_EXTSCHED=RMS[nodes=2], and you do not want to specify any node option at all, use -extsched "RMS[RMS_SNODE;nodes=]".

Running jobs on any host type

You can specify several types of topology scheduling options at job submission and LSF will schedule jobs appropriately. Jobs that do not specify RMS-related options can be dispatched to RMS hosts, and jobs with RMS-related options can be dispatched to non- RMS hosts.

Use the LSF resource requirements specification (-R option of bsub or RES_REQ in queue definition in lsb.queues) to identify the host types required for your job.

For example, HP pset hosts and Linux/QsNet hosts running RMS can exist in the same LSF cluster. Use the -R option to define resource requirements for either HP pset hosts or Linux/QsNet hosts. Your job will run on either host type, but not both.

For example:

bsub n -8 -R "pset || rms" -ext "RMS[ptile=2];PSET[PTILE=2]" myjob

Runs myjob on an HP pset host or an RMS host if one is available, but not both. If it runs on an RMS host, the RMS[ptile=2] option is applied. If it runs on a pset host, the RMS option is ignored and the PSET[PTILE=2] option is applied.

Viewing nodes allocated to your job

Running jobs (bjobs -l)

Use bjobs -l to see RMS allocation information for a running job. For example, the following job allocates nodes on hosts hostA and hostB, with resource ID 3759 in the partition named parallel:

bjobs -l 723

Job <723>, User <user1>, Project <default>, Status <DONE>, Queue <rms>, 
                     Extsched <RMS[nodes=3;base=hostA]>, Command <hostname>
Wed Aug  6 17:09:44: Submitted from host <hostA>, CWD <$HOME>, 7 Processors R
                     equested;

 STACKLIMIT
   5256 K  
Wed Aug  6 17:09:58: Started on 7 Hosts/Processors <3*hostA> <4*hostB>, 
                     Execution Home </home/user1>, Execution CWD </home/user1>; 
Wed Aug  6 17:09:58: rms_rid=parallel.3759;rms_alloc=3*hostA 2*hostB[2-3];
Wed Aug  6 17:10:01: Done successfully. The CPU time used is 0.1 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 EXTERNAL MESSAGES:
 MSG_ID  FROM     POST_TIME   MESSAGE                   ATTACHMENT 
 0          -             -         -                           -  
 1      user1 Aug  6 17:09    RMS[nodes=3;base=hostA]     N  

Finished jobs (bhist -l)

Use bhist -l to see RMS allocation information for finished jobs. For example:

bhist -l 723

Job <723>, User <user1>, Project <default>, Extsched <RMS[nodes=3;base=hostA
                     ]>, Command <hostname>
Wed Aug  6 17:09:44: Submitted from host <hostA>, to Queue <rms>, CWD <$HOME>,
                     7 Processors Requested;
Wed Aug  6 17:09:58: Dispatched to 7 Hosts/Processors <3*hostA> <4*hostB>; 
Wed Aug  6 17:09:58: rms_rid=parallel.3759;rms_alloc=3*hostA 2*hostB[2-3];
Wed Aug  6 17:09:58: Starting (Pid 971318);
Wed Aug  6 17:10:00: Running with execution home </home/user1>, Execution CWD
                     </home/user1>, Execution Pid <971318>;
Wed Aug  6 17:10:01: Done successfully. The CPU time used is 0.1 seconds;
Wed Aug  6 17:10:01: Post job process done successfully;

Summary of time in seconds spent in various states by  Wed Aug  6 17:10:01
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  14       0        3        0        0        0        17 

Job accounting information (bacct -l)

Use bacct -l to see RMS allocation information logged to lsb.acct. For example:

bacct -l 3088

Accounting information about jobs that are: 
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
------------------------------------------------------------------------------

Job <3088>, User <user1>, Project <default>, Status <DONE>, Queue <rms>, 
Command 
                     <prun hostname>
Wed Aug  6 17:09:44: Submitted from host <hostS>, CWD <$HOME>;
Wed Aug  6 17:09:58: Dispatched to 7 Hosts/Processors <3*hostA> <4*hostB>; 
Wed Aug  6 17:09:58: rms_rid=parallel.3759;rms_alloc=3*hostA 2*hostB[2-3];
Wed Aug  6 17:10:01: Completed <done>.

Accounting information about this job:
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
      0.43        8             43     exit         0.0100  1024K      0K
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second ) 
 Total number of done jobs:       0      Total number of exited jobs:     1
 Total CPU time consumed:       0.4      Average CPU time consumed:     0.4
 Maximum CPU time of a job:     0.4      Minimum CPU time of a job:     0.4
 Total wait time in queues:     8.0
 Average wait time in queue:    8.0
 Maximum wait time in queue:    8.0      Minimum wait time in queue:    8.0
 Average turnaround time:        43 (seconds/job)
 Maximum turnaround time:        43      Minimum turnaround time:        43
 Average hog factor of a job:  0.01 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.01      Minimum hog factor of a job:  0.01

Example job submissions

Example 1: Submitting a job with a script

The following script defines a job requiring 128 CPUs:

#!/bin/sh
#myscript
./sequential_pre_processor
prun -n 64 parallel1 &
prun -n 64 parallel2 &
wait
./sequential_prog
prun -n 128 parallel3
./write_results

Submit the job with the following command:

$ bsub -n 128 ./myscript

Example 2: Submitting a job without a script

$ bsub -n 128 prun parallel_app

Example 3: Submitting a job with prun specified in the queue

The following job assumes that prun is specified in the JOB_STARTER parameter of the default queue definition in lsb.queues.

$ bsub -n 128 parallel_app

prun with no arguments uses all CPUs allocated to it.

Example 4: Using ptile and nodes topology options

To enforce a multi-threaded application to run within a node, use -extsched "nodes=1". For example:

$ bsub -n 3 -extsched "RMS[RMS_SLOAD;nodes=1]" prun mt_app

To enforce that a job only takes nodes where there is no other job running on, use RMS[ptile=cpus_per_node]. For example, on an ES40 machine:

$ bsub -n 20 -extsched "RMS[RMS_SLOAD;ptile=4]" prun 
parallel_app

For more information

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: August 20, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.