Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Using Platform LSF HPC with SLURM


SLURM Version 1.2.25

Contents

[ Top ]


About Platform LSF HPC and SLURM

About Platform LSF HPC and SLURM

Simple Linux Utility for Resource Management (SLURM)

SLURM is a resource management system suitable for use on large and small Linux clusters. It was jointly developed by Lawrence Livermore National Laboratory (LLNL), HP, Bull, and Linux NetworX. As a resource manager, SLURM allocates exclusive or non-exclusive access to resources on compute nodes) for users to perform work, and provides a framework to start, execute and monitor work (normally parallel jobs) on the set of allocated nodes.

A SLURM system consists of two daemons:

The SLURM configuration file (slurm.conf) must be available on each node of the system. Use the SLURM scontrol reconfig command to see the current SLURM configuration.

SLURM terminology

LSF job terminology

Platform LSF HPC and SLURM system architecture

What Platform LSF HPC does

LSF HPC acts primarily as the workload scheduler and node allocator on top of the SLURM system, providing policy and topology-based scheduling for user tasks. SLURM provides a job execution and monitoring layer for LSF. LSF uses SLURM interfaces to:

LSF daemons run on a single front-end node with resource manager role, which represents the whole SLURM cluster. From the point of view of users, a SLURM cluster is one LSF host with multiple CPUs. LIM communicates with the SLURM system to get all available resource metrics for each compute node and reports resource and load information to the master LIM.

Supported features

Platform LSF HPC provides the following capabilities:

Assumptions and limitations

[ Top ]


Installing a New Platform LSF HPC Cluster with SLURM

Platform LSF HPC distribution

The Platform LSF HPC distribution consists of the following files:

Installing Platform LSF HPC (lsfinstall)

The installation program for Platform LSF HPC is lsfinstall.

What lsfinstall does

Begin Host
HOST_NAME     MXJ     r1m       pg    ls     tmp  DISPATCH_WINDOW  # Keywords
...
default       ()      ()        ()    ()     ()   ()               # Example
SLINUX64       !      ()        ()    ()     ()   ()
...
End Host


Do not change the default MXJ=! in lsb.hosts.

Preinstallation checks

  1. Log on as root to the node with resource manager role.
  2. Check for the existence of /var/lsf/lsfslurm.

    If the file does not exist, touch a file with that name:

    # touch /var/lsf/lsfslurm
    
  3. Make sure there is a shared file system available and mounted on all SLURM nodes, with a verified mount point. For example: /hptc_cluster/lsf/tmp.
  4. Make sure that users' home directories can be accessed from all SLURM nodes.

Running lsfinstall

  1. Log on as root to the node with resource manager role.
  2. Change to the directory containing the distribution files.

    For example:

    # cd /tmp
    
  3. Use the zcat and tar commands to uncompress and extract lsf7Update3_lsfinstall.tar.Z:
    # zcat lsf7Update3_lsfinstall.tar.Z | tar xvf -
    
    


    Do not extract the Platform LSF HPC distribution files.

  4. Change to lsf7_lsfinstall:
    # cd /tmp/lsf7Update5_lsfinstall
    
  5. Read lsf7Update5_lsfinstall/install.config and decide which installation variables you need to set.
  6. Edit lsf7Update5_lsfinstall/install.config to set the installation variables you need.

    Uncomment the options you want in the template file, and replace the example values with your own settings.


    The sample values in the install.config template file are examples only. They are not default installation values.

  7. Run lsfinstall as root:
    # ./lsfinstall -f install.config
    

See the Platform LSF Command Reference for more information about lsfinstall and the Platform LSF Configuration Reference for more information about the install.config file.

Required install.config variables

Variables that require an absolute path

Adding RLA port to the NIS or NIS+ database (optional)

By default, LSB_RLA_PORT is configured in LSF_ENVDIR/lsf.conf during installation. If you have configured other LSF HPC ports in NIS or NIS+, you should also configure the RLA port in the NIS database before installing LSF HPC. lsfinstall checks if this port is already defined in NIS and does not add it to lsf.conf if it is already defined.

See Administering Platform LSF for information about modifying the NIS or NIS+ database.

[ Top ]


Configuring Platform LSF HPC with SLURM

Recommended SLURM configuration (slurm.conf)

General parameters

Partitions

LSF configuration notes

Resource to determine SLURM-enabled hosts

If not already configured, you must add the Boolean resource slurm in the RESOURCES column of the host section in lsf.cluster.cluster_name for all nodes that run in an LSF partition.

For example:

Begin   Host
HOSTNAME  model    type  server  r1m  mem  swp  RESOURCES    #Keywords
hostA     !        !     1       3.5  ()   ()   (slurm)
End     Host

The slurm resource is defined in the default lsf.shared template file at installation.

Maximum job slot limit (MXJ in lsb.hosts)

By default, the maximum job slot limit is set by lsfinstall to the number of CPUs that LIM reports. This is specified by MXJ=! for host type and SLINUX64 in the Host section of LSB_CONFDIR/lsb.hosts:

Begin Host
HOST_NAME     MXJ     r1m       pg    ls     tmp  DISPATCH_WINDOW  # Keywords
...
default       ()      ()        ()    ()     ()   ()               # Example
SLINUX64       !      ()        ()    ()     ()   ()
...
End Host


Do not change the default MXJ=! in lsb.hosts.

schmod_slurm plugin

The SLURM scheduling plugin schmod_slurm must be configured as the last scheduler plugin module in lsb.modules.

Maximum number of sbatchd connections (lsb.params)

If LSF HPC operates on a large system (for example, a system with more than 32 nodes), you may need to configure the parameter MAX_SBD_CONNS in lsb.params. MAX_SBD_CONNS controls the maximum number of files mbatchd can have open and connected to sbatchd. The default value of MAX_SBD_CONNS is 32.

In a very busy cluster with many jobs being dispatched, running, finishing at the same time, you may see it takes a very long time for mbatchd to update the status change of a job, and to dispatch new jobs. If your cluster shows this behavior, you should set MAX_SBD_CONNS = (number of nodes) * 2 or 300, which ever is less. Setting MAX_SBD_CONNS too high may slow down the speed of mbatchd dispatching new jobs.

RLA status file directory (lsf.conf)

Use LSB_RLA_WORKDIR=directory to specify the location of the RLA status file. The RLA status file keeps track of job information to allow RLA to recover its original state when it restarts. When RLA first starts, it creates the directory defined by LSB_RLA_WORKDIR if it does not exist, then creates subdirectories for each host.

You should avoid using /tmp or any other directory that is automatically cleaned up by the system. Unless your installation has restrictions on the LSB_SHAREDIR directory, you should use the default:

LSB_SHAREDIR/cluster_name/rla_workdir

Other LSF HPC configuration parameters (lsf.conf)

The following lsf.conf parameters control when LIM starts to report number of usable CPUs. They are all optional.

Two parameters determine whether system becomes stable:

For more information

Customizing job control actions (optional)

By default, LSF HPC carries out job control actions by sending the appropriate signal to suspend, terminate, or resume a job. If your jobs need special job control actions, change the default job control actions in your queue configuration.

JOB_CONTROLS parameter (lsb.queues)

Use the JOB_CONTROLS parameter in lsb.queues to configure suspend, terminate, or resume job controls for the queue:

JOB_CONTROLS = SUSPEND[command] | 
               RESUME[command] | 
               TERMINATE[command]

where command is:

See the Platform LSF Configuration Reference for more information about the JOB_CONTROLS parameter in lsb.queues.

Example job control actions

Begin Queue
QUEUE_NAME=slurm
...
JOB_CONTROLS = TERMINATE[/opt/scripts/act.sh;kill -s TERM 
-$LSB_JOBPGIDS;scancel $SLURM_JOBID]
                SUSPEND[/opt/scripts/act.sh;kill -s STOP -$LSB_JOBPGIDS; 
scancel -s STOP $SLURM_JOBID]
                RESUME[/opt/scripts/act.sh;kill -s CONT -$LSB_JOBPGIDS; 
scancel -s CONT $SLURM_JOBID]
...
End Queue


Some environments may require a TSTP signal instead of STOP.

Verifying that the configuration is correct

  1. Log on as root to the LSF HPC master host.
  2. Set your LSF HPC user environment. For example:
    • For csh or tcsh:
      % source /usr/share/lsf/conf/cshrc.lsf
      
    • For sh, ksh, or bash:
      $ . /usr/share/lsf/conf/profile.lsf
      
  3. Start LSF HPC:
    # lsadmin limstartup
    # lsadmin resstartup
    # badmin hstartup
    
    


    You must be root to start LSF.

  4. Test your cluster by running some basic LSF HPC commands (e.g., lsid and lshosts).
  5. Use the lsload -l and bhosts -l commands to display load information for the cluster.

Example lsload -l output

The status for all nodes should be ok. Hosts with the static resource slurm defined only report the ls load index. The output should look similar to the this:

# lsload -l
HOST_NAME  status  r15s   r1m  r15m   ut   pg    io  ls  it   tmp   swp   mem
hostA      ok         -     -     -    -    -     -   1   -     -     -     -

See How LSF HPC reports resource metrics for more information about how load indices displayed by lsload.

Example bhosts -l output

The status for all nodes should be ok. The output should look similar to this:

# bhosts -l
HOST  hostA
STATUS     CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
ok        16.00     -      8      0      0      0      0      0      -

 CURRENT LOAD USED FOR SCHEDULING:
              r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem
 Total          -     -     -     -     -     -     1    -     -     -     - 
 Reserved      0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M


 LOAD THRESHOLD USED FOR SCHEDULING:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

When a partition is down, bhosts shows all LSF HPC hosts belonging to the partition as closed_Adm.

Making LSF HPC available to users

After verifying that LSF HPC is operating properly, make LSF available to your users by having them include LSF_ENVDIR/cshrc.lsf or LSF_ENVDIR/profile.lsf in their .cshrc or .profile.

[ Top ]


Operating Platform LSF HPC with SLURM

Platform LSF HPC SLURM allocation plugin

The Platform LSF HPC external scheduler plugin for SLURM (schmod_slurm) is loaded on the LSF HPC master host by mbschd and handles all communication between the LSF HPC scheduler and SLURM. It translates LSF HPC concepts (hosts and job slots) into SLURM concepts (nodes, allocation options, and allocation shape).

Platform LSF HPC allocation adapter (RLA)

The Platform LSF HPC allocation adapter (RLA) is located on each LSF HPC host. RLA is started by sbatchd and runs on the SLURM node with resource manager role. It is the interface for the LSF SLURM plugin and the SLURM system.

To schedule a job, the SLURM external scheduler plugin calls RLA to:

The SLURM allocation plugin works with RLA to do the allocation calculation and use RLA services to allocate nodes and de-allocate nodes. sbatchd places jobs within allocated nodes.

Job lifecycle

How jobs run

LSF schedules jobs based on their resource requirements and communicates with the SLURM system to allocate the resources needed for the job to run. LSF provides node- level scheduling for parallel jobs and CPU-level scheduling for serial jobs.

After the LSF scheduler creates SLURM resources, it saves allocation information into LSF event file (lsb.events) and account file (lsb.acct).

When LSF starts job, it sets SLURM_JOBID and SLURM_NPROCS environment variables in the job environment. SLURM_JOBID associates the LSF job with SLURM allocated resources. SLURM_NPROCS corresponds to the bsub -n option. The LSF job file is started on the same node where the LSF daemons run. You must use srun or mpirun to launch real tasks on the allocated nodes.

After the job finishes, LSF cleans up the SLURM resources.

1. Job submission

Use bsub with the -ext SLURM[] external scheduler parameter to submit jobs.

In a mixed cluster, use -R "select[defined(slurm)]" to explicitly run jobs on a SLURM cluster.

Use srun to launch real parallel tasks on the allocated nodes.

2. Job scheduling

For each job, the SLURM scheduler plugin

3. Job execution

When a job starts, sbatchd

By default, pre-execution programs start on the resource manager node. You can use srun to launch pre-execution programs on all allocated nodes. See Running pre- execution programs for more information.

Interactive batch jobs (bsub -I), are started on the resource manager node

Normal batch jobs (bsub without -I) are started on the first node of the SLURM allocation.

Except for sending signals to job processes running on the resource manager node, when sbatchd receives a signal request, it uses the SLURM scancel command to propagate signals to all remote tasks.

4. Job finish (Done/Exit)

For interactive jobs, sbatchd considers a job finished if:

sbatchd uses the SLURM scontrol command to check whether job exits because of SLURM NODE FAIL. If so, sbatchd sets TERM_SLURM job terminate reason and job exit code as 123. Configure REQUEUE_EXIT_VALUE in the queue to enable automatic job requeue.

Post-execution commands run on the resource manager node.

After mbschd receives a job finish event, SLURM plugin contacts RLA to clean up SLURM job allocation.

Supported srun -A allocation shape options

SRUN option Description LSF equivalent
-n, --ntasks=ntasks
Number of processes (tasks) to run. Total CPUs required = ncpus * ntasks
bsub -n
-c, --cpus-per-task=ncpus
Number of CPUs per task. Minimum CPUs per node = MAX(ncpus, mincpus)
Not provided--the meaning of this option is already covered by bsub -n and -ext "SLURM[mincpus=num_cpus]"
-N, --nodes=min[-max]
Minimum number of nodes in the allocation request. Optionally, specifies a range of minimum to maximum nodes in the allocation.
The allocation will contain at least the minimum number of nodes, but cannot exceed the maximum number of nodes.
-ext "SLURM[nodes=min[-max]]"
--mincpus=n
Minimum number of CPUs on the node.
Min CPUs per node = MAX(-c ncpus, --mincpus=n)
The default is 1.
-ext "SLURM[mincpus=num_cpus]"
--mem=MB
Minimum amount of real memory on each node, in MB
-ext "SLURM[mem=MB]"
--tmp=MB
Minimum amount of space on /tmp file system on each node, in MB
-ext "SLURM[tmp=MB]"
-C, --constraint=list
A list of constraints on the node allocation.
The constraint list is a logical expression containing multiple features separated by | (OR -- all nodes have must have at least one of the listed features) and & (AND -- all nodes must have all listed features).
-ext "SLURM[constraint=list]"
-w, --nodelist=node1,..,nodeN
Request a specific list of nodes. Specify a comma-separated list of nodes, or a range of nodes.
The allocation will contain at least the minimum number of nodes, but cannot exceed the maximum number of nodes.
nodelist=[node_list]
Comma-separated list of node names or a list of node ranges that must be included in the allocation.
If you specify node list with contiguous allocation, the nodes in the node list must be contiguous for the job to run. You cannot specify a non-contiguous node list.
nodelist cannot specify the first execution node; SLURM starts the job on the leftmost node in the allocation.
-ext "SLURM[nodelist=node_list]"
-x, --exclude=node1,.. nodeN
Comma-separated list of node name ranges that must be excluded from the allocation
-ext "SLURM[exclude=node_list]"
-p, --partition=partition
Request resources from specified partition
One RootOnly partition for LSF named lsf
--contiguous
Fit the allocation in a single block of nodes with consecutive node indices
-ext "SLURM[contiguous=yes]"

How LSF HPC reports resource metrics

LSF treats an entire SLURM cluster as one LSF host with multiple CPUs and provides a single system image end users. The following tables summarize the static resource metrics and load indices reported for SLURM clusters.

Static resource metrics

Only the following static resource metrics are available for each compute node:

Static Resource Description How to Calculate
ncpus
Total number of usable CPUs on host
Minimum value between CPUs of all available nodes in LSF partition and number of licensed CPUs. If total number of usable CPUs is 0, LIM sets ncpus to 1 and close host.
maxmem
Maximum amount of memory available for user processes
Calculated as a minimal value over all compute nodes
maxtmp
The maximum /tmp space available on the host
Calculated as a minimal value over all compute nodes
maxswap
The total available swap space
Not available (-)
ndisks
Number of disks attached to host
Not available (-)

Load indices

The number of login users (ls) is the only load index that LSF reports. For load indices that cannot be calculated (r15s, r1m, r15m, ut, pg, io, it, tmp, swp, and mem), lshosts and lsload displays not available (-).

Load Index Description How to Calculate
r15s
15-second exponentially averaged CPU run queue length
Not available (-)
r1m
1-minute exponentially averaged CPU run queue length
Not available (-)
r15m
15-minutes exponentially averaged CPU run queue length
Not available (-)
ut
CPU utilization exponentially averaged over last minute, in 0-1 interval
Not available (-)
pg
Memory paging rate exponentially averaged over the last minute, pages per second
Not available (-)
io
I/O rate exponentially averaged over the last minute, KB per second
Not available (-)
ls
Number of currently logged users
The value on resource manager node
it
Idle time of the host (keyboard is not touched on all login sessions), seconds
Not available (-)
tmp
Amount of free space on /tmp, MB
Not available (-)
swp
Amount of free swap space, MB
Not available (-)
mem
Amount of free memory, MB
Not available (-)

Custom load indices


You can configure LSF HPC to report other load indices. For more information about LSF HPC load indices, see Administering Platform LSF.

LSF HPC Licensing

LSF HPC licenses are managed by the Platform LSF HPC licensing mechanism, which determines whether LSF HPC is correctly licensed for the appropriate number of CPUs on the LSF HPC host.

The LSF license is not transferable to any other hosts in the LSF HPC cluster.

The following LSF HPC features are enabled:

How to get additional LSF HPC licenses

To get licenses for additional LSF HPC features, contact Platform Computing at license@platform.com. For example, to enable Platform MultiCluster licenses in your LSF HPC cluster, get a license key for the feature lsf_multicluster.

For more information about LSF HPC features and licensing, see Administering Platform LSF.

Best-fit and first-fit cluster-wide allocation policies

By default, LSF applies a first-fit allocation policy to select from the nodes available for the job. The allocations are made left to right for all parallel jobs, and right to left for all serial jobs (all other job requirements being equal).

In a heterogeneous SLURM cluster, a best-fit allocation may be preferable for clusters where a mix of serial and parallel jobs run. In this context, best fit means: "the nodes that minimally satisfy the requirements." Nodes with the maximum number of CPUs are chosen first. For parallel and serial jobs, the nodes with minimal memory, minimal tmp space, and minimal weight are chosen.

To enable best-fit allocation, specify LSB_SLURM_BESTFIT=Y in lsf.conf.

Node failover

The failover mechanism on SLURM clusters requires two nodes with resource manager role, where one node is master and another node is backup. At any time, LSF daemons should only be started and running on the master node.

When failover happens, the administrator must restart the LSF daemons on the backup node, and this node will become the new master node. LSF daemons or clients from other hosts can communicate with new LSF daemons.

Requeing exited jobs when the resource manager node fails

LSF jobs already started on the master node exit with exit code 122 if the master node goes down. To make sure that these jobs are restarted when the LSF daemons restart either on the backup node (as new master) or on the original master node, configure REQUEUE_EXIT_VALUES in lsb.queues to requeue the jobs automatically.

For example:

Begin Queue
QUEUE_NAME   = high
...
REQUEUE_EXIT_VALUES = 122
...
End Queue

Threshold conditions to report number of CPUs

When a SLURM system starts, some compute nodes may take some time to come up. If LSF starts to report the number of CPUs before all nodes are up, already queued smaller jobs might get started when a more optimal choice might be to start a larger job needing more CPUs.

To make sure all usable nodes are available for LSF to dispatch jobs to, use the following parameters in lsf.conf to control when LSF starts to report total usable CPUs on a SLURM cluster:

Running pre-execution programs

Though LSF daemons only run on a SLURM node with resource manager role, batch jobs can run on any SLURM nodes with compute role that satisfy the scheduling and allocation requirements.

By default, where pre-execution commands run depends on the type of job:

Before starting a pre-exec program, LSF sets the SLURM_JOBID environment variable. To enable srun to launch pre-execution on the first allocated node and other allocated nodes, your pre-exec program should pick up the SLURM_JOBID environment variable. SLURM_JOBID gives LSF HPC the information it needs to run the job on the nodes required by your pre-exec program.

Controlling where a pre-execution program starts and runs

To run a pre-execution program on

See the SLURM srun command man page for more information about the SLURM_JOBID environment variable and the -N option.

Support for SLURM batch mode (srun -b)

Platform LSF HPC uses the SLURM srun -b --jobid=SLURM_JOBID command to launch jobs on the first node of the SLURM allocation.

In LSF job pre-execution programs

Do not use srun -b --jobid=SLURM_JOBID inside pre-execution programs. The srun -b --jobid=SLURM_JOBID command returns immediately after a SLURM batch job gets submitted. This can cause the pre-exec script to exit with success while the real task is still running in batch mode.

In LSF job commands

Application-level checkpointing

To enable application-level checkpoint, the checkpoint directory specified for checkpointable jobs (CHKPNT=chkpnt_dir parameter in the configuration of the preemptable queue) must be accessible by all SLURM nodes configured in the LSF partition.

Platform LSF HPC creates checkpoint trigger files in the job working directory to trigger the checkpoint process of applications. Since the specified checkpoint directory must be accessible by all the nodes of LSF partition, the checkpoint trigger files will be readable by the application at run time.

For more information

About checkpointing and restart, and checkpointing specific applications, see Administering Platform LSF

[ Top ]


Submitting and Monitoring Jobs with SLURM

bsub command

To submit a job, use the bsub command.

Syntax

bsub -n min_cpus[,max_cpus] 
-ext[sched"SLURM[[allocation_options][;allocation_options]...]" job_name

Specify allocation options for SLURM jobs either in the -ext option, or with DEFAULT_EXTSCHED or MANDATORY_EXTSCHED in a queue definition in lsb.queues.


You can abbreviate the -extsched option to -ext.

The options set by -ext can be combined with the queue-level MANDATORY_EXTSCHED or DEFAULT_EXTSCHED parameters.

The -ext "SLURM[]" options override the DEFAULT_EXTSCHED parameter, and the MANDATORY_EXTSCHED parameter overrides -ext "SLURM[]" options.

Controlling and monitoring jobs

Use bkill, bstop, and bresume to kill, suspend, and resume jobs.

Use bjobs, bacct, and bhist to view allocation and job resource usage information.

Examples

For more information

Running jobs on any host type

You can specify several types of allocation options at job submission and LSF HPC will schedule jobs appropriately. Jobs that do not specify SLURM-related options can be dispatched to SLURM hosts, and jobs with SLURM-related options can be dispatched to non-SLURM hosts.

Use the LSF HPC resource requirements specification (-R option of bsub or RES_REQ in queue definition in lsb.queues) to identify the host types required for your job.

SLURM hosts can exist in the same LSF HPC cluster with other host types. Use the -R option to define host type resource requirements. For example. 64-bit Linux hosts are host type LINUX64 and SLURM hosts are host type SLINUX64.

Examples

% bsub-n 8 -R "type==any" -ext "SLURM[nodes=4-4];RMS[ptile=2]" myjob

If myjob runs on an RMS-enabled host, the RMS ptile option is applied. If it runs on any other host type, the SLURM and RMS options are ignored.

Viewing nodes allocated to your job

After LSF allocates nodes for job, it attaches allocation information to job, so you can view allocation through bjobs, bhist, and bacct.

The job allocation information string has the form:

slurm_id=slurm_jobid;ncpus=number;slurm_alloc=node_list;
Where:

For example:

Tue Aug 31 16:22:27: slurm_id=60;ncpus=4;slurm_alloc=n[14-15];

Running jobs (bjobs -l)

Use bjobs -l to see SLURM allocation information for a running job.

For example, the following job allocates nodes on hosts hostA:

% bsub -n 8 -ext"SLURM[]" mpirun -srun /usr/share/lsf7slurm/bin/hw
Job <7267> is submitted to default queue <normal>.

bjobs output looks like this:

% bjobs -l 7267

Job <7267>, User <user1>, Project <default>, Status <DONE>, Queue <normal>,
                     Extsched <SLURM[]>, Command <mpirun -srun /usr/share/lsf7
                     slurm/bin/hw>
Thu Sep 16 15:29:06: Submitted from host <hostA>, CWD </usr/share/lsf7slurm/b
                     in>, 8 Processors Requested;
Thu Sep 16 15:29:16: Started on 8 Hosts/Processors <8*hostA>;
Thu Sep 16 15:29:16: slurm_id=21795;ncpus=8;slurm_alloc=n[13-16];
Thu Sep 16 15:29:23: Done successfully. The CPU time used is 0.0 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

 EXTERNAL MESSAGES:
 MSG_ID FROM       POST_TIME      MESSAGE                            ATTACHMENT
 0          -             -                        -                      -
 1      user1      Sep 16 15:29   SLURM[]                                 N

Finished jobs (bhist -l)

Use bhist -l to see SLURM allocation information for finished jobs. For example:

% bhist -l 7267

Job <7267>, User <user1>, Project <default>, Extsched <SLURM[]>, Command <mpi
                     run -srun /usr/share/lsf7slurm/bin/hw>
Thu Sep 16 15:29:06: Submitted from host <hostA>, to Queue <normal>, CWD </u
                     sr/share/lsf7slurm/bin>, 8 Processors Requested;
Thu Sep 16 15:29:16: Dispatched to 8 Hosts/Processors <8*hostA>;
Thu Sep 16 15:29:16: slurm_id=21795;ncpus=8;slurm_alloc=n[13-16];
Thu Sep 16 15:29:16: Starting (Pid 5804);
Thu Sep 16 15:29:23: Done successfully. The CPU time used is 0.0 seconds;
Thu Sep 16 15:29:23: Post job process done successfully;

Summary of time in seconds spent in various states by  Thu Sep 16 15:29:23
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  10       0        7        0        0        0        17

Job accounting information (bacct -l)

Use bacct -l to see SLURM allocation information logged to lsb.acct. For example:

% bacct -l 7267

Accounting information about jobs that are:
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

Job <7267>, User <user1>, Project <default>, Status <DONE>, Queue <normal>,
                     Command <mpirun -srun /usr/share/lsf7slurm/bin/hw>
Thu Sep 16 15:29:06: Submitted from host <hostA>, CWD </usr/share/lsf7slurm/b
                     in>;
Thu Sep 16 15:29:16: Dispatched to 8 Hosts/Processors <8*hostA>;
Thu Sep 16 15:29:16: slurm_id=21795;ncpus=8;slurm_alloc=n[13-16];
Thu Sep 16 15:29:23: Completed <done>.

Accounting information about this job:
     Share group charged </user1>
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
      0.05       10             17     done         0.0029     0K      0K
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:       1      Total number of exited jobs:     0
 Total CPU time consumed:       0.1      Average CPU time consumed:     0.1
 Maximum CPU time of a job:     0.1      Minimum CPU time of a job:     0.1
 Total wait time in queues:     7.0
 Average wait time in queue:    7.0
 Maximum wait time in queue:    7.0      Minimum wait time in queue:    7.0
 Average turnaround time:        28 (seconds/job)
 Maximum turnaround time:        28      Minimum turnaround time:        28
 Average hog factor of a job:  0.00 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00

Example job submissions

Environment

On one SLURM cluster, the lsf partition configures 4 compute nodes, each with 2 CPUs. LSF is installed on hostA. lsid, lshosts, bhosts, and sinfo show the configuration:

% lsid
Platform LSF HPC 7.0.3.108010 for SLURM, Jun  3 2008
Copyright 1992-2009 Platform Computing Corporation

My cluster name is cluster1
My master name is hostA

% lshosts
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
hostA       SLINUX6 Itanium2  16.0     8     1M      -    Yes (slurm)

% bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
hostA              ok              -      8      0      0      0      0      0
% sinfo
PARTITION AVAIL TIMELIMIT NODES  STATE NODELIST
lsf          up  infinite     4  alloc n[13-16]

Submit a job script with -I

Multiple srun commands are specified inside the script:

% cat myjobscript.sh 
#!/bin/sh
srun hostname
srun uname -a

Submit the job:

% bsub -I -n 4 < myjobscript.sh
Job <1> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
n13
n13
n14
n14
Linux n13 2.4.21-15.9smp #1 SMP Wed Aug 25 01:07:12 EDT 2009 ia64 ia64 ia64 
GNU/Linux
Linux n13 2.4.21-15.9smp #1 SMP Wed Aug 25 01:07:12 EDT 2009 ia64 ia64 ia64 
GNU/Linux
Linux n14 2.4.21-15.9smp #1 SMP Wed Aug 25 01:07:12 EDT 2009 ia64 ia64 ia64 
GNU/Linux
Linux n14 2.4.21-15.9smp #1 SMP Wed Aug 25 01:07:12 EDT 2009 ia64 ia64 ia64 
GNU/Linux

Submit /bin/sh with -Ip

Use the SLURM sinfo command to show node information:

% sinfo
PARTITION AVAIL TIMELIMIT NODES  STATE NODELIST
lsf          up  infinite     4  alloc n[13-16]

Submit the job:

% bsub -n8 -Ip /bin/sh
Job <2> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>

sinfo shows the allocation:

% sinfo
PARTITION AVAIL TIMELIMIT NODES  STATE NODELIST
lsf          up  infinite     4  alloc n[13-16]

View the SLURM job ID:

% env | grep SLURM
SLURM_JOBID=18
SLURM_NPROCS=8

Run some commands:

% srun hostname
n13
n13
n14
n14
n15
n15
n16
n16

% srun -n 5 hostname
n13
n13
n14
n15
n16

% exit
exit

Use bjobs to see the interactive jobs:

% bjobs -l 2

Job <2>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Int
                     eractive pseudo-terminal mode, Command </bin/sh>
Wed Sep 22 18:05:56: Submitted from host <hostA>, CWD <$HOME>, 8 Processors 
                     Requested;
Wed Sep 22 18:06:06: Started on 8 Hosts/Processors <8*hostA>;
Wed Sep 22 18:06:06: slurm_id=18;ncpus=8;slurm_alloc=n[13-16];
Wed Sep 22 18:10:37: Done successfully. The CPU time used is 0.6 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

Use bhist to see the history of the finished jobs:

% bhist -l 2

Job <2>, User <user1>, Project <default>, Interactive pseudo-terminal mode, 
                     Command </bin/sh>
Wed Sep 22 18:05:56: Submitted from host <hostA>, to Queue <normal>, CWD <$H
                     OME>, 8 Processors Requested;
Wed Sep 22 18:06:06: Dispatched to 8 Hosts/Processors <8*hostA>;
Wed Sep 22 18:06:06: slurm_id=18;ncpus=8;slurm_alloc=n[13-16];
Wed Sep 22 18:06:06: Starting (Pid 9462);
Wed Sep 22 18:10:24: Done successfully. The CPU time used is 0.6 seconds;
Wed Sep 22 18:10:37: Post job process done successfully;

Summary of time in seconds spent in various states by  Wed Sep 22 18:10:37
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  10       0        258      0        0        0        268

Use the SLURM sinfo command to see node state:

% sinfo
PARTITION AVAIL TIMELIMIT NODES  STATE NODELIST
lsf          up  infinite     4  alloc n[13-16]

Use the SLURM scontrol command to see the SLURM job information:

% scontrol show job 18
JobId=18 UserId=user1(502) GroupId=lsfadmin(502)
   Name=lsf7slurm@2 JobState=COMPLETED
   Priority=4294901743 Partition=lsf BatchFlag=0
   AllocNode:Sid=n16:8833 TimeLimit=UNLIMITED
   StartTime=09/22-18:06:01 EndTime=09/22-18:10:24
   NodeList=n[13-16] NodeListIndecies=-1
   ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   ReqNodeList=(null) ReqNodeListIndecies=-1
   ExcNodeList=(null) ExcNodeListIndecies=-1

Run an MPI job

Submit the job:

% bsub -I -n6 -ext "SLURM[nodes=3]" /opt/mpi/bin/mpirun -srun 
/usr/share/lsf7slurm/bin/hw
Job <6> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
I'm process 0! from ( n13 pid 27222)
Greetings from process 1! from ( n13 pid 27223)
Greetings from process 2! from ( n14 pid 14011)
Greetings from process 3! from ( n14 pid 14012)
Greetings from process 4! from ( n15 pid 18227)
Greetings from process 5! from ( n15 pid 18228)
mpirun exits with status: 0

Use bjobs to see the job:

% bjobs -l 6

Job <6>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Int
                     eractive mode, Extsched <SLURM[nodes=3]>, Command </opt/
                     mpi/bin/mpirun -srun /usr/share/lsf7slurm/bin/hw>
Wed Sep 22 18:16:51: Submitted from host <hostA>, CWD <$HOME>, 6 Processors 
                     Requested;
Wed Sep 22 18:17:02: Started on 6 Hosts/Processors <6*hostA>;
Wed Sep 22 18:17:02: slurm_id=22;ncpus=6;slurm_alloc=n[13-15];
Wed Sep 22 18:17:09: Done successfully. The CPU time used is 0.0 seconds.

 SCHEDULI7NG PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 EXTERNAL MESSAGES:
 MSG_ID FROM      POST_TIME      MESSAGE                             ATTACHMENT 
 0      -         -              -                                        -
 1      user1     Sep 22 18:16   SLURM[nodes=3]                           N

Use bhist to see the history of the finished job:

% bhist -l 6

Job <6>, User <user1>, Project <default>, Interactive mode, Extsched <SLURM[
                     nodes=3]>, Command </opt/mpi/bin/mpirun -srun /usr/share
                     /lsf7slurm/bin/hw>
Wed Sep 22 18:16:51: Submitted from host <hostA>, to Queue <normal>, CWD <$H
                     OME>, 6 Processors Requested;
Wed Sep 22 18:17:02: Dispatched to 6 Hosts/Processors <6*hostA>;
Wed Sep 22 18:17:02: slurm_id=22;ncpus=6;slurm_alloc=n[13-15];
Wed Sep 22 18:17:02: Starting (Pid 11216);
Wed Sep 22 18:17:09: Done successfully. The CPU time used is 0.0 seconds;
Wed Sep 22 18:17:09: Post job process done successfully;

Summary of time in seconds spent in various states by  Wed Sep 22 18:17:09
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  11       0        7        0        0        0        18

Run a job with a SLURM allocation request

Submit jobs to a SLURM cluster with three compute nodes (n13, n14, and n16).

[ Top ]


SLURM Command Reference

bsub command

-ext[sched] "SLURM[[allocation_options][;allocation_options] ...]"

Specifies allocation options for SLURM jobs.


You can abbreviate the -extsched option to -ext.

The options set by -ext can be combined with the queue-level MANDATORY_EXTSCHED or DEFAULT_EXTSCHED parameters.

The -ext "SLURM[]" options override the DEFAULT_EXTSCHED parameter, and the MANDATORY_EXTSCHED parameter overrides -ext "SLURM[]" options.

See lsb.queues file for more information about MANDATORY_EXTSCHED and DEFAULT_EXTSCHED.

allocation_options

Specifies the SLURM allocation shape options for the job:

Usage

If allocation options are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -ext option of bsub. For example, if DEFAULT_EXTSCHED=SLURM[nodes=2], and you do not want to specify any node option at all, use -ext "SLURM[nodes=]".

[ Top ]


SLURM File Reference

lsb.queues file

DEFAULT_EXTSCHED

Syntax

DEFAULT_EXTSCHED=SLURM[[allocation_options][;allocation_options] ...]

Description

Specifies SLURM allocation options for the queue.

-ext options on the bsub command are merged with DEFAULT_EXTSCHED options, and -ext options override any conflicting queue-level options set by DEFAULT_EXTSCHED.

For example, if DEFAULT_EXTSCHED=SLURM[nodes=2;tmp=100] and a job is submitted with -ext "SLURM[nodes=3;tmp=]", LSF HPC uses the following resulting options for scheduling:

SLURM[nodes=3]

nodes=3 in the -ext option overrides nodes=2 in DEFAULT_EXTSCHED, and tmp= in -ext option overrides tmp=100 in DEFAULT_EXTSCHED.

DEFAULT_EXTSCHED can be used in combination with MANDATORY_EXTSCHED in the same queue. For example:

LSF HPC uses the resulting options for scheduling:

SLURM[nodes=3;contiguous=yes;tmp=200]

nodes=3 in the -ext option overrides nodes=2 in DEFAULT_EXTSCHED, and tmp= in -ext option overrides tmp=100 in DEFAULT_EXTSCHED. MANDATORY_EXTSCHED adds contiguous=yes, and overrides tmp= in -ext option and tmp=100 in DEFAULT_EXTSCHED with tmp=200.

If allocation options are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -ext option of bsub. For example, if DEFAULT_EXTSCHED=SLURM[nodes=2], and you do not want to specify any node option at all, use -ext "SLURM[nodes=]".

See bsub command for more information.

Default

Undefined.

MANDATORY_EXTSCHED

Syntax

MANDATORY_EXTSCHED=SLURM[[allocation_options][;allocation_options] ...]

Description

Specifies mandatory SLURM allocation options for the queue.

-ext options on the bsub command are merged with MANDATORY_EXTSCHED options, and MANDATORY_EXTSCHED options override any conflicting job-level options set by -ext.

Overrides -ext options on the bsub command.

For example, if MANDATORY_EXTSCHED=SLURM[contiguous=yes;tmp=200] and a job is submitted with -ext "SLURM[nodes=3;tmp=100]", LSF HPC uses the following resulting options for scheduling:

"SLURM[nodes=3;contiguous=yes;tmp=200]"

MANDATORY_EXTSCHED can be used in combination with DEFAULT_EXTSCHED in the same queue. For example:

LSF HPC uses the resulting options for scheduling:

SLURM[nodes=3;contiguous=yes;tmp=200]

nodes=3 in the -ext option overrides nodes=2 in DEFAULT_EXTSCHED, and tmp= in -ext option overrides tmp=100 in DEFAULT_EXTSCHED. MANDATORY_EXTSCHED adds contiguous=yes, and overrides tmp= in -ext option and tmp=100 in DEFAULT_EXTSCHED with tmp=200.

If you want to prevent users from setting allocation options in the -ext option of bsub, use the keyword with no value. For example, if the job is submitted with -ext "SLURM[nodes=4]", use MANDATORY_EXTSCHED=RMS[nodes=] to override this setting.

See bsub command for more information.

Default

Undefined.

lsf.conf file

LSB_RLA_PORT

Syntax

LSB_RLA_PORT=port_number

Description

TCP port used for communication between the LSF HPC HPC allocation adapter (RLA) and the SLURM scheduler plugin.

Default

6883

LSB_RLA_TIMEOUT

Syntax

LSB_RLA_TIMEOUT=seconds

Description

Defines the communications timeout between RLA and its clients (e.g., sbatchd and SLURM scheduler plugin.)

Default

10 seconds

LSB_RLA_UPDATE

Syntax

LSB_RLA_UPDATE=seconds

Description

Specifies how often the LSF HPC scheduler refreshes free node information from RLA.

Default

600 seconds

LSB_RLA_WORKDIR

Syntax

LSB_RLA_WORKDIR=directory

Description

Directory to store the RLA status file. Allows RLA to recover its original state when it restarts. When RLA first starts, it creates the directory defined by LSB_RLA_WORKDIR if it does not exist, then creates subdirectories for each host.

You should avoid using /tmp or any other directory that is automatically cleaned up by the system. Unless your installation has restrictions on the LSB_SHAREDIR directory, you should use the default for LSB_RLA_WORKDIR.

Default

LSB_SHAREDIR/cluster_name/rla_workdir

LSB_SLURM_BESTFIT

Syntax

LSB_SLURM_BESTFIT=y | Y

Description

Enables best-fit node allocation.

By default, LSF applies a first-fit allocation policy to select from the nodes available for the job. The allocations are made left to right for all parallel jobs, and right to left for all serial jobs (all other job requirements being equal).

In a heterogeneous SLURM cluster, a best-fit allocation may be preferable for clusters where a mix of serial and parallel jobs run. In this context, best fit means: "the nodes that minimally satisfy the requirements." Nodes with the maximum number of CPUs are chosen first. For parallel and serial jobs, the nodes with minimal memory, minimal tmp space, and minimal weight are chosen.

Default

Undefined

LSF_ENABLE_EXTSCHEDULER

Syntax

LSF_ENABLE_EXTSCHEDULER=y | Y

Description

Enables external scheduling for Platform LSF HPC

Default

Y (automatically set by lsfinstall)

LSF_HPC_EXTENSIONS

Syntax

LSF_HPC_EXTENSIONS="extension_name ..."

Description

Enables Platform LSF HPC extensions.

Valid values

The following extension names are supported:

Default

Undefined

LSF_HPC_NCPU_COND

Syntax

LSF_HPC_NCPU_COND=and | or

Description

Defines how any two LSF_HPC_NCPU_* thresholds are combined.

Default

or

LSF_HPC_NCPU_INCREMENT

Syntax

LSF_HPC_NCPU_INCREMENT=increment

Description

Defines the upper limit for the number of CPUs that are changed since the last checking cycle.

Default

0

LSF_HPC_NCPU_INCR_CYCLES

Syntax

LSF_HPC_NCPU_INCR_CYCLES=incr_cyscles

Description

Minimum number of consecutive cycles where the number of CPUs changed does not exceed LSF_HPC_NCPU_INCREMENT. LSF checks total usable CPUs every 2 minutes.

Default

1

LSF_HPC_NCPU_THRESHOLD

Syntax

LSF_HPC_NCPU_THRESHOLD=threshold

Description

LSF_HPC_NCPU_THRESHOLD=threshold

The percentage of total usable CPUs in the LSF partition.

Default

80

LSF_NON_PRIVILEGED_PORTS

Syntax

LSF_NON_PRIVILEGED_PORTS=y | Y

Description

Disables privileged ports usage.

By default, LSF daemons and clients running under root account will use privileged ports to communicate with each other. Without LSF_NON_PRIVILEGED_PORTS defined, and if LSF_AUTH is not defined in lsf.conf, LSF daemons check privileged port of request message to do authentication.

If LSF_NON_PRIVILEGED_PORTS=Y is defined, LSF clients (LSF commands and daemons) will not use privileged ports to communicate with daemons and LSF daemons will not check privileged ports of incoming requests to do authentication.

Default

Undefined

LSF_SLURM_BINDIR

Syntax

LSF_SLURM_BINDIR=absolute_path

Description

Specifies an absolute path to the directory containing the SLURM commands. If you install SLURM in a different location from the default, you must define LSF_SLURM_BINDIR.

Default

/opt/hptc/slurm/bin

LSF_SLURM_DISABLE_CLEANUP

Syntax

LSF_SLURM_DISABLE_CLEANUP=y | Y

Description

Disables cleanup of non-LSF jobs running in a SLURM LSF partition.

By default, only LSF jobs are allowed to run within a SLURM LSF partition. LSF periodically cleans up any jobs submitted outside of LSF. This clean up period is defined through LSB_RLA_UPDATE.

For example, the following srun job is not submitted through LSF, so it is terminated:

% srun -n 4 -p lsf sleep 100000
srun: error: n13: task[0-1]: Terminated
srun: Terminating job

If LSF_SLURM_DISABLE_CLEANUP=Y is set, this job would be allowed to run.

Default

Undefined

LSF_SLURM_TMPDIR

Syntax

LSF_SLURM_TMPDIR=path

Description

Specifies the LSF tmp directory for SLURM clusters. The default LSF_TMPDIR /tmp cannot be shared across nodes, so LSF_SLURM_TMPDIR must specify a path that is accessible on all SLURM nodes.

LSF_SLURM_TMPDIR only affects SLURM machine configuration. It is ignored on other systems in a mixed cluster environment.

The location of LSF tmp directory is determined in the following order:

Default

/hptc_cluster/lsf/tmp

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: August 20, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.