Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Working with Hosts

Contents

Host status

Host status describes the ability of a host to accept and run batch jobs in terms of daemon states, load levels, and administrative controls. The bhosts and lsload commands display host status.

bhosts

Displays the current status of the host:

STATUS
Description
ok
Host is available to accept and run new batch jobs.
unavail
Host is down, or LIM and sbatchd are unreachable.
unreach
LIM is running but sbatchd is unreachable.
closed
Host will not accept new jobs. Use bhosts -l to display the reasons.
unlicensed
Host does not have a valid license.

bhosts -l

Displays the closed reasons. A closed host does not accept new batch jobs:

Closed reason
Description
closed_Adm
An LSF administrator or root explicitly closed the host using badmin hclose. Running jobs are not affected.
closed_Busy
The value of a load index exceeded a threshold (configured in lsb.hosts, displayed by bhosts -l). Running jobs are not affected. Indices that exceed thresholds are identified with an asterisk (*).
closed_Excl
An exclusive batch job (i.e., bsub -x) is running on the host.
closed_cu_Excl
An exclusive compute unit job (i.e., bsub -R "cu[excl]") is running within the compute unit containing this host.
closed_Full
The configured maximum number of running jobs has been reached. Running jobs will not be affected.
closed_LIM
sbatchd is running but LIM is unavailable.
closed_Lock
An LSF administrator or root explicitly locked the host using lsadmin limlock. Running jobs are suspended (SSUSP). Use lsadmin limunlock to unlock LIM on the local host.
closed_Wind
Host is closed by a dispatch window defined in lsb.hosts. Running jobs are not affected.
closed_EGO
For EGO-enabled SLA scheduling, closed_EGO indicates that the host is closed because it has not been allocated by EGO to run LSF jobs. Hosts allocated from EGO display status ok.

bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS   RUN  SSUSP  USUSP   RSV
hostA              ok            -      55    2      2    0      0       0
hostB              closed        -      20    16     16   0      0       0
... 
bhosts -l hostB
HOST  hostB
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
closed_Adm      23.10     -     55      2      2      0      0      0      -
CURRENT LOAD USED FOR SCHEDULING:
             r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem
Total         1.0  -0.0  -0.0    4%   9.4   148    2     3 4231M  698M  233M
Reserved      0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M
LOAD THRESHOLD USED FOR SCHEDULING:
          r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
loadSched   -     -     -     -       -     -    -     -     -      -      -  
loadStop    -     -     -     -       -     -    -     -     -      -      -  
 		                cpuspeed    bandwidth 
 loadSched          -            - 
 loadStop           -            - 

lsload

Displays the current state of the host:

Status
Description
ok
Host is available to accept and run batch jobs and remote tasks.
-ok
LIM is running but RES is unreachable.
busy
Does not affect batch jobs, only used for remote task placement (i.e., lsrun). The value of a load index exceeded a threshold (configured in lsf.cluster.cluster_name, displayed by lshosts -l). Indices that exceed thresholds are identified with an asterisk (*).
lockW
Does not affect batch jobs, only used for remote task placement (i.e., lsrun). Host is locked by a run window (configured in lsf.cluster.cluster_name, displayed by lshosts -l).
lockU
Will not accept new batch jobs or remote tasks. An LSF administrator or root explicitly locked the host using lsadmin limlock, or an exclusive batch job (bsub -x) is running on the host. Running jobs are not affected. Use lsadmin limunlock to unlock LIM on the local host.
unavail
Host is down, or LIM is unavailable.
unlicensed
The host does not have a valid license.

lsload
HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem
hostA               ok   0.0   0.0   0.0   4%   0.4   0  4316   10G  302M  252M
hostB               ok   1.0   0.0   0.0   4%   8.2   2    14 4231M  698M  232M
... 

How LIM Determines Host Models and Types

The LIM (load information manager) daemon/service automatically collects information about hosts in an LSF cluster, and accurately determines running host models and types. At most, 1024 model types can be manually defined in lsf.shared.

If lsf.shared is not fully defined with all known host models and types found in the cluster, LIM attempts to match an unrecognized running host to one of the models and types that is defined.

LIM supports both exact matching of host models and types, and "fuzzy" matching, where an entered host model name or type is slightly different from what is defined in lsf.shared (or in ego.shared if EGO is enabled in the LSF cluster).

How does "fuzzy" matching work?

LIM reads host models and types that have been manually configured in lsf.shared. The format for entering host models and types is model_bogomips_architecture (for example, x15_4604_OpterontmProcessor142, IA64_2793, or SUNWUltra510_360_sparc). Names can be up to 64 characters long.

When LIM attempts to match running host model with what is entered in lsf.shared, it first attempts an exact match, then proceeds to make a fuzzy match.

How LIM attempts to make matches

Architecture name of running host
What the lim reports
Additional information about the lim process
Same as definition in lsf.shared (exact match)
Reports the reference index of exact match
LIM detects an exact match between model and input architecture string
Similar to what is defined in lsf.shared (fuzzy match)
Reports fuzzy match based on detection of 1or 2 fields in the input architecture string
  • For input architecture strings with only one field, if LIM cannot detect an exact match for the input string, then it reports the best match. A best match is a model field with the most characters shared by the input string.
  • For input architecture strings with two fields:
    1. If LIM cannot detect an exact match, it attempts to find a best match by identifying the model field with the most characters that match the input string
    2. LIM then attempts to find the best match on the bogomips field
  • For architecture strings with three fields:
    1. If LIM cannot detect an exact match, it attempts to find a best match by identifying the model field with the most characters that match the input string
    2. After finding the best match for the model field, LIM attempts to find the best match on the architecture field
    3. LIM then attempts to find the closest match on the bogomips field, with wildcards supported (where the bogomips field is a wildcard)
Has an illegal name
Reports default host model
An illegal name is one that does not follow the permitted format for entering an architecture string where the first character of the string is not an English-language character.

Viewing Host Information

LSF uses some or all of the hosts in a cluster as execution hosts. The host list is configured by the LSF administrator. Use the bhosts command to view host information. Use the lsload command to view host load information.

To view...
Run...
All hosts in the cluster and their status
bhosts
Condensed host groups in an uncondensed format
bhosts -X
Detailed server host information
bhosts -l and lshosts -l
Host load by host
lsload
Host architecture information
lshosts
Host history
badmin hhist
Host model and type information
lsinfo
Job exit rate and load for hosts
bhosts -l and bhosts -x
Dynamic host information
lshosts

View all hosts in the cluster and their status

  1. Run bhosts to display information about all hosts and their status.
  2. bhosts displays condensed information for hosts that belong to condensed host groups. When displaying members of a condensed host group, bhosts lists the host group name instead of the name of the individual host. For example, in a cluster with a condensed host group (groupA), an uncondensed host group (groupB containing hostC and hostE), and a host that is not in any host group (hostF), bhosts displays the following:

    bhosts 
    HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
    groupA             ok              5      8     4       2      0      1      1 
    hostC              ok              -      3     0       0      0      0      0 
    hostE              ok              2      4     2       1      0      0      1 
    hostF              ok              -      2     2       1      0      1      0 
     

    Define condensed host groups in the HostGroups section of lsb.hosts. To find out more about condensed host groups and to see the configuration for the above example, see Defining condensed host groups.

View uncondensed host information

  1. Run bhosts -X to display all hosts in an uncondensed format, including those belonging to condensed host groups:
  2. bhosts -X 
    HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
    hostA              ok              2      2      0      0      0      0      0 
    hostD              ok              2      4      2      1      0      0      1 
    hostB              ok              1      2      2      1      0      1      0 
    hostC              ok              -      3      0      0      0      0      0 
    hostE              ok              2      4      2      1      0      0      1 
    hostF              ok              -      2      2      1      0      1      0 
    

View detailed server host information

  1. Run bhosts -l host_name and lshosts -l host_name to display all information about each server host such as the CPU factor and the load thresholds to start, suspend, and resume jobs:
  2. bhosts -l hostB 
    HOST  hostB 
    STATUS   CPUF   JL/U   MAX   NJOBS   RUN   SSUSP   USUSP  RSV  DISPATCH_WINDOWS 
    ok       20.20   -      -     0       0      0      0      0    - 
    CURRENT LOAD USED FOR SCHEDULING: 
             r15s  r1m   r15m   ut   pg   io   ls   it   tmp   swp   mem 
    Total    0.1   0.1    0.1   9%  0.7  24   17    0    394M  396M  12M 
    Reserved 0.0   0.0    0.0   0%  0.0   0    0    0      0M    0M   0M 
    LOAD THRESHOLD USED FOR SCHEDULING: 
                r15s  r1m   r15m  ut   pg   io  ls   it   tmp  swp  mem 
    loadSched   -     -     -     -    -     -   -    -     -    -    - 
    loadStop    -     -     -     -    -     -   -    -     -    -    - 
     
                    cpuspeed    bandwidth 
     loadSched          -            - 
     loadStop           -            - 
    lshosts -l hostB 
    HOST_NAME:  hostB 
    type model cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads 
    LINUX86 PC6000 116.1       2      1  2016M  1983M 72917M      0    Yes   1    2        2 
     
    RESOURCES: Not defined 
    RUN_WINDOWS:  (always open) 
     
    LICENSES_ENABLED: (LSF_Base LSF_Manager LSF_MultiCluster)  
    LICENSE_NEEDED: Class(E)  
     
    LOAD_THRESHOLDS: 
      r15s   r1m   r15m   ut   pg   io   ls   it   tmp   swp   mem 
         -   1.0      -    -    -    -    -    -     -     -    4M 
    

View host load by host

The lsload command reports the current status and load levels of hosts in a cluster. The lshosts -l command shows the load thresholds.

The lsmon command provides a dynamic display of the load information. The LSF administrator can find unavailable or overloaded hosts with these tools.

  1. Run lsload to see load levels for each host:
  2. lsload 
    HOST_NAME status r15s r1m  r15m ut  pg  ls it tmp swp  mem 
    hostD     ok     1.3  1.2  0.9  92% 0.0 2  20 5M  148M 88M 
    hostB     -ok    0.1  0.3  0.7  0%  0.0 1  67 45M 25M  34M 
    hostA     busy   8.0  *7.0 4.9  84% 4.6 6  17 1M  81M  27M 
     

    The first line lists the load index names, and each following line gives the load levels for one host.

Viewing host architecture (type and model) information

An LSF cluster may consist of hosts of differing architectures and speeds. The lshosts command displays configuration information about hosts. All these parameters are defined by the LSF administrator in the LSF configuration files, or determined by the LIM directly from the system.

Host types represent binary compatible hosts; all hosts of the same type can run the same executable. Host models give the relative CPU performance of different processors. For example:

lshosts 
HOST_NAME   type    model cpuf ncpus maxmem maxswp server  RESOURCES 
hostD     SUNSOL SunSparc  6.0     1    64M   112M    Yes  (solaris cserver) 
hostM       RS6K   IBM350  7.0     1    64M   124M    Yes  (cserver aix) 
hostC       SGI6     R10K 14.0    16  1024M  1896M    Yes  (irix cserver) 
hostA       HPPA    HP715  6.0     1    98M   200M    Yes  (hpux fserver) 

In the above example, the host type SUNSOL represents Sun SPARC systems running Solaris, and SGI6 represents an SGI server running IRIX 6. The lshosts command also displays the resources available on each host.

type

The host CPU architecture. Hosts that can run the same binary programs should have the same type.

An UNKNOWN type or model indicates the host is down, or LIM on the host is down. See UNKNOWN host type or model for instructions on measures to take.

When automatic detection of host type or model fails (the host type configured in lsf.shared cannot be found), the type or model is set to DEFAULT. LSF will work on the host, but a DEFAULT model may be inefficient because of incorrect CPU factors. A DEFAULT type may also cause binary incompatibility because a job from a DEFAULT host type can be migrated to another DEFAULT host type. automatic detection of host type or model has failed, and the host type configured in lsf.shared cannot be found.

View host history

  1. Run badmin hhist to view the history of a host such as when it is opened or closed:
  2. badmin hhist hostB 
    Wed Nov 20 14:41:58: Host <hostB> closed by administrator <lsf>. 
    Wed Nov 20 15:23:39: Host <hostB> opened by administrator <lsf>. 
    

View host model and type information

  1. Run lsinfo -m to display information about host models that exist in the cluster:
  2. lsinfo -m 
    MODEL_NAME      CPU_FACTOR      ARCHITECTURE 
    PC1133               23.10      x6_1189_PentiumIIICoppermine 
    HP9K735               4.50      HP9000735_125 
    HP9K778               5.50      HP9000778 
    Ultra5S              10.30      SUNWUltra510_270_sparcv9 
    Ultra2               20.20      SUNWUltra2_300_sparc 
    Enterprise3000       20.00      SUNWUltraEnterprise_167_sparc 
    
  3. Run lsinfo -M to display all host models defined in lsf.shared:
  4. lsinfo -M 
    MODEL_NAME      CPU_FACTOR      ARCHITECTURE 
    UNKNOWN_AUTO_DETECT      1.00      UNKNOWN_AUTO_DETECT 
    DEFAULT               1.00       
    LINUX133              2.50      x586_53_Pentium75 
    PC200                 4.50      i86pc_200 
    Intel_IA64           12.00      ia64 
    Ultra5S              10.30      SUNWUltra5_270_sparcv9 
    PowerPC_G4           12.00      x7400G4 
    HP300                 1.00       
    SunSparc             12.00  
    
  5. Run lim -t to display the type, model, and matched type of the current host. You must be the LSF administrator to use this command:
  6. lim -t 
    Host Type             : NTX64 
    Host Architecture     : EM64T_1596 
    Physical Processors   : 2 
    Cores per Processor   : 4 
    Threads per Core      : 2 
    License Needed        : Class(B),Multi-cores 
    Matched Type          : NTX64 
    Matched Architecture  : EM64T_3000 
    Matched Model         : Intel_EM64T 
    CPU Factor            : 60.0 
    

View job exit rate and load for hosts

  1. Run bhosts to display the exception threshold for job exit rate and the current load value for hosts.:
  2. In the following example, EXIT_RATE for hostA is configured as 4 jobs per minute. hostA does not currently exceed this rate

    bhosts -l hostA 
    HOST  hostA 
    STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW 
    ok              18.60     -      1      0      0      0      0      0      - 
     
     CURRENT LOAD USED FOR SCHEDULING: 
                  r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem 
     Total         0.0   0.0   0.0    0%   0.0     0    1     2  646M  648M  115M 
     Reserved      0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M 
     
     
                 share_rsrc host_rsrc 
     Total              3.0       2.0 
     Reserved           0.0       0.0 
     
     
     LOAD THRESHOLD USED FOR SCHEDULING: 
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem 
     loadSched   -     -     -     -       -     -    -     -     -      -      -   
     loadStop    -     -     -     -       -     -    -     -     -      -      -   
     
                    cpuspeed    bandwidth 
     loadSched          -            - 
     loadStop           -            - 
     
     THRESHOLD AND LOAD USED FOR EXCEPTIONS: 
                JOB_EXIT_RATE 
     Threshold    4.00 
     Load         0.00 
    
  3. Use bhosts -x to see hosts whose job exit rate has exceeded the threshold for longer than JOB_EXIT_RATE_DURATION, and are still high. By default, these hosts are closed the next time LSF checks host exceptions and invokes eadmin.
  4. If no hosts exceed the job exit rate, bhosts -x displays:

    There is no exceptional host found 
    

View dynamic host information

  1. Use lshosts to display information on dynamically added hosts.
  2. An LSF cluster may consist of static and dynamic hosts. The lshosts command displays configuration information about hosts. All these parameters are defined by the LSF administrator in the LSF configuration files, or determined by the LIM directly from the system.

    Host types represent binary compatible hosts; all hosts of the same type can run the same executable. Host models give the relative CPU performance of different processors. Server represents the type of host in the cluster. "Yes" is displayed for LSF servers, "No" is displayed for LSF clients, and "Dyn" is displayed for dynamic hosts.

    For example:

    lshosts 
    HOST_NAME   type    model cpuf ncpus maxmem maxswp server  RESOURCES 
    hostA      SOL64 Ultra60F 23.5     1    64M   112M    Yes  () 
    hostB    LINUX86 Opteron8 60.0     1    94M   168M    Dyn  () 
     

    In the above example, hostA is a static host while hostB is a dynamic host.

Controlling Hosts

Hosts are opened and closed by an LSF Administrator or root issuing a command or through configured dispatch windows.

Close a host

  1. Run badmin hclose:
  2. badmin hclose hostB 
    Close <hostB> ...... done 
     

    If the command fails, it may be because the host is unreachable through network problems, or because the daemons on the host are not running.

Open a host

  1. Run badmin hopen:
  2. badmin hopen hostB 
    Open <hostB> ...... done 
    

Configure Dispatch Windows

A dispatch window specifies one or more time periods during which a host will receive new jobs. The host will not receive jobs outside of the configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, dispatch windows are not configured.

To configure dispatch windows:

  1. Edit lsb.hosts.
  2. Specify one or more time windows in the DISPATCH_WINDOW column:
  3. Begin Host 
    HOST_NAME     r1m      pg    ls     tmp    DISPATCH_WINDOW 
    ... 
    hostB         3.5/4.5  15/   12/15  0      (4:30-12:00) 
    ... 
    End Host 
    
  4. Reconfigure the cluster:
    1. Run lsadmin reconfig to reconfigure LIM.
    2. Run badmin reconfig to reconfigure mbatchd.
  5. Run bhosts -l to display the dispatch windows.

Log a comment when closing or opening a host

  1. Use the -C option of badmin hclose and badmin hopen to log an administrator comment in lsb.events:
  2. badmin hclose -C "Weekly backup" hostB 
     

    The comment text Weekly backup is recorded in lsb.events. If you close or open a host group, each host group member displays with the same comment string.

    A new event record is recorded for each host open or host close event. For example:

    badmin hclose -C "backup" hostA

    followed by

    badmin hclose -C "Weekly backup" hostA

    generates the following records in lsb.events:

    "HOST_CTRL" "7.0 1050082346 1 "hostA" 32185 "lsfadmin" "backup" "HOST_CTRL" "7.0 1050082373 1 "hostA" 32185 "lsfadmin" "Weekly backup"
  3. Use badmin hist or badmin hhist to display administrator comments for closing and opening hosts:
  4. badmin hhist 
    Fri Apr  4 10:35:31: Host <hostB> closed by administrator 
    <lsfadmin> Weekly backup. 
     

    bhosts -l also displays the comment text:

    bhosts -l HOST hostA STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 1.00 - - 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.0 0.0 0.0 2% 0.0 64 2 11 7117M 512M 432M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 2.00 Load 0.00 ADMIN ACTION COMMENT: "Weekly backup"

How events are displayed and recorded in MultiCluster lease model

In the MultiCluster resource lease model, host control administrator comments are recorded only in the lsb.events file on the local cluster. badmin hist and badmin hhist display only events that are recorded locally. Host control messages are not passed between clusters in the MultiCluster lease model. For example. if you close an exported host in both the consumer and the provider cluster, the host close events are recorded separately in their local lsb.events.

Adding a Host

You use the lsfinstallcommand to add a host to an LSF cluster.

Contents

Add a host of an existing type using lsfinstall

restriction:  
lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
  1. Verify that the host type already exists in your cluster:
    1. Log on to any host in the cluster. You do not need to be root.
    2. List the contents of the LSF_TOP/7.0 directory. The default is /usr/share/lsf/7.0. If the host type currently exists, there is a subdirectory with the name of the host type. If it does not exist, go to Add a host of a new type using lsfinstall.
  2. Add the host information to lsf.cluster.cluster_name:
    1. Log on to the LSF master host as root.
    2. Edit LSF_CONFDIR/lsf.cluster.cluster_name, and specify the following in the Host section:
      • The name of the host.
      • The model and type, or specify ! to automatically detect the type or model.
      • Specify 1 for LSF server or 0 for LSF client.
      • Begin Host 
        HOSTNAME  model  type      server  r1m  mem  RESOURCES  REXPRI 
        hosta     !      SUNSOL6   1       1.0  4    ()         0 
        hostb     !      SUNSOL6   0       1.0  4    ()         0 
        hostc     !      HPPA1132  1       1.0  4    ()         0 
        hostd     !      HPPA1164  1       1.0  4    ()         0 
        End Host 
        
    3. Save your changes.
  3. Run lsadmin reconfig to reconfigure LIM.
  4. Run badmin mbdrestart to restart mbatchd.
  5. Run hostsetup to set up the new host and configure the daemons to start automatically at boot from /usr/share/lsf/7.0/install:
  6. ./hostsetup --top="/usr/share/lsf" --boot="y" 
    
  7. Start LSF on the new host:
  8. lsadmin limstartup 
    lsadmin resstartup 
    badmin hstartup 
    
  9. Run bhosts and lshosts to verify your changes.

Add a host of a new type using lsfinstall

restriction:  
lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
  1. Verify that the host type does not already exist in your cluster:
    1. Log on to any host in the cluster. You do not need to be root.
    2. List the contents of the LSF_TOP/7.0 directory. The default is /usr/share/lsf/7.0. If the host type currently exists, there will be a subdirectory with the name of the host type. If the host type already exists, go to Add a host of an existing type using lsfinstall.
  2. Get the LSF distribution tar file for the host type you want to add.
  3. Log on as root to any host that can access the LSF install directory.
  4. Change to the LSF install directory. The default is
  5. /usr/share/lsf/7.0/install 
    
  6. Edit install.config:
    1. For LSF_TARDIR, specify the path to the tar file. For example:
    2. LSF_TARDIR="/usr/share/lsf_distrib/7.0" 
      
    3. For LSF_ADD_SERVERS, list the new host names enclosed in quotes and separated by spaces. For example:
    4. LSF_ADD_SERVERS="hosta hostb" 
      
    5. Run ./lsfinstall -f install.config. This automatically creates the host information in lsf.cluster.cluster_name.
  7. Run lsadmin reconfig to reconfigure LIM.
  8. Run badmin reconfig to reconfigure mbatchd.
  9. Run hostsetup to set up the new host and configure the daemons to start automatically at boot from /usr/share/lsf/7.0/install:
  10. ./hostsetup --top="/usr/share/lsf" --boot="y" 
    
  11. Start LSF on the new host:
  12. lsadmin limstartup 
    lsadmin resstartup 
    badmin hstartup 
    
  13. Run bhosts and lshosts to verify your changes.

Remove a Host

Removing a host from LSF involves preventing any additional jobs from running on the host, removing the host from LSF, and removing the host from the cluster.

caution:  
Never remove the master host from LSF. If you want to remove your current default master from LSF, change lsf.cluster.cluster_name to assign a different default master host. Then remove the host that was once the master host.
  1. Log on to the LSF host as root.
  2. Run badmin hclose to close the host. This prevents jobs from being dispatched to the host and allows running jobs to finish.
  3. Stop all running daemons manually.
  4. Remove any references to the host in the Host section of LSF_CONFDIR/lsf.cluster.cluster_name.
  5. Remove any other references to the host, if applicable, from the following LSF configuration files:
  6. Log off the host to be removed, and log on as root or the primary LSF administrator to any other host in the cluster.
  7. Run lsadmin reconfig to reconfigure LIM.
  8. Run badmin mbdrestart to restart mbatchd.
  9. If you configured LSF daemons to start automatically at system startup, remove the LSF section from the host's system startup files.
  10. If any users of the host use lstcsh as their login shell, change their login shell to tcsh or csh. Remove lstcsh from the /etc/shells file.

Remove a Host from Master Candidate List

You can remove a host from the master candidate list so that it can no longer be the master should failover occur. You can choose to either keep it as part of the cluster or remove it.

  1. Shut down the current LIM:
  2. limshutdown host_name

    If the host was the current master, failover occurs.

  3. In lsf.conf, remove the host name from LSF_MASTER_LIST.
  4. Run lsadmin reconfig for the remaining master candidates.
  5. If the host you removed as a master candidate still belongs to the cluster, start up the LIM again:
  6. limstartup host_name

Adding Hosts Dynamically

By default, all configuration changes made to LSF are static. To add or remove hosts within the cluster, you must manually change the configuration and restart all master candidates.

Dynamic host configuration allows you to add and remove hosts without manual reconfiguration. To enable dynamic host configuration, all of the parameters described in the following table must be defined.

Parameter
Defined in ...
Description
LSF_MASTER_LIST
lsf.conf
Defines a list of master host candidates. These hosts receive information when a dynamic host is added to or removed from the cluster. Do not add dynamic hosts to this list, because dynamic hosts cannot be master hosts.
LSF_DYNAMIC_HOST_WAIT_TIME
lsf.conf
Defines the length of time a dynamic host waits before sending a request to the master LIM to add the host to the cluster.
LSF_HOST_ADDR_RANGE
lsf.cluster.cluster_name
Identifies the range of IP addresses for hosts that can dynamically join or leave the cluster.

important:  
If you choose to enable dynamic hosts when you install LSF, the installer adds the parameter LSF_HOST_ADDR_RANGE to lsf.cluster.cluster_name using a default value that allows any host to join the cluster. To enable security, configure LSF_HOST_ADDR_RANGE in lsf.cluster.cluster_name after installation to restrict the hosts that can join your cluster.

How dynamic host configuration works

Master LIM

The master LIM runs on the master host for the cluster. The master LIM receives requests to add hosts, and tells the master host candidates defined by the parameter LSF_MASTER_LIST to update their configuration information when a host is dynamically added or removed.

Upon startup, both static and dynamic hosts wait to receive an acknowledgement from the master LIM. This acknowledgement indicates that the master LIM has added the host to the cluster. Static hosts normally receive an acknowledgement because the master LIM has access to static host information in the LSF configuration files. Dynamic hosts do not receive an acknowledgement, however, until they announce themselves to the master LIM. The parameter LSF_DYNAMIC_HOST_WAIT_TIME in lsf.conf determines how long a dynamic host waits before sending a request to the master LIM to add the host to the cluster.

Master candidate LIMs

The parameter LSF_MASTER_LIST defines the list of master host candidates. These hosts receive updated host information from the master LIM so that any master host candidate can take over as master host for the cluster.

important:  
Master candidate hosts should share LSF configuration and binaries.

Dynamic hosts cannot be master host candidates. By defining the parameter LSF_MASTER_LIST, you ensure that LSF limits the list of master host candidates to specific, static hosts.

mbatchd

mbatchd gets host information from the master LIM; when it detects the addition or removal of a dynamic host within the cluster, mbatchd automatically reconfigures itself.

tip:  
After adding a host dynamically, you might have to wait for mbatchd to detect the host and reconfigure. Depending on system load, mbatchd might wait up to a maximum of 10 minutes before reconfiguring.
lsadmin command

Use the command lsadmin limstartup to start the LIM on a newly added dynamic host.

Allowing only certain hosts to join the cluster

By default, any host can be dynamically added to the cluster. To enable security, define LSF_HOST_ADDR_RANGE in lsf.cluster.cluster_name to identify a range of IP addresses for hosts that are allowed to dynamically join the cluster as LSF hosts. IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. You can use IPv6 addresses if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf; you do not have to map IPv4 addresses to an IPv6 format.

Configure LSF to run batch jobs on dynamic hosts

Before you run batch jobs on a dynamic host, complete any or all of the following steps, depending on your cluster configuration.

  1. Configure queues to accept all hosts by defining the HOSTS parameter in lsb.queues using the keyword all.
  2. Define host groups that will accept wild cards in the HostGroup section of lsb.hosts.
  3. For example, define linuxrack* as a GROUP_MEMBER within a host group definition.

  4. Add a dynamic host to a host group using the command badmin hghostadd.
Changing a dynamic host to a static host

If you want to change a dynamic host to a static host, first use the command
badmin hghostdel to remove the dynamic host from any host group that it belongs to, and then configure the host as a static host in lsf.cluster.cluster_name.

Adding dynamic hosts

Add a dynamic host in a shared file system environment

In a shared file system environment, you do not need to install LSF on each dynamic host. The master host will recognize a dynamic host as an LSF host when you start the daemons on the dynamic host.

  1. In lsf.conf on the master host, define the parameter LSF_DYNAMIC_HOST_WAIT_TIME, in seconds, and assign a value greater than zero.
  2. LSF_DYNAMIC_HOST_WAIT_TIME specifies the length of time a dynamic host waits before sending a request to the master LIM to add the host to the cluster.

    For example:

    LSF_DYNAMIC_HOST_WAIT_TIME=60 
    
  3. In lsf.conf on the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.
  4. LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host, mbatchd automatically reconfigures itself.

    note:  
    For very large clusters, defining this parameter could decrease system performance.

    For example:

    LSF_DYNAMIC_HOST_TIMEOUT=60m 
    
  5. In lsf.cluster.cluster_name on the master host, define the parameter LSF_HOST_ADDR_RANGE.
  6. LSF_HOST_ADDR_RANGE enables security by defining a list of hosts that can join the cluster. Specify IP addresses or address ranges for hosts that you want to allow in the cluster.

    tip:  
    If you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf, IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format; you do not have to map IPv4 addresses to an IPv6 format.

    For example:

    LSF_HOST_ADDR_RANGE=100-110.34.1-10.4-56 
     

    All hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. In this example, no IPv6 hosts are allowed.

  7. Log on as root to each host you want to join the cluster.
  8. Source the LSF environment:
  9. Do you want LSF to start automatically when the host reboots?
  10. Use the following commands to start LSF:
  11. lsadmin limstartup 
    lsadmin resstartup 
    badmin hstartup 
    
Add a dynamic host in a non-shared file system environment

In a non-shared file system environment, you must install LSF binaries, a localized lsf.conf file, and shell environment scripts (cshrc.lsf and profile.lsf) on each dynamic host.

Specify installation options in the slave.config file

All dynamic hosts are slave hosts, because they cannot serve as master host candidates. The slave.config file contains parameters for configuring all slave hosts.

  1. Define the required parameters.
  2. LSF_SERVER_HOSTS="host_name [host_name ...]"

    LSF_ADMINS="user_name user_name ... ]"

    LSF_TOP="/path"

  3. Define the optional parameters.
  4. LSF_LIM_PORT=port_number

    important:  
    If the master host does not use the default LSF_LIM_PORT, you must specify the same LSF_LIM_PORT defined in lsf.conf on the master host.
Add local resources on a dynamic host to the cluster

Prerequisites: Ensure that the resource name and type are defined in lsf.shared, and that the ResourceMap section of lsf.cluster.cluster_name contains at least one resource mapped to at least one static host. LSF can add local resources as long as the ResourceMap section is defined; you do not need to map the local resources.

  1. In the slave.config file, define the parameter LSF_LOCAL_RESOURCES.
  2. For numeric resources, define name-value pairs:

    "[resourcemap value*resource_name]" 
     

    For Boolean resources, the value is the resource name in the following format:

    "[resource resource_name]"

    For example:

    LSF_LOCAL_RESOURCES="[resourcemap 1*verilog] [resource linux]"
    tip:  
    If LSF_LOCAL_RESOURCES are already defined in a local lsf.conf on the dynamic host, lsfinstall does not add resources you define in LSF_LOCAL_RESOURCES in slave.config.

    When the dynamic host sends a request to the master host to add it to the cluster, the dynamic host also reports its local resources. If the local resource is already defined in lsf.cluster.cluster_name as default or all, it cannot be added as a local resource.

Install LSF on a dynamic host
  1. Run lsfinstall -s -f slave.config.
  2. lsfinstall creates a local lsf.conf for the dynamic host, which sets the following parameters:

    LSF_CONFDIR="/path"

    LSF_GET_CONF=lim

    LSF_LIM_PORT=port_number (same as the master LIM port number)

    LSF_LOCAL_RESOURCES="resource ..."

    tip:  
    Do not duplicate LSF_LOCAL_RESOURCES entries in lsf.conf. If local resources are defined more than once, only the last definition is valid.

    LSF_SERVER_HOSTS="host_name [host_name ...]"

    LSF_VERSION=7.0

    important:  
    If LSF_STRICT_CHECKING is defined in lsf.conf to protect your cluster in untrusted environments, and your cluster has dynamic hosts, LSF_STRICT_CHECKING must be configured in the local lsf.conf on all dynamic hosts.
Configure dynamic host parameters
  1. In lsf.conf on the master host, define the parameter LSF_DYNAMIC_HOST_WAIT_TIME, in seconds, and assign a value greater than zero.
  2. LSF_DYNAMIC_HOST_WAIT_TIME specifies the length of time a dynamic host waits before sending a request to the master LIM to add the host to the cluster.

    For example:

    LSF_DYNAMIC_HOST_WAIT_TIME=60 
    
  3. In lsf.conf on the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.
  4. LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host, mbatchd automatically reconfigures itself.

    note:  
    For very large clusters, defining this parameter could decrease system performance.

    For example:

    LSF_DYNAMIC_HOST_TIMEOUT=60m 
    
  5. In lsf.cluster.cluster_name on the master host, define the parameter LSF_HOST_ADDR_RANGE.
  6. LSF_HOST_ADDR_RANGE enables security by defining a list of hosts that can join the cluster. Specify IP addresses or address ranges for hosts that you want to allow in the cluster.

    tip:  
    If you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf, IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format; you do not have to map IPv4 addresses to an IPv6 format.

    For example:

    LSF_HOST_ADDR_RANGE=100-110.34.1-10.4-56 
     

    All hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. No IPv6 hosts are allowed.

Start LSF daemons
  1. Log on as root to each host you want to join the cluster.
  2. Source the LSF environment:
  3. Do you want LSF to start automatically when the host reboots?
  4. Is this the first time the host is joining the cluster?

Removing dynamic hosts

To remove a dynamic host from the cluster, you can either set a timeout value, or you can edit the hostcache file.

Remove a host by setting a timeout value

LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host, mbatchd automatically reconfigures itself.

note:  
For very large clusters, defining this parameter could decrease system performance. If you want to use this parameter to remove dynamic hosts from a very large cluster, disable the parameter after LSF has removed the unwanted hosts.
  1. In lsf.conf on the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.
  2. To specify minutes rather than hours, append m or M to the value.

    For example:

    LSF_DYNAMIC_HOST_TIMEOUT=60m 
    
Remove a host by editing the hostcache file

Dynamic hosts remain in the cluster unless you intentionally remove them. Only the cluster administrator can modify the hostcache file.

  1. Shut down the cluster.
  2. lsfshutdown

    This shuts down LSF on all hosts in the cluster and prevents LIMs from trying to write to the hostcache file while you edit it.

  3. In the hostcache file $EGO_WORKDIR/lim/hostcache, delete the line for the dynamic host that you want to remove.
  4. Close the hostcache file, and then start up the cluster.
  5. lsfrestart

Automatically Detect Operating System Types and Versions

LSF can automatically detect most operating system types and versions so that you do not need to add them to the lsf.shared file manually. The list of automatically detected operating systems is updated regularly.

  1. Edit lsf.shared.
  2. In the Resource section, remove the comment from the following line:
  3. ostype   String   ()   ()   ()   (Operating system and version) 
    
  4. In $LSF_SERVERDIR, rename tmp.eslim.ostype to eslim.ostype.
  5. Run the following commands to restart the LIM and master batch daemon:
    1. lsadmin reconfig
    2. badmin mbdrestart
  6. To view operating system types and versions, run lshosts -l or lshosts -s.
  7. LSF displays the operating system types and versions in your cluster, including any that LSF automatically detects as well as those you have defined manually in the HostType section of lsf.shared.

You can specify ostype in your resource requirement strings. For example, when submitting a job you can specify the following resource requirement: -R "select[ostype=RHEL2.6]".

Modify how long LSF waits for new operating system types and versions

Prerequisites: You must enable LSF to automatically detect operating system types and versions.

You can configure how long LSF waits for OS type and version detection.

  1. In lsf.conf, modify the value for EGO_ESLIM_TIMEOUT.
  2. The value is time in seconds.

Add Host Types and Host Models to lsf.shared

The lsf.shared file contains a list of host type and host model names for most operating systems. You can add to this list or customize the host type and host model names. A host type and host model name can be any alphanumeric string up to 39 characters long.

Add a custom host type or model

  1. Log on as the LSF administrator on any host in the cluster.
  2. Edit lsf.shared:
    1. For a new host type, modify the HostType section:
    2. Begin HostType 
      TYPENAME                     # Keyword 
      DEFAULT 
      IBMAIX564 
      LINUX86 
      LINUX64 
      NTX64 
      NTIA64 
      SUNSOL 
      SOL732 
      SOL64 
      SGI658 
      SOLX86 
      HPPA11 
      HPUXIA64 
      MACOSX 
      End HostType 
      
    3. For a new host model, modify the HostModel section:
    4. Add the new model and its CPU speed factor relative to other models. For more details on tuning CPU factors, see Tuning CPU Factors.

      Begin HostModel 
      MODELNAME  CPUFACTOR   ARCHITECTURE # keyword 
      # x86 (Solaris, Windows, Linux): approximate values, based on SpecBench results 
      # for Intel processors (Sparc/Win) and BogoMIPS results (Linux). 
      PC75             1.5   (i86pc_75  i586_75  x586_30) 
      PC90             1.7   (i86pc_90  i586_90  x586_34 x586_35 x586_36) 
      HP9K715          4.2   (HP9000715_100) 
      SunSparc          12.0         ()  
      CRAYJ90           18.0         ()  
      IBM350            18.0         ()  
      End HostModel 
      
  3. Save the changes to lsf.shared.
  4. Run lsadmin reconfig to reconfigure LIM.
  5. Run badmin reconfig to reconfigure mbatchd.

Registering Service Ports

LSF uses dedicated UDP and TCP ports for communication. All hosts in the cluster must use the same port numbers to communicate with each other.

The service port numbers can be any numbers ranging from 1024 to 65535 that are not already used by other services. To make sure that the port numbers you supply are not already used by applications registered in your service database check /etc/services or use the command ypcat services

By default, port numbers for LSF services are defined in the lsf.conf file. You can also configure ports by modifying /etc/services or the NIS or NIS+ database. If you define port numbers lsf.conf, port numbers defined in the service database are ignored.

lsf.conf

  1. Log on to any host as root.
  2. Edit lsf.conf and add the following lines:
  3. LSF_RES_PORT=3878 
    LSB_MBD_PORT=3881 
    LSB_SBD_PORT=3882 
    
  4. Add the same entries to lsf.conf on every host.
  5. Save lsf.conf.
  6. Run lsadmin reconfig to reconfigure LIM.
  7. Run badmin mbdrestart to restart mbatchd.
  8. Run lsfstartup to restart all daemons in the cluster.

/etc/services

Configure services manually
tip:  
During installation, use the hostsetup --boot="y" option to set up the LSF port numbers in the service database.
  1. Use the file LSF_TOP/version/install/instlib/example.services file as a guide for adding LSF entries to the services database.
  2. If any other service listed in your services database has the same port number as one of the LSF services, you must change the port number for the LSF service. You must use the same port numbers on every LSF host.

  3. Log on to any host as root.
  4. Edit the /etc/services file by adding the contents of the LSF_TOP/version/install/instlib/example.services file:
  5. # /etc/services entries for LSF daemons 
    # 
    res     3878/tcp # remote execution server 
    lim     3879/udp # load information manager 
    mbatchd 3881/tcp # master lsbatch daemon 
    sbatchd 3882/tcp # slave lsbatch daemon 
    # 
    # Add this if ident is not already defined 
    # in your /etc/services file 
    ident 113/tcp auth tap # identd 
    
  6. Run lsadmin reconfig to reconfigure LIM.
  7. Run badmin reconfig to reconfigure mbatchd.
  8. Run lsfstartup to restart all daemons in the cluster.
NIS or NIS+ database

If you are running NIS, you only need to modify the services database once per NIS master. On some hosts the NIS database and commands are in the /var/yp directory; on others, NIS is found in /etc/yp.

  1. Log on to any host as root.
  2. Run lsfshutdown to shut down all the daemons in the cluster
  3. To find the name of the NIS master host, use the command:
  4. ypwhich -m services 
    
  5. Log on to the NIS master host as root.
  6. Edit the /var/yp/src/services or /etc/yp/src/services file on the NIS master host adding the contents of the LSF_TOP/version/install/instlib/example.services file:
  7. # /etc/services entries for LSF daemons. 
    # 
    res     3878/tcp # remote execution server 
    lim     3879/udp # load information manager 
    mbatchd 3881/tcp # master lsbatch daemon 
    sbatchd 3882/tcp # slave lsbatch daemon 
    # 
    # Add this if ident is not already defined 
    # in your /etc/services file 
    ident 113/tcp auth tap # identd 
     

    Make sure that all the lines you add either contain valid service entries or begin with a comment character (#). Blank lines are not allowed.

  8. Change the directory to /var/yp or /etc/yp.
  9. Use the following command:
  10. ypmake services 
     

    On some hosts the master copy of the services database is stored in a different location.

    On systems running NIS+ the procedure is similar. Refer to your system documentation for more information.

  11. Run lsadmin reconfig to reconfigure LIM.
  12. Run badmin reconfig to reconfigure mbatchd.
  13. Run lsfstartup to restart all daemons in the cluster.

Host Naming

LSF needs to match host names with the corresponding Internet host addresses.

LSF looks up host names and addresses the following ways:

Each host is configured to use one or more of these mechanisms.

Network addresses

Each host has one or more network addresses; usually one for each network to which the host is directly connected. Each host can also have more than one name.

Official host name

The first name configured for each address is called the official name.

Host name aliases

Other names for the same host are called aliases.

LSF uses the configured host naming system on each host to look up the official host name for any alias or host address. This means that you can use aliases as input to LSF, but LSF always displays the official name.

Using host name ranges as aliases

The default host file syntax

ip_address official_name [alias [alias ...]] 

is powerful and flexible, but it is difficult to configure in systems where a single host name has many aliases, and in multihomed host environments.

In these cases, the hosts file can become very large and unmanageable, and configuration is prone to error.

The syntax of the LSF hosts file supports host name ranges as aliases for an IP address. This simplifies the host name alias specification.

To use host name ranges as aliases, the host names must consist of a fixed node group name prefix and node indices, specified in a form like:

host_name[index_x-index_y, index_m, index_a-index_b] 

For example:

atlasD0[0-3,4,5-6, ...]  

is equivalent to:

atlasD0[0-6, ...] 

The node list does not need to be a continuous range (some nodes can be configured out). Node indices can be numbers or letters (both upper case and lower case).

Example

Some systems map internal compute nodes to single LSF host names. A host file might contains 64 lines, each specifying an LSF host name and 32 node names that correspond to each LSF host:

... 
177.16.1.1 atlasD0 atlas0 atlas1 atlas2 atlas3 atlas4 ... atlas31 
177.16.1.2 atlasD1 atlas32 atlas33 atlas34 atlas35 atlas36 ... atlas63 
... 

In the new format, you still map the nodes to the LSF hosts, so the number of lines remains the same, but the format is simplified because you only have to specify ranges for the nodes, not each node individually as an alias:

... 
177.16.1.1 atlasD0 atlas[0-31] 
177.16.1.2 atlasD1 atlas[32-63] 
... 

You can use either an IPv4 or an IPv6 format for the IP address (if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf).

Host name services

Solaris

On Solaris systems, the /etc/nsswitch.conf file controls the name service.

Other UNIX platforms

On other UNIX platforms, the following rules apply:

For more information

The man pages for the gethostbyname function, the ypbind and named daemons, the resolver functions, and the hosts, svc.conf, nsswitch.conf, and resolv.conf files explain host name lookups in more detail.

Hosts with Multiple Addresses

Multi-homed hosts

Hosts that have more than one network interface usually have one Internet address for each interface. Such hosts are called multi-homed hosts. For example, dual-stack hosts are multi-homed because they have both an IPv4 and an IPv6 network address.

LSF identifies hosts by name, so it needs to match each of these addresses with a single host name. To do this, the host name information must be configured so that all of the Internet addresses for a host resolve to the same name.

There are two ways to do it:

Multiple network interfaces

Some system manufacturers recommend that each network interface, and therefore, each Internet address, be assigned a different host name. Each interface can then be directly accessed by name. This setup is often used to make sure NFS requests go to the nearest network interface on the file server, rather than going through a router to some other interface. Configuring this way can confuse LSF, because there is no way to determine that the two different names (or addresses) mean the same host. LSF provides a workaround for this problem.

All host naming systems can be configured so that host address lookups always return the same name, while still allowing access to network interfaces by different names. Each host has an official name and a number of aliases, which are other names for the same host. By configuring all interfaces with the same official name but different aliases, you can refer to each interface by a different alias name while still providing a single official name for the host.

Configuring the LSF hosts file

If your LSF clusters include hosts that have more than one interface and are configured with more than one official host name, you must either modify the host name configuration, or create a private hosts file for LSF to use.

The LSF hosts file is stored in LSF_CONFDIR. The format of LSF_CONFDIR/hosts is the same as for /etc/hosts.

In the LSF hosts file, duplicate the system hosts database information, except make all entries for the host use the same official name. Configure all the other names for the host as aliases so that you can still refer to the host by any name.

Example

For example, if your /etc/hosts file contains:

AA.AA.AA.AA  host-AA host # first interface 
BB.BB.BB.BB  host-BB      # second interface 

then the LSF_CONFDIR/hosts file should contain:

AA.AA.AA.AA  host host-AA # first interface 
BB.BB.BB.BB  host host-BB # second interface 

Example /etc/hosts entries

No unique official name

The following example is for a host with two interfaces, where the host does not have a unique official name.

# Address          Official name    Aliases 
# Interface on network A 
AA.AA.AA.AA        host-AA.domain   host.domain host-AA host 
# Interface on network B 
BB.BB.BB.BB        host-BB.domain   host-BB host 

Looking up the address AA.AA.AA.AA finds the official name host-AA.domain. Looking up address BB.BB.BB.BB finds the name host-BB.domain. No information connects the two names, so there is no way for LSF to determine that both names, and both addresses, refer to the same host.

To resolve this case, you must configure these addresses using a unique host name. If you cannot make this change to the system file, you must create an LSF hosts file and configure these addresses using a unique host name in that file.

Both addresses have the same official name

Here is the same example, with both addresses configured for the same official name.

# Address          Official name    Aliases 
# Interface on network A 
AA.AA.AA.AA        host.domain      host-AA.domain host-AA host 
# Interface on network B 
BB.BB.BB.BB        host.domain      host-BB.domain host-BB host 

With this configuration, looking up either address returns host.domain as the official name for the host. LSF (and all other applications) can determine that all the addresses and host names refer to the same host. Individual interfaces can still be specified by using the host-AA and host-BB aliases.

Example for a dual-stack host

Dual-stack hosts have more than one IP address. You must associate the host name with both addresses, as shown in the following example:

# Address                              Official name    Aliases 
# Interface IPv4 
AA.AA.AA.AA                            host.domain      host-AA.domain 
# Interface IPv6 
BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB    host.domain      host-BB.domain 

With this configuration, looking up either address returns host.domain as the official name for the host. LSF (and all other applications) can determine that all the addresses and host names refer to the same host. Individual interfaces can still be specified by using the host-AA and host-BB aliases.

Sun Solaris example

For example, Sun NIS uses the /etc/hosts file on the NIS master host as input, so the format for NIS entries is the same as for the /etc/hosts file. Since LSF can resolve this case, you do not need to create an LSF hosts file.

DNS configuration

The configuration format is different for DNS. The same result can be produced by configuring two address (A) records for each Internet address. Following the previous example:

# name            class  type address
host.domain       IN     A    AA.AA.AA.AA
host.domain       IN     A    BB.BB.BB.BB
host-AA.domain    IN     A    AA.AA.AA.AA
host-BB.domain    IN     A    BB.BB.BB.BB 

Looking up the official host name can return either address. Looking up the interface-specific names returns the correct address for each interface.

For a dual-stack host:

# name            class  type address
host.domain       IN     A    AA.AA.AA.AA
host.domain       IN     A    BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB
host-AA.domain    IN     A    AA.AA.AA.AA
host-BB.domain    IN     A    BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB 
PTR records in DNS

Address-to-name lookups in DNS are handled using PTR records. The PTR records for both addresses should be configured to return the official name:

# address                  class  type  name
AA.AA.AA.AA.in-addr.arpa   IN     PTR   host.domain
BB.BB.BB.BB.in-addr.arpa   IN     PTR   host.domain 

For a dual-stack host:

# address                  class  type  name
AA.AA.AA.AA.in-addr.arpa   IN     PTR   host.domain
BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB.in-addr.arpa   IN     PTR   host.domain 

If it is not possible to change the system host name database, create the hosts file local to the LSF system, and configure entries for the multi-homed hosts only. Host names and addresses not found in the hosts file are looked up in the standard name system on your host.

Using IPv6 Addresses

IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. You can use IPv6 addresses if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf; you do not have to map IPv4 addresses to an IPv6 format.

LSF supports IPv6 addresses for the following platforms:

Enable both IPv4 and IPv6 support

  1. Configure the parameter LSF_ENABLE_SUPPORT_IPV6=Y in lsf.conf.

Configure hosts for IPv6

Follow the steps in this procedure if you do not have an IPv6-enabled DNS server or an IPv6-enabled router. IPv6 is supported on some linux2.4 kernels and on all linux2.6 kernels.

  1. Configure the kernel.
    1. Does the entry /proc/net/if_inet6 exist?
      • If yes, the kernel is already configured for IPv6. Go to step 2.
      • If no, go to step b.
    2. To load the IPv6 module into the kernel, execute the following command as root:
      modprobe ipv6
    3. To check that the module loaded correctly, execute the command
      lsmod | grep -w 'ipv6'
  2. Add an IPv6 address to the host by executing the following command as root:
    /sbin/ifconfig eth0 inet6 add 3ffe:ffff:0:f101::2/64
  3. Display the IPv6 address using ifconfig.
  4. Repeat step 1 through step 3 for other hosts in the cluster.
  5. To configure IPv6 networking, add the addresses for all IPv6 hosts to /etc/hosts on each host.
  6. note:  
    For IPv6 networking, hosts must be on the same subnet.
  7. Test IPv6 communication between hosts using the command ping6.

Specify host names with condensed notation

A number of commands often require you to specify host names. You can now specify host name ranges instead. You can use condensed notation with the following commands:

You must specify a valid range of hosts, where the start number is smaller than the end number.

Host Groups

You can define a host group within LSF or use an external executable to retrieve host group members.

Use bhosts to view a list of existing hosts. Use bmgroup to view host group membership.

Where to use host groups

LSF host groups can be used in defining the following parameters in LSF configuration files:

Configure host groups

  1. Log in as the LSF administrator to any host in the cluster.
  2. Open lsb.hosts.
  3. Add the HostGroup section if it does not exist.
  4. Begin HostGroup 
    GROUP_NAME        GROUP_MEMBER 
    groupA            (all) 
    groupB            (groupA ~hostA ~hostB) 
    groupC            (hostX hostY hostZ) 
    groupD            (groupC ~hostX) 
    groupE            (all ~groupC ~hostB) 
    groupF            (hostF groupC hostK) 
    desk_tops         (hostD hostE hostF hostG) 
    Big_servers       (!) 
    End HostGroup 
    
  5. Enter a group name under the GROUP_NAME column.
  6. External host groups must be defined in the egroup executable.

  7. Specify hosts in the GROUP_MEMBER column.
  8. (Optional) To tell LSF that the group members should be retrieved using egroup, put an exclamation mark (!) in the GROUP_MEMBER column.

  9. Save your changes.
  10. Run badmin ckconfig to check the group definition. If any errors are reported, fix the problem and check the configuration again.
  11. Run badmin mbdrestart to apply the new configuration.

Using wildcards and special characters to define host names

You can use special characters when defining host group members under the GROUP_MEMBER column to specify hosts. These are useful to define several hosts in a single entry, such as for a range of hosts, or for all host names with a certain text string.

If a host matches more than one host group, that host is a member of all groups. If any host group is a condensed host group, the status and other details of the hosts are counted towards all of the matching host groups.

When defining host group members, you can use string literals and the following special characters:

Restrictions

You cannot use more than one set of square brackets in a single host group definition.

The following example is not correct:

... (hostA[1-10]B[1-20] hostC[101-120]) 

The following example is correct:

... (hostA[1-20] hostC[101-120]) 

You cannot define subgroups that contain wildcards and special characters. The following definition for groupB is not correct because groupA defines hosts with a wildcard:

Begin HostGroup 
GROUP_NAME   GROUP_MEMBER 
groupA       (hostA*) 
groupB       (groupA) 
End HostGroup 

Defining condensed host groups

You can define condensed host groups to display information for its hosts as a summary for the entire group. This is useful because it allows you to see the total statistics of the host group as a whole instead of having to add up the data yourself. This allows you to better plan the distribution of jobs submitted to the hosts and host groups in your cluster.

To define condensed host groups, add a CONDENSE column to the HostGroup section. Under this column, enter Y to define a condensed host group or N to define an uncondensed host group, as shown in the following:

Begin HostGroup 
GROUP_NAME   CONDENSE   GROUP_MEMBER 
groupA          Y       (hostA hostB hostD) 
groupB          N       (hostC hostE) 
End HostGroup 

The following commands display condensed host group information:

For the bhosts output of this configuration, see Viewing Host Information.

Use bmgroup -l to see whether host groups are condensed or not.

Hosts belonging to multiple condensed host groups

If you configure a host to belong to more than one condensed host group using wildcards, bjobs can display any of the host groups as execution host name.

For example, host groups hg1 and hg2 include the same hosts:

Begin HostGroup 
GROUP_NAME      CONDENSE    GROUP_MEMBER        # Key words 
hg1               Y         (host*) 
hg2               Y         (hos*) 
End HostGroup 

Submit jobs using bsub -m:

bsub -m "hg2" sleep 1001 

bjobs displays hg1 as the execution host instead of hg2:

bjobs 
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME 
520     user1   RUN   normal     host5       hg1        sleep 1001 Apr 15 13:50 
521     user1   RUN   normal     host5       hg1        sleep 1001 Apr 15 13:50 
522     user1   PEND  normal     host5                  sleep 1001 Apr 15 13:51 

Importing external host groups (egroup)

When the membership of a host group changes frequently, or when the group contains a large number of members, you can use an external executable called egroup to retrieve a list of members rather than having to configure the group membership manually. You can write a site-specific egroup executable that retrieves host group names and the hosts that belong to each group. For information about how to use the external host and user groups feature, see the Platform LSF Configuration Reference.

Compute Units

Compute units are similar to host groups, with the added feature of granularity allowing the construction of clusterwide structures that mimic network architecture. Job scheduling using compute unit resource requirements optimizes job placement based on the underlying system architecture, minimizing communications bottlenecks. Compute units are especially useful when running communication-intensive parallel jobs spanning several hosts.

Resource requirement strings can specify compute units requirements such as running a job exclusively (excl), spreading a job evenly over multiple compute units (balance), or choosing compute units based on other criteria.

For a complete description of compute units see Controlling Job Locality using Compute Units in Chapter 34, "Running Parallel Jobs".

Compute unit configuration

To enforce consistency, compute unit configuration has the following requirements:

tip:  
Configure each individual host as a compute unit to use the compute unit features for host level job allocation.

Where to use compute units

LSF compute units can be used in defining the following parameters in LSF configuration files:

Configure compute units

  1. Log in as the LSF administrator to any host in the cluster.
  2. Open lsb.params.
  3. Add the COMPUTE_UNIT_TYPES parameter if it does not already exist and list your compute unit types in order of granularity (finest first).
COMPUTE_UNIT_TYPES=enclosure rack cabinet 
  1. Save your changes.
  2. Open lsb.hosts.
  3. Add the ComputeUnit section if it does not exist.
  4. Begin ComputeUnit 
    NAME        MEMBER         TYPE 
    encl1       (hostA hg1)    enclosure 
    encl2       (hostC hostD)  enclosure 
    encl3       (hostE hostF)  enclosure 
    encl4       (hostG hg2)    enclosure 
    rack1       (encl1 encl2)  rack 
    rack2       (encl3 encl4)  rack 
    cab1        (rack1 rack2)  cabinet 
    End ComputeUnit 
    
  5. Enter a compute unit name under the NAME column.
  6. External compute units must be defined in the egroup executable.

  7. Specify hosts or host groups in the MEMBER column of the finest granularity compute unit type. Specify compute units in the MEMBER column of coarser compute unit types.
  8. (Optional) To tell LSF that the compute unit members of a finest granularity compute unit should be retrieved using egroup, put an exclamation mark (!) in the MEMBER column.

  9. Specify the type of compute unit in the TYPE column.
  10. Save your changes.
  11. Run badmin ckconfig to check the compute unit definition. If any errors are reported, fix the problem and check the configuration again.
  12. Run badmin mbdrestart to apply the new configuration.
  13. To view configured compute units, run bmgroup -cu.

Using wildcards and special characters to define names in compute units

You can use special characters when defining compute unit members under the MEMBER column to specify hosts, host groups, and compute units. These are useful to define several names in a single entry such as a range of hosts, or for all names with a certain text string.

When defining host, host group, and compute unit members of compute units, you can use string literals and the following special characters:

Restrictions

You cannot use more than one set of square brackets in a single compute unit definition.

The following example is not correct:

... (hostA[1-10]B[1-20] hostC[101-120]) 

The following example is correct:

... (hostA[1-20] hostC[101-120]) 

The keywords all, allremote, all@cluster, other and default cannot be used when defining compute units.

Defining condensed compute units

You can define condensed compute units to display information for its hosts as a summary for the entire group, including the slot usage for each compute unit. This is useful because it allows you to see statistics of the compute unit as a whole instead of having to add up the data yourself. This allows you to better plan the distribution of jobs submitted to the hosts and compute units in your cluster.

To define condensed compute units, add a CONDENSE column to the ComputeUnit section. Under this column, enter Y to define a condensed host group or N to define an uncondensed host group, as shown in the following:

Begin ComputeUnit 
NAME    CONDENSE   MEMBER                TYPE 
enclA   Y          (hostA hostB hostD)   enclosure 
enclB   N          (hostC hostE)         enclosure 
End HostGroup 

The following commands display condensed host information:

For the bhosts output of this configuration, see Viewing Host Information.

Use bmgroup -l to see whether host groups are condensed or not.

Importing external host groups (egroup)

When the membership of a compute unit changes frequently, or when the compute unit contains a large number of members, you can use an external executable called egroup to retrieve a list of members rather than having to configure the membership manually. You can write a site-specific egroup executable that retrieves compute unit names and the hosts that belong to each group, and compute units of the finest granularity can contain egroups as members. For information about how to use the external host and user groups feature, see the Platform LSF Configuration Reference.

Using compute units with advance reservation

When running exclusive compute unit jobs (with the resource requirement cu[excl]), the advance reservation can affect hosts outside the advance reservation but in the same compute unit as follows:

Ideally all hosts belonging to a compute unit should be inside or outside of an advance reservation.

Tuning CPU Factors

CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that response time is minimized.

To achieve this, it is important that you define correct CPU factors for each machine model in your cluster.

How CPU factors affect performance

Incorrect CPU factors can reduce performance the following ways.

Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. LSF then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.

Guidelines for setting CPU factors

CPU factors should be set based on a benchmark that reflects your workload. If there is no such benchmark, CPU factors can be set based on raw CPU power.

The CPU factor of the slowest hosts should be set to 1, and faster hosts should be proportional to the slowest.

Example

Consider a cluster with two hosts: hostA and hostB. In this cluster, hostA takes 30 seconds to run a benchmark and hostB takes 15 seconds to run the same test. The CPU factor for hostA should be 1, and the CPU factor of hostB should be 2 because it is twice as fast as hostA.

View normalized ratings

  1. Run lsload -N to display normalized ratings.
  2. LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. Hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.

Tune CPU factors

  1. Log in as the LSF administrator on any host in the cluster.
  2. Edit lsf.shared, and change the HostModel section:
  3. Begin HostModel 
    MODELNAME  CPUFACTOR   ARCHITECTURE # keyword 
    #HPUX (HPPA) 
    HP9K712S         2.5   (HP9000712_60) 
    HP9K712M         2.5   (HP9000712_80) 
    HP9K712F         4.0   (HP9000712_100) 
     

    See the Platform LSF Configuration Reference for information about the lsf.shared file.

  4. Save the changes to lsf.shared.
  5. Run lsadmin reconfig to reconfigure LIM.
  6. Run badmin reconfig to reconfigure mbatchd.

Handling Host-level Job Exceptions

You can configure hosts so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.

Host exceptions LSF can detect

If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such "black hole", or "job-eating" hosts exit abnormally. LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in lsb.hosts).

If EXIT_RATE is specified for the host, LSF invokes eadmin if the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION in lsb.params to change how frequently LSF checks the job exit rate.

Use GLOBAL_EXIT_RATE in lsb.params to set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host in lsb.hosts, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.

Configuring host exception handling (lsb.hosts)

EXIT_RATE

Specify a threshold for exited jobs. If the job exit rate is exceeded for 5 minutes or the period specified by JOB_EXIT_RATE_DURATION in lsb.params, LSF invokes eadmin to trigger a host exception.

Example

The following Host section defines a job exit rate of 20 jobs for all hosts, and an exit rate of 10 jobs on hostA.

Begin Host 
HOST_NAME    MXJ      EXIT_RATE  # Keywords 
Default      !        20 
hostA        !        10 
End Host 

Configuring thresholds for host exception handling

By default, LSF checks the number of exited jobs every 5 minutes. Use JOB_EXIT_RATE_DURATION in lsb.params to change this default.

Tuning
tip:  
Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.
Example

In the following diagram, the job exit rate of hostA exceeds the configured threshold (EXIT_RATE for hostA in lsb.hosts) LSF monitors hostA from time t1 to time t2 (t2=t1 + JOB_EXIT_RATE_DURATION in lsb.params). At t2, the exit rate is still high, and a host exception is detected. At t3 (EADMIN_TRIGGER_DURATION in lsb.params), LSF invokes eadmin and the host exception is handled. By default, LSF closes hostA and sends email to the LSF administrator. Since hostA is closed and cannot accept any new jobs, the exit rate drops quickly.


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index