Knowledge Center Contents Previous Next Index |
Working with Hosts
Contents
- Host status
- How LIM Determines Host Models and Types
- Viewing Host Information
- Controlling Hosts
- Adding a Host
- Remove a Host
- Adding Hosts Dynamically
- Automatically Detect Operating System Types and Versions
- Add Host Types and Host Models to lsf.shared
- Registering Service Ports
- Host Naming
- Hosts with Multiple Addresses
- Using IPv6 Addresses
- Specify host names with condensed notation
- Host Groups
- Compute Units
- Tuning CPU Factors
- Handling Host-level Job Exceptions
Host status
Host status describes the ability of a host to accept and run batch jobs in terms of daemon states, load levels, and administrative controls. The
bhosts
andlsload
commands display host status.bhosts
Displays the current status of the host:
bhosts -l
Displays the closed reasons. A closed host does not accept new batch jobs:
bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hostA ok - 55 2 2 0 0 0 hostB closed - 20 16 16 0 0 0 ...bhosts -l hostB
HOST hostB STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 23.10 - 55 2 2 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 1.0 -0.0 -0.0 4% 9.4 148 2 3 4231M 698M 233M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -lsload
Displays the current state of the host:
lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostA ok 0.0 0.0 0.0 4% 0.4 0 4316 10G 302M 252M hostB ok 1.0 0.0 0.0 4% 8.2 2 14 4231M 698M 232M ...How LIM Determines Host Models and Types
The LIM (load information manager) daemon/service automatically collects information about hosts in an LSF cluster, and accurately determines running host models and types. At most, 1024 model types can be manually defined in
lsf.shared
.If
lsf.shared
is not fully defined with all known host models and types found in the cluster, LIM attempts to match an unrecognized running host to one of the models and types that is defined.LIM supports both exact matching of host models and types, and "fuzzy" matching, where an entered host model name or type is slightly different from what is defined in
lsf.shared
(or inego.shared
if EGO is enabled in the LSF cluster).How does "fuzzy" matching work?
LIM reads host models and types that have been manually configured in
lsf.shared
. The format for entering host models and types ismodel_bogomips_architecture
(for example,x15_4604_OpterontmProcessor142
,IA64_2793
, orSUNWUltra510_360_sparc
). Names can be up to 64 characters long.When LIM attempts to match running host model with what is entered in
lsf.shared
, it first attempts an exact match, then proceeds to make a fuzzy match.How LIM attempts to make matches
Viewing Host Information
LSF uses some or all of the hosts in a cluster as execution hosts. The host list is configured by the LSF administrator. Use the
bhosts
command to view host information. Use thelsload
command to view host load information.
View all hosts in the cluster and their status
- Run
bhosts
to display information about all hosts and their status.
bhosts
displays condensed information for hosts that belong to condensed host groups. When displaying members of a condensed host group,bhosts
lists the host group name instead of the name of the individual host. For example, in a cluster with a condensed host group (groupA
), an uncondensed host group (groupB
containinghostC
andhostE
), and a host that is not in any host group (hostF
),bhosts
displays the following:bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV groupA ok 5 8 4 2 0 1 1 hostC ok - 3 0 0 0 0 0 hostE ok 2 4 2 1 0 0 1 hostF ok - 2 2 1 0 1 0Define condensed host groups in the
HostGroups
section oflsb.hosts
. To find out more about condensed host groups and to see the configuration for the above example, see Defining condensed host groups.View uncondensed host information
- Run
bhosts -X
to display all hosts in an uncondensed format, including those belonging to condensed host groups:bhosts -X
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hostA ok 2 2 0 0 0 0 0 hostD ok 2 4 2 1 0 0 1 hostB ok 1 2 2 1 0 1 0 hostC ok - 3 0 0 0 0 0 hostE ok 2 4 2 1 0 0 1 hostF ok - 2 2 1 0 1 0View detailed server host information
- Run
bhosts -l
host_name
andlshosts -l
host_name
to display all information about each server host such as the CPU factor and the load thresholds to start, suspend, and resume jobs:bhosts -l hostB
HOST hostB STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOWS ok 20.20 - - 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.1 0.1 0.1 9% 0.7 24 17 0 394M 396M 12M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -lshosts -l hostB
HOST_NAME: hostB type model cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads LINUX86 PC6000 116.1 2 1 2016M 1983M 72917M 0 Yes 1 2 2 RESOURCES: Not defined RUN_WINDOWS: (always open) LICENSES_ENABLED: (LSF_Base LSF_Manager LSF_MultiCluster) LICENSE_NEEDED: Class(E) LOAD_THRESHOLDS: r15s r1m r15m ut pg io ls it tmp swp mem - 1.0 - - - - - - - - 4M
View host load by host
The
lsload
command reports the current status and load levels of hosts in a cluster. Thelshosts -l
command shows the load thresholds.The
lsmon
command provides a dynamic display of the load information. The LSF administrator can find unavailable or overloaded hosts with these tools.
- Run
lsload
to see load levels for each host:lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostD ok 1.3 1.2 0.9 92% 0.0 2 20 5M 148M 88M hostB -ok 0.1 0.3 0.7 0% 0.0 1 67 45M 25M 34M hostA busy 8.0 *7.0 4.9 84% 4.6 6 17 1M 81M 27MThe first line lists the load index names, and each following line gives the load levels for one host.
Viewing host architecture (type and model) information
An LSF cluster may consist of hosts of differing architectures and speeds. The
lshosts
command displays configuration information about hosts. All these parameters are defined by the LSF administrator in the LSF configuration files, or determined by the LIM directly from the system.Host types represent binary compatible hosts; all hosts of the same type can run the same executable. Host models give the relative CPU performance of different processors. For example:
lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostD SUNSOL SunSparc 6.0 1 64M 112M Yes (solaris cserver) hostM RS6K IBM350 7.0 1 64M 124M Yes (cserver aix) hostC SGI6 R10K 14.0 16 1024M 1896M Yes (irix cserver) hostA HPPA HP715 6.0 1 98M 200M Yes (hpux fserver)In the above example, the host type
SUNSOL
represents Sun SPARC systems running Solaris, andSGI6
represents an SGI server running IRIX 6. Thelshosts
command also displays the resources available on each host.type
The host CPU architecture. Hosts that can run the same binary programs should have the same type.
An
UNKNOWN
type or model indicates the host is down, or LIM on the host is down. See UNKNOWN host type or model for instructions on measures to take.When automatic detection of host type or model fails (the host type configured in
lsf.shared
cannot be found), the type or model is set toDEFAULT
. LSF will work on the host, but aDEFAULT
model may be inefficient because of incorrect CPU factors. ADEFAULT
type may also cause binary incompatibility because a job from aDEFAULT
host type can be migrated to anotherDEFAULT
host type. automatic detection of host type or model has failed, and the host type configured inlsf.shared
cannot be found.View host history
- Run
badmin hhist
to view the history of a host such as when it is opened or closed:badmin hhist hostB
Wed Nov 20 14:41:58: Host <hostB> closed by administrator <lsf>. Wed Nov 20 15:23:39: Host <hostB> opened by administrator <lsf>.View host model and type information
- Run
lsinfo -m
to display information about host models that exist in the cluster:lsinfo -m
MODEL_NAME CPU_FACTOR ARCHITECTURE PC1133 23.10 x6_1189_PentiumIIICoppermine HP9K735 4.50 HP9000735_125 HP9K778 5.50 HP9000778 Ultra5S 10.30 SUNWUltra510_270_sparcv9 Ultra2 20.20 SUNWUltra2_300_sparc Enterprise3000 20.00 SUNWUltraEnterprise_167_sparc- Run
lsinfo -M
to display all host models defined inlsf.shared:
lsinfo -M
MODEL_NAME CPU_FACTOR ARCHITECTURE UNKNOWN_AUTO_DETECT 1.00 UNKNOWN_AUTO_DETECT DEFAULT 1.00 LINUX133 2.50 x586_53_Pentium75 PC200 4.50 i86pc_200 Intel_IA64 12.00 ia64 Ultra5S 10.30 SUNWUltra5_270_sparcv9 PowerPC_G4 12.00 x7400G4 HP300 1.00 SunSparc 12.00Run lim -t
to display the type, model, and matched type of the current host. You must be the LSF administrator to use this command:lim -t
Host Type : NTX64 Host Architecture : EM64T_1596 Physical Processors : 2 Cores per Processor : 4 Threads per Core : 2 License Needed : Class(B),Multi-cores Matched Type : NTX64 Matched Architecture : EM64T_3000 Matched Model : Intel_EM64T CPU Factor : 60.0View job exit rate and load for hosts
- Run
bhosts
to display the exception threshold for job exit rate and the current load value for hosts.:In the following example, EXIT_RATE for
hostA
is configured as 4 jobs per minute.hostA
does not currently exceed this ratebhosts -l hostA
HOST hostA STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW ok 18.60 - 1 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.0 0.0 0.0 0% 0.0 0 1 2 646M 648M 115M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M share_rsrc host_rsrc Total 3.0 2.0 Reserved 0.0 0.0 LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 4.00 Load 0.00- Use
bhosts -x
to see hosts whose job exit rate has exceeded the threshold for longer than JOB_EXIT_RATE_DURATION, and are still high. By default, these hosts are closed the next time LSF checks host exceptions and invokeseadmin
.If no hosts exceed the job exit rate,
bhosts -x
displays:There is no exceptional host foundView dynamic host information
- Use
lshosts
to display information on dynamically added hosts.An LSF cluster may consist of static and dynamic hosts. The
lshosts
command displays configuration information about hosts. All these parameters are defined by the LSF administrator in the LSF configuration files, or determined by the LIM directly from the system.Host types represent binary compatible hosts; all hosts of the same type can run the same executable. Host models give the relative CPU performance of different processors. Server represents the type of host in the cluster. "Yes" is displayed for LSF servers, "No" is displayed for LSF clients, and "Dyn" is displayed for dynamic hosts.
For example:
lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostA SOL64 Ultra60F 23.5 1 64M 112M Yes () hostB LINUX86 Opteron8 60.0 1 94M 168M Dyn ()In the above example,
hostA
is a static host whilehostB
is a dynamic host.Controlling Hosts
Hosts are opened and closed by an LSF Administrator or root issuing a command or through configured dispatch windows.
Close a host
- Run
badmin hclose
:badmin hclose hostB
Close <hostB> ...... doneIf the command fails, it may be because the host is unreachable through network problems, or because the daemons on the host are not running.
Open a host
- Run
badmin hopen
:badmin hopen hostB
Open <hostB> ...... doneConfigure Dispatch Windows
A dispatch window specifies one or more time periods during which a host will receive new jobs. The host will not receive jobs outside of the configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, dispatch windows are not configured.
To configure dispatch windows:
- Edit
lsb.hosts
.- Specify one or more time windows in the DISPATCH_WINDOW column:
Begin Host HOST_NAME r1m pg ls tmp DISPATCH_WINDOW ... hostB 3.5/4.5 15/ 12/15 0 (4:30-12:00) ... End Host- Reconfigure the cluster:
- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin reconfig
to reconfigurembatchd
.- Run
bhosts -l
to display the dispatch windows.Log a comment when closing or opening a host
- Use the
-C
option ofbadmin hclose
andbadmin hopen
to log an administrator comment inlsb.events
:badmin hclose -C "Weekly backup" hostB
The comment text
Weekly backup
is recorded inlsb.events
. If you close or open a host group, each host group member displays with the same comment string.A new event record is recorded for each host open or host close event. For example:
badmin hclose -C "backup" hostA
followed by
badmin hclose -C "Weekly backup" hostA
generates the following records in
"HOST_CTRL" "7.0 1050082346 1 "hostA" 32185 "lsfadmin" "backup" "HOST_CTRL" "7.0 1050082373 1 "hostA" 32185 "lsfadmin" "Weekly backup"lsb.events
:- Use
badmin hist
orbadmin hhist
to display administrator comments for closing and opening hosts:badmin hhist
Fri Apr 4 10:35:31: Host <hostB> closed by administrator <lsfadmin> Weekly backup.
bhosts -l
also displays the comment text:bhosts -l
HOST hostA STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 1.00 - - 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.0 0.0 0.0 2% 0.0 64 2 11 7117M 512M 432M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 2.00 Load 0.00 ADMIN ACTION COMMENT: "Weekly backup"How events are displayed and recorded in MultiCluster lease model
In the MultiCluster resource lease model, host control administrator comments are recorded only in the
lsb.events
file on the local cluster.badmin hist
andbadmin hhist
display only events that are recorded locally. Host control messages are not passed between clusters in the MultiCluster lease model. For example. if you close an exported host in both the consumer and the provider cluster, the host close events are recorded separately in their locallsb.events
.Adding a Host
You use the
lsfinstall
command to add a host to an LSF cluster.Contents
Add a host of an existing type using lsfinstall
restriction:
lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
- Verify that the host type already exists in your cluster:
- Log on to any host in the cluster. You do not need to be root.
- List the contents of the LSF_TOP/7.0 directory. The default is
/usr/share/lsf/
7.0. If the host type currently exists, there is a subdirectory with the name of the host type. If it does not exist, go to Add a host of a new type using lsfinstall.- Add the host information to
lsf.cluster.
cluster_name
:
- Log on to the LSF master host as root.
- Edit
LSF_CONFDIR/lsf.cluster.
cluster_name
, and specify the following in theHost
section:
- The name of the host.
- The model and type, or specify ! to automatically detect the type or model.
- Specify
1
for LSF server or0
for LSF client.Begin Host HOSTNAME model type server r1m mem RESOURCES REXPRI hosta ! SUNSOL6 1 1.0 4 () 0 hostb ! SUNSOL6 0 1.0 4 () 0 hostc ! HPPA1132 1 1.0 4 () 0 hostd ! HPPA1164 1 1.0 4 () 0 End Host- Save your changes.
- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin mbdrestart
to restartmbatchd
.- Run
hostsetup
to set up the new host and configure the daemons to start automatically at boot from/usr/share/lsf/7.0/install
:./hostsetup --top="/usr/share/lsf" --boot="y"
- Start LSF on the new host:
lsadmin limstartup
lsadmin resstartup
badmin hstartup
- Run
bhosts
andlshosts
to verify your changes.
- If any host type or host model is UNKNOWN, follow the steps in UNKNOWN host type or model to fix the problem.
- If any host type or host model is DEFAULT, follow the steps in DEFAULT host type or model to fix the problem.
Add a host of a new type using lsfinstall
restriction:
lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
- Verify that the host type does not already exist in your cluster:
- Log on to any host in the cluster. You do not need to be root.
- List the contents of the LSF_TOP/7.0 directory. The default is
/usr/share/lsf/7.0
. If the host type currently exists, there will be a subdirectory with the name of the host type. If the host type already exists, go to Add a host of an existing type using lsfinstall.- Get the LSF distribution tar file for the host type you want to add.
- Log on as root to any host that can access the LSF install directory.
- Change to the LSF install directory. The default is
/usr/share/lsf/7.0/install
- Edit
install.config:
- For LSF_TARDIR, specify the path to the tar file. For example:
LSF_TARDIR="/usr/share/lsf_distrib/7.0"- For LSF_ADD_SERVERS, list the new host names enclosed in quotes and separated by spaces. For example:
LSF_ADD_SERVERS="hosta hostb"- Run
./lsfinstall -f
install.config
. This automatically creates the host information inlsf.cluster.
cluster_name
.- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin reconfig
to reconfigurembatchd
.- Run
hostsetup
to set up the new host and configure the daemons to start automatically at boot from/usr/share/lsf/7.0/install
:./hostsetup --top="/usr/share/lsf" --boot="y"
- Start LSF on the new host:
lsadmin limstartup
lsadmin resstartup
badmin hstartup
- Run
bhosts
andlshosts
to verify your changes.
- If any host type or host model is UNKNOWN, follow the steps in UNKNOWN host type or model to fix the problem.
- If any host type or host model is DEFAULT, follow the steps in DEFAULT host type or model to fix the problem.
Remove a Host
Removing a host from LSF involves preventing any additional jobs from running on the host, removing the host from LSF, and removing the host from the cluster.
caution:
Never remove the master host from LSF. If you want to remove your current default master from LSF, changelsf.cluster.
cluster_name
to assign a different default master host. Then remove the host that was once the master host.
- Log on to the LSF host as root.
- Run
badmin hclose
to close the host. This prevents jobs from being dispatched to the host and allows running jobs to finish.- Stop all running daemons manually.
- Remove any references to the host in the Host section of
LSF_CONFDIR/lsf.cluster.
cluster_name
.- Remove any other references to the host, if applicable, from the following LSF configuration files:
LSF_CONFDIR/lsf.shared
LSB_CONFDIR/
cluster_name
/configdir/lsb.hosts
LSB_CONFDIR/
cluster_name
/configdir/lsb.queues
LSB_CONFDIR/
cluster_name
/configdir/lsb.resources
- Log off the host to be removed, and log on as
root
or the primary LSF administrator to any other host in the cluster.- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin mbdrestart
to restartmbatchd
.- If you configured LSF daemons to start automatically at system startup, remove the LSF section from the host's system startup files.
- If any users of the host use
lstcsh
as their login shell, change their login shell totcsh
orcsh
. Removelstcsh
from the/etc/shells
file.Remove a Host from Master Candidate List
You can remove a host from the master candidate list so that it can no longer be the master should failover occur. You can choose to either keep it as part of the cluster or remove it.
- Shut down the current LIM:
limshutdown
host_name
If the host was the current master, failover occurs.
- In
lsf.conf
, remove the host name fromLSF_MASTER_LIST
.- Run
lsadmin reconfig
for the remaining master candidates.- If the host you removed as a master candidate still belongs to the cluster, start up the LIM again:
limstartup
host_name
Adding Hosts Dynamically
By default, all configuration changes made to LSF are static. To add or remove hosts within the cluster, you must manually change the configuration and restart all master candidates.
Dynamic host configuration allows you to add and remove hosts without manual reconfiguration. To enable dynamic host configuration, all of the parameters described in the following table must be defined.
important:
If you choose to enable dynamic hosts when you install LSF, the installer adds the parameter LSF_HOST_ADDR_RANGE tolsf.cluster.
cluster_name
using a default value that allows any host to join the cluster. To enable security, configure LSF_HOST_ADDR_RANGE inlsf.cluster.
cluster_name
after installation to restrict the hosts that can join your cluster.How dynamic host configuration works
Master LIM
The master LIM runs on the master host for the cluster. The master LIM receives requests to add hosts, and tells the master host candidates defined by the parameter LSF_MASTER_LIST to update their configuration information when a host is dynamically added or removed.
Upon startup, both static and dynamic hosts wait to receive an acknowledgement from the master LIM. This acknowledgement indicates that the master LIM has added the host to the cluster. Static hosts normally receive an acknowledgement because the master LIM has access to static host information in the LSF configuration files. Dynamic hosts do not receive an acknowledgement, however, until they announce themselves to the master LIM. The parameter LSF_DYNAMIC_HOST_WAIT_TIME in
lsf.conf
determines how long a dynamic host waits before sending a request to the master LIM to add the host to the cluster.Master candidate LIMs
The parameter LSF_MASTER_LIST defines the list of master host candidates. These hosts receive updated host information from the master LIM so that any master host candidate can take over as master host for the cluster.
important:
Master candidate hosts should share LSF configuration and binaries.Dynamic hosts cannot be master host candidates. By defining the parameter LSF_MASTER_LIST, you ensure that LSF limits the list of master host candidates to specific, static hosts.
mbatchd
mbatchd
gets host information from the master LIM; when it detects the addition or removal of a dynamic host within the cluster,mbatchd
automatically reconfigures itself.
tip:
After adding a host dynamically, you might have to wait formbatchd
to detect the host and reconfigure. Depending on system load,mbatchd
might wait up to a maximum of 10 minutes before reconfiguring.lsadmin command
Use the command
lsadmin limstartup
to start the LIM on a newly added dynamic host.Allowing only certain hosts to join the cluster
By default, any host can be dynamically added to the cluster. To enable security, define LSF_HOST_ADDR_RANGE in
lsf.cluster.
cluster_name
to identify a range of IP addresses for hosts that are allowed to dynamically join the cluster as LSF hosts. IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. You can use IPv6 addresses if you define the parameter LSF_ENABLE_SUPPORT_IPV6 inlsf.conf
; you do not have to map IPv4 addresses to an IPv6 format.Configure LSF to run batch jobs on dynamic hosts
Before you run batch jobs on a dynamic host, complete any or all of the following steps, depending on your cluster configuration.
- Configure queues to accept all hosts by defining the
HOSTS
parameter inlsb.queues
using the keywordall
.- Define host groups that will accept wild cards in the
HostGroup
section oflsb.hosts
.For example, define
linuxrack*
as aGROUP_MEMBER
within a host group definition.- Add a dynamic host to a host group using the command
badmin hghostadd
.Changing a dynamic host to a static host
If you want to change a dynamic host to a static host, first use the command
badmin hghostdel
to remove the dynamic host from any host group that it belongs to, and then configure the host as a static host inlsf.cluster.
cluster_name
.Adding dynamic hosts
Add a dynamic host in a shared file system environment
In a shared file system environment, you do not need to install LSF on each dynamic host. The master host will recognize a dynamic host as an LSF host when you start the daemons on the dynamic host.
- In
lsf.conf
on the master host, define the parameter LSF_DYNAMIC_HOST_WAIT_TIME, in seconds, and assign a value greater than zero.LSF_DYNAMIC_HOST_WAIT_TIME specifies the length of time a dynamic host waits before sending a request to the master LIM to add the host to the cluster.
For example:
LSF_DYNAMIC_HOST_WAIT_TIME=60- In
lsf.conf
on the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host,
mbatchd
automatically reconfigures itself.
note:
For very large clusters, defining this parameter could decrease system performance.For example:
LSF_DYNAMIC_HOST_TIMEOUT=60m- In
lsf.cluster.
cluster_name
on the master host, define the parameter LSF_HOST_ADDR_RANGE.LSF_HOST_ADDR_RANGE enables security by defining a list of hosts that can join the cluster. Specify IP addresses or address ranges for hosts that you want to allow in the cluster.
tip:
If you define the parameter LSF_ENABLE_SUPPORT_IPV6 inlsf.conf,
IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format; you do not have to map IPv4 addresses to an IPv6 format.For example:
LSF_HOST_ADDR_RANGE=100-110.34.1-10.4-56All hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. In this example, no IPv6 hosts are allowed.
- Log on as root to each host you want to join the cluster.
- Source the LSF environment:
- For
csh
ortcsh
:source LSF_TOP/conf/cshrc.lsf
- For
sh
,ksh
, orbash
:. LSF_TOP/conf/profile.lsf
- Do you want LSF to start automatically when the host reboots?
- If no, go to step 7.
- If yes, run the
hostsetup
command. For example:cd /usr/share/lsf/7.0/install
./hostsetup --top="/usr/share/lsf" --boot="y"
For complete
hostsetup
usage, enterhostsetup -h
.- Use the following commands to start LSF:
lsadmin limstartup
lsadmin resstartup
badmin hstartup
Add a dynamic host in a non-shared file system environment
In a non-shared file system environment, you must install LSF binaries, a localized
lsf.conf
file, and shell environment scripts (cshrc.lsf
andprofile.lsf
) on each dynamic host.Specify installation options in the
slave.config
fileAll dynamic hosts are slave hosts, because they cannot serve as master host candidates. The
slave.config
file contains parameters for configuring all slave hosts.
- Define the required parameters.
LSF_SERVER_HOSTS="
host_name
[host_name
...]"LSF_ADMINS="
user_name
[user_name ...
]"LSF_TOP="/
path
"- Define the optional parameters.
LSF_LIM_PORT=
port_number
important:
If the master host does not use the default LSF_LIM_PORT, you must specify the same LSF_LIM_PORT defined inlsf.conf
on the master host.Add local resources on a dynamic host to the cluster
Prerequisites: Ensure that the resource name and type are defined in
lsf.shared
, and that the ResourceMap section of lsf.cluster.cluster_name
contains at least one resource mapped to at least one static host. LSF can add local resources as long as the ResourceMap section is defined; you do not need to map the local resources.
- In the
slave.config
file, define the parameter LSF_LOCAL_RESOURCES.For numeric resources, define name-value pairs:
"[resourcemap
value
*resource_name
]"
For Boolean resources, the value is the resource name in the following format:
"[resource
resource_name
]"
For example:
LSF_LOCAL_RESOURCES="[resourcemap 1*verilog] [resource linux]"
tip:
If LSF_LOCAL_RESOURCES are already defined in a locallsf.conf
on the dynamic host,lsfinstall
does not add resources you define in LSF_LOCAL_RESOURCES inslave.config
.When the dynamic host sends a request to the master host to add it to the cluster, the dynamic host also reports its local resources. If the local resource is already defined in
lsf.cluster.
cluster_name
asdefault
orall
, it cannot be added as a local resource.Install LSF on a dynamic host
- Run
lsfinstall -s -f slave.config
.
lsfinstall
creates a locallsf.conf
for the dynamic host, which sets the following parameters:LSF_CONFDIR="/
path
"LSF_GET_CONF=lim
LSF_LIM_PORT=
port_number
(same as the master LIM port number)LSF_LOCAL_RESOURCES="
resource
..."
tip:
Do not duplicate LSF_LOCAL_RESOURCES entries inlsf.conf
. If local resources are defined more than once, only the last definition is valid.LSF_SERVER_HOSTS="
host_name
[host_name
...]"LSF_VERSION=7.0
important:
If LSF_STRICT_CHECKING is defined in lsf.conf to protect your cluster in untrusted environments, and your cluster has dynamic hosts, LSF_STRICT_CHECKING must be configured in the locallsf.conf
on all dynamic hosts.Configure dynamic host parameters
- In
lsf.conf
on the master host, define the parameter LSF_DYNAMIC_HOST_WAIT_TIME, in seconds, and assign a value greater than zero.LSF_DYNAMIC_HOST_WAIT_TIME specifies the length of time a dynamic host waits before sending a request to the master LIM to add the host to the cluster.
For example:
LSF_DYNAMIC_HOST_WAIT_TIME=60- In
lsf.conf
on the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host,
mbatchd
automatically reconfigures itself.
note:
For very large clusters, defining this parameter could decrease system performance.For example:
LSF_DYNAMIC_HOST_TIMEOUT=60m- In
lsf.cluster.
cluster_name
on the master host, define the parameter LSF_HOST_ADDR_RANGE.LSF_HOST_ADDR_RANGE enables security by defining a list of hosts that can join the cluster. Specify IP addresses or address ranges for hosts that you want to allow in the cluster.
tip:
If you define the parameter LSF_ENABLE_SUPPORT_IPV6 inlsf.conf,
IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format; you do not have to map IPv4 addresses to an IPv6 format.For example:
LSF_HOST_ADDR_RANGE=100-110.34.1-10.4-56All hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. No IPv6 hosts are allowed.
Start LSF daemons
- Log on as root to each host you want to join the cluster.
- Source the LSF environment:
- For
csh
ortcsh
:source LSF_TOP/conf/cshrc.lsf
- For
sh
,ksh
, orbash
:. LSF_TOP/conf/profile.lsf
- Do you want LSF to start automatically when the host reboots?
- If no, go to step 4.
- If yes, run the
hostsetup
command. For example:cd /usr/share/lsf/7.0/install
./hostsetup --top="/usr/share/lsf" --boot="y"
For complete
hostsetup
usage, enterhostsetup -h
.- Is this the first time the host is joining the cluster?
- If no, use the following commands to start LSF:
lsadmin limstartup
lsadmin resstartup
badmin hstartup
- If yes, you must start the daemons from the local host. For example, if you want to start the daemons on hostB from hostA, use the following commands:
rsh hostB lsadmin limstartup
rsh hostB lsadmin resstartup
rsh hostB badmin hstartup
Removing dynamic hosts
To remove a dynamic host from the cluster, you can either set a timeout value, or you can edit the
hostcache
file.Remove a host by setting a timeout value
LSF_DYNAMIC_HOST_TIMEOUT specifies the length of time (minimum 10 minutes) a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host,
mbatchd
automatically reconfigures itself.
note:
For very large clusters, defining this parameter could decrease system performance. If you want to use this parameter to remove dynamic hosts from a very large cluster, disable the parameter after LSF has removed the unwanted hosts.
- In
lsf.conf
on the master host, define the parameter LSF_DYNAMIC_HOST_TIMEOUT.To specify minutes rather than hours, append m or M to the value.
For example:
LSF_DYNAMIC_HOST_TIMEOUT=60mRemove a host by editing the hostcache file
Dynamic hosts remain in the cluster unless you intentionally remove them. Only the cluster administrator can modify the
hostcache
file.
- Shut down the cluster.
lsfshutdown
This shuts down LSF on all hosts in the cluster and prevents LIMs from trying to write to the
hostcache
file while you edit it.- In the hostcache file
$EGO_WORKDIR/lim/hostcache
, delete the line for the dynamic host that you want to remove.
- If EGO is enabled, the hostcache file is in
$EGO_WORKDIR/lim/hostcache.
- If EGO is not enabled, the hostcache file is in
$LSB_SHAREDIR
.- Close the
hostcache
file, and then start up the cluster.
lsfrestart
Automatically Detect Operating System Types and Versions
LSF can automatically detect most operating system types and versions so that you do not need to add them to the
lsf.shared
file manually. The list of automatically detected operating systems is updated regularly.
- Edit
lsf.shared
.- In the Resource section, remove the comment from the following line:
ostype String () () () (Operating system and version)
- In
$LSF_SERVERDIR
, renametmp.eslim.ostype
toeslim.ostype
.- Run the following commands to restart the LIM and master batch daemon:
lsadmin reconfig
badmin mbdrestart
- To view operating system types and versions, run
lshosts -l
orlshosts -s
.LSF displays the operating system types and versions in your cluster, including any that LSF automatically detects as well as those you have defined manually in the HostType section of
lsf.shared
.You can specify ostype in your resource requirement strings. For example, when submitting a job you can specify the following resource requirement:
-R "select[ostype=RHEL2.6]"
.Modify how long LSF waits for new operating system types and versions
Prerequisites: You must enable LSF to automatically detect operating system types and versions.
You can configure how long LSF waits for OS type and version detection.
- In
lsf.conf
, modify the value forEGO_ESLIM_TIMEOUT
.The value is time in seconds.
Add Host Types and Host Models to lsf.shared
The
lsf.shared
file contains a list of host type and host model names for most operating systems. You can add to this list or customize the host type and host model names. A host type and host model name can be any alphanumeric string up to 39 characters long.Add a custom host type or model
- Log on as the LSF administrator on any host in the cluster.
- Edit
lsf.shared
:
- For a new host type, modify the
HostType
section:Begin HostType TYPENAME # Keyword DEFAULT IBMAIX564 LINUX86 LINUX64 NTX64 NTIA64 SUNSOL SOL732 SOL64 SGI658 SOLX86 HPPA11 HPUXIA64 MACOSX End HostType- For a new host model, modify the
HostModel
section:Add the new model and its CPU speed factor relative to other models. For more details on tuning CPU factors, see Tuning CPU Factors.
Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE # keyword # x86 (Solaris, Windows, Linux): approximate values, based on SpecBench results # for Intel processors (Sparc/Win) and BogoMIPS results (Linux). PC75 1.5 (i86pc_75 i586_75 x586_30) PC90 1.7 (i86pc_90 i586_90 x586_34 x586_35 x586_36) HP9K715 4.2 (HP9000715_100) SunSparc 12.0 () CRAYJ90 18.0 () IBM350 18.0 () End HostModel- Save the changes to
lsf.shared
.- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin reconfig
to reconfigurembatchd
.Registering Service Ports
LSF uses dedicated UDP and TCP ports for communication. All hosts in the cluster must use the same port numbers to communicate with each other.
The service port numbers can be any numbers ranging from 1024 to 65535 that are not already used by other services. To make sure that the port numbers you supply are not already used by applications registered in your service database check
/etc/services
or use the commandypcat services
By default, port numbers for LSF services are defined in the
lsf.conf
file. You can also configure ports by modifying/etc/services
or the NIS or NIS+ database. If you define port numberslsf.conf
, port numbers defined in the service database are ignored.lsf.conf
- Log on to any host as
root
.- Edit
lsf.conf
and add the following lines:LSF_RES_PORT=3878 LSB_MBD_PORT=3881 LSB_SBD_PORT=3882- Add the same entries to
lsf.conf
on every host.- Save
lsf.conf
.- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin mbdrestart
to restartmbatchd
.- Run
lsfstartup
to restart all daemons in the cluster./etc/services
Configure services manually
tip:
During installation, use thehostsetup --boot="y"
option to set up the LSF port numbers in the service database.
- Use the file
LSF_TOP/
version
/install/instlib/example.services
file as a guide for adding LSF entries to the services database.If any other service listed in your services database has the same port number as one of the LSF services, you must change the port number for the LSF service. You must use the same port numbers on every LSF host.
- Log on to any host as
root
.- Edit the
/etc/services
file by adding the contents of theLSF_TOP/version/install/instlib/example.services
file:# /etc/services entries for LSF daemons # res 3878/tcp # remote execution server lim 3879/udp # load information manager mbatchd 3881/tcp # master lsbatch daemon sbatchd 3882/tcp # slave lsbatch daemon # # Add this if ident is not already defined # in your /etc/services file ident 113/tcp auth tap # identd- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin reconfig
to reconfigurembatchd
.- Run
lsfstartup
to restart all daemons in the cluster.NIS or NIS+ database
If you are running NIS, you only need to modify the services database once per NIS master. On some hosts the NIS database and commands are in the
/var/yp
directory; on others, NIS is found in/etc/yp
.
- Log on to any host as
root
.- Run
lsfshutdown
to shut down all the daemons in the cluster- To find the name of the NIS master host, use the command:
ypwhich -m services
- Log on to the NIS master host as
root
.- Edit the
/var/yp/src/services
or/etc/yp/src/services
file on the NIS master host adding the contents of theLSF_TOP/version/install/instlib/example.services
file:# /etc/services entries for LSF daemons. # res 3878/tcp # remote execution server lim 3879/udp # load information manager mbatchd 3881/tcp # master lsbatch daemon sbatchd 3882/tcp # slave lsbatch daemon # # Add this if ident is not already defined # in your /etc/services file ident 113/tcp auth tap # identdMake sure that all the lines you add either contain valid service entries or begin with a comment character (
#
). Blank lines are not allowed.- Change the directory to
/var/yp
or/etc/yp
.- Use the following command:
ypmake services
On some hosts the master copy of the services database is stored in a different location.
On systems running NIS+ the procedure is similar. Refer to your system documentation for more information.
- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin reconfig
to reconfigurembatchd
.- Run
lsfstartup
to restart all daemons in the cluster.Host Naming
LSF needs to match host names with the corresponding Internet host addresses.
LSF looks up host names and addresses the following ways:
- In the
/etc/hosts
file- Sun Network Information Service/Yellow Pages (NIS or YP)
- Internet Domain Name Service (DNS).
- DNS is also known as the Berkeley Internet Name Domain (BIND) or
named
, which is the name of the BIND daemon.Each host is configured to use one or more of these mechanisms.
Network addresses
Each host has one or more network addresses; usually one for each network to which the host is directly connected. Each host can also have more than one name.
Official host name
The first name configured for each address is called the official name.
Host name aliases
Other names for the same host are called aliases.
LSF uses the configured host naming system on each host to look up the official host name for any alias or host address. This means that you can use aliases as input to LSF, but LSF always displays the official name.
Using host name ranges as aliases
The default host file syntax
ip_address
official_name
[alias
[alias
...]]is powerful and flexible, but it is difficult to configure in systems where a single host name has many aliases, and in multihomed host environments.
In these cases, the
hosts
file can become very large and unmanageable, and configuration is prone to error.The syntax of the LSF
hosts
file supports host name ranges as aliases for an IP address. This simplifies the host name alias specification.To use host name ranges as aliases, the host names must consist of a fixed node group name prefix and node indices, specified in a form like:
host_name
[index_x
-index_y
,index_m
,index_a-index_b
]For example:
atlasD0[0-3,4,5-6, ...]is equivalent to:
atlasD0[0-6, ...]The node list does not need to be a continuous range (some nodes can be configured out). Node indices can be numbers or letters (both upper case and lower case).
Example
Some systems map internal compute nodes to single LSF host names. A host file might contains 64 lines, each specifying an LSF host name and 32 node names that correspond to each LSF host:
... 177.16.1.1 atlasD0 atlas0 atlas1 atlas2 atlas3 atlas4 ... atlas31 177.16.1.2 atlasD1 atlas32 atlas33 atlas34 atlas35 atlas36 ... atlas63 ...In the new format, you still map the nodes to the LSF hosts, so the number of lines remains the same, but the format is simplified because you only have to specify ranges for the nodes, not each node individually as an alias:
... 177.16.1.1 atlasD0 atlas[0-31] 177.16.1.2 atlasD1 atlas[32-63] ...You can use either an IPv4 or an IPv6 format for the IP address (if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in
lsf.conf
).Host name services
Solaris
On Solaris systems, the
/etc/nsswitch.conf
file controls the name service.Other UNIX platforms
On other UNIX platforms, the following rules apply:
- If your host has an
/etc/resolv.conf
file, your host is using DNS for name lookups- If the command
ypcat hosts
prints out a list of host addresses and names, your system is looking up names in NIS- Otherwise, host names are looked up in the
/etc/hosts
fileFor more information
The man pages for the
gethostbyname
function, theypbind
andnamed
daemons, theresolver
functions, and thehosts
,svc.conf
,nsswitch.conf
, andresolv.conf
files explain host name lookups in more detail.Hosts with Multiple Addresses
Multi-homed hosts
Hosts that have more than one network interface usually have one Internet address for each interface. Such hosts are called
multi-homed hosts
. For example, dual-stack hosts are multi-homed because they have both an IPv4 and an IPv6 network address.LSF identifies hosts by name, so it needs to match each of these addresses with a single host name. To do this, the host name information must be configured so that all of the Internet addresses for a host resolve to the same name.
There are two ways to do it:
- Modify the system hosts file (
/etc/hosts
) and the changes will affect the whole system- Create an LSF hosts file (
LSF_CONFDIR/hosts
) and LSF will be the only application that resolves the addresses to the same hostMultiple network interfaces
Some system manufacturers recommend that each network interface, and therefore, each Internet address, be assigned a different host name. Each interface can then be directly accessed by name. This setup is often used to make sure NFS requests go to the nearest network interface on the file server, rather than going through a router to some other interface. Configuring this way can confuse LSF, because there is no way to determine that the two different names (or addresses) mean the same host. LSF provides a workaround for this problem.
All host naming systems can be configured so that host address lookups always return the same name, while still allowing access to network interfaces by different names. Each host has an official name and a number of aliases, which are other names for the same host. By configuring all interfaces with the same official name but different aliases, you can refer to each interface by a different alias name while still providing a single official name for the host.
Configuring the LSF hosts file
If your LSF clusters include hosts that have more than one interface and are configured with more than one official host name, you must either modify the host name configuration, or create a private
hosts
file for LSF to use.The LSF
hosts
file is stored in LSF_CONFDIR. The format ofLSF_CONFDIR/hosts
is the same as for/etc/hosts
.In the LSF
hosts
file, duplicate the systemhosts
database information, except make all entries for the host use the same official name. Configure all the other names for the host as aliases so that you can still refer to the host by any name.Example
For example, if your
/etc/hosts
file contains:AA.AA.AA.AA host-AA host # first interface BB.BB.BB.BB host-BB # second interfacethen the
LSF_CONFDIR/hosts
file should contain:AA.AA.AA.AA host host-AA # first interface BB.BB.BB.BB host host-BB # second interfaceExample /etc/hosts entries
No unique official name
The following example is for a host with two interfaces, where the host does not have a unique official name.
# Address Official name Aliases # Interface on network A AA.AA.AA.AA host-AA.domain host.domain host-AA host # Interface on network B BB.BB.BB.BB host-BB.domain host-BB hostLooking up the address
AA.AA.AA.AA
finds the official namehost-AA.domain
. Looking up addressBB.BB.BB.BB
finds the namehost-BB.domain
. No information connects the two names, so there is no way for LSF to determine that both names, and both addresses, refer to the same host.To resolve this case, you must configure these addresses using a unique host name. If you cannot make this change to the system file, you must create an LSF hosts file and configure these addresses using a unique host name in that file.
Both addresses have the same official name
Here is the same example, with both addresses configured for the same official name.
# Address Official name Aliases # Interface on network A AA.AA.AA.AA host.domain host-AA.domain host-AA host # Interface on network B BB.BB.BB.BB host.domain host-BB.domain host-BB hostWith this configuration, looking up either address returns
host.domain
as the official name for the host. LSF (and all other applications) can determine that all the addresses and host names refer to the same host. Individual interfaces can still be specified by using thehost-AA
andhost-BB
aliases.Example for a dual-stack host
Dual-stack hosts have more than one IP address. You must associate the host name with both addresses, as shown in the following example:
# Address Official name Aliases # Interface IPv4 AA.AA.AA.AA host.domain host-AA.domain # Interface IPv6 BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB host.domain host-BB.domainWith this configuration, looking up either address returns
host.domain
as the official name for the host. LSF (and all other applications) can determine that all the addresses and host names refer to the same host. Individual interfaces can still be specified by using thehost-AA
andhost-BB
aliases.Sun Solaris example
For example, Sun NIS uses the
/etc/hosts
file on the NIS master host as input, so the format for NIS entries is the same as for the/etc/hosts
file. Since LSF can resolve this case, you do not need to create an LSF hosts file.DNS configuration
The configuration format is different for DNS. The same result can be produced by configuring two address (A) records for each Internet address. Following the previous example:
# name class type address host.domain IN A AA.AA.AA.AA host.domain IN A BB.BB.BB.BB host-AA.domain IN A AA.AA.AA.AA host-BB.domain IN A BB.BB.BB.BBLooking up the official host name can return either address. Looking up the interface-specific names returns the correct address for each interface.
For a dual-stack host:
# name class type address host.domain IN A AA.AA.AA.AA host.domain IN A BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB host-AA.domain IN A AA.AA.AA.AA host-BB.domain IN A BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBBPTR records in DNS
Address-to-name lookups in DNS are handled using PTR records. The PTR records for both addresses should be configured to return the official name:
# address class type name AA.AA.AA.AA.in-addr.arpa IN PTR host.domain BB.BB.BB.BB.in-addr.arpa IN PTR host.domainFor a dual-stack host:
# address class type name AA.AA.AA.AA.in-addr.arpa IN PTR host.domain BBBB:BBBB:BBBB:BBBB:BBBB:BBBB::BBBB.in-addr.arpa IN PTR host.domainIf it is not possible to change the system host name database, create the
hosts
file local to the LSF system, and configure entries for the multi-homed hosts only. Host names and addresses not found in thehosts
file are looked up in the standard name system on your host.Using IPv6 Addresses
IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. You can use IPv6 addresses if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in
lsf.conf
; you do not have to map IPv4 addresses to an IPv6 format.LSF supports IPv6 addresses for the following platforms:
- Linux 2.4
- Linux 2.6
- Solaris 10
- Windows
- XP
- 2003
- 2000 with Service Pack 1 or higher
- AIX 5
- HP-UX
- 11i
- 11iv1
- 11iv2
- 11.11
- SGI Altix ProPack 3, 4, and 5
- IRIX 6.5.19 and higher, Trusted IRIX 6.5.19 and higher
- Mac OS 10.2 and higher
- Cray XT3
- IBM Power 5 Series
Enable both IPv4 and IPv6 support
- Configure the parameter
LSF_ENABLE_SUPPORT_IPV6=Y
inlsf.conf
.Configure hosts for IPv6
Follow the steps in this procedure if you do not have an IPv6-enabled DNS server or an IPv6-enabled router. IPv6 is supported on some linux2.4 kernels and on all linux2.6 kernels.
- Configure the kernel.
- Does the entry
/proc/net/if_inet6
exist?- To load the IPv6 module into the kernel, execute the following command as root:
modprobe ipv6- To check that the module loaded correctly, execute the command
lsmod | grep -w 'ipv6'
- Add an IPv6 address to the host by executing the following command as root:
/sbin/ifconfig eth0 inet6 add 3ffe:ffff:0:f101::2/64
- Display the IPv6 address using
ifconfig
.- Repeat step 1 through step 3 for other hosts in the cluster.
- To configure IPv6 networking, add the addresses for all IPv6 hosts to
/etc/hosts
on each host.
note:
For IPv6 networking, hosts must be on the same subnet.- Test IPv6 communication between hosts using the command
ping6
.Specify host names with condensed notation
A number of commands often require you to specify host names. You can now specify host name ranges instead. You can use condensed notation with the following commands:
bacct
bhist
bjobs
bmig
bmod
bpeek
brestart
brsvadd
brsvmod
brsvs
brun
bsub
bswitch
You must specify a valid range of hosts, where the start number is smaller than the end number.
- Run the command you want and specify the host names as a range.
For example:
bsub -m "host[1-100].corp.com"
The job is submitted to
host1.corp.com
,host2.corp.com
,host3.corp.com
, all the way tohost100.corp.com
.- Run the command you want and specify host names as a combination of ranges and individuals.
For example:
bsub -m "host[1-10,12,20-25].corp.com
"The job is submitted to
host.1.corp.com
,host2.corp.com
,host3.corp.com
, up to and includinghost10.corp.com
. It is also submitted tohost12.corp.com
and the hosts between and includinghost20.corp.com
andhost25.corp.com
.Host Groups
You can define a host group within LSF or use an external executable to retrieve host group members.
Use
bhosts
to view a list of existing hosts. Usebmgroup
to view host group membership.Where to use host groups
LSF host groups can be used in defining the following parameters in LSF configuration files:
- HOSTS in
lsb.queues
for authorized hosts for the queue- HOSTS in
lsb.hosts
in theHostPartition
section to list host groups that are members of the host partitionConfigure host groups
- Log in as the LSF administrator to any host in the cluster.
- Open
lsb.hosts
.- Add the
HostGroup
section if it does not exist.Begin HostGroup GROUP_NAME GROUP_MEMBER groupA (all) groupB (groupA ~hostA ~hostB) groupC (hostX hostY hostZ) groupD (groupC ~hostX) groupE (all ~groupC ~hostB) groupF (hostF groupC hostK) desk_tops (hostD hostE hostF hostG) Big_servers (!) End HostGroup- Enter a group name under the GROUP_NAME column.
External host groups must be defined in the
egroup
executable.- Specify hosts in the GROUP_MEMBER column.
(Optional) To tell LSF that the group members should be retrieved using
egroup
, put an exclamation mark (!
) in the GROUP_MEMBER column.- Save your changes.
- Run
badmin ckconfig
to check the group definition. If any errors are reported, fix the problem and check the configuration again.- Run
badmin mbdrestart
to apply the new configuration.Using wildcards and special characters to define host names
You can use special characters when defining host group members under the GROUP_MEMBER column to specify hosts. These are useful to define several hosts in a single entry, such as for a range of hosts, or for all host names with a certain text string.
If a host matches more than one host group, that host is a member of all groups. If any host group is a condensed host group, the status and other details of the hosts are counted towards all of the matching host groups.
When defining host group members, you can use string literals and the following special characters:
- Use a tilde (
~
) to exclude specified hosts or host groups from the list. The tilde can be used in conjunction with the other special characters listed below. The following example matches all hosts in the cluster except forhostA
,hostB
, and all members of thegroupA
host group:... (all ~hostA ~hostB ~groupA)Use an asterisk ( *
) as a wildcard character to represent any number of characters. The following example matches all hosts beginning with the text string "hostC
" (such ashostCa
,hostC1
, orhostCZ1
):... (hostC*)Use square brackets with a hyphen ( [
integer1
-
integer2
]
) to define a range of non-negative integers at the end of a host name. The first integer must be less than the second integer. The following example matches all hosts fromhostD51
tohostD100
:... (hostD[51-100])Use square brackets with commas ( [
integer1
,
integer2
...]
) to define individual non-negative integers at the end of a host name. The following example matcheshostD101
,hostD123
, andhostD321
:... (hostD[101,123,321])Use square brackets with commas and hyphens (such as [
integer1
-
integer2
,
integer3
,
integer4
-
integer5
]
) to define different ranges of non-negative integers at the end of a host name. The following example matches all hosts fromhostD1
tohostD100
,hostD102
, all hosts fromhostD201
tohostD300
, andhostD320
):... (hostD[1-100,102,201-300,320])Restrictions
You cannot use more than one set of square brackets in a single host group definition.
The following example is
not
correct:... (hostA[1-10]B[1-20] hostC[101-120])The following example is correct:
... (hostA[1-20] hostC[101-120])You cannot define subgroups that contain wildcards and special characters. The following definition for
groupB
is not correct becausegroupA
defines hosts with a wildcard:Begin HostGroup GROUP_NAME GROUP_MEMBER groupA (hostA*) groupB (groupA) End HostGroupDefining condensed host groups
You can define condensed host groups to display information for its hosts as a summary for the entire group. This is useful because it allows you to see the total statistics of the host group as a whole instead of having to add up the data yourself. This allows you to better plan the distribution of jobs submitted to the hosts and host groups in your cluster.
To define condensed host groups, add a CONDENSE column to the
HostGroup
section. Under this column, enterY
to define a condensed host group orN
to define an uncondensed host group, as shown in the following:Begin HostGroup GROUP_NAME CONDENSE GROUP_MEMBER groupA Y (hostA hostB hostD) groupB N (hostC hostE) End HostGroupThe following commands display condensed host group information:
bhosts
bhosts -w
bjobs
bjobs -w
For the
bhosts
output of this configuration, see Viewing Host Information.Use
bmgroup -l
to see whether host groups are condensed or not.Hosts belonging to multiple condensed host groups
If you configure a host to belong to more than one condensed host group using wildcards,
bjobs
can display any of the host groups as execution host name.For example, host groups
hg1
andhg2
include the same hosts:Begin HostGroup GROUP_NAME CONDENSE GROUP_MEMBER # Key words hg1 Y (host*) hg2 Y (hos*) End HostGroupSubmit jobs using
bsub -m
:bsub -m "hg2" sleep 1001
bjobs
displayshg1
as the execution host instead ofhg2
:bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 520 user1 RUN normal host5 hg1 sleep 1001 Apr 15 13:50 521 user1 RUN normal host5 hg1 sleep 1001 Apr 15 13:50 522 user1 PEND normal host5 sleep 1001 Apr 15 13:51Importing external host groups (egroup)
When the membership of a host group changes frequently, or when the group contains a large number of members, you can use an external executable called
egroup
to retrieve a list of members rather than having to configure the group membership manually. You can write a site-specificegroup
executable that retrieves host group names and the hosts that belong to each group. For information about how to use the external host and user groups feature, see thePlatform LSF Configuration Reference
.Compute Units
Compute units are similar to host groups, with the added feature of granularity allowing the construction of clusterwide structures that mimic network architecture. Job scheduling using compute unit resource requirements optimizes job placement based on the underlying system architecture, minimizing communications bottlenecks. Compute units are especially useful when running communication-intensive parallel jobs spanning several hosts.
Resource requirement strings can specify compute units requirements such as running a job exclusively (
excl
), spreading a job evenly over multiple compute units (balance
), or choosing compute units based on other criteria.For a complete description of compute units see Controlling Job Locality using Compute Units in Chapter 34, "Running Parallel Jobs".
Compute unit configuration
To enforce consistency, compute unit configuration has the following requirements:
- Hosts and host groups appear in the finest granularity compute unit type, and nowhere else.
- Hosts appear in the membership list of at most one compute unit of the finest granularity.
- All compute units of the same type have the same type of compute units (or hosts) as members.
tip:
Configure each individual host as a compute unit to use the compute unit features for host level job allocation.Where to use compute units
LSF compute units can be used in defining the following parameters in LSF configuration files:
EXCLUSIVE
inlsb.queues
for the compute unit type allowed for the queue.HOSTS
inlsb.queues
for the hosts on which jobs from this queue can be run.RES_REQ
inlsb.queues
for queue compute unit resource requirements.RES_REQ
inlsb.applications
for application profile compute unit resource requirements.Configure compute units
- Log in as the LSF administrator to any host in the cluster.
- Open
lsb.params
.- Add the
COMPUTE_UNIT_TYPES
parameter if it does not already exist and list your compute unit types in order of granularity (finest first).COMPUTE_UNIT_TYPES=enclosure rack cabinet
- Save your changes.
- Open
lsb.hosts
.- Add the
ComputeUnit
section if it does not exist.Begin ComputeUnit NAME MEMBER TYPE encl1 (hostA hg1) enclosure encl2 (hostC hostD) enclosure encl3 (hostE hostF) enclosure encl4 (hostG hg2) enclosure rack1 (encl1 encl2) rack rack2 (encl3 encl4) rack cab1 (rack1 rack2) cabinet End ComputeUnit- Enter a compute unit name under the NAME column.
External compute units must be defined in the
egroup
executable.- Specify hosts or host groups in the MEMBER column of the finest granularity compute unit type. Specify compute units in the MEMBER column of coarser compute unit types.
(Optional) To tell LSF that the compute unit members of a finest granularity compute unit should be retrieved using
egroup
, put an exclamation mark (!
) in the MEMBER column.- Specify the type of compute unit in the TYPE column.
- Save your changes.
- Run
badmin ckconfig
to check the compute unit definition. If any errors are reported, fix the problem and check the configuration again.- Run
badmin mbdrestart
to apply the new configuration.To view configured compute units, run
bmgroup -cu
.Using wildcards and special characters to define names in compute units
You can use special characters when defining compute unit members under the MEMBER column to specify hosts, host groups, and compute units. These are useful to define several names in a single entry such as a range of hosts, or for all names with a certain text string.
When defining host, host group, and compute unit members of compute units, you can use string literals and the following special characters:
- Use a tilde (
~
) to exclude specified hosts, host groups, or compute units from the list. The tilde can be used in conjunction with the other special characters listed below. The following example matches all hosts ingroup12
except forhostA
, andhostB
:... (group12 ~hostA ~hostB)Use an asterisk ( *
) as a wildcard character to represent any number of characters. The following example matches all hosts beginning with the text string "hostC
" (such ashostCa
,hostC1
, orhostCZ1
):... (hostC*)Use square brackets with a hyphen ( [
integer1
-
integer2
]
) to define a range of non-negative integers at the end of a name. The first integer must be less than the second integer. The following example matches all hosts fromhostD51
tohostD100
:... (hostD[51-100])Use square brackets with commas ( [
integer1
,
integer2
...]
) to define individual non-negative integers at the end of a name. The following example matcheshostD101
,hostD123
, andhostD321
:... (hostD[101,123,321])Use square brackets with commas and hyphens (such as [
integer1
-
integer2
,
integer3
,
integer4
-
integer5
]
) to define different ranges of non-negative integers at the end of a name. The following example matches all hosts fromhostD1
tohostD100
,hostD102
, all hosts fromhostD201
tohostD300
, andhostD320
):... (hostD[1-100,102,201-300,320])Restrictions
You cannot use more than one set of square brackets in a single compute unit definition.
The following example is
not
correct:... (hostA[1-10]B[1-20] hostC[101-120])The following example is correct:
... (hostA[1-20] hostC[101-120])The keywords
all
,allremote
,all@cluster
,other
anddefault
cannot be used when defining compute units.Defining condensed compute units
You can define condensed compute units to display information for its hosts as a summary for the entire group, including the slot usage for each compute unit. This is useful because it allows you to see statistics of the compute unit as a whole instead of having to add up the data yourself. This allows you to better plan the distribution of jobs submitted to the hosts and compute units in your cluster.
To define condensed compute units, add a CONDENSE column to the
ComputeUnit
section. Under this column, enterY
to define a condensed host group orN
to define an uncondensed host group, as shown in the following:Begin ComputeUnit NAME CONDENSE MEMBER TYPE enclA Y (hostA hostB hostD) enclosure enclB N (hostC hostE) enclosure End HostGroupThe following commands display condensed host information:
bhosts
bhosts -w
bjobs
bjobs -w
For the
bhosts
output of this configuration, see Viewing Host Information.Use
bmgroup -l
to see whether host groups are condensed or not.Importing external host groups (egroup)
When the membership of a compute unit changes frequently, or when the compute unit contains a large number of members, you can use an external executable called
egroup
to retrieve a list of members rather than having to configure the membership manually. You can write a site-specificegroup
executable that retrieves compute unit names and the hosts that belong to each group, and compute units of the finest granularity can containegroup
s as members. For information about how to use the external host and user groups feature, see thePlatform LSF Configuration Reference
.Using compute units with advance reservation
When running exclusive compute unit jobs (with the resource requirement
cu
[excl]
), the advance reservation can affect hosts outside the advance reservation but in the same compute unit as follows:
- An exclusive compute unit job dispatched to a host inside the advance reservation will lock the entire compute unit, including any hosts outside the advance reservation.
- An exclusive compute unit job dispatched to a host outside the advance reservation will lock the entire compute unit, including any hosts inside the advance reservation.
Ideally all hosts belonging to a compute unit should be inside or outside of an advance reservation.
Tuning CPU Factors
CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that response time is minimized.
To achieve this, it is important that you define correct CPU factors for each machine model in your cluster.
How CPU factors affect performance
Incorrect CPU factors can reduce performance the following ways.
- If the CPU factor for a host is too low, that host may not be selected for job placement when a slower host is available. This means that jobs would not always run on the fastest available host.
- If the CPU factor is too high, jobs are run on the fast host even when they would finish sooner on a slower but lightly loaded host. This causes the faster host to be overused while the slower hosts are underused.
Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. LSF then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.
Guidelines for setting CPU factors
CPU factors should be set based on a benchmark that reflects your workload. If there is no such benchmark, CPU factors can be set based on raw CPU power.
The CPU factor of the slowest hosts should be set to 1, and faster hosts should be proportional to the slowest.
Example
Consider a cluster with two hosts:
hostA
andhostB
. In this cluster,hostA
takes 30 seconds to run a benchmark andhostB
takes 15 seconds to run the same test. The CPU factor forhostA
should be 1, and the CPU factor ofhostB
should be 2 because it is twice as fast ashostA
.View normalized ratings
Run lsload -N
to display normalized ratings.LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. Hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.
Tune CPU factors
- Log in as the LSF administrator on any host in the cluster.
- Edit
lsf.shared
, and change theHostModel
section:Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE # keyword #HPUX (HPPA) HP9K712S 2.5 (HP9000712_60) HP9K712M 2.5 (HP9000712_80) HP9K712F 4.0 (HP9000712_100)See the
Platform LSF Configuration Reference
for information about thelsf.shared
file.- Save the changes to
lsf.shared
.- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin reconfig
to reconfigurembatchd
.Handling Host-level Job Exceptions
You can configure hosts so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.
Host exceptions LSF can detect
If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such "black hole", or "job-eating" hosts exit abnormally. LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in
lsb.hosts
).If EXIT_RATE is specified for the host, LSF invokes
eadmin
if the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION inlsb.params
to change how frequently LSF checks the job exit rate.Use GLOBAL_EXIT_RATE in
lsb.params
to set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host inlsb.hosts
, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.Configuring host exception handling (lsb.hosts)
EXIT_RATE
Specify a threshold for exited jobs. If the job exit rate is exceeded for 5 minutes or the period specified by JOB_EXIT_RATE_DURATION in
lsb.params
, LSF invokeseadmin
to trigger a host exception.Example
The following Host section defines a job exit rate of 20 jobs for all hosts, and an exit rate of 10 jobs on
hostA
.Begin Host HOST_NAME MXJ EXIT_RATE # Keywords Default ! 20 hostA ! 10 End HostConfiguring thresholds for host exception handling
By default, LSF checks the number of exited jobs every 5 minutes. Use JOB_EXIT_RATE_DURATION in
lsb.params
to change this default.Tuning
tip:
Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.Example
In the following diagram, the job exit rate of
hostA
exceeds the configured threshold (EXIT_RATE for hostA inlsb.hosts
) LSF monitorshostA
from time t1 to time t2 (t2=t1 + JOB_EXIT_RATE_DURATION inlsb.params
). At t2, the exit rate is still high, and a host exception is detected. At t3 (EADMIN_TRIGGER_DURATION inlsb.params
), LSF invokeseadmin
and the host exception is handled. By default, LSF closeshostA
and sends email to the LSF administrator. SincehostA
is closed and cannot accept any new jobs, the exit rate drops quickly.
![]()
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |