Knowledge Center Contents Previous Next Index |
Working with Your Cluster
Contents
- Viewing cluster information
- Example directory structures
- Cluster administrators
- Controlling daemons
- Controlling mbatchd
- Reconfiguring your cluster
Viewing cluster information
LSF provides commands for users to access information about the cluster. Cluster information includes the cluster master host, cluster name, cluster resource definitions, cluster administrator, and so on.
To view the ... Run ... Version of LSFlsid
Cluster namelsid
Current master hostlsid
Cluster administratorslsclusters
Configuration parametersbparams
View LSF version, cluster name, and current master host
- Run
lsid
to display the version of LSF, the name of your cluster, and the current master host:lsid
Platform LSF 7 Update 5 March 3 2009 Copyright 1992-2009 Platform Computing Corporation My cluster name is cluster1 My master name is hostAView cluster administrators
- Run
lsclusters
to find out who your cluster administrator is and see a summary of your cluster:lsclusters
CLUSTER_NAME STATUS MASTER_HOST ADMIN HOSTS SERVERS cluster1 ok hostA lsfadmin 6 6If you are using the LSF MultiCluster product, you will see one line for each of the clusters that your local cluster is connected to in the output of
lsclusters
.View configuration parameters
- Run
bparams
to display the generic configuration parameters of LSF. These include default queues, job dispatch interval, job checking interval, and job accepting interval.bparams
Default Queues: normal idle Job Dispatch Interval: 20 seconds Job Checking Interval: 15 seconds Job Accepting Interval: 20 seconds- Run
bparams -l
to display the information in long format, which gives a brief description of each parameter and the name of the parameter as it appears inlsb.params
.bparams -l
System default queues for automatic queue selection: DEFAULT_QUEUE = normal idle The interval for dispatching jobs by master batch daemon: MBD_SLEEP_TIME = 20 (seconds) The interval for checking jobs by slave batch daemon: SBD_SLEEP_TIME = 15 (seconds) The interval for a host to accept two batch jobs subsequently: JOB_ACCEPT_INTERVAL = 1 (* MBD_SLEEP_TIME) The idle time of a host for resuming pg suspended jobs: PG_SUSP_IT = 180 (seconds) The amount of time during which finished jobs are kept in core: CLEAN_PERIOD = 3600 (seconds) The maximum number of finished jobs that are logged in current event file: MAX_JOB_NUM = 2000 The maximum number of retries for reaching a slave batch daemon: MAX_SBD_FAIL = 3 The number of hours of resource consumption history: HIST_HOURS = 5 The default project assigned to jobs. DEFAULT_PROJECT = default Sync up host status with master LIM is enabled: LSB_SYNC_HOST_STAT_LIM = Y MBD child query processes will only run on the following CPUs: MBD_QUERY_CPUS=1 2 3- Run
bparams -a
to display all configuration parameters and their values inlsb.params
.For example:
bparams -a
lsb.params configuration at Fri Jun 8 10:27:52 CST 2007 MBD_SLEEP_TIME = 20 SBD_SLEEP_TIME = 15 JOB_ACCEPT_INTERVAL = 1 SUB_TRY_INTERVAL = 60 LSB_SYNC_HOST_STAT_LIM = N MAX_JOBINFO_QUERY_PERIOD = 2147483647 PEND_REASON_UPDATE_INTERVAL = 30Viewing daemon parameter configuration
- Display all configuration settings for running LSF daemons.
- Use
lsadmin showconf
to display all configured parameters and their values inlsf.conf
orego.conf
for LIM.- Use
badmin showconf
to display all configured parameters and their values inlsf.conf
orego.conf
formbatchd
andsbatchd
.In a MultiCluster environment,
lsadmin showconf
andbadmin showconf
only display the parameters of daemons on the local cluster.Running
lsadmin showconf
andbadmin showconf
from a master candidate host will reach all server hosts in the cluster. Runninglsadmin showconf
andbadmin showconf
from a slave-only host may not be able to reach other slave-only hosts.You cannot run
lsadmin showconf
andbadmin showconf
from client hosts.lsadmin
shows only server host configuration, not client host configuration.
lsadmin showconf
andbadmin showconf
only displays the values used by LSF.
lsadmin showconf
andbadmin showconf
display EGO_MASTER_LIST from wherever it is defined. You can define either LSF_MASTER_LIST inlsf.conf
or or EGO_MASTER_LIST inego.conf
. LIM readslsf.conf
first, andego.conf
if EGO is enabled in the LSF cluster. LIM only takes the value of LSF_MASTER_LIST if EGO_MASTER_LIST is not defined at all inlsf.conf
.For example, if EGO is enabled in the LSF cluster, and you define LSF_MASTER_LIST in
lsf.conf
, and EGO_MASTER_LIST inego.conf
,lsadmin showconf
andbadmin showconf
display the value of EGO_MASTER_LIST inego.conf
.If EGO is disabled,
ego.conf
not loaded, so whatever is defined inlsf.conf
is displayed.- Display
mbatchd
and rootsbatchd
configuration.
- Use
badmin showconf mbd
to display the parameters configured inlsf.conf
orego.conf
that apply tombatchd
.- Use
badmin showconf sbd
to display the parameters configured inlsf.conf
orego.conf
that apply to rootsbatchd
.- Display LIM configuration.
Use
lsadmin showconf lim
to display the parameters configured inlsf.conf
orego.conf
that apply to root LIM.By default,
lsadmin
displays the local LIM parameters. You can specify the host to display the LIM parameters.Examples
- Show
mbatchd
configuration:badmin showconf mbd MBD configuration at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show sbatchd
configuration on a specific host:badmin showconf sbd hosta SBD configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show sbatchd
configuration for all hosts:badmin showconf sbd all SBD configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ... SBD configuration for host <hostb> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show lim
configuration:lsadmin showconf lim LIM configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show lim
configuration for a specific host:lsadmin showconf lim hosta LIM configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Show lim
configuration for all hosts:lsadmin showconf lim all LIM configuration for host <hosta> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ... LIM configuration for host <hostb> at Fri Jun 8 10:27:52 CST 2007 LSB_SHAREDIR=/scratch/dev/lsf/user1/0604/work LSF_CONFDIR=/scratch/dev/lsf/user1/0604/conf LSF_LOG_MASK=LOG_WARNING LSF_ENVDIR=/scratch/dev/lsf/user1/0604/conf LSF_EGO_DAEMON_CONTROL=N ...Example directory structures
UNIX and Linux
The following figures show typical directory structures for a new UNIX or Linux installation with
lsfinstall
. Depending on which products you have installed and platforms you have selected, your directory structure may vary.
![]()
Microsoft Windows
The following diagram shows an example directory structure for a Windows installation.
![]()
Cluster administrators
Primary cluster administrator
Required. The first cluster administrator, specified during installation. The primary LSF administrator account owns the configuration and log files. The primary LSF administrator has permission to perform clusterwide operations, change configuration files, reconfigure the cluster, and control jobs submitted by all users.
Other cluster administrators
Optional. May be configured during or after installation.
Cluster administrators can perform administrative operations on all jobs and queues in the cluster. Cluster administrators have the same cluster-wide operational privileges as the primary LSF administrator except that they do not have permission to change LSF configuration files.
Add cluster administrators
- In the
ClusterAdmins
section oflsf.cluster.
cluster_name
, specify the list of cluster administrators following ADMINISTRATORS, separated by spaces.You can specify user names and group names.
The first administrator in the list is the primary LSF administrator. All others are cluster administrators.
For example:
Begin ClusterAdmins ADMINISTRATORS = lsfadmin admin1 admin2 End ClusterAdmins- Save your changes.
- Run
lsadmin reconfig
to reconfigure LIM.- Run
badmin mbdrestart
to restartmbatchd
.Controlling daemons
Permissions required
To control all daemons in the cluster, you must
- Be logged on as root or as a user listed in the
/etc/lsf.sudoers
file. See thePlatform LSF Configuration Reference
for configuration details oflsf.sudoers
.- Be able to run the
rsh
orssh
commands across all LSF hosts without having to enter a password. See your operating system documentation for information about configuring thersh
andssh
commands. The shell command specified by LSF_RSH inlsf.conf
is used beforersh
is tried.Daemon commands
The following is an overview of commands you use to control LSF daemons.
sbatchd
Restarting
sbatchd
on a host does not affect jobs that are running on that host.If
sbatchd
is shut down, the host is not available to run new jobs. Existing jobs running on that host continue, but the results are not sent to the user untilsbatchd
is restarted.LIM and RES
Jobs running on the host are not affected by restarting the daemons.
If a daemon is not responding to network connections,
lsadmin
displays an error message with the host name. In this case you must kill and restart the daemon manually.If the LIM and the other daemons on the current master host shut down, another host automatically takes over as master.
If the RES is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted.
Controlling mbatchd
You use the
badmin
command to controlmbatchd
.Reconfigure mbatchd
If you add a host to a host group, a host to a queue, or change resource configuration in the Hosts section of
lsf.cluster.
cluster_name
, the change is not recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must restartmbatchd
.
- Run
badmin reconfig
.When you reconfigure the cluster,
mbatchd
is not restarted. Only configuration files are reloaded.Restart mbatchd
- Run
badmin mbdrestart
.LSF checks configuration files for errors and prints the results to
stderr
. If no errors are found, the following occurs:
- Configuration files are reloaded
mbatchd
is restarted- Events in
lsb.events
are reread and replayed to recover the running state of the lastmbatchd
tip:
Whenevermbatchd
is restarted, it is unavailable to service requests. In large clusters where there are many events inlsb.events
, restartingmbatchd
can take some time. To avoid replaying events inlsb.events
, use the commandbadmin reconfig
.Log a comment when restarting mbatchd
- Use the
-C
option ofbadmin mbdrestart
to log an administrator comment inlsb.events
.For example:
badmin mbdrestart -C "Configuration change"
The comment text
Configuration change
is recorded inlsb.events
.- Run
badmin hist
orbadmin mbdhist
to display administrator comments formbatchd
restart.Shut down mbatchd
- Run
badmin hshutdown
to shut downsbatchd
on the master host.For example:
badmin hshutdown hostD
Shut down slave batch daemon on <hostD> .... done- Run
badmin mbdrestart
:badmin mbdrestart
Checking configuration files ... No errors found.This causes
mbatchd
andmbschd
to exit.mbatchd
cannot be restarted, becausesbatchd
is shut down. All LSF services are temporarily unavailable, but existing jobs are not affected. Whenmbatchd
is later started bysbatchd
, its previous status is restored from the event log file and job scheduling continues.Customize batch command messages
LSF displays error messages when a batch command cannot communicate with
mbatchd
. Users see these messages when the batch command retries the connection tombatchd
.You can customize three of these messages to provide LSF users with more detailed information and instructions.
- In the file
lsf.conf
, identify the parameter for the message that you want to customize.The following lists the parameters you can use to customize messages when a batch command does not receive a response from
mbatchd
.
- Specify a message string, or specify an empty string:
- To specify a message string, enclose the message text in quotation marks (") as shown in the following example:
LSB_MBD_BUSY_MSG=
"The mbatchd daemon is busy. Your command will retry every 5 minutes. No action required.
"- To specify an empty string, type quotation marks (") as shown in the following example:
LSB_MBD_BUSY_MSG=
""Whether you specify a message string or an empty string, or leave the parameter undefined, the batch command retries the connection to
mbatchd
at the intervals specified by the parameters LSB_API_CONNTIMEOUT and LSB_API_RECVTIMEOUT.
note:
Before Version 7.0, LSF displayed the following message for all three message types: "batch daemon not responding...still trying." To display the previous default message, you must define each of the three message parameters and specify "batch daemon not responding...still trying" as the message string.- Save and close the
lsf.conf
file.Reconfiguring your cluster
After changing LSF configuration files, you must tell LSF to reread the files to update the configuration. Use the following commands to reconfigure a cluster:
lsadmin reconfig
badmin reconfig
badmin mbdrestart
The reconfiguration commands you use depend on which files you change in LSF. The following table is a quick reference.
Reconfigure the cluster with lsadmin and badmin
To make a configuration change take effect, use this method to reconfigure the cluster.
- Log on to the host as
root
or the LSF administrator.- Run
lsadmin reconfig
to reconfigure LIM:
lsadmin reconfig
The
lsadmin reconfig
command checks for configuration errors.If no errors are found, you are prompted to either restart
lim
on master host candidates only, or to confirm that you want to restartlim
on all hosts. If fatal errors are found, reconfiguration is aborted.- Run
badmin reconfig
to reconfigurembatchd
:
badmin reconfig
The
badmin reconfig
command checks for configuration errors.If fatal errors are found, reconfiguration is aborted.
Reconfigure the cluster by restarting mbatchd
To replay and recover the running state of the cluster, use this method to reconfigure the cluster.
- Run
badmin mbdrestart
to restartmbatchd
:
badmin mbdrestart
The
badmin mbdrestart
command checks for configuration errors.If no fatal errors are found, you are asked to confirm
mbatchd
restart. If fatal errors are found, the command exits without taking any action.
tip:
If thelsb.events
file is large, or many jobs are running, restartingmbatchd
can take some time. In addition,mbatchd
is not available to service requests while it is restarted.View configuration errors
- Run
lsadmin ckconfig -v
.- Run
badmin ckconfig -v
.This reports all errors to your terminal.
How reconfiguring the cluster affects licenses
If the license server goes down, LSF can continue to operate for a period of time until it attempts to renew licenses.
Reconfiguring causes LSF to renew licenses. If no license server is available, LSF does not reconfigure the system because the system would lose all its licenses and stop working.
If you have multiple license servers, reconfiguration proceeds provided LSF can contact at least one license server. In this case, LSF still loses the licenses on servers that are down, so LSF may have fewer licenses available after reconfiguration.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |