Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



MultiCluster Setup


Contents

[ Top ]


Setup Overview

System requirements

The setup procedures will guide you through configuring your system to meet each requirement. However, you might find it helpful to understand the system requirements before you begin. This section includes:

Requirements to install MultiCluster

MultiCluster is a licensed product; you will have to obtain a license from Platform Computing in order to run MultiCluster.

You can use MultiCluster to link two or more LSF clusters. Then, the participating clusters can be configured to share resources.

MultiCluster files are automatically installed by LSF's regular Setup program (lsfinstall). Install LSF and make sure each cluster works properly as a standalone cluster before you proceed to configure MultiCluster.

Requirements for MultiCluster communication to occur between 2 clusters

Requirements for resource sharing to occur between 2 clusters

Requirements for jobs to run across clusters

Installation and configuration procedures

To install and configure MultiCluster, take these steps:

  1. Plan the cluster
  2. Required tasks to establish communication between clusters
  3. Additional tasks that might be required to establish communication between clusters
  4. Testing communication between clusters
  5. Required tasks to establish resource sharing
  6. Optional tasks

Plan the cluster

  1. Read the overview to learn about how MultiCluster can be useful to you. See MultiCluster Overview.
  2. Decide which clusters will participate. Read about setup to learn about the issues that could prevent clusters from working together. See MultiCluster Setup.
  3. Decide which resources you want to share.
  4. Decide how you will share the resources among clusters. To learn about the various configuration options, see MultiCluster Job Forwarding Model or MultiCluster Resource Leasing Model.
  5. Read about setup to learn about configuration options common to both models. See MultiCluster Setup.

Required tasks to establish communication between clusters

  1. For each participating cluster, obtain and install a valid MultiCluster license. See Licensing MultiCluster.
  2. For each participating cluster, add the MultiCluster product to the LSF cluster configuration file. See Installing MultiCluster products.
  3. For resource sharing to work between clusters, the clusters should have common definitions of host types, host models, and resources. Configure this information in lsf.shared. See Setting common resource definitions.
  4. To establish communication, clusters must be aware of other clusters and know how to contact other clusters. Add each cluster name and its master host name to Cluster section of lsf.shared. Defining participating clusters and valid master hosts.

Additional tasks that might be required to establish communication between clusters

  1. By default, LSF assumes a uniform user name space within a cluster and between clusters.
  2. With MultiCluster, LSF daemons can use non-privileged ports. By default, LSF daemons in a MultiCluster environment use privileged port authentication. See Security of Daemon Communication.

Testing communication between clusters

  1. Restart each cluster using the lsadmin and badmin commands.
    % lsadmin limrestart all
    % badmin mbdrestart
    
  2. To verify that MultiCluster is enabled, run lsclusters and bclusters.
    % lsclusters
    CLUSTER_NAME    STATUS    MASTER_HOST    ADMIN    HOSTS    SERVERS
    cluster1         ok       hostA          admin1   1        1
    cluster2         ok       hostD          admin2   3        3
    
    % bclusters
    [Remote Batch Information]
    No local queue sending/receiving jobs from remote clusters
    

Required tasks to establish resource sharing

  1. Optional. Run a simple test of resource sharing. See Testing the Resource Leasing Model or Testing the Job Forwarding Model.
  2. Configure resource-sharing policies between clusters. See MultiCluster Job Forwarding Model or MultiCluster Resource Leasing Model.

Optional tasks

  1. By default, all the clusters in a MultiCluster environment are aware of all the other clusters. This makes it possible for clusters to share resources or information. You can restrict awareness of remote clusters at the cluster level. See Restricted Awareness of Remote Clusters.
  2. With MultiCluster, LSF daemons can use non-privileged ports (by default, LSF daemons in a MultiCluster environment use privileged port authentication). You can also choose the method of daemon authentication. See Security of Daemon Communication and Authentication Between Clusters.
  3. When a local cluster requests load or host information from a remote cluster, the information is cached. If the local cluster is required to display the same information again, LSF displays the cached information, unless the cache has expired. The expiry period for cached information is configurable. See Cache thresholds.
  4. The default configuration of LSF is that clusters share information about the resources used by other clusters, and the information is updated every 5 minutes by the execution or provider cluster. You can disable the feature or modify how often MultiCluster resource use is updated. See Configuring resource use updating for MultiCluster jobs.
  5. To learn about optional features related to each configuration model, see MultiCluster Job Forwarding Model or MultiCluster Resource Leasing Model.

Licensing MultiCluster

To license MultiCluster, do the following:

  1. Send the license server host IDs for each participating cluster to your LSF vendor, and Platform Computing will generate license keys for you.
  2. Append the new FEATURE lines to your existing LSF license.dat files, so that each participating cluster is appropriately licensed. The feature line required to license MultiCluster is lsf_multicluster.
  3. To make the change take effect, restart each LSF cluster and each license server.

Installing MultiCluster products

MultiCluster files are automatically installed by LSF's regular Setup program (lsfinstall). Install LSF and make sure each cluster works properly as a standalone cluster before you proceed to configure MultiCluster.

To make each cluster run MultiCluster, add LSF_MultiCluster to the products specified in the parameters section of lsf.cluster.cluster_name:

Begin Parameters
PRODUCTS=LSF_Base LSF_Manager LSF_MultiCluster 
End Parameters

Setting common ports

Participating clusters must use the same port numbers for the daemons LIM, RES, and MBD.

By default, all clusters have the identical settings, as shown:

LSF_LIM_PORT=7869
LSF_RES_PORT=6878
LSB_MBD_PORT=6881
LSB_SBD_PORT=6882

LSF_LIM_PORT change

The default for LSF_LIM_PORT has changed to accommodate Platform EGO default port configuration. On EGO, default ports start with lim at 7869, and are numbered consecutively for the EGO pem, vemkd, and egosc daemons.

This is different from previous LSF releases where the default LSF_LIM_PORT was 6879. LSF res, sbatchd, and mbatchd continue to use the default pre-7.0 ports 6878, 6881, and 6882.

Upgrade installation preserves existing port settings for lim, res, sbatchd, and mbatchd. EGO pem, vemkd, and egosc use default EGO ports starting at 7870, if they do not conflict with existing lim, res, sbatchd, and mbatchd ports.

YTroubleshooting

To check your port numbers, check the LSF_TOP/conf/lsf.conf file in each cluster. (LSF_TOP is the LSF installation directory. On UNIX, this is defined in the install.config file). Make sure you have identical settings in each cluster for the following parameters:

Setting common resource definitions

For resource sharing to work between clusters, the clusters should have common definitions of host types, host models, and resources. Each cluster finds this information in lsf.shared, so the best way to configure MultiCluster is to make sure lsf.shared is identical for each cluster. If you do not have a shared file system, replicate lsf.shared across all clusters.

Defining participating clusters and valid master hosts

To enable MultiCluster, define all participating clusters in the Cluster section of the LSF_TOP/conf/lsf.shared file.

  1. For ClusterName, specify the name of each participating cluster. On UNIX, each cluster name is defined by LSF_CLUSTER_NAME in the install.config file.
  2. For Servers, specify one or more candidate master hosts for the cluster (these are the first hosts listed in the Host section of lsf.cluster.cluster_name). A cluster will not participate in MultiCluster resource sharing unless its current master host is listed here.

Example

Begin Cluster
ClusterName  Servers 
Cluster1     (hostA hostB)
Cluster2     (hostD)
End Cluster

In this example, hostA should be the master host of Cluster1 (the first host listed in lsf.cluster.cluster1 HOST section) with hostB as the backup, and hostD should be the master host of Cluster2. If the master host fails in Cluster1, MultiCluster will still work because the backup master is also listed here. However, if the master host fails in Cluster2, MultiCluster will not recognize any other host as the master, so Cluster2 will no longer participate in MultiCluster resource sharing.

EGO_PREDEFINED_RESOURCES in lsf.conf

When Platform EGO is enabled in the LSF cluster (LSF_ENABLE_EGO=Y), you also can set the several EGO parameters related to LIM, PIM, and ELIM in either lsf.conf or ego.conf.

All clusters must have the same value of EGO_PREDEFINED_RESOURCES in lsf.conf to enable the nprocs, ncores, and nthreads host resources in remote clusters to be usable.

See Administering Platform LSF for more information about configuring Platform LSF on EGO.

[ Top ]


Non-Uniform Name Spaces

By default, LSF assumes a uniform user name space within a cluster and between clusters.

User account mapping

To support the execution of batch jobs across non-uniform user name spaces between clusters, LSF allows user account mapping.

See Account mapping between clusters.

File transfer

By default, LSF uses lsrcp for file transfer (bsub -f option),


The lsrcp utility depends on a uniform user ID in different clusters.

Account mapping between clusters

By default, LSF assumes a uniform user name space within a cluster and between clusters. To support the execution of batch jobs across non-uniform user name spaces between clusters, LSF allows user account mapping.

For a job submitted by one user account in one cluster to run under a different user account on a host that belongs to a remote cluster, both the local and remote clusters must have the account mapping properly configured. System-level account mapping is configured by the LSF administrator, while user-level account mapping can be configured by LSF users.

System-level account mapping

You must be an LSF administrator to configure system level account mapping.

System-level account mapping is defined in the UserMap section of lsb.users. The submission cluster proposes a set of user mappings (defined using the keyword export) and the execution cluster accepts a set of user mappings (defined using the keyword import). For a user's job to run, the mapping must be both proposed and accepted.

Example

lsb.users on cluster1:

Begin UserMap
LOCAL    REMOTE                             DIRECTION
user1    user2@cluster2                     export
user3    (user4@cluster2 user6@cluster2)    export
End UserMap

lsb.users on cluster2:

Begin UserMap
LOCAL            REMOTE                   DIRECTION
user2            user1@cluster1           import
(user6 user8)    user3@cluster1           import
End UserMap

Cluster1 configures user1 to run jobs as user2 in cluster2, and user3 to run jobs as user4 or user6 in cluster2.

Cluster2 configures user1 from cluster1 to run jobs as user2, and user3 from cluster1 to run jobs as user6 or user8.

Only mappings configured in both clusters work. The common account mappings are for user1 to run jobs as user2, and for user3 to run jobs as user6. Therefore, these mappings work, but the mappings of user3 to users 4 and 8 are only half-done and so do not work.

User-level account mapping

To set up your own account mapping, set up an .lsfhosts file in your home directory with Owner Read-Write permissions only. Do not give other users and groups permissions on this file.

Account mapping can specify cluster names in place of host names.

Example #1

You have two accounts: user1 on cluster1, and user2 on cluster2. To run jobs in either cluster, configure .lsfhosts as shown.

On each host in cluster1:

% cat ~user1/.lsfhosts
cluster2 user2

On each host in cluster2:

% cat ~user2/.lsfhosts
cluster1 user1

Example #2

You have the account user1 on cluster1, and want to run jobs on cluster2 under the lsfguest account. Configure .lsfhosts as shown.

On each host in cluster1:

cat ~user1/.lsfhosts
cluster2 lsfguest send

On each host in cluster2:

cat ~lsfguest/.lsfhosts
cluster1 user1 recv

Example #3

You have a uniform account name (user2) on all hosts in cluster2, and a uniform account name (user1) on all hosts in cluster1 except hostX. On hostX, you have the account name user99.

To use both clusters transparently, configure .lsfhosts in your home directories on different hosts as shown.

On hostX in cluster1:

cat ~user99/.lsfhosts
cluster1    user1
hostX       user99
cluster2    user2

On every other host in cluster1:

cat ~user1/.lsfhosts
cluster2    user2
hostX       user99

On each host in cluster2:

cat ~user2/.lsfhosts
cluster1    user1
hostX       user99

[ Top ]


Restricted Awareness of Remote Clusters

By default, all the clusters in a MultiCluster environment are aware of all the other clusters. This makes it possible for clusters to share resources or information when you configure MultiCluster links between them.

You can restrict awareness of remote clusters at the cluster level, by listing which of the other clusters in the MultiCluster environment are allowed to interact with the local cluster. In this case, the local cluster cannot display information about unrecognized clusters and does not participate in MultiCluster resource sharing with unrecognized clusters.

How it works

By default, the local cluster can obtain information about all other clusters specified in lsf.shared. The default behavior of RES is to accept requests from all the clusters in lsf.shared.

If the RemoteClusters section in lsf.cluster.cluster_name is defined, the local cluster has a list of recognized clusters, and is only aware of those clusters. The local cluster is not aware of the other clusters in the MultiCluster environment:

However, remote clusters might still be aware of this cluster:

Example

This example illustrates how the RemoteClusters list works.

The MultiCluster environment consists of 4 clusters with a common lsf.shared:

CLUSTERS
cluster1
cluster2
cluster3
cluster4

In addition, cluster2 is configured with a RemoteClusters list in lsf.cluster.cluster_name:

Begin RemoteClusters
CLUSTERNAME
cluster3
cluster4
End RemoteClusters

Because of the RemoteClusters list, local applications in cluster2 are aware of cluster3 and cluster4, but not cluster1. For example, if you view information or configure queues using the keyword all, LSF will behave as if you specified the list of recognized clusters instead of all clusters in lsf.shared.

Adding or modifying RemoteClusters list

You must have cluster administrator privileges in the local cluster to perform this task.

  1. Open lsf.cluster.cluster_name of the local cluster.
  2. If it does not already exist, create the RemoteClusters section as shown:
    Begin RemoteClusters
    CLUSTERNAME
    ...
    End RemoteClusters
    
  3. Edit the RemoteClusters section. Under the heading CLUSTERNAME, specify the names of the remote clusters that you want the local cluster recognize.

    These clusters must also be listed in lsf.shared, so the RemoteClusters list is always a subset of the clusters list in lsf.shared.

[ Top ]


Security of Daemon Communication

With MultiCluster, LSF daemons can be configured to communicate over non- privileged ports (by default, LSF daemons in a MultiCluster environment use privileged port authentication).

If disabling the privileged port authentication makes you concerned about the security of daemon authentication, you can use an eauth program to enable any method of authentication for secure communication between clusters. See Authentication Between Clusters.

Requirements

Steps

  1. To make LSF daemons use non-privileged ports, edit lsf.conf in every cluster as shown:

    LSF_MC_NON_PRIVILEGED_PORTS=Y

  2. To make the changes take effect, restart the master LIM and MBD in every cluster. For example, if a cluster's master host is hostA, run the following commands in that cluster:
    lsadmin limrestart hostA
    badmin mbdrestart
    

[ Top ]


Authentication Between Clusters

For extra security, you can use any method of external authentication between any two clusters in the MultiCluster grid.

Because this is configured for individual clusters, not globally, different cluster pairs can use different systems of authentication. You use a different eauth program for each different authentication mechanism.

If no common external authentication method has been configured, two clusters communicate with the default security, which is privileged port authentication.

eauth executables

Contact Platform Professional Services for more information about the eauth programs that Platform distributes to allow LSF to work with different security mechanisms. If you already have an eauth that works with LSF for daemon authentication within the cluster, use a copy of it.

If different clusters use different methods of authentication, set up multiple eauth programs.

Steps

  1. Copy the corresponding eauth program to LSF_SERVERDIR.
  2. Name the eauth program eauth.method_name.

    If you happen to use the same eauth program for daemon authentication within the cluster, you should have two copies, one named eauth (used by LSF) and one named eauth.method_name (used by MultiCluster).

MultiCluster configuration

Steps

  1. Edit the lsf.cluster.cluster_name RemoteClusters section.

    If the cluster does not already include a RemoteClusters list, you must add it now. To maintain the existing compatibility, specify all remote clusters in the list, even if the preferred method of authentication is the default method.

  2. If necessary, add the AUTH column to the RemoteClusters section.
  3. For each remote cluster, specify the preferred authentication method. Set AUTH to method_name (using the same method name that identifies the corresponding eauth program). For default behavior, specify a dash (-).
  4. To make the changes take effect in a working cluster, run the following commands:
    lsadmin limrestart master_host 
    badmin mbdreconfig
    

Repeat the steps for each cluster that will use external authentication, making sure that the configurations of paired-up clusters match.

Configuration example

In this example, Cluster1 and Cluster2 use Kerberos authentication with each other, but not with Cluster3. It does not matter how Cluster3 is configured, because there is no extra authentication unless the configurations of both clusters agree.

Cluster1

lsf.cluster.cluster1:

Begin RemoteClusters
CLUSTERNAME  EQUIV   CACHE_INTERVAL   RECV_FROM   AUTH
cluster2       Y           60            Y        KRB
cluster3       N           30            N         -
End RemoteClusters

LSF_SERVERDIR in Cluster1 includes an eauth executable named eauth.KRB.

Cluster2

lsf.cluster.cluster2:

Begin RemoteClusters
CLUSTERNAME  EQUIV   CACHE_INTERVAL   RECV_FROM   AUTH
cluster1       Y           60            Y        KRB
cluster3       N           30            N         -
End RemoteClusters

LSF_SERVERDIR in Cluster2 includes an eauth executable named eauth.KRB.

[ Top ]


Resource Use Updating for MultiCluster Jobs

Upon installation, the default configuration of LSF is that clusters share information about the resources used by other clusters, and the information is updated every 5 minutes by the execution or provider cluster. You can disable the feature or modify how often MultiCluster resource use is updated. Depending on load, updating the information very frequently can affect the performance of LSF.

Configuring resource use updating for MultiCluster jobs

To change the timing of resource usage updating between clusters, set MC_RUSAGE_UPDATE_INTERVAL in lsb.params in the execution or provider cluster. Specify how often to update the information in the submission or consumer cluster, in seconds.

To disable LSF resource usage updating between clusters, specify zero:

MC_RUSAGE_UPDATE_INTERVAL=0

Restriction

You must configure this parameter manually; you cannot use LSF GUI tools to add or modify this parameter.

[ Top ]


MultiCluster Information Cache

When a local cluster requests load or host information from a remote cluster, the information is cached. If the local cluster is required to display the same information again, LSF displays the cached information, unless the cache has expired.

The expiry period for cached information is configurable, so you can view more up-to- date information if you don't mind connecting to the remote cluster more often.

It is more efficient to get information from a local cluster than from a remote cluster. Caching remote cluster information locally minimizes excessive communication between clusters.

Cache thresholds

The cache threshold is the maximum time that remote cluster information can remain in the local cache.

There are two cache thresholds, one for load information, and one for host information. The threshold for host information is always double the threshold for load information.

By default, cached load information expires after 60 seconds and cached host information expires after 120 seconds.

How it works

When a local cluster requests load or host information from a remote cluster, the information is cached by the local master LIM.

When the local cluster is required to display the same information again, LSF evaluates the age of the information in the cache.

Configuring cache threshold

Set CACHE_INTERVAL in the RemoteClusters section of lsf.cluster.cluster_name, and specify the number of seconds to cache load information.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: March 13, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.