In this model, the cluster that is starving for resources sends jobs over to the cluster that has resources to spare. Job status, pending reason, and resource usage are returned to the submission cluster. When the job is done, the exit code returns to the submission cluster.
By default, bhosts shows information about hosts and resources that are available to the local cluster and information about jobs that are scheduled by the local cluster.
The bjobs command shows all jobs associated with hosts in the cluster, including MultiCluster jobs. Jobs from remote clusters can be identified by the FROM_HOST column, which shows the remote cluster name and the submission or consumer cluster job ID in the format host_name@remote_cluster_name:remote_job_ID.
If the MultiCluster job is running under the job forwarding model, the QUEUE column shows a local queue, but if the MultiCluster job is running under the resource leasing model, the name of the remote queue is shown in the format queue_name@remote_cluster_name.
Use -w or -l to prevent the MultiCluster information from being truncated.
Displays remote resource provider and consumer information, resource flow information, and connection status between the local and remote cluster.
Use -app to view available application profiles in remote clusters.
Information related to the job forwarding model is displayed under the heading Job Forwarding Information.
LOCAL_QUEUE: Name of a local MultiCluster send-jobs or receive-jobs queue.
REMOTE: For send-jobs queues, shows the name of the receive-jobs queue in a remote cluster.
CLUSTER: For send-jobs queues, shows the name of the remote cluster containing the receive-jobs queue.
For receive-jobs queues, shows the name of the remote cluster that can send jobs to the local queue.
STATUS: Indicates the connection status between the local queue and remote queue.
The two clusters can exchange information and the system is properly configured.
Communication between the two clusters has not been established. This could occur because there are no jobs waiting to be dispatched, or because the remote master cannot be located.
The remote queue rejects jobs from the send-jobs queue. The local queue and remote queue are connected and the clusters communicate, but the queue-level configuration is not correct. For example, the send-jobs queue in the submission cluster points to a receive-jobs queue that does not exist in the remote cluster.
If the job is rejected, it returns to the submission cluster.
For example, consider the following application profile configurations:
On the submission cluster (Cluster1) in the lsb.applications file:
Begin ApplicationNAME = fluentDESCRIPTION = FLUENT Version 6.2CPULIMIT = 180/bp860-10 # 3 hours of host hostAFILELIMIT = 20000DATALIMIT = 20000 # jobs data segment limitCORELIMIT = 20000PROCLIMIT = 5 # job processor limitPRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.outPOST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hi"REQUEUE_EXIT_VALUES = 55 34 78End ApplicationBegin ApplicationNAME = catiaDESCRIPTION = CATIA V5CPULIMIT = 24:0/bp860-10 # 24 hours of host hostAFILELIMIT = 20000DATALIMIT = 20000 # jobs data segment limitCORELIMIT = 20000PROCLIMIT = 5 # job processor limitPRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.outPOST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hi"REQUEUE_EXIT_VALUES = 55 34 78End ApplicationBegin ApplicationNAME = djobDESCRIPTION = distributed jobsFILELIMIT = 20000DATALIMIT = 2000000 # jobs data segment limitRTASK_GONE_ACTION="KILLJOB_TASKEXIT IGNORE_TASKCRASH"DJOB_ENV_SCRIPT = /lsf/djobs/proj_1/djob_envDJOB_RU_INTERVAL = 300DJOB_HB_INTERVAL = 30DJOB_COMMFAIL_ACTION="KILL_TASKS"End Application
On the execution cluster (Cluster2) in the lsb.applications file:
Begin ApplicationNAME = dynaDESCRIPTION = ANSYS LS-DYNACPULIMIT = 8:0/amd64dcore # 8 hours of host model SunIPCFILELIMIT = 20000DATALIMIT = 20000 # jobs data segment limitCORELIMIT = 20000PROCLIMIT = 5 # job processor limitPRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.outPOST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hi"REQUEUE_EXIT_VALUES = 55 255 78End ApplicationBegin ApplicationNAME = defaultDESCRIPTION = global defaultsCORELIMIT = 0 # No core filesSTACKLIMIT = 200000 # Give large defaultRERUNNABLE = Y #RES_REQ = order[mem:ut] # change the default ordering methodEnd Application
Verify that MultiCluster is enabled:
lsclustersCLUSTER_NAME STATUS MASTER_HOST ADMIN HOSTS SERVERScluster1 ok master_c1 admin 1 1cluster2 ok master_c2 admin 2 2
View available applications on remote clusters from the submission cluster (Cluster1):
bclusters -appREMOTE_CLUSTER APP_NAME DESCRIPTIONcluster2 dyna ANSYS LS-DYNAcluster2 default global defaults
View available applications on remote clusters from the execution cluster (Cluster2):