Sometimes the LIM is up, but executing the lsload command prints the following error message:
If the LIM has just been started, this is normal, because the LIM needs time to get initialized by reading configuration files and contacting other LIMs. If the LIM does not become available within one or two minutes, check the LIM error log for the host you are working on.
To prevent communication timeouts when starting or restarting the local LIM, define the parameter LSF_SERVER_HOSTS in the lsf.conf file. The client will contact the LIM on one of the LSF_SERVER_HOSTS and execute the command, provided that at least one of the hosts defined in the list has a LIM that is up and running.
When the local LIM is running but there is no master LIM in the cluster, LSF applications display the following message:
Sometimes the master LIM is up, but executing the lsload or lshosts command prints the following error message:
If the /etc/hosts file on the host where the master LIM is running is configured with the host name assigned to the loopback IP address (127.0.0.1), LSF client LIMs cannot contact the master LIM. When the master LIM starts up, it sets its official host name and IP address to the loopback address. Any client requests will get the master LIM address as 127.0.0.1, and try to connect to it, and in fact will try to access itself.
A command may fail with the following error message due to a non-uniform file name space.
chdir(...) failed: no such file or directory
You are trying to execute a command remotely, where either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.
This reports most errors. You should also check if there is any email in the LSF administrator’s mailbox. If the mbatchd is running but the sbatchd dies on some hosts, it may be because mbatchd has not been configured to use those hosts.
mbatchd allows sbatchd to run only on the hosts listed in the Host section of the lsb.hosts file. If you try to configure an unknown host in the HostGroup or HostPartition sections of the lsb.hosts file, or as a HOSTS definition for a queue in the lsb.queues file, mbatchd logs the following message.
mbatchd on host: LSB_CONFDIR/cluster1/configdir/file(line #): Host hostname is not used by lsbatch; ignored
If you start sbatchd on a host that is not known by mbatchd, mbatchd rejects the sbatchd. The sbatchd logs the following message and exits.
If you see DEFAULT in lim -t, it means that automatic detection of host type or model has failed, and the host type configured in lsf.shared cannot be found. LSF will work on the host, but a DEFAULT model may be inefficient because of incorrect CPU factors. A DEFAULT type may also cause binary incompatibility because a job from a DEFAULT host type can be migrated to anotherDEFAULT host type.
lshostsHOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCEShostA DEFAULT DEFAULT 1 2 256M 710M Yes ()
If model is DEFAULT, LSF will work correctly but the host will have a CPU factor of 1, which may not make efficient use of the host model.
If type is DEFAULT, there may be binary incompatibility. For example, there are 2 hosts, one is Solaris, the other is HP. If both hosts are set to type DEFAULT, it means jobs running on the Solaris host can be migrated to the HP host and vice-versa.