MAX_SBD_FAIL

Syntax

MAX_SBD_FAIL=integer

Description

The maximum number of retries for reaching a non-responding slave batch daemon, sbatchd.

The interval between retries is defined by MBD_SLEEP_TIME. If mbatchd fails to reach a host and has retried MAX_SBD_FAIL times, the host is considered unreachable.

If you define LSB_SYNC_HOST_STAT_LIM=Y, mbatchd obtains the host status from the master LIM before it polls sbatchd. When the master LIM reports that a host is unavailable (LIM is down) or unreachable (sbatchd is down) MAX_SBD_FAIL number of times, mbatchd reports the host status as unavailable or unreachable.

When a host becomes unavailable, mbatchd assumes that all jobs running on that host have exited and that all rerunnable jobs (jobs submitted with the bsub -r option) are scheduled to be rerun on another host.

Default

3