[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- blaunch Distributed Application Framework
- OpenMP Jobs
- PVM Jobs
- SGI Vendor MPI Support
- HP Vendor MPI Support
- LSF Generic Parallel Job Launcher Framework
- How the Generic PJL Framework Works
- Tuning PAM Scalability and Fault Tolerance
- Running Jobs with Task Geometry
- Enforcing Resource Usage Limits for Parallel Tasks
- Example Integration: LAM/MPI
- Tips for Writing PJL Wrapper Scripts
- Other Integration Options
[ Top ]
blaunch Distributed Application Framework
Most MPI implementations and many distributed applications use
rsh
andssh
as their task launching mechanism. Theblaunch
command provides a drop-in replacement forrsh
andssh
as a transparent method for launching parallel and distributed applications within LSF.The following figure illustrates
blaunch
processing:
![]()
About the blaunch command
Similar to the LSF
lsrun
command,blaunch
transparently connects directly to the RES/SBD on the remote host, and subsequently creates and tracks the remote tasks, and provides the connection back to LSF. There no need to insertpam
, taskstarter into thersh
orssh
calling sequence, or configure any wrapper scripts.
blaunch
supports the following core command line options asrsh
andssh
:Whereas the host name value for
rsh
andssh
can only be a single host name, you can use the-z
option to specify a space-delimited list of hosts where tasks are started in parallel. All otherrsh
andssh
options are silently ignored.
You cannot run blaunch directly from the command line as a standalone command.
blaunch
only works within an LSF job; it can only be used to launch tasks on remote hosts that are part of a job allocation. On success,blaunch
exits with 0.Windows:
blaunch
is supported on Windows 2000 or later with the following exceptions:
- Only the following signals are supported: SIGKILL, SIGSTOP, SIGCONT.
- The -n option is not supported.
- CMD.EXE /C <user command line> is used as intermediate command shell when:
- CMD.EXE /C is not used when -no-shell is specified.
- Windows Vista User Account Control must be configured correctly to run jobs.
See the Platform LSF Command Reference for more information about the blaunch command.
LSF APIs for the blaunch distributed application framework
LSF provides the following APIs for programming your own applications to use the blaunch distributed application framework:
lsb_launch()
--a synchronous API call to allow source level integration with vendor MPI implementations. This API will launch the specified command (argv
) on the remote nodes in parallel. LSF must be installed before integrating your MPI implementation withlsb_launch()
. Thelsb_launch()
API requires the full set ofliblsf.so
,libbat.so
(orliblsf.a
,libbat.a
).lsb_getalloc()
--allocates memory for a host list to be used for launching parallel tasks throughblaunch
and thelsb_lanuch()
API. It is the responsibility of the caller to free the host list when it is no longer needed. On success, the host list will be a list of strings. Before freeing host list, the individual elements must be freed. An application using thelsb_getalloc()
API is assumed to be part of an LSF job, and that LSB_MCPU_HOSTS is set in the environment.See the Platform LSF API Reference for more information about these APIs.
The blaunch job environment
blaunch
determines from the job environment what job it is running under, and what the allocation for the job is. These can be determined by examining the environment variables LSB_JOBID, LSB_JOBINDEX, and LSB_MCPU_HOSTS. If any of these variables do not exist,blaunch
exits with a non-zero value. Similarly, ifblaunch
is used to start a task on a host not listed in LSB_MCPU_HOSTS, the command exits with a non-zero value.The job submission script contains the
blaunch
command in place ofrsh
orssh
. Theblaunch
command does sanity checking of the environment to check for LSB_JOBID and LSB_MCPU_HOSTS. Theblaunch
command contacts the job RES to validate the information determined from the job environment. When the job RES receives the validation request fromblaunch
, it registers with the rootsbatchd
to handle signals for the job.The job RES periodically requests resource usage for the remote tasks. This message also acts as a heartbeat for the job. If a resource usage request is not made within a certain period of time it is assumed the job is gone and that the remote tasks should be shut down. This timeout is configurable in an application profile in
lsb.applications
.The
blaunch
command also honors the parameters LSB_CMD_LOG_MASK, LSB_DEBUG_CMD, and LSB_CMD_LOGDIR when defined inlsf.conf
or as environment variables. The environment variables take precedence over the values inlsf.conf
.To ensure that no other users can run jobs on hosts allocated to tasks launched by
blaunch
set LSF_DISABLE_LSRUN=Y inlsf.conf
. When LSF_DISABLE_LSRUN=Y is defined, RES refuses remote connections fromlsrun
andlsgrun
unless the user is either an LSF administrator or root. LSF_ROOT_REX must be defined for remote execution by root. Other remote execution commands, such asch
andlsmake
are not affected.Temporary directory for tasks launched by blaunch
By default, LSF creates a temporary directory for a job only on the first execution host. If LSF_TMPDIR is set in
lsf.conf
, the path of the job temporary directory on the first execution host is set toLSF_TMPDIR/
job_ID.tmpdir
.If LSB_SET_TMPDIR= Y, the environment variable TMPDIR will be set equal to the path specified by LSF_TMPDIR. This value for TMPDIR overrides any value that might be set in the submission environment.
Tasks launched through the
blaunch
distributed application framework make use of the LSF temporary directory specified by LSF_TMPDIR:
- When the environment variable TMPDIR is set on the first execution host, the
blaunch
framework propagates this environment variable to all execution hosts when launching remote tasks- The job RES or the task RES creates the directory specified by TMPDIR if it does not already exist before starting the job
- The directory created by the job RES or task RES has permission 0700 and is owned by the execution user
- If the TMPDIR directory was created by the task RES, LSF deletes the temporary directory and its contents when the task is complete
- If the TMPDIR directory was created by the job RES, LSF will delete the temporary directory and its contents when the job is done
- If the TMPDIR directory is on a shared file system, it is assumed to be shared by all the hosts allocated to the
blaunch
job, so LSF does not remove TMPDIR directories created by the job RES or task RESAutomatic generation of the job host file
LSF automatically places the allocated hosts for a job into the $LSB_HOSTS and $LSB_MCPU_HOSTS environment variables. Since most MPI implementations and parallel applications expect to read the allocated hosts from a file, LSF creates a host file in the the default job output directory
$HOME/.lsbatch
on the execution host before the job runs, and deletes it after the job has finished running. The name of the host file created has the format:.lsb.<jobID>.hostfileThe host file contains one host per line. For example, if
LSB_MCPU_HOSTS="hostA 2 hostB 2 hostC 1"
, the host file contains:hostA hostA hostB hostB hostCLSF publishes the full path to the host file by setting the environment variable LSB_DJOB_HOSTFILE.
Configuring application profiles for the blaunch framework
You can configure an application profile in
lsb.applications
to control the behavior of a parallel or distributed application when a remote task exits. Specify a value for RTASK_GONE_ACTION in the application profile to define what the LSF does when a remote task exits.The default behavior is:
When ... LSF ... Task exits with zero value
Does nothing
Task exits with non-zero value
Does nothing
Task crashes
Shuts down the entire job
RTASK_GONE_ACTION has the following syntax:
RTASK_GONE_ACTION="
[KILLJOB_TASKDONE
|KILLJOB_TASKEXIT
] [IGNORE_TASKCRASH
]"
Where:
- IGNORE_TASKCRASH
A remote task crashes. LSF does nothing. The job continues to launch the next task.
- KILLJOB_TASKDONE
A remote task exits with zero value. LSF terminates all tasks in the job.
- KILLJOB_TASKEXIT
A remote task exits with non-zero value. LSF terminates all tasks in the job.
For example:
RTASK_GONE_ACTION="IGNORE_TASKCRASH KILLJOB_TASKEXIT"RTASK_GONE_ACTION only applies to the
blaunch
distributed application framework.When defined in an application profile, the LSB_DJOB_RTASK_GONE_ACTION variable is set when running
bsub -app
for the specified application.You can also use the environment variable LSB_DJOB_RTASK_GONE_ACTION to override the value set in the application profile.
By default, LSF shuts down the entire job if connection is lost with the task RES, validation timeout, or heartbeat timeout. You can configure an application profile in
lsb.applications
so only the current tasks are shut down, not the entire job.Use DJOB_COMMFAIL_ACTION="KILL_TASKS" to define the behavior of LSF when it detects a communication failure between itself and one or more tasks. If not defined, LSF terminates all tasks, and shuts down the job. If set to KILL_TASKS, LSF tries to kill all the current tasks of a parallel or distributed job associated with the communication failure.
DJOB_COMMFAIL_ACTION only applies to the
blaunch
distributed application framework.When defined in an application profile, the LSB_DJOB_COMMFAIL_ACTION environment variable is set when running
bsub -app
for the specified application.LSF can run an appropriate script that is responsible for setup and cleanup of the job launching environment. You can specify the name of the appropriate script in an application profile in
lsb.applications
.Use DJOB_ENV_SCRIPT to define the path to a script that sets the environment for the parallel or distributed job launcher. The script runs as the user, and is part of the job. DJOB_ENV_SCRIPT only applies to the
blaunch
distributed application framework.If a full path is specified, LSF uses the path name for the execution. If a full path is not specified, LSF looks for it in LSF_BINDIR.
The specified script must support a
setup
argument and acleanup
argument. LSF invokes the script with thesetup
argument before launching the actual job to set up the environment, and withcleanup
argument after the job is finished.LSF assumes that if setup cannot be performed, the environment to run the job does not exist. If the script returns a non-zero value at setup, an error is printed to
stderr
of the job, and the job exits.Regardless of the return value of the script at cleanup, the real job exit value is used. If the return value of the script is non-zero, an error message is printed to
stderr
of the job.When defined in an application profile, the LSB_DJOB_ENV_SCRIPT variable is set when running
bsub -app
for the specified application.For example, if
DJOB_ENV_SCRIPT=mpich.script
, LSF runs$LSF_BINDIR/mpich.script setupto set up the environment to run an MPICH job. After the job completes, LSF runs
$LSF_BINDIR/mpich.script cleanupOn cleanup, the
mpich.script
file could, for example, remove any temporary files and release resources used by the job. Changes to the LSB_DJOB_ENV_SCRIPT environment variable made by the script are visible to the job.Use DJOB_HB_INTERVAL in an application profile in
lsb.applications
to configure an interval in seconds used to update the heartbeat between LSF and the tasks of a parallel or distributed job. DJOB_HB_INTERVAL only applies to theblaunch
distributed application framework.When DJOB_HB_INTERVAL is specified, the interval is scaled according to the number of tasks in the job:
max(DJOB_HB_INTERVAL, 10) + host_factorwhere
host_factor = 0.01 * number of hosts allocated for the job
When defined in an application profile, the LSB_DJOB_HB_INTERVAL variable is set in the parallel or distributed job environment. You should not manually change the value of LSB_DJOB_HB_INTERVAL.
By default, the interval is equal to SBD_SLEEP_TIME in
lsb.params
, where the default value of SBD_SLEEP_TIME is 30 seconds.Use DJOB_RU_INTERVAL in an application profile in
lsb.applications
to configure an interval in seconds used to update the resource usage for the tasks of a parallel or distributed job. DJOB_RU_INTERVAL only applies to theblaunch
distributed application framework.When DJOB_RU_INTERVAL is specified, the interval is scaled according to the number of tasks in the job:
max(DJOB_RU_INTERVAL, 10) + host_factorwhere
host_factor = 0.01 * number of hosts allocated for the job
When defined in an application profile, the LSB_DJOB_RU_INTERVAL variable is set in parallel or distributed job environment. You should not manually change the value of LSB_DJOB_RU_INTERVAL.
By default, the interval is equal to SBD_SLEEP_TIME in
lsb.params
, where the default value of SBD_SLEEP_TIME is 30 seconds.How blaunch supports task geometry and process group files
The current support for task geometry in LSF requires the user submitting a job to specify the wanted task geometry by setting the environment variable LSB_PJL_TASK_GEOMETRY in their submission environment before job submission. LSF checks for LSB_PJL_TASK_GEOMETRY and modifies LSB_MCPU_HOSTS appropriately
The environment variable LSB_PJL_TASK_GEOMETRY is checked for all parallel jobs. If LSB_PJL_TASK_GEOMETRY is set users submit a parallel job (a job that requests more than 1 slot), LSF attempts to shape LSB_MCPU_HOSTS accordingly.
Resource collection for all commands in a job script
Parallel and distributed jobs are typically launched with a job script. If your job script runs multiple commands, you can ensure that resource usage is collected correctly for all commands in a job script by configuring LSF_HPC_EXTENSIONS=CUMULATIVE_RUSAGE in
lsf.conf
. Resource usage is collected for jobs in the job script, rather than being overwritten when each command is executed.Resizable jobs and blaunch
Because a resizable job can be resized any time, the
blaunch
framework is aware of the newly added resources (hosts) or released resources. When a validation request comes with those additional resources, theblaunch
framework accepts the request and launches the remote tasks accordingly. When part of an allocation is released, theblaunch
framework makes sure no remote tasks are running on those released resources, by terminating remote tasks on the released hosts if any. Any further validation requests with those released resources are rejected.The
blaunch
framework provides the following functionality for resizable jobs:
- The
blaunch
command andlsb_getalloc()
API call accesses up to date resource allocation through the LSB_DJOB_HOSTFILE environment variable- Validation request (to launch remote tasks) with the additional resources succeeds
- Validation request (to launch remote tasks) with the released resources fails
- Remote tasks on the released resources are terminated and the
blaunch
framework terminates tasks on a host when the host has been completely removed from the allocation.- When releasing resources, LSF allows a configurable grace period (DJOB_RESIZE_ GRACE_PERIOD in
lsb.applications
) for tasks to clean up and exit themselves. By default, there is no grace period.- When remote tasks are launched on new additional hosts but the notification command fails, those remote tasks are terminated.
Submitting jobs with blaunch
Use
bsub
to callblaunch
, or to invoke an execution script that callsblaunch
. Theblaunch
command assumes thatbsub -n
implies one task per job slot.
- Submit a job:
bsub -n 4 blaunch myjob- Submit a job to launch tasks on a specific host:
bsub -n 4 blaunch hostA myjob- Submit a job with a host list:
bsub -n 4 blaunch -z "hostA hostB" myjob- Submit a job with a host file:
bsub -n 4 blaunch -u ./hostfile myjob- Submit a job to an application profile
bsub -n 4 -app djob blaunch myjobExample execution scripts
Launching MPICH-P4 tasks
To launch an MPICH-P4 tasks through LSF using the
blaunch
framework, substitute the path torsh
orssh
with the path toblaunch
. For example:Sample
mpirun
script changes:... # Set default variables AUTOMOUNTFIX="sed -e s@/tmp_mnt/@/@g" DEFAULT_DEVICE=ch_p4RSHCOMMAND="$LSF_BINDIR/blaunch"
SYNCLOC=/bin/sync CC="cc" ...You must also set special arguments for the ch_p4 device:
#! /bin/sh # # mpirun.ch_p4.args # # Special args for the ch_p4 devicesetrshcmd="yes"
givenPGFile=0 case $arg in ...Sample job submission script:
#! /bin/sh # # job script for MPICH-P4 # #BSUB -n 2 #BSUB -R'span[ptile=1]' #BSUB -o %J.out #BSUB -e %J.err NUMPROC=`wc -l $LSB_DJOB_HOSTFILE|cut -f 1 -d ' '` mpirun -n $NUMPROC -machinefile $LSB_DJOB_HOSTFILE ./myjobLaunching ANSYS jobs
To launch an ANSYS job through LSF using the
blaunch
framework, substitute the path torsh
orssh
with the path toblaunch
. For example:#BSUB -o stdout.txt #BSUB -e stderr.txt # Note: This case statement should be used to set up any # environment variables needed to run the different versions # of Ansys. All versions in this case statement that have the # string "version list entry" on the same line will appear as # choices in the Ansys service submission page. case $VERSION in 10.0) #version list entry export ANSYS_DIR=/usr/share/app/ansys_inc/v100/Ansys export ANSYSLMD_LICENSE_FILE=1051@licserver.company.comexport MPI_REMSH=/opt/lsf/bin/blaunch
program=${ANSYS_DIR}/bin/ansys100 ;; *) echo "Invalid version ($VERSION) specified" exit 1 ;; esac if [ -z "$JOBNAME" ]; then export JOBNAME=ANSYS-$$ fi if [ $CPUS -eq 1 ]; then ${program} -p ansys -j $JOBNAME -s read -l en-us -b -i $INPUT $OPTS else if [ $MEMORY_ARCH = "Distributed" ] Then HOSTLIST=`echo $LSB_HOSTS | sed s/" "/":1:"/g` ${program} -j $JOBNAME -p ansys -pp -dis -machines \ ${HOSTLIST}:1 -i $INPUT $OPTS else ${program} -j $JOBNAME -p ansys -pp -dis -np $CPUS \ -i $INPUT $OPTS fi fi[ Top ]
OpenMP Jobs
Platform LSF provides the ability to start parallel jobs that use OpenMP to communicate between process on shared-memory machines and MPI to communicate across networked and non-shared memory machines.
This implementation allows you to specify the number of machines and to reserve an equal number of processors per machine. When the job is dispatched, PAM only starts one process per machine.
The OpenMP specifications are owned and managed by the OpenMP Architecture Review Board. See
www.openmp.org
for detailed information.OpenMP esub
An
esub
for OpenMP jobs,esub.openmp
, is installed with Platform LSF. The OpenMPesub
sets environment variable LSF_PAM_HOSTLIST_USE=unique, and starts PAM.Use
bsub -a openmp
to submit OpenMP jobs.Submitting OpenMP jobs
To run an OpenMP job with MPI on multiple hosts, specify the number of processors and the number of processes per machine. For example, to reserve 32 processors and run 4 processes per machine:
bsub -a openmp -n 32 -R "span[ptile=4]" myOpenMPJob
myOpenMPJob
runs across 8 machines (4/32=8) and PAM starts 1 MPI process per machine.To run a parallel OpenMP job on a single host, specify the number of processors:
bsub -a openmp -n 4 -R "span[hosts=1]" myOpenMPJob[ Top ]
PVM Jobs
Parallel Virtual Machine (PVM) is a parallel programming system distributed by Oak Ridge National Laboratory. PVM programs are controlled by the PVM hosts file, which contains host names and other information.
PVM esub
An
esub
for PVM jobs,esub.pvm
, is installed with Platform LSF. The PVMesub
calls thepvmjob
script.Use
bsub -a pvm
to submit PVM jobs.pvmjob script
The
pvmjob
shell script is invoked byesub.pvm
to run PVM programs as parallel LSF jobs. Thepvmjob
script reads the LSF environment variables, sets up the PVM hosts file and then runs the PVM job. If your PVM job needs special options in the hosts file, you can modify thepvmjob
script.Example
For example, if the command line to run your PVM job is:
myjob data1 -o out1the following command submits this job to run on 10 processors:
bsub -a pvm -n 10 myjob data1 -o out1Other parallel programming packages can be supported in the same way.
[ Top ]
SGI Vendor MPI Support
Compiling and linking your MPI program
You must use the SGI C compiler (
cc
by default). You cannot usempicc
to build your programs.For example, use one of the following compilation commands to build the program
mpi_sgi
:
- On IRIX/TRIX:
cc -g -64 -o mpi_sgi mpi_sgi.c -lmpi f90 -g -64 -o mpi_sgi mpi_sgi.c -lmpi cc -g -n32 -mips3 -o mpi_sgi mpi_sgi.c -lmpi- On Altix:
efc -g -o mpi_sgi mpi_sgi.f -lmpi ecc -g -o mpi_sgi mpi_sgi.c -lmpi gcc -g -o mpi_sgi mpi_sgi.c -lmpiSystem requirements
SGI MPI has the following system requirements:
- Your SGI systems must be running IRIX 6.5.24 or higher, or SGI Alitx ProPack 3.0 or higher, with the latest operating system patches applied. Use the
uname
command to determine your system configuration. For example:uname -aR
IRIX64 hostA 6.5 6.5.17f 07121148 IP27- SGI MPI version:
Use the one of the following commands to determine your installation:
Configuring LSF to work with SGI MPI
To use 32-bit or 64-bit SGI MPI with Platform LSF, set the following parameters in
lsf.conf
:
- Set LSF_VPLUGIN to the full path to the MPI library
libxmpi.so
.For example:
You can specify multiple paths for LSF_VPLUGIN, separated by colons (
:
). For example, the following configures both/usr/lib32/libxmpi.so
for SGI IRIX, and/usr/lib/libxmpi.so
for SGI IRIX:LSF_VPLUGIN="/usr/lib32/libxmpi.so:/usr/lib/libxmpi.so"- LSF_PAM_USE_ASH=Y enables LSF to use the SGI Array Session Handler (ASH) to propagate signals to the parallel jobs.
See the SGI system documentation and the
array_session
(5
) man page for more information about array sessions.For PAM to access the
libxmpi.so
library, the file permission mode must be 755 (-rwxr-xr-x
).For PAM jobs on Altix, the SGI Array Services daemon
arrayd
must be running and AUTHENTICATION must be set to NONE in the SGI array services authentication file/usr/lib/array/arrayd.auth
(comment out the AUTHENTICATION NOREMOTE method and uncomment the AUTHENTICATION NONE method).To run a mulithost MPI applications, you must also enable
rsh
without password prompt between hosts:
- The remote host must defined in the
arrayd
configuration.- Configure
.rhosts
so thatrsh
does not require a password.The pam command
The
pam
command invokes the Platform Parallel Application Manager (PAM) to run parallel batch jobs in LSF. It uses thempirun
library and SGI array services to spawn the child processes needed for the parallel tasks that make up your MPI application. It starts these tasks on the systems allocated by LSF. The allocation includes the number of execution hosts needed, and the number of child processes needed on each host.The
-mpi
option on thebsub
andpam
command line is equivalent tompirun
in the SGI environment.The
-auto_place
option on thepam
command line tells thempirun
library to launch the MPI application according to the resources allocated by LSF.The
-n
option on thepam
command line notifies PAM to wait for-n
number of TaskStarter to return.You can use both
bsub -n
andpam -n
in the same job submission. The number specified in thepam -n
option should be less than or equal to the number specified bybsub -n
. If the number of tasks specified withpam -n
is greater than the number specified bybsub -n
, thepam -n
is ignored.For example, you can specify:
bsub -n 5 pam -n 2 a.outHere, the job requests 5 processors, but PAM only starts 2 parallel tasks.
Examples
To run a job and have LSF select the host, the command:
mpirun -np 4 a.outis entered as:
bsub -n 4 pam -mpi -auto_place a.out
To run a single-host job and have LSF select the host, the command:
mpirun -np 4 a.outis entered as:
bsub -n 4 -R "span[hosts=1]" pam -mpi -auto_place a.out
To run a multihost job (5 processors per host) and have LSF select the hosts, the following command:
mpirun hosta -np 5 a.out: hostb -np 5 a.outis entered as:
bsub -n 10 -R "span[ptile=5]" pam -mpi -auto_place a.out
For a complete list of
mpirun
options and environment variable controls refer to the SGImpirun
man page.Limitations
- SBD and MBD take a few seconds to get the process IDs and process group IDs of the PAM jobs from the SGI MPI components, If you use
bstop
,bresume
, orbkill
before this happens, uncontrolled MPI child processes may be left running.- A single MPI job cannot run on a heterogeneous architecture. The entire job must run on systems of a single architecture.
[ Top ]
HP Vendor MPI Support
When you use
mpirun
in stand-alone mode, you specify host names to be used by the MPI job.Automatic Platform MPI library configuration
During installation,
lsfinstall
sets LSF_VPLUGIN inlsf.conf
to the full path to the MPI librarylibmpirm.sl
. For example:LSF_VPLUGIN="/opt/mpi/lib/pa1.1/libmpirm.sl"On Linux hosts running Platform MPI, you must manually set the full path to the vendor MPI library
libmpirm.so
.For example, if Platform MPI is installed in
/opt/hpmpi
:LSF_VPLUGIN="/opt/hpmpi/lib/linux_ia32/libmpirm.so"The pam command
The
pam
command invokes the Platform Parallel Application Manager (PAM) to run parallel batch jobs in LSF. It uses thempirun
library to spawn the child processes needed for the parallel tasks that make up your MPI application. It starts these tasks on the systems allocated by LSF. The allocation includes the number of execution hosts needed, and the number of child processes needed on each host.Automatic host allocation by LSF
To achieve better resource utilization, you can have LSF manage the allocation of hosts, coordinating the start-up phase with
mpirun
.This is done by preceding the regular
mpirun
command with:bsub pam -mpi
The
-mpi
option on thebsub
andpam
command line is equivalent tompirun
in the Platform MPI environment. The-mpi
option must be the first option of thepam
command.How to run Platform MPI jobs
- Add the Platform MPI command
mpirun
in the $PATH environment variable.- Set the MPI_ROOT environment variable to point to the Platform MPI installation directory.
- Set LSF_VPLUGIN in
lsf.conf
or in your environment.- Submit thte job with
-lsb_hosts
option:bsub -I -n 3 pam -mpi mpirun -lsb_hosts myjob
For example, to run a single-host job and have LSF select the host, the command:
mpirun -np 14 a.out
is entered as:
bsub pam -mpi mpirun -np 14 a.out
For example, to run a multi-host job and have LSF select the hosts, the command:
mpirun -f appfileis entered as:
bsub -n 8 -R "span[ptile=4]" pam -mpi mpirun -f appfile
where
appfile
contains the following entries:-h host1 -np 4 a.out -h host2 -np 4 b.outIn this example
host1
andhost2
are used in place of actual host names and refer to the actual hosts that LSF allocates to the job.[ Top ]
LSF Generic Parallel Job Launcher Framework
Any parallel execution environment (for example a vendor MPI, or an MPI package like MPICH-GM, MPICH-P4, or LAM/MPI) can be made compatible with LSF using the generic parallel job launcher (PJL) framework.
Vendor MPIs for SGI MPI and Platform MPI are already integrated with Platform LSF.
The generic PJL integration is a framework that allows you to integrate any vendor's parallel job launcher with Platform LSF. PAM does not launch the parallel jobs directly, but manages the job to monitor job resource usage and provide job control over the parallel tasks.
System requirements
- Vendor parallel package is installed and operating properly
- LSF cluster is installed and operating properly
[ Top ]
How the Generic PJL Framework Works
Terminology
The host name at the top of the execution host list as determined by LSF. Starts PAM.
The most suitable hosts to execute the batch job as determined by LSF
A process that runs on a host; the individual process of a parallel application
A parallel job consists of multiple tasks that could be executed on different hosts.
(Parallel Job Launcher) Any executable script or binary capable of starting parallel tasks on all hosts assigned for a parallel job (for example,
mpirun
.)Slave Batch Daemons (SBDs) are batch job execution agents residing on the execution hosts.
sbatchd
receives jobs frommbatchd
in the form of a job specification and starts RES to run the job according the specification.sbatchd
reports the batch job status tombatchd
whenever job state changes.Reads the environment variable LSF_PJL_TYPE, and generates the appropriate
pam
command line to invoke the PJL. The esub programs provided in LSF_SERVERDIR set this variable to the proper type.(TaskStarter) An executable responsible for starting a parallel task on a host and reporting the process ID and host name to PAM. TS is located in LSF_BINDIR.
(Parallel Application Manager) The supervisor of any parallel LSF job. PAM allows LSF to collect resources used by the job and perform job control.
PAM starts the PJL and maintains connection with RES on all execution hosts. It collects resource usage, updates the resource usage of tasks and its own PID and PGID to
sbatchd
. It propagates signals to all process groups and individual tasks, and cleans up tasks as needed.A script that starts the PJL. The wrapper is typically used to set up the environment for the parallel job and invokes the PJL.
(Remote Execution Server) An LSF daemon running on each server host. Accepts remote execution requests to provide transparent and secure remote execution of jobs and tasks.
RES manages all remote tasks and forwards signals, standard I/O, resources consumption data, and parallel job information between PAM and the tasks.
Architecture
Running a parallel job using a non-integrated PJL
![]()
Without the generic PJL framework, the PJL starts tasks directly on each host, and manages the job.
Even if the MPI job was submitted through LSF, LSF never receives information about the individual tasks. LSF is not able to track job resource usage or provide job control.
If you simply replace PAM with a parallel job launcher that is not integrated with LSF, LSF loses control of the process and is not able to monitor job resource usage or provide job control. LSF never receives information about the individual tasks.
PAM is the resource manager for the job. The key step in the integration is to place TS in the job startup hierarchy, just before the task starts. TS must be the parent process of each task in order to collect the task process ID (PID) and pass it to PAM.
The following figure illustrates the relationship between PAM, PJL, PJL wrapper, TS, and the parallel job tasks.
![]()
- Instead of starting the PJL directly, PAM starts the specified PJL wrapper on a single host.
- The PJL wrapper starts the PJL (for example,
mpirun
).- Instead of starting tasks directly, PJL starts TS on each host selected to run the parallel job.
- TS starts the task.
Each TS reports its task PID and host name back to PAM. Now PAM can perform job control and resource usage collection through RES.
TaskStarter also collects the exit status of the task and reports it to PAM. When PJL exits, PAM exits with the same termination status as the PJL.
If you choose to customize
mpirun.lsf
and your job scripts callmpirun.lsf
more than once, make use of the the environment variables that call a custom command, script, or binary when needed:
- $MPIRUN_LSF_PRE_EXEC: Runs before calling pam..PJL_wrapper.
- $MPIRUN_LSF_POST_EXEC: Runs after calling pam..PJL_wrapper.
These environment variables are run as users.
Integration methods
There are 2 ways to integrate the PJL.
In this method, PAM rewrites the PJL command line to insert TS in the correct position, and set callback information for TS to communicate with PAM.
Use this method when:
- You always use the same number of PJL arguments
- The job in the PJL command line is the executable application that starts the parallel tasks
For details, see Integration Method 1
In this method, you rewrite or wrap the PJL to include TS and callback information for TS to communicate with PAM. This method of integration is the most flexible, but may be more difficult to implement.
Use this method when:
- The number of PJL arguments is uncertain
- Parallel tasks have a complex startup sequence
- The job in the PJL command line could be a script instead of the executable application that starts the parallel tasks
For details, see Integration Method 2.
Error handling
- If PAM cannot start PJL, no tasks are started and PAM exits.
- If PAM does not receive all the TS registration messages (host name and PID) within a the timeout specified by LSF_HPC_PJL_LOADENV_TIMEOUT in
lsf.conf
, it assumes that the job can not be executed. It kills the PJL, kills all the tasks that have been successfully started (if any), and exits. The default for LSF_HPC_PJL_LOADENV_TIMEOUT is 300 seconds.- If TS cannot start the task, it reports this to PAM and exits. If all tasks report, PAM checks to make sure all tasks have started. If any task does not start, PAM kills the PJL, sends a message to kill all the remote tasks that have been successfully started, and exit.
- If TS terminates before it can report the exit status of the task to PAM, PAM never succeeds in receiving all the exit status. It then exits when the PJL exits.
- If the PJL exits before all TS have registered the exit status of the tasks, then PAM assumes the parallel job is completed, and communicates with RES, which signals the tasks.
Using the pam -n option (SGI MPI only)
The
-n
option on thepam
command line specifies the number of tasks that PAM should start.You can use both
bsub -n
andpam -n
in the same job submission. The number specified in thepam -n
option should be less than or equal to the number specified bybsub -n
. If the number of task specified withpam -n
is greater than the number specified bybsub -n
, thepam -n
is ignored.For example, you can specify:
bsub -n 5 pam -n 2 -mpi a.outHere, 5 processors are reserved for the job, but PAM only starts 2 parallel tasks.
Custom job controls for parallel jobs
As with sequential LSF jobs, you can use the JOB_CONTROLS parameter in the queue (
lsb.queues
) to configure custom job controls for your parallel jobs.
Using the LSB_JOBRES_PID and LSB_PAMPID environment variables
How to use these two variables in your job control scripts:
- If
pam
and the job RES are in same process group, use LSB_JOBRES_PID. Here is an example of JOB_CONTROL defined in the queue:JOB_CONTROLS = TERMINATE[kill -CONT -$LSB_JOBRES_PID; kill -TERM -$LSB_JOBRES_PID]- If
pam
and the job RES are in different process groups (for example,pam
is started by a wrapper, which could set its own PGID). Use both LSB_JOBRES_PID and LSB_PAMPID to make sure your parallel jobs are cleaned up.JOB_CONTROLS = TERMINATE[kill -CONT -$LSB_JOBRES_PID -$LSB_PAMPID; kill -TERM -$LSB_JOBRES_PID -$LSB_PAMPID]
LSB_PAM_PID may not be available when job first starts. It take some time for pam to register back its PID tosbatchd
.
See the Platform LSF Configuration Reference for information about JOB_CONTROLS in the
lsb.queues
file.See Administering Platform LSF for information about configuring job controls.
Sample job termination script for queue job control
By default, LSF sends a SIGUSR2 signal to terminate a job that has reached its run limit or deadline. Some applications do not respond to the SIGUSR2 signal (for example, LAM/MPI), so jobs may not exit immediately when a job run limit is reached. You should configure your queues with a custom job termination action specified by the JOB_CONTROLS parameter.
Use the following sample job termination control script for the TERMINATE job control in the
hpc_linux
queue for LAM/MPI jobs:#!/bin/sh #JOB_CONTROL_LOG=job.control.log.$LSB_BATCH_JID JOB_CONTROL_LOG=/dev/null kill -CONT -$LSB_JOBRES_PID >>$JOB_CONTROL_LOG 2>&1 if [ "$LSB_PAM_PID" != "" -a "$LSB_PAM_PID" != "0" ]; then kill -TERM $LSB_PAM_PID >>$JOB_CONTROL_LOG 2>&1 MACHINETYPE=`uname -a | cut -d" " -f 5` while [ "$LSB_PAM_PID" != "0" -a "$LSB_PAM_PID" != "" ] # pam is running do if [ "$MACHINETYPE" = "CRAY" ]; then PIDS=`(ps -ef; ps auxww) 2>/dev/null | egrep ".*[/\[ \t]pam[] \t]*$"| sed -n "/grep/d;s/^ *[^ \t]* *\([0-9]*\).*/\1/p" | sort -u` else PIDS=`(ps -ef; ps auxww) 2>/dev/null | egrep " pam |/pam | pam$|/pam$"| sed -n "/grep/d;s/^ *[^ \t]* *\([0-9]*\).*/\1/p" | sort -u` fi echo PIDS=$PIDS >> $JOB_CONTROL_LOG if [ "$PIDS" = "" ]; then # no pam is running break; fi foundPamPid="N" for apid in $PIDS do if [ "$apid" = "$LSB_PAM_PID" ]; then # pam is running foundPamPid="Y" break fi done if [ "$foundPamPid" == "N" ]; then break # pam has exited fi sleep 2 done fi # User other terminate signals if SIGTERM is # caught and ignored by your application. kill -TERM -$LSB_JOBRES_PID >>$JOB_CONTROL_LOG 2>&1 exit 0
- Create a job control script named
job_terminate_control.sh
.- Make the script executable:
chmod +x job_terminate_control.sh
- Edit the
hpc_linux
queue inlsb.queues
to configure yourjob_terminate_control.sh
script as the TERMINATE action in the JOB_CONTROLS parameter. For example:Begin Queue QUEUE_NAME = hpc_linux_tv PRIORITY = 30 NICE = 20 # ... JOB_CONTROLS = TERMINATE[kill -CONT -$LSB_JOBRES_PID; kill -TERM -$LSB_JOBRES_PID]JOB_CONTROLS = TERMINATE [/
path/job_terminate_control.sh]
TERMINATE_WHEN = LOAD PREEMPT WINDOW RERUNNABLE = NO INTERACTIVE = NO DESCRIPTION = Platform LSF TotalView Debug queue. End Queue- Reconfigure your cluster to make the change take effect:
#badmin mbdrestart
[ Top ]
Integration Method 1
When to use this integration method
In this method, PAM rewrites the PJL command line to insert TS in the correct position, and set callback information for TS to communicate with PAM.
Use this method when:
- You always use the same number of PJL arguments
- The job in the PJL command line is the executable application that starts the parallel tasks
Using pam to call the PJL
Submit jobs using
pam
in the following format:pam [other_pam_options] -g num_args pjl [pjl_options] job [job_options]The command line includes:
- The
pam
command and its options (other_pam_options)- the
pam -g
num_args option- The parallel job launcher or PJL wrapper (pjl) and its options (pjl_options)
- The job to run (job) and its options (job_options)
The
-g
option is required to use the generic PJL framework. You must specify all the otherpam
options before-g
.num_args specifies how many space-separated arguments in the command line are related to the PJL, including the PJL itself (after that, the rest of the command line is assumed to be related to the binary application that launches the parallel tasks).
For example:
- A PJL named
no_arg_pjl
takes no options, so-g 1
is required after the otherpam
options:pam [pam_options] -g 1 no_arg_pjl job [job_options]- A PJL is named
3_arg_pjl
and takes the options-a
,-b
, and group_name, so The option-g 4
is required after the otherpam
options:pam [pam_options] -g 4 3_arg_pjl -a -b group_name job [job_options]How PAM inserts TaskStarter
Before the PJL is started, PAM automatically modifies the command line and inserts the TS, the host and port for TS to contact PAM, and the LSF_ENVDIR in the correct position before the actual job.
TS is placed between the PJL and the parallel application. In this way, the TS starts each task, and LSF can monitor resource usage and control the task.
For example, if your LSF directory is
/usr/share/lsf
and you input:pam [pam_options] -g 3 my_pjl -b group_name job [job_options]PAM automatically modifies the PJL command line to:
my_pjl -b group_name/usr/share/lsf/TaskStarter -p
host_name:port_number-c /user/share/lsf/conf
job [job_options] [pjl_options]See Example Integration: LAM/MPI
[ Top ]
Integration Method 2
When to use this integration method
In this method, you rewrite or wrap the PJL to include TS and callback information for TS to communicate with PAM. This method of integration is the most flexible, but may be more difficult to implement.
Use this method when:
- The number of PJL arguments varies
- Parallel tasks have a complex startup sequence
- The job in the PJL command line could be a script instead of the executable application that starts the parallel tasks
Using pam to call the PJL
Submit jobs using
pam
in the following format:pam [other_pam_options] -g pjl_wrap [pjl_wrap_options] job [job_options]The command line includes:
- The PJL wrapper script (pjl_wrap) and its options (pjl_wrap_options). This wrapper script must insert TS in the correct position before the actual job command.
- The job to run (job) and its options (job_options)
The job could be a wrapper script that starts the application that starts the parallel tasks, or it could be the executable application itself
The
-g
option is required to use the generic PJL framework. You must specify all the otherpam
options before-g
.Placing TaskStarter in your code
Each end job task must be started by the binary TaskStarter that is provided by Platform Computing.
When you use this method, PAM does not insert TS for you. You must modify your code to use TS and the LSF_TS_OPTIONS environment variable. LSF_TS_OPTIONS is created by PAM on the first execution host and contains the callback information for TS to contact PAM.
You must insert TS and the PAM callback information directly in front of the executable application that starts the parallel tasks.
To place TS and its options, you can modify either the PJL wrapper or the job script, depending on your implementation. If the package requires the path, specify the full path to
TaskStarter
.Example
This example modifies the PJL wrapper. The job script includes both the PJL wrapper and the job itself.
Without the integration, your job submission command line is:
bsub -n 2 jobscriptYour job script is:
#!/bin/sh if [ -n "$ENV1" ]; then pjl -opt1 job1 else pjl -opt2 -opt3 job2 fiAfter the integration, your job submission command line includes the
pam
command:bsub -n 2 pam -g new_jobscriptYour new job script inserts TS and LSF_TS_OPTIONS before the jobs:
#!/bin/sh if [ -n "$ENV1" ]; then pjl -opt1 usr/share/lsf/TaskStarter $LSF_TS_OPTIONS job1 else pjl -opt2 -opt3 usr/share/lsf/TaskStarter $LSF_TS_OPTIONS job2 fiSee Example Integration: LAM/MPI
[ Top ]
Tuning PAM Scalability and Fault Tolerance
To improve performance and scalability for large parallel jobs, tune the following parameters.
Parameters for PAM (lsf.conf)
For better performance, you can adjust the following parameters in
lsf.conf
. The user's environment can override these.LSF_HPC_PJL_LOADENV_TIMEOUT
Timeout value in seconds for PJL to load or unload the environment. For example, the time needed for IBM POE to load or unload adapter windows.
At job startup, the PJL times out if the first task fails to register within the specified timeout value. At job shutdown, the PJL times out if it fails to exit after the last Taskstarter termination report within the specified timeout value.
Default: LSF_HPC_PJL_LOADENV_TIMEOUT=300
LSF_PAM_RUSAGE_UPD_FACTOR
This factor adjusts the update interval according to the following calculation:
RUSAGE_UPDATE_INTERVAL + num_tasks * 1 * LSF_PAM_RUSAGE_UPD_F ACTOR.
PAM updates resource usage for each task for every SBD_SLEEP_TIME + num_tasks * 1 seconds (by default, SBD_SLEEP_TIME=15). For large parallel jobs, this interval is too long. As the number of parallel tasks increases, LSF_PAM_RUSAGE_UPD_FACTOR causes more frequent updates.
Default: LSF_PAM_RUSAGE_UPD_FACTOR=0.01
[ Top ]
Running Jobs with Task Geometry
Specifying task geometry allows you to group tasks of a parallel job step to run together on the same node. Task geometry allows for flexibility in how tasks are grouped for execution on system nodes. You cannot specify the particular nodes that these groups run on; the scheduler decides which nodes run the specified groupings.
Task geometry is supported for all Platform LSF MPI integrations including IBM POE, LAM/MPI, MPICH-GM, MPICH-P4, and Intel® MPI.
Use the LSB_PJL_TASK_GEOMETRY environment variable to specify task geometry for your jobs. LSB_PJL_TASK_GEOMETRY overrides any process group or command file placement options.
The environment variable LSB_PJL_TASK_GEOMETRY is checked for all parallel jobs. If LSB_PJL_TASK_GEOMETRY is set users submit a parallel job (a job that requests more than 1 slot), LSF attempts to shape LSB_MCPU_HOSTS accordingly.
The
mpirun.lsf
script sets the LSB_MCPU_HOSTS environment variable in the job according to the task geometry specification. The PJL wrapper script controls the actual PJL to start tasks based on the new LSB_MCPU_HOSTS and task geometry.Syntax
setenv LSB_PJL_TASK_GEOMETRY "{(
task_ID,
...)
...}"
For example, to submit a job to spawn 8 tasks and span 4 nodes, specify:
setenv LSB_PJL_TASK_GEOMETRY "{(2,5,7)(0,6)(1,3)(4)}"
- Tasks 2,5, and 7 run on one node
- Tasks 0 and 6 run on another node
- Tasks 1 and 3 run on a third node
- Task 4 runs on one node alone
Each task_ID number corresponds to a task ID in a job, each set of parenthesis contains the task IDs assigned to one node. Tasks can appear in any order, but the entire range of tasks specified must begin with 0, and must include all task ID numbers; you cannot skip a task ID number. Use braces to enclose the entire task geometry specification, and use parentheses to enclose groups of nodes. Use commas to separate task IDs.
For example.
setenv LSB_PJL_TASK_GEOMETRY "{(1)(2)}"is incorrect because it does not start from task 0.
setenv LSB_PJL_TASK_GEOMETRY "{(0)(3)}"is incorrect because it does not specify task 1and 2.
LSB_PJL_TASK_GEOMETRY cannot request more hosts than specified by the
bsub -n
option.For example:
setenv LSB_PJL_TASK_GEOMETRY "{(0)(1)(2)}"specifies three nodes, one task per node. A correct job submission must request at least 3 hosts:
bsub -n 3 -R "span[ptile=1]" -I -a mpich_gm mpirun.lsf my_job Job <564> is submitted to queue <hpc_linux>. <<Waiting for dispatch ...>> <<Starting on hostA>> ...Planning your task geometry specification
You should plan their task geometry in advance and specify the job resource requirements for LSF to select hosts appropriately.
Use
bsub -n
and-R "span[ptile=]"
to make sure LSF selects appropriate hosts to run the job, so that:
- The correct number of nodes is specified
- All exceution hosts have the same number of available slots
- The
ptile
value is the maximum number of CPUs required on one node by task geometry specifications.LSB_PJL_TASK_GEOMETRY only guarantees the geometry but does not guarantee the host order. You must make sure each host selected by LSF can run any group of tasks specified in LSB_PJL_TASK_GEOMETRY.
You can also use
bsub -x
to run jobs exclusively on a host. No other jobs share the node once this job is scheduled.Usage notes and limitations
- MPICH-P4 jobs:
MPICH-P4
mpirun
requires the first task to run on local node OR all tasks to run on remote node (-nolocal
). If the LSB_PJL_TASK_GEOMETRY environment variable is set,mpirun.lsf
makes sure the task group that contains task 0 in LSB_PJL_TASK_GEOMETRY runs on the first node.- LAM/MPI jobs:
You should not specify
mpirun n
manually on command line; you should use LSB_PJL_TASK_GEOMETRY for consistency with other Platform LSF MPI integrations. LSB_PJL_TASK_GEOMETRY overrides thempirun n
option.- OpenMPI jobs:
Each thread of an OpenMPI job is counted as a task. For example, task geometry specification is:
setenv LSB_PJL_TASK_GEOMETRY "{(1), (2,3,4) (0,5)}"and task 5 is an
openmp
job that spawns 3 threads. From this specification, the job spans 3 nodes, and maximum number of CPUs required is 4 (because(0,5)
requires 4 cpus). The job should be submitted as:bsub -n 12 -R "span[ptile=4]" -a openmp mpirun.lsf myjobExamples
For the following task geometry:
setenv LSB_PJL_TASK_GEOMETRY "{(2,5,7)(0,6)(1,3)(4)}"The job submission should look like:
bsub -n 12 -R "span[ptile=3]" -a poe mpirun.lsf myjobIf task 6 is an OpenMP job that spawns 4 threads, the job submission is:
bsub -n 20 -R "span[ptile=5]" -a poe mpirun.lsf myjob
Do not use -a openmp or set LSF_PAM_HOSTLIST_USE for OpenMP jobs.
A POE job has three tasks:
task0
,task1
, andtask2
, andTask
task2
spawns 3 threads. The taskstask0
andtask1
run on one node andtask2
runs on the other node. The job submission is:bsub -a poe -n 6 -R "span[ptile=3]" mpirun.lsf -cmdfile mycmdfilewhere
mycmdfile
contains:task0 task1 task2The order of the tasks in the task geometry specification must match the order of tasks in
mycmdfile
:setenv LSB_PJL_TASK_GEOMETRY "{(0,1)(2)}"If the order of tasks in
mycmdfile
changes, you must change the task geometry specification accordingly.For example, if
mycmdfile
contains:task0 task2 task1the task geometry must be changed to:
setenv LSB_PJL_TASK_GEOMETRY "{(0,2)(1)}"[ Top ]
Enforcing Resource Usage Limits for Parallel Tasks
A typical Platform LSF parallel job launches its tasks across multiple hosts. By default you can enforce limits on the total resources used by all the tasks in the job. Because PAM only reports the sum of parallel task resource usage, LSF does not enforce resource usage limits on individual tasks in a parallel job.
For example, resource usage limits cannot control allocated memory of a single task of a parallel job to prevent it from allocating memory and bringing down the entire system. For some jobs, the total resource usage may be exceed a configured resource usage limit even if no single task does, and the job is terminated when it does not need to be.
Attempting to limit individual tasks by setting a system-level swap hard limit (RLIMIT_AS) in the system limit configuration file (
/etc/security/limits.conf
) is not satisfactory, because it only prevents tasks running on that host from allocating more memory than they should; other tasks in the job can continue to run, with unpredictable results.By default, custom job controls (JOB_CONTROL in
lsb.queues
) apply only to the entire job, not individual parallel tasks.Enabling resource usage limit enforcement for parallel tasks
Use the LSF_HPC_EXTENSIONS options TASK_SWAPLIMIT and TASK_MEMLIMIT in
lsf.conf
to enable resource usage limit enforcement and job control for parallel tasks. When TASK_SWAPLIMIT or TASK_MEMLIMIT is set in LSF_HPC_EXTENSIONS, LSF terminates the entire parallel job if any single task exceeds the limit setting for memory and swap limits.Other resource usage limits (CPU limit, process limit, run limit, and so on) continue to be enforced for the entire job, not for individual tasks.
For detailed information about resource usage limits in LSF, see the "Runtime Resource Usage Limits" chapter in Administering Platform LSF.
Assumptions and behavior
- To enforce resource usage limits by parallel task, you must use the LSF generic PJL framework (PAM/TS) to launch your parallel jobs.
- This feature only affects parallel jobs monitored by PAM. It has no effect on other LSF jobs.
- LSF_HPC_EXTENSIONS=TASK_SWAPLIMIT overrides the default behavior of swap limits (
bsub -v
,bmod -v
, or SWAPLIMIT inlsb.queues
).- LSF_HPC_EXTENSIONS=TASK_MEMLIMIT overrides the default behavior of memory limits (
bsub -M
,bmod -M
, or MEMLIMIT inlsb.queues
).- LSF_HPC_EXTENSIONS=TASK_MEMLIMIT overrides LSB_MEMLIMIT_ENFORCE=Y or LSB_JOB_MEMLIMIT=Y in
lsf.conf
- When a parallel job is terminated because of task limit enforcement, LSF sets a value in the LSB_JOBEXIT_INFO environment variable for any post-execution programs:
- When a parallel job is terminated because of task limit enforcement, LSF logs the job termination reason in
lsb.acct
file:and
bacct
displays the termination reason.
[ Top ]
Example Integration: LAM/MPI
The script
lammpirun_wrapper
is the PJL wrapper. Use either Integration Method 1 or Integration Method 2 to call this script:pam [other_pam_options] -g num_args lammpirun_wrapper job [job_options] pam [other_pam_options] -g lammpirun_wrapper job [job_options]Example script
#!/bin/sh # # ----------------------------------------------------- # Source the LSF environment. Optional. # ----------------------------------------------------- . ${LSF_ENVDIR}/lsf.conf # ----------------------------------------------------- # Set up the variable LSF_TS representing the TaskStarter. # ----------------------------------------------------- LSF_TS="$LSF_BINDIR/TaskStarter" # --------------------------------------------------------------------- # Define the function to handle external signals: # - display the signal received and the shutdown action to the user # - log the signal received and the daemon shutdown action # - exit gracefully by shutting down the daemon # - set the exit code to 1 # ---------------------------------------------------------------------- # lammpirun_exit() { trap '' 1 2 3 15 echo "Signal Received, Terminating the job<${TMP_JOBID}> and run lamhalt ..." echo "Signal Received, Terminating the job<${TMP_JOBID}> and run lamhalt ..." >>$LOGFILE $LAMHALT_CMD >>$LOGFILE 2>&1 exit 1 } #lammpirun_exit #----------------------------------- # Name: who_am_i # Synopsis: who_am_i # Environment Variables: # Description: # It returns the name of the current user. # Return Value: # User name. #----------------------------------- who_am_i() { if [ `uname` = ConvexOS ] ; then _my_name=`whoami | sed -e "s/[ ]//g"` else _my_name=`id | sed -e 's/[^(]*(\([^)]*\)).*/\1/' | sed -e "s/[ ]//g"` fi echo $_my_name } # who_am_i # # ----------------------------------------------------- # Set up the script's log file: # - create and set the variable LOGDIR to represent the log file directory # - fill in your own choice of directory LOGDIR # - the log directory you choose must be accessible by the user from all hosts # - create a log file with a unique name, based on the job ID # - if the log directory is not specified, the log file is /dev/null # - the first entry logs the file creation date and file name # - we create and set a second variable DISPLAY_JOBID to format the job # ID properly for writing to the log file # ---------------------------------------------------- # # # Please specify your own LOGDIR, # Your LOGDIR must be accessible by the user from all hosts. # LOGDIR="" TMP_JOBID="" if [ -z "$LSB_JOBINDEX" -o "$LSB_JOBINDEX" = "0" ]; then TMP_JOBID="$LSB_JOBID" DISPLAY_JOBID="$LSB_JOBID" else TMP_JOBID="$LSB_JOBID"_"$LSB_JOBINDEX" DISPLAY_JOBID="$LSB_JOBID[$LSB_JOBINDEX]" fi if [ -z "$LOGDIR" ]; then LOGFILE="/dev/null" else LOGFILE="${LOGDIR}/lammpirun_wrapper.job${TMP_JOBID}.log" fi # # ----------------------------------------------------- # Create and set variables to represent the commands used in the script: # - to modify this script to use different commands, edit this section # ---------------------------------------------------- # TPING_CMD="tping" LAMMPIRUN_CMD="mpirun" LAMBOOT_CMD="lamboot" LAMHALT_CMD="lamhalt" # # ----------------------------------------------------- # Define an exit value to rerun the script if it fails # - create and set the variable EXIT_VALUE to represent the requeue exit value # - we assume you have enabled job requeue in LSF # - we assume 66 is one of the job requeue values you specified in LSF # ---------------------------------------------------- # # EXIT_VALUE should not be set to 0 EXIT_VALUE="66" # # ----------------------------------------------------- # Write the first entry to the script's log file # - date of creationg # - name of log file # ---------------------------------------------------- # my_name=`who_am_i` echo "`date` $my_name" >>$LOGFILE # ----------------------------------------------------- # Use the signal handling function to handle specific external signals. # ---------------------------------------------------- # trap lammpirun_exit 1 2 3 15 # # ----------------------------------------------------- # Set up a hosts file in the specific format required by LAM MPI: # - remove any old hosts file # - create a new hosts file with a unique name using the LSF job ID # - write a comment at the start of the hosts file # - if the hosts file was not created properly, display an error to # the user and exit # - define the variables HOST, NUM_PROC, FLAG, and TOTAL_CPUS to # help with parsing the host information # - LSF's selected hosts are described in LSB_MCPU_HOSTS environment variable # - parse LSB_MCPU_HOSTS into the components # - write the new hosts file using this information # - write a comment at the end of the hosts file # - log the contents of the new hosts file to the script log file # ---------------------------------------------------- # LAMHOST_FILE=".lsf_${TMP_JOBID}_lammpi.hosts" if [ -d "$HOME" ]; then LAMHOST_FILE="$HOME/$LAMHOST_FILE" fi # # # start a new host file from scratch rm -f $LAMHOST_FILE echo "# LAMMPI host file created by LSF on `date`" >> $LAMHOST_FILE # check if we were able to start writing the conf file if [ -f $LAMHOST_FILE ]; then : else echo "$0: can't create $LAMHOST_FILE" exit 1 fi HOST="" NUM_PROC="" FLAG="" TOTAL_CPUS=0 for TOKEN in $LSB_MCPU_HOSTS do if [ -z "$FLAG" ]; then HOST="$TOKEN" FLAG="0" else NUM_PROC="$TOKEN" TOTAL_CPUS=`expr $TOTAL_CPUS + $NUM_PROC` FLAG="1" fi if [ "$FLAG" = "1" ]; then _x=0 while [ $_x -lt $NUM_PROC ] do echo "$HOST" >>$LAMHOST_FILE _x=`expr $_x + 1` done # get ready for the next host FLAG="" HOST="" NUM_PROC="" fi done # last thing added to LAMHOST_FILE echo "# end of LAMHOST file" >> $LAMHOST_FILE echo "Your lamboot hostfile looks like:" >> $LOGFILE cat $LAMHOST_FILE >> $LOGFILE # ----------------------------------------------------- # Process the command line: # - extract [mpiopts] from the command line # - extract jobname [jobopts] from the command line # ----------------------------------------------------- ARG0=`$LAMMPIRUN_CMD -h 2>&1 | \ egrep '^[[:space:]]+-[[:alpha:][:digit:]-]+[[:space:]][[:space:]]' | \ awk '{printf "%s ", $1}'` # get -ton,t and -w / nw options TMPARG=`$LAMMPIRUN_CMD -h 2>&1 | \ egrep '^[[:space:]]+-[[:alpha:]_-]+[[:space:]]*(,|/)[[:space:]]- [[:alpha:]]*' | sed 's/,/ /'| sed 's/\// /' | \ awk '{printf "%s %s ", $1, $2}'` ARG0="$ARG0 $TMPARG" ARG1=`$LAMMPIRUN_CMD -h 2>&1 | \ egrep '^[[:space:]]+-[[:alpha:]_- ]+[[:space:]]+<[[:alpha:][:space:]_]+>[[:space:]]' | \ awk '{printf "%s ", $1}'` while [ $# -gt 0 ] do MPIRunOpt="0" #single-valued options for option in $ARG1 do if [ "$option" = "$1" ]; then MPIRunOpt="1" case "$1" in -np|-c) shift shift ;; *) LAMMPI_OPTS="$LAMMPI_OPTS $1" #get option name shift LAMMPI_OPTS="$LAMMPI_OPTS $1" #get option value shift ;; esac break fi done if [ $MPIRunOpt = "1" ]; then : else #Non-valued options for option in $ARG0 do if [ $option = "$1" ]; then MPIRunOpt="1" case "$1" in -v) shift ;; *) LAMMPI_OPTS="$LAMMPI_OPTS $1" shift ;; esac break fi done fi if [ $MPIRunOpt = "1" ]; then : else JOB_CMDLN="$*" break fi done # ----------------------------------------------------------------------------- # Set up the CMD_LINE variable representing the integrated section of the # command line: # - LSF_TS, script variable representing the TaskStarter binary. # TaskStarter must start each and every job task process. # - LSF_TS_OPTIONS, LSF environment variable containing all necessary # information for TaskStarter to callback to LSF's Parallel Application # Manager. # - JOB_CMDLN, script variable containing the job and job options #------------------------------------------------------------------------------ if [ -z "$LSF_TS_OPTIONS" ] then echo CMD_LINE="$JOB_CMDLN" >> $LOGFILE CMD_LINE="$JOB_CMDLN " else echo CMD_LINE="$LSF_TS $LSF_TS_OPTIONS $JOB_CMDLN" >> $LOGFILE CMD_LINE="$LSF_TS $LSF_TS_OPTIONS $JOB_CMDLN " fi # # ----------------------------------------------------- # Pre-execution steps required by LAMMPI: # - define the variable LAM_MPI_SOCKET_SUFFIX using the LSF # job ID and export it # - run lamboot command and log the action # - append the hosts file to the script log file # - run tping command and log the action and output # - capture the result of tping and test for success before proceeding # - exits with the "requeue" exit value if pre-execution setup failed # ---------------------------------------------------- # LAM_MPI_SOCKET_SUFFIX="${LSB_JOBID}_${LSB_JOBINDEX}" export LAM_MPI_SOCKET_SUFFIX echo $LAMBOOT_CMD $LAMHOST_FILE >>$LOGFILE $LAMBOOT_CMD $LAMHOST_FILE >>$LOGFILE 2>&1 echo $TPING_CMD h -c 1 >>$LOGFILE $TPING_CMD N -c 1 >>$LOGFILE 2>&1 EXIT_VALUE="$?" if [ "$EXIT_VALUE" = "0" ]; then # # ----------------------------------------------------- # Run the parallel job launcher: # - log the action # - trap the exit value # ---------------------------------------------------- # #call mpirun -np # a.out echo "Your command line looks like:" >> $LOGFILE echo $LAMMPIRUN_CMD $LAMMPI_OPTS -v C $CMD_LINE >> $LOGFILE $LAMMPIRUN_CMD $LAMMPI_OPTS -v C $CMD_LINE EXIT_VALUE=$? # # ----------------------------------------------------- # Post-execution steps required by LAMMPI: # - run lamhalt # - log the action # ---------------------------------------------------- # echo $LAMHALT_CMD >>$LOGFILE $LAMHALT_CMD >>$LOGFILE 2>&1 fi # # ----------------------------------------------------- # Clean up after running this script: # - delete the hosts file we created # - log the end of the job # - log the exit value of the job # ---------------------------------------------------- # # cleanup temp and conf file then exit rm -f $LAMHOST_FILE echo "Job<${DISPLAY_JOBID}> exits with exit value $EXIT_VALUE." >>$LOGFILE 2>&1 # To support multiple jobs inside one job script # Sleep one sec to allow next lamd start up, otherwise tping will return error sleep 1 exit $EXIT_VALUE # # ----------------------------------------------------- # End the script. # ---------------------------------------------------- #[ Top ]
Tips for Writing PJL Wrapper Scripts
A wrapper script is often used to call the PJL. We assume the PJL is not integrated with LSF, so if PAM was to start the PJL directly, the PJL would not automatically use the hosts that LSF selected, or allow LSF to collect resource information.
The wrapper script can set up the environment before starting the actual job.
The script should create and use its own log file, for troubleshooting purposes. For example, it should log a message each time it runs a command, and it should also log the result of the command. The first entry might record the successful creation of the log file itself.
Set up aliases for the commands used in the script, and identify the full path to the command. Use the alias throughout the script, instead of calling the command directly. This makes it simple to change the path or the command at a later time, by editing just one line.
If the script is interrupted or terminated before it finishes, it should exit gracefully and undo any work it started. This might include closing files it was using, removing files it created, shutting down daemons it started, and recording the signal event in the log file for troubleshooting purposes.
In LSF, job requeue is an optional feature that depends on the job's exit value. PAM exits with the same exit value as PJL, or its wrapper script. Some or all errors in the script can specify a special exit value that causes LSF to requeue the job.
Use
/dev/null
to redirect any screen output to a null file.Set LSF_ENVDIR and source the
lsf.conf
file. This gives you access to LSF configuration settings.The hosts LSF has selected to run the job are described by the environment variable LSB_MCPU_HOSTS. This environment variable specifies a list, in quotes, consisting of one or more host names paired with the number of processors to use on that host:
"host_name number_processors host_name number_processors ..."
Parse this variable into the components and create a host file in the specific format required by the vendor PJL. In this way, the hosts LSF has chosen are passed to the PJL.
Depending on the vendor, the PJL may require some special pre-execution work, such as initializing environment variables or starting daemons. You should log each pre-exec task in the log file, and also check the result and handle errors if a required task failed.
If an external resource is used to identify MPI-enabled hosts, LSF has selected hosts based on the availability of that resource. However, there is some time delay between LSF scheduling the job and the script starting the PJL. It's a good idea to make the script verify that required resources are still available on the selected hosts (and exit if the hosts are no longer able to execute the parallel job). Do this immediately before starting the PJL.
The most important function of the wrapper script is to start the PJL and have it execute the parallel job on the hosts selected by LSF. Normally, you use a version of the
mpirun
command.Depending on the vendor, the PJL may require some special post-execution work, such as stopping daemons. You should log each post-exec task in the log file, and also check the result and handle errors if any task failed.
The script should exit gracefully. This might include closing files it used, removing files it created, shutting down daemons it started, and recording each action in the log file for troubleshooting purposes.
[ Top ]
Other Integration Options
Once the PJL integration is successful, you might be interested in the following LSF features.
For more information about these features, see the LSF documentation.
Using a job starter
A job starter is a wrapper script that can set up the environment before starting the actual job.
Using external resources
You may need to identify MPI-enabled hosts
If all hosts in the LSF cluster can be used run the parallel jobs, with no restrictions, you don't need to differentiate between regular hosts and MPI-enabled hosts.
You can use an external resource to identify suitable hosts for running your parallel jobs.
To identify MPI-enabled hosts, you can configure a static Boolean resource in LSF.
For some integrations, to make sure the parallel jobs are sent to suitable hosts, you must track a dynamic resource (such as free ports). You can use an LSF ELIM to report the availability of these. See Administering Platform LSF for information about writing ELIMs.
- If you create a dedicated LSF queue to manage the parallel jobs, make sure the queue's host list includes only MPI-enabled hosts.
- The
bsub
option-m host_name
allows you to specify hosts by name. All the hosts you name are used to run the parallel job.- The
bsub
option-R res_req
allows you to specify any LSF resource requirements, including a list of hosts; in this case, you specify that the hosts selected must have one of the names in your host list.Using esub
An
esub
program can contain logic that modifies a job before submitting it to LSF. Theesub
can be used to simplify the user input. An example is the LAM/MPI integration in the Platform open source FTP directory.[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 10, 2011
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2011 Platform Computing Corporation. All rights reserved.