[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- Running IBM POE Jobs
- Migrating IBM Load Leveler Job Scripts to Use LSF Options
- Controlling Allocation and User Authentication for IBM POE Jobs
- Submitting IBM POE Jobs over InfiniBand
[ Top ]
Running IBM POE Jobs
The IBM Parallel Operating Environment (POE) interfaces with the Resource Manager to allow users to run parallel jobs requiring dedicated access to the high performance switch.
The LSF HPC integration for IBM High-Performance Switch (HPS) systems provides support for submitting POE jobs from AIX hosts to run on IBM HPS hosts.
An IBM HPS system consists of multiple nodes running AIX. The system can be configured with a high-performance switch to allow high bandwidth and low latency communication between the nodes. The allocation of the switch to jobs as well as the division of nodes into pools is controlled by the HPS Resource Manager.
Runchown
to change the owner of nrt_api to root, and then usechmod
to setsetuid
bit (chmod u+s).
hpc_ibm queue for POE jobs
During installation,
lsfinstall
configures a queue inlsb.queues
namedhpc_ibm
for running POE jobs. It defines requeue exit values to enable requeuing of POE jobs if some users submit jobs requiring exclusive access to the node.The
poejob
script will exit with 133 if it is necessary to requeue the job. Other types of jobs should not be submitted to the same queue. Otherwise, they will get requeued if they happen to exit with 133.Begin Queue QUEUE_NAME = hpc_ibm PRIORITY = 30 NICE = 20 ... RES_REQ = select[ poe > 0 ] REQUEUE_EXIT_VALUES = 133 134 135 ... DESCRIPTION = Platform LSF HPC 7 for IBM. This queue is to run POE jobs ONLY. End QueueConfiguring LSF HPC to run POE jobs
Ensure that the HPS node names are the same as their host names. That is,
st_status
should return the same names for the nodes thatlsload
returns.1. Configure per-slot resource reservation (lsb.resources).
2. Optional. Enable exclusive mode (lsb.queues).
3. Optional. Define resource management pools (rmpool) and node locking queue threshold.
4. Optional. Define system partitions (spname).
5. Allocate switch adapter specific resources.
6. Optional. Tune PAM parameters.
7. Reconfigure to apply the changes.
1. Configure per-slot resource reservation (lsb.resources)
To support the IBM HPS architecture, LSF HPC must reserve resources based on job slots. During installation,
lsfinstall
configures the ReservationUsage section inlsb.resources
to reserve HPS resources on a per-slot basis.Resource usage defined in the ReservationUsage section overrides the cluster-wide RESOURCE_RESERVE_PER_SLOT parameter defined in
lsb.params
if it also exists.Begin ReservationUsage RESOURCE METHOD adapter_windows PER_SLOT ntbl_windows PER_SLOT csss PER_SLOT css0 PER_SLOT End ReservationUsage2. Optional. Enable exclusive mode (lsb.queues)
To support the MP_ADAPTER_USE and
-adapter_use
POE job options, you must enable the LSF exclusive mode for each queue. To enable exclusive mode, editlsb.queues
and set EXCLUSIVE=Y:Begin Queue ... EXCLUSIVE=Y ... End Queue3. Optional. Define resource management pools (rmpool) and node locking queue threshold
If you schedule jobs based on resource management pools, you must configure
rmpools
as a static resource in LSF. Resource management pools are collections of SP2 nodes that together contain all available SP2 nodes without any overlap.For example, to configure 2 resource management pools,
p1
andp2
, made up of 6 SP2 nodes (sp2n1
,sp2n1
,sp2n3
, ...,sp2n6
):
- Edit
lsf.shared
and add an external resource calledpool
. For example:Begin Resource RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION ... pool Numeric () () (sp2 resource mgmt pool) lock Numeric 60 Y (IBM SP Node lock status) ... End Resource
pool
represents the resource management pool the node is in, and andlock
indicates whether the switch is locked.- Edit
lsf.cluster.
cluster_name and allocate thepool
resource. For example:Begin ResourceMap RESOURCENAME LOCATION ... pool (p1@[sp2n1 sp2n2 sp2n3] p2@[sp2n4 sp2n5 sp2n6]) ... End ResourceMap- Edit
lsb.queues
and a threshold for thelock
index in thehpc_ibm
queue:Begin Queue NAME=hpc_ibm ... lock=0 ... End QueueThe scheduling threshold on the
lock
index prevents dispatching to nodes which are being used in exclusive mode by other jobs.4. Optional. Define system partitions (spname)
If you schedule jobs based on system partition names, you must configure the static resource
spname
. System partitions are collections of HPS nodes that together contain all available HPS nodes without any overlap. For example, to configure two system partition names,spp1
andspp2
, made up of 6 SP2 nodes (sp2n1
,sp2n1
,sp2n3
, ...,sp2n6
):
- Edit
lsf.shared
and add an external resource calledspname
. For example:Begin Resource RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION ... spname String () () (sp2 sys partition name) ... End Resource- Edit
lsf.cluster.
cluster_name and allocate thespname
resource. For example:Begin ResourceMap RESOURCENAME LOCATION ... spname (spp1@[sp2n1 sp2n3 sp2n5] spp2@[sp2n2 sp2n4 sp2n6]) ... End ResourceMap5. Allocate switch adapter specific resources
If you use a switch adapter, you must define specific resources in LSF. During installation,
lsfinstall
defines the following external resources inlsf.shared
:
adapter_windows
--number of free adapter windows on IBM SP Switch2 systemsntbl_windows
--number of free network table windows on IBM HPS systemscss0
--number of free adapter windows on on css0 on IBM SP Switch2 systemscsss
--number of free adapter windows on on csss on IBM SP Switch2 systemsdedicated_tasks
--number of of running dedicated tasksip_tasks
--number of of running IP (Internet Protocol communication subsystem) tasksus_tasks
--number of of running US (User Space communication subsystem) tasksThese resources are updated through
elim.hpc
.Begin Resource RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION ... adapter_windows Numeric 30 N (free adapter windows on css0 on IBM SP) ntbl_windows Numeric 30 N (free ntbl windows on IBM HPS) poe Numeric 30 N (poe availability) css0 Numeric 30 N (free adapter windows on css0 on IBM SP) csss Numeric 30 N (free adapter windows on csss on IBM SP) dedicated_tasks Numeric () Y (running dedicated tasks) ip_tasks Numeric () Y (running IP tasks) us_tasks Numeric () Y (running US tasks) ... End ResourceYou must edit
lsf.cluster.
cluster_name and allocate the external resources. For example, to configure a switch adapter for six SP2 nodes (sp2n1
,sp2n1
,sp2n3
, ...,sp2n6
):Begin ResourceMap RESOURCENAME LOCATION ... adapter_windows [default] ntbl_windows [default] css0 [default] csss [default] dedicated_tasks (0@[default]) ip_tasks (0@[default]) us_tasks (0@[default]) ... End ResourceMapThe
adapter_windows
andntbl_windows
resources are required for all POE jobs.The other three resources are only required when you run IP and US jobs at the same time.
6. Optional. Tune PAM parameters
To improve performance and scalability for large POE jobs, tune the following
lsf.conf
parameters. The user's environment can override these.
- LSF_HPC_PJL_LOADENV_TIMEOUT
Timeout value in seconds for PJL to load or unload the environment. For example, the time needed for IBM POE to load or unload adapter windows.
At job startup, the PJL times out if the first task fails to register within the specified timeout value. At job shutdown, the PJL times out if it fails to exit after the last Taskstarter termination report within the specified timeout value.
Default: LSF_HPC_PJL_LOADENV_TIMEOUT=300
- LSF_PAM_RUSAGE_UPD_FACTOR
This factor adjusts the update interval according to the following calculation:
RUSAGE_UPDATE_INTERVAL + num_tasks * 1 * LSF_PAM_RUSAGE_UPD_FACTOR.
PAM updates resource usage for each task for every SBD_SLEEP_TIME + num_tasks * 1 seconds (by default, SBD_SLEEP_TIME=15). For large parallel jobs, this interval is too long. As the number of parallel tasks increases, LSF_PAM_RUSAGE_UPD_FACTOR causes more frequent updates.
Default: LSF_PAM_RUSAGE_UPD_FACTOR=0.01For large clusters
7. Reconfigure to apply the changes
- Run
badmin ckconfig
to check the configuration changes.If any errors are reported, fix the problem and check the configuration again.
- Reconfigure the cluster:
badmin reconfig
Checking configuration files ... No errors found. Do you want to reconfigure? [y/n]y
Reconfiguration initiatedLSF checks for any configuration errors. If no fatal errors are found, you are asked to confirm reconfiguration. If fatal errors are found, reconfiguration is aborted.
POE ELIM (elim.hpc)
An external LIM (ELIM) for POE jobs is supplied with LSF HPC.
On IBM HPS systems, ELIM uses the
st_status
orntbl_status
command to collect information from the Resource Manager.PATH variable in elim
The ELIM searches the following path for the
poe
andst_status
commands:PATH="/usr/bin:/bin:/usr/local/bin:/local/bin:/sbin:/usr/sbin:/usr/ucb:/usr/sbi n: /usr/bsd:${PATH}"If these commands are installed in a different directory, you must modify the PATH variable in
LSF_SERVERDIR/elim.hpc
to point to the correct directory.POE esub (esub.poe)
The
esub
for POE jobs,esub.poe
, is installed bylsfinstall
. It is invoked using the-a poe
option ofbsub
. By default, the POEesub
sets the environment variable LSF_PJL_TYPE=poe. The job launcher,mpirun.lsf
reads the environment variable LSF_PJL_TYPE=poe, and generates the appropriatepam
command line to invoke POE to start the job.The value of the
bsub -n
option overrides the POE-procs
option. If no-n
is used, theesub
sets default values with the variables LSB_SUB_NUM_PROCESSORS=1 and LSB_SUB_MAX_NUM_PROCESSORS=1.If you specify
-euilib us
(US
mode), then-euidevice
must becss0
orcsss
(the HPS for interprocess communications.)The
-euidevice sn_all
option is supported. The-euidevice sn_single
option is ignored. POE jobs submitted with-euidevice sn_single
use-euidevice sn_all
.POE PJL wrapper (poejob)
The POE PJL (Parallel Job Launcher) wrapper,
poejob
, parses the POE job options, and filters out those that have been set by LSF.Submitting POE jobs
Use
bsub
to submit POE jobs, including parameters required for the application and POE. PAM launches POE and collects resource usage for all running tasks in the parallel job.Syntax
bsub -a poe
[bsub_options]mpirun.lsf
program_name [program_options] [poe_options]where:
Invokes
esub.poe
.Examples
Running US jobs
To submit an POE job in US mode, and runs on six processors:
bsub -a poe -n 6 mpirun.lsf my_prog -euilib us -euidevice css0
Running IP jobs
To run POE jobs in IP mode, MP_EUILIB (or
-euilib
) must be set to IP (Internet Protocol communication subsystem). For example:bsub -a poe -n 6 mpirun.lsf my_prog -euilib ip ...
The POE -procs option is ignored by esub.poe. Use the bsub -n option to specify the number of processors required for the job. The default if -n is not specified is 1.
Submitting POE jobs with a job script
A wrapper script is often used to call the POE script. You can submit a job using a job script as an embedded script or directly as a job, for example:
bsub -a -n 4 poe < embedded_jobscript
bsub -a-n 4 poe
jobscriptFor information on generic PJL wrapper script components, see Running Parallel Jobs.
See Administering Platform LSF for information about submitting jobs with job scripts.
IBM SP Switch2 support
The SP Switch2 switch should be correctly installed and operational. By default, Platform LSF HPC only supports homogeneous clusters of IBM SP PSSP 3.4 or PSSP 3.5 SP Swich2 systems.
To verify the version of PSSP, run:
lslpp -l | grep ssp.basicOutput should look something like:
lslpp -l | grep ssp.basic
ssp.basic 3.2.0.9 COMMITTED SP System Support Package ssp.basic 3.2.0.9 COMMITTED SP System Support PackageTo verify the switch type, run:
SDRGetObjects Adapter css_type
Switch type Value SP_Switch_Adapter
2
SP_Switch_MX_Adapter
3
SP_Switch_MX2_Adapter
3
SP_Switch2_Adapter
5
SP_Switch2_Adapter
indicates that you are using SP Switch2.Use these values to configure the
device_type
variable in the scriptLSF_BINDIR/poejob
. The default fordevice_type
is 3.IBM High Performance Switch (HPS) support
Tasks of a parallel job running in US mode use the IBM pSeries High Performance Switch (HPS) exclusively for communication. HPS resources are referred to as network table windows. For US jobs to run, network table windows must be allocated ahead of the actual application startup.
You can run US jobs through LSF control (Load Leveler (LL) is not used). Job execution for US jobs has two stages:
- Load HPS network table windows using
ntbl_api
HPS support via The AIX Switch Network Interface (SNI)- Optional. Start the application using the POE wrapper
poe_w
commandRunning IP jobs
IP jobs do not require loading of network table windows. You just start
poe
orpoe_w
with the proper host name list file supplied.Starting a parallel job on a pSeries HPS system is similar to starting jobs on an SP Switch2 system:
- Load a table file to connect network table windows allocated to a task
- Launch the task over network table windows connected
- Unload the same table file to disconnect the network table window allocated to the task
[ Top ]
Migrating IBM Load Leveler Job Scripts to Use LSF Options
You can integrate LSF with your POE jobs by modifying your job scripts to convert POE Load Leveler options to LSF options. After modifying your job scripts, your LSF job submission will be equivalent to a POE job submission:
bsub < jobscript
becomes equivalent toLlsubmit jobCmdFile
The following POE options are handled differently when converting to LSF options:
- US (User Space) options
- IP (Internet Protocol) options
-nodes
combinations- Other Load Leveler directives
US options
Use the following combinations of US options as a guideline for converting them to LSF options.
IP options
For IP jobs that do not use a switch,
adapter_use
does not apply. Use the following combinations of IP options as a guideline for converting them to LSF options.-nodes combinations
Load Leveler directives
Load Leveler job commands are handled as follows:
- Ignored by LSF
- Converted to
bsub
options (or queue options inlsb.queues
)- Require special handling in your job script
Load Leveler Command Ignored bsub option Special Handling account_no
Y
Use LSF accounting.
arguments
Y
Place job arguments in the job command line
blocking
bsub -n with span[ptile]
all checkpoint commands
Y
class
bsub -P or -J
comment
Y
core_limit
bsub -C
cpu_limit
bsub -c or -n
data_limit
bsub -D
dependency
bsub -w
environment
Set in job script or in esub.poe
error
bsub -e
executable
Y
Enter the job name in the job script
file_limit
bsub -F
group
Y
hold
bsub -H
image_size
bsub -v or -M
initialdir
Y
The working directory is the current directory
input
bsub -i
job_cpu_limit
bsub -c
job_name
bsub -J
job_type
Y
Handled by esub.poe
max_processors
bsub -n min, max
min_processors
bsub -n min, max
network
bsub -R
node combinations
See -nodes combinations
notification
Set in lsf.conf
notify_user
Set in lsf.conf
output
bsub -o
parallel_path
Y
preferences
bsub -R "select[...]
queue
bsub -q
requirements
bsub -R and -m
resources
bsub -R
Set rusage for each task according to the Load Leveler equivalent
rss_limit
bsub -M
shell
Y
stack_limit
bsub -S
startdate
bsub -b
step_name
Y
task_geometry
Use the LSB_PJL_TASK_GEOMETRY environment variable to specify task geometry for your jobs. LSB_PJL_TASK_GEOMETRY overrides any mpirun n option.
total_tasks
bsub -n
user_priority
bsub -sp
wall_clock_limit
bsub -W
Simple job script modifications
The following example shows how to convert the POE options in a Load Leveler command file to LSF options in your job scripts for a non-shared US or IP job.
- Only one job at a time can run on a non-shared node
- An IP job can share a node with a dedicated US job (
-adapter_use
is dedicated)- The POE job always runs one task per CPU, so the
-cpu_use
option is not usedThis example uses following POE job script to run an executable named
mypoejob
:#!/bin/csh #@ shell = /bin/csh #@ environment = ENVIRONMENT=BATCH; COPY_ALL;\ # MP_EUILIB=us; MP_STDOUTMODE=ordered; MP_INFOLEVEL=0; #@ network.MPI = switch,dedicated,US #@ job_type = parallel #@ job_name = batch-test #@ output = $(job_name).log #@ error = $(job_name).log #@ account_no = USER1 #@ node = 2 #@ tasks_per_node = 8 #@ node_usage = not_shared #@ wall_clock_limit = 1:00:00 #@ class = batch #@ notification = never #@ queue # --------------------------------------------- # Copy required workfiles to $WORKDIR, which is set # to /scr/$user under the large GPFS work filesystem, # named /scr. cp ~/TESTS/mpihello $WORKDIR/mpihello # Change directory to $WORKDIR cd $WORKDIR # Execute program mypoejob poe mypoejob poe $WORKDIR/mpihello # Copy output data from $WORKDIR to appropriate archive FS, # since we are currently running within a volatile # "scratch" filesystem. # Clean unneeded files from $WORKDIR after job ends. rm -f $WORKDIR/mpihello echo "Job completed at: `date`"To convert POE options in a Load Leveler command file to LSF options
- Make sure the queue
hpc_ibm
is available inlsb.queues
.- Set the EXCLUSIVE parameter of the queue:
EXCLUSIVE=Y- Create the job script for the LSF job. For example:
#!/bin/csh # mypoe_jobscript # Start script --------- #BSUB -a poe #BSUB -n 16 #BSUB -x #BSUB -o batch_test.%J_%I.out #BSUB -e batch_test.%J_%I.err #BSUB -W 60 #BSUB -J batch_test #BSUB -q hpc_ibm setenv ENVIRONMENT BATCH setenv MP_EUILIB=us # Copy required workfiles to $WORKDIR, which is set # to /scr/$user under the large GPFS work filesystem, # named /scr. cp ~/TESTS/mpihello $WORKDIR/mpihello # Change directory to $WORKDIR cd $WORKDIR # Execute program mypoejob mpirun.lsf mypoejob -euilib us mpirun.lsf $WORKDIR/mpihello -euilib us # Copy output data from $WORKDIR to appropriate archive FS, # since we are currently running within a volatile # "scratch" filesystem. # Clean unneeded files from $WORKDIR after job ends. rm -f $WORKDIR/mpihello echo "Job completed at: `date`" # End script ---------- Submit the job script as a redirected job, specifying the appropriate resource requirement string:
bsub -R "select[adapter_windows>0] rusage[adapter_windows=1] span[ptile=8]" < mypoe_jobscriptComparing some of the converted options
Compare the job script submission with the equivalent job submitted with all the LSF options on the command line:
bsub -x -a poe -q hpc_ibm -n 16 -R "select[adapter_windows>0] rusage[adapter_windows=1] span[ptile=8]" mpirun.lsf mypoejob -euilib usTo submit the same job as an IP job, substitute
ip
forus
, and remove theselect
andrusage
statements:bsub -x -a poe -q hpc_ibm -n 16 -R "span[ptile=8]" mpirun.lsf mypoejob -euilib ipTo submit the job as a shared US or IP job, remove the
bsub -x
option from the job script or command line. This allows other jobs to run on the host your job is running on:bsub -a poe -q hpc_ibm -n 16 -R "span[ptile=8]" mpirun.lsf mypoejob -euilib usor
bsub -a poe -q hpc_ibm -n 16 -R "span[ptile=8]" mpirun.lsf mypoejob -euilib ipAdvanced job script modifications
If your environment runs any of the following:
- A mix of IP and US jobs,
- A combinations of dedicated and shared
-adapter_use
- Unique and multiple
-cpu_use
your job scripts must use the
us_tasks
anddedicated_tasks
LSF resources.The following examples show how to convert the POE options in a Load Leveler command file to LSF options in your job scripts for several kinds of jobs.
-adapter_use dedicated and -cpu_use unique
- This example uses following POE job script:
#!/bin/csh #@ shell = /bin/csh #@ environment = ENVIRONMENT=BATCH; COPY_ALL;\ # MP_EUILIB=us; MP_STDOUTMODE=ordered; MP_INFOLEVEL=0; #@ network.MPI = switch,dedicated,US #@ job_type = parallel #@ job_name = batch-test #@ output = $(job_name).log #@ error = $(job_name).log #@ account_no = USER1 #@ node = 2 #@ tasks_per_node = 8 #@ node_usage = not_shared #@ wall_clock_limit = 1:00:00 #@ class = batch #@ notification = never #@ queue # --------------------------------------------- # Copy required workfiles to $WORKDIR, which is set # to /scr/$user under the large GPFS work filesystem, # named /scr. cp ~/TESTS/mpihello $WORKDIR/mpihello # Change directory to $WORKDIR cd $WORKDIR # Execute program(s) poe mypoejob poe $WORKDIR/mpihello # Copy output data from $WORKDIR to appropriate archive FS, # since we are currently running within a volatile # "scratch" filesystem. # Clean unneeded files from $WORKDIR after job ends. rm -f $WORKDIR/mpihello echo "Job completed at: `date`"- The job script for the LSF job is:
#!/bin/csh # mypoe_jobscript #BSUB -a poe #BSUB -n 16 #BSUB -x #BSUB -o batch_test.%J_%I.out #BSUB -e batch_test.%J_%I.err #BSUB -W 60 #BSUB -J batch_test #BSUB -q hpc_ibm setenv ENVIRONMENT BATCH setenv MP_EUILIB us # Copy required workfiles to $WORKDIR, which is set # to /scr/$user under the large GPFS work filesystem, # named /scr. cp ~/TESTS/mpihello $WORKDIR/mpihello # Change directory to $WORKDIR cd $WORKDIR # Execute program(s) mpirun.lsf mypoejob -euilib us mpirun.lsf $WORKDIR/mpihello -euilib us # Copy output data from $WORKDIR to appropriate archive FS, # since we are currently running within a volatile # "scratch" filesystem. # Clean unneeded files from $WORKDIR after job ends. rm -f $WORKDIR/mpihello echo "Job completed at: `date`" # End of script ---------
- Submit the job script as a redirected job, specifying the appropriate resource requirement string:
bsub -R "select[adapter_windows>0] rusage[adapter_windows=1] span[ptile=8]" < mypoe_jobscript- Submit
mypoejob
as a single exclusive job:bsub -x -a poe -q hpc_ibm -n 16 -R "select[adapter_windows>0] rusage[adapter_windows=1] span[ptile=8]" mpirun.lsf mypoejob -euilib us[ Top ]
Controlling Allocation and User Authentication for IBM POE Jobs
About POE authentication
Establishing authentication for POE jobs means ensuring that users are permitted to run parallel jobs on the nodes they intend to use. POE supports two types of user authentication:
When interactive remote login to HPS execution nodes is not allowed, you can still run parallel jobs under Parallel Environment (PE) through LSF. PE jobs under LSF on the system with restricted access to the execution nodes uses two wrapper programs to allow user authentication:
Enabling user authentication for POE jobs
To enable user authentication through the
poe_w
andpmd_w
wrappers, you must set LSF_HPC_EXTENSIONS="LSB_POE_AUTHENTICATION" in/etc/lsf.conf
.Enforcing node and CPU allocation for POE jobs
To enable POE Allocation control, use LSF_HPC_EXTENSIONS="LSB_POE_ALLOCATION" in
/etc/lsf.conf
.poe_w
enforces the LSF allocation decision frommbatchd
.For US jobs,
swtbl_api
andntbl_api
validates network table windows data files withmbatchd
. For IP and US jobs,poe_wrapper
validates the POE host file with the information frommbatchd
. If the information does not match with the information frommbatchd
, the job is terminated.When LSF_HPC_EXTENSIONS="LSB_POE_ALLOCATION" is set:
poe_w
parses the POE host file and validates its contents with information frommbatchd
.ntbl_api
andswtbl_api
parse the network table and switch table data files and validate their contents with information frommbatchd
.
- Host names from data files must match host names as allocated by LSF
- The number of tasks per node cannot exceed the number of tasks per node as allocated by LSF
- Total number of tasks cannot exceed the total number of tasks requested at job submission (
bsub -n
)Configuring POE allocation and authentication support
- Register
pmv4lsf
(pmv3lsf
) service withinetd
:
- Add the following line to
/etc/inetd.conf
:pmv4lsf stream tcp nowait root /etc/pmdv4lsf pmdv4lsf- Make a symbolic link from
pmd_w
to/etc/pmdv4lsf
.For example:
# ln -s $LSF_BINDIR/pmd_w /etc/pmdv4lsf
Both $LSF_BINDIR and /etc must be owned by root for the symbolic link to work. Symbolic links are not allowed under /etc on some AIX 5.3 systems, so you may need to copy $LSF_BINDIR/pmd_w to /etc/pmdv4lsf:
cp -f $LSF_BINDIR/pmd_w /etc/pmdv4lsf
- Add
pmv4lsf
to/etc/services
.For example:
pmv4lsf 6128/tcp #pmd wrapper- Add
poelsf
service to/etc/services
.The port defined for this service will be used by
pmd_w
andpoe_w
for communication with each other.poelsf 6129/tcp #pmd_w - poe_w communication port- Run one of the following commands to restart
inetd
:# refresh -s inetd # kill -1 "inetd_pid"
- Create
/etc/lsf.conf
file if does not exist already and add the following parameter:LSF_HPC_EXTENSIONS="LSB_POE_ALLOCATION LSB_POE_AUTHENTICATION"- (Optional) Two optional parameters can be added to the
lsf.conf
file:
Both LSF_POE_TIMEOUT_BIND and LSF_POE_TIMEOUT_SELECT can also be set as environment variables for poe_w to read.
Example job scripts
For the following job script:
# mypoe_jobscript #!/bin/sh #BSUB -o out.%J #BSUB -n 2 #BSUB -m "hostA" #BSUB -a poe export MP_EUILIB=ip mpirun.lsf ./hmpisSubmit the job script as a redirected job, specifying the appropriate resource requirement string:
bsub -R "select[poe>0]" < mypoe_jobscriptFor the following job script:
# mypoe_jobscript #!/bin/sh #BSUB -o out.%J #BSUB -n 2 #BSUB -m "hostA" #BSUB -a poe export MP_EUILIB=us mpirun.lsf ./hmpisSubmit the job script as a redirected job, specifying the appropriate resource requirement string:
bsub -R "select[ntbl_windows>0] rusage[ntbl_windows=1] span[ptile=1]" < mypoe_jobscriptLimitations
- POE authentication for LSF jobs is supported on PE 3.x or PE 4.x. It is assumed that only one
pmd
version is installed on each node in the default location:
/usr/lpp/ppe.poe/bin/pmdv3
for PE 3.xor
/usr/lpp/ppe.poe/bin/pmdv4
for PE 4.xIf both
pmdv3
andpmdv4
are available in/usr/lpp/ppe.poe/bin
,pmd_w
launchespmdv3
.[ Top ]
Submitting IBM POE Jobs over InfiniBand
Platform LSF installation adds a shared nrt_windows resource to run and monitor POE jobs over the InfiniBand interconnect.
Begin Resource RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION ... poe Numeric 30 N (poe availability) dedicated_tasks Numeric () Y (running dedicated tasks) ip_tasks Numeric () Y (running IP tasks) us_tasks Numeric () Y (running US tasks) nrt_windows Numeric 30 N (free nrt windows on IBM poe over IB) ... End Resourcelsf.cluster.cluster_name
Begin ResourceMap RESOURCENAME LOCATION poe [default] nrt_windows [default] dedicated_tasks (0@[default]) ip_tasks (0@[default]) us_tasks (0@[default]) End ResourceMapJob Submission
Run
bsub -a poe
to submit an IP mode job:bsub -a poe mpirun.lsf job job_options -euilib ip poe_optionsRun bsub -a poe to submit a US mode job:
bsub -a poe mpirun.lsf job job_options -euilib us poe_optionsIf some of the AIX hosts do not have InfiniBand support (for example, hosts that still use HPS), you must explicitly tell LSF to exclude those hosts:
bsub -a poe -R "select[nrt_windows>0]" mpirun.lsf job job_options poe_optionsJob monitoring
Run
lsload
to display thenrt_windows
andpoe
resources:lsload -l
HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem nrt_windows poe hostA ok 0.0 0.0 0.0 1% 8.1 4 1 0 1008M 4090M 6976M 128.0 1.0 hostB ok 0.0 0.0 0.0 0% 0.7 1 0 0 1006M 4092M 7004M 128.0 1.0[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: August 20, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.