Runtime options affecting parallel processing can be specified
with the XLSMPOPTS environment variable. This environment variable
must be set before you run an application, and uses basic syntax of
the form:

.-:-------------------------------------------.
V |
>>-XLSMPOPTS-- = -+---+----runtime_option_name-- = ---option_setting---+--+---+-><
'-"-' '-"-'
You can specify option names and settings in uppercase or lowercase.
You can add blanks before and after the colons and equal signs to
improve readability. However, if the XLSMPOPTS option string contains
imbedded blanks, you must enclose the entire option string in double
quotation marks (").
For example, to have a program run time create 4 threads and use
dynamic scheduling with chunk size of 5, you would set the XLSMPOPTS
environment variable as shown below:
XLSMPOPTS=PARTHDS=4:SCHEDULE=DYNAMIC=5
The following are the available runtime option settings for the
XLSMPOPTS environment variable:
Scheduling options are as follows:
- schedule
- Specifies the type of scheduling algorithms and chunk size (n)
that are used for loops to which no other scheduling algorithm has
been explicitly assigned in the source code.
Work is assigned
to threads in a different manner, depending on the scheduling type
and chunk size used. Choosing chunking granularity is a tradeoff between
overhead and load balancing. The syntax for this option is
schedule=
suboption,
where the suboptions are defined as follows:
- affinity[=n]
- The iterations of a loop are initially divided into n partitions,
containing ceiling(number_of_iterations/number_of_threads)
iterations. Each partition is initially assigned to a thread and is
then further subdivided into chunks that each contain n iterations.
If n is not specified, then the chunks consist of ceiling(number_of_iterations_left_in_partition /
2) loop iterations.
When a thread becomes free, it takes the next
chunk from its initially assigned partition. If there are no more
chunks in that partition, then the thread takes the next available
chunk from a partition initially assigned to another thread.
The
work in a partition initially assigned to a sleeping thread will be
completed by threads that are active.
The affinity scheduling
type does not appear in the OpenMP API standard.
- dynamic[=n]
- The iterations of a loop are divided into chunks containing n iterations
each. If n is not specified, then the chunks consist of ceiling(number_of_iterations/number_of_threads)
iterations.
Active threads are assigned these chunks on a "first-come,
first-do" basis. Chunks of the remaining work are assigned to available
threads until all work has been assigned.
If a thread is asleep,
its assigned work will be taken over by an active thread once that
thread becomes available.
- guided[=n]
- The iterations of a loop are divided into progressively smaller
chunks until a minimum chunk size of n loop iterations is reached.
If n is not specified, the default value for n is 1
iteration.
Active threads are assigned chunks on a "first-come,
first-do" basis. The first chunk contains ceiling(number_of_iterations/number_of_threads)
iterations. Subsequent chunks consist of ceiling(number_of_iterations_left
/ number_of_threads) iterations.
- static[=n]
- The iterations of a loop are divided into chunks containing n iterations
each. Each thread is assigned chunks in a "round-robin" fashion.
This is known as block cyclic scheduling. If the value of n is
1, then the scheduling type is specifically referred to as cyclic
scheduling.
If n is not specified, the chunks will
contain ceiling(number_of_iterations/number_of_threads)
iterations. Each thread is assigned one of these chunks. This is
known as block scheduling.
If a thread
is asleep and it has been assigned work, it will be awakened so that
it may complete its work.
- n
- Must be an integral assignment expression of value 1 or greater.
Specifying schedule with no suboption
is equivalent to schedule=runtime.
Parallel environment options are as follows:
- parthds=num
- Specifies the number of threads (num) requested, which
is usually equivalent to the number of processors available on the
system.
Some applications cannot use more threads than the maximum
number of processors available. Other applications can experience
significant performance improvements if they use more threads than
there are processors. This option gives you full control over the
number of user threads used to run your program.
The default
value for num is the number of processors available on the
system.
- usrthds=num
- Specifies the maximum number of threads (num) that you
expect your code will explicitly create if the code does explicit
thread creation. The default value for num is 0.
- stack=num
- Specifies the largest amount of space in bytes (num) that
a thread's stack needs. The default value for num is 2097152.
Set num so it is within the acceptable upper limit. num can
be up to 256 MB for 32-bit mode, or up to the limit imposed by system
resources for 64-bit mode. An application that exceeds the upper limit
may cause a segmentation fault.
The
glibc library is compiled by default to allow a stack size of 2 MB.
Setting num to a value greater than this will cause the default stack
size to be used. If larger stack sizes are required, you should link
the program to a glibc library compiled with the FLOATING_STACKS parameter
turned on.
- stackcheck[=num]
- When the -qsmp=stackcheck is in effect, enables stack overflow
checking for slave threads at runtime. num is the size of the
stack in bytes; when the remaining stack size is less than this value,
a runtime warning message is issued. If you do not specify a value
for num, the default value is 4096 bytes. Note that this option
only has an effect when the -qsmp=stackcheck has also been
specified at compile time. See -qsmp for more information.
- startproc=cpu_id
- Enables thread binding and specifies the cpu_id to which
the first thread binds. If the value provided is outside the range
of available processors, a warning message is issued and no threads
are bound.
- procs=cpu_id[,cpu_id,...]
- Enables thread binding and specifies a list of cpu_id to
which the threads are bound. If the number of CPU IDs specified is
less than the number of threads used by the program, the remaining
threads are not bound.
- stride=num
- Specifies the increment used to determine the cpu_id to
which subsequent threads bind. num must be greater than or
equal to 1. If the value provided causes a thread to bind to a CPU
outside the range of available processors, a warning message is issued
and no threads are bound.
Performance tuning options are as follows:
- spins=num
- Specifies the number of loop spins, or iterations, before a yield
occurs.
When a thread completes its work, the thread continues
executing in a tight loop looking for new work. One complete scan
of the work queue is done during each busy-wait state. An extended
busy-wait state can make a particular application highly responsive,
but can also harm the overall responsiveness of the system unless
the thread is given instructions to periodically scan for and yield
to requests from other applications.
A complete busy-wait
state for benchmarking purposes can be forced by setting both spins and yields to
0.
The default value for num is 100.
- yields=num
- Specifies the number of yields before a sleep occurs.
When
a thread sleeps, it completely suspends execution until another thread
signals that there is work to do. This provides better system utilization,
but also adds extra system overhead for the application.
The
default value for num is 100.
- delays=num
- Specifies a period of do-nothing delay time between each scan
of the work queue. Each unit of delay is achieved by running a single
no-memory-access delay loop.
The default value for num is
500.
Dynamic profiling options are as follows:
- profilefreq=n
- Specifies the frequency with which a loop should be revisited
by the dynamic profiler to determine its appropriateness for parallel
or serial execution.The runtime library uses dynamic profiling to
dynamically tune the performance of automatically parallelized loops.
Dynamic profiling gathers information about loop running times to
determine if the loop should be run sequentially or in parallel the
next time through. Threshold running times are set by the parthreshold and seqthreshold dynamic
profiling options, described below.
The allowed values for this
option are the numbers from 0 to 32. If num is 0, all profiling
is turned off, and overheads that occur because of profiling will
not occur. If num is greater than 0, running time of the loop
is monitored once every num times through the loop. The default
for num is 16. Values of num exceeding 32 are changed
to 32.
It is important to note that dynamic profiling is not
applicable to user-specified parallel loops.
- parthreshold=num
- Specifies the time, in milliseconds, below which each loop must
execute serially. If you set num to 0, every loop that has
been parallelized by the compiler will execute in parallel. The default
setting is 0.2 milliseconds, meaning that if a loop requires fewer
than 0.2 milliseconds to execute in parallel, it should be serialized.
Typically, num is set to be equal to the parallelization
overhead. If the computation in a parallelized loop is very small
and the time taken to execute these loops is spent primarily in the
setting up of parallelization, these loops should be executed sequentially
for better performance.
- seqthreshold=num
- Specifies the time, in milliseconds, beyond which a loop that
was previously serialized by the dynamic profiler should revert to
being a parallel loop. The default setting is 5 milliseconds, meaning
that if a loop requires more than 5 milliseconds to execute serially,
it should be parallelized.
seqthreshold acts as the reverse
of parthreshold.