Knowledge Center Contents Previous Next Index |
Working with Application Profiles
Application profiles
improve the management of applications by separating scheduling policies (preemption, fairshare, etc.) from application-level requirements, such as pre-execution and post-execution commands, resource limits, or job controls, job chunking, etc.Contents
- Manage application profiles
- View application profile information
- Use application profiles
- How application profiles interact with queue and job parameters
Manage application profiles
About application profiles
Use application profiles to map common execution requirements to application-specific job containers. For example, you can define different job types according to the properties of the applications that you use; your FLUENT jobs can have different execution requirements from your CATIA jobs, but they can all be submitted to the same queue.
The following application profile defines the execution requirements for the FLUENT application:
Begin Application NAME = fluent DESCRIPTION = FLUENT Version 6.2 CPULIMIT = 180/hostA # 3 hours of host hostA FILELIMIT = 20000 DATALIMIT = 20000 # jobs data segment limit CORELIMIT = 20000 PROCLIMIT = 5 # job processor limit PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out REQUEUE_EXIT_VALUES = 55 34 78 End ApplicationSee the
lsb.applications
template file for additional application profile examples.Add or remove application profiles
Add an application profile
- Log in as the LSF administrator on any host in the cluster.
- Edit
lsb.applications
to add the new application profile definition.You can copy another application profile definition from this file as a starting point; remember to change the
NAME
of the copied profile.- Save the changes to
lsb.applications
.- Run
badmin reconfig
to reconfigurembatchd
.Adding an application profile does not affect pending or running jobs.
Remove an application profile
Prerequisites: Before removing an application profile, make sure there are no pending jobs associated with the application profile.
If there are jobs in the application profile, use
bmod -app
to move pending jobs to another application profile, then remove the application profile. Running jobs are not affected by removing the application profile associated with them,
note:
You cannot remove a default application profile.
- Log in as the LSF administrator on any host in the cluster.
- Run
bmod -app
to move all pending jobs into another application profile.If you leave pending jobs associated with an application profile that has been removed, they remain pending wit h the pending reason
Specified application profile does not exist- Edit
lsb.applicationss
and remove or comment out the definition for the application profile you want to remove.- Save the changes to
lsb.applications
.- Run
badmin reconfig
to reconfigurembatchd
.Define a default application profile
Define a default application profile that is used when a job is submitted without specifying an application profile,
- Log in as the LSF administrator on any host in the cluster.
- Set
DEFAULT_APPLICATION
inlsb.params
to the name of the default application profile.DEFAULT_APPLICATION=catia- Save the changes to
lsb.params
.- Run
badmin reconfig
to reconfigurembatchd
.Adding an application profile does not affect pending or running jobs.
Specify successful application exit values.
Use SUCCESS_EXIT_VALUES to specify a list of exit codes that will be considered as successful execution for the application.
- Log in as the LSF administrator on any host in the cluster.
- Set
SUCCESS_EXIT_VALUES
to specify a list of job success exit codes for the application.SUCCESS_EXIT_VALUES=230 222 12- Save the changes to
lsb.applications
.- Run
badmin reconfig
to reconfigurembatchd
.Understanding successful application exit values
Jobs that exit with one of the exit codes specified by SUCCESS_EXIT_VALUES in an application profile are marked as DONE. These exit values are not be counted in the EXIT_RATE calculation.
0 always indicates application success regardless of SUCCESS_EXIT_VALUES.
If both SUCCESS_EXIT_VALUES and REQUEUE_EXIT_VALUES are defined, job will be set to PEND state and requeued.
SUCCESS_EXIT_VALUES has no effect on pre-exec and post-exec commands. The value is only used for user jobs.
If the job exit value falls into SUCCESS_EXIT_VALUES, the job will be marked as DONE. Job dependencies on done jobs behave normally.
For parallel jobs, the exit status refers to the job exit status and not the exit status of individual tasks.
Exit codes for jobs terminated by LSF are excluded from success exit value even if they are specified in SUCCESS_EXIT_VALUES.
For example. if SUCCESS_EXIT_VALUES=2 is defined, jobs exiting with 2 are marked as DONE. However, if LSF cannot find the current working directory, LSF terminates the job with exit code 2, and the job is marked as EXIT. The appropriate termination reason is displayed by
bacct
.MultiCluster jobs
In the job forwarding model, for jobs sent to a remote cluster, jobs exiting with success exit codes defined in the remote cluster are considered done successfully.
In the lease model, the parameters of
lsb.applications
apply to jobs running on remote leased hosts as if they are running on local hosts.Use application profiles
Submit jobs to application profiles
Use the
-app
option ofbsub
to specify an application profile for the job.
- Run
bsub -app
to submit jobs to an application profile.bsub -app fluent -q overnight myjob
LSF rejects the job if the specified application profile does not exist.
Modify the application profile associated with a job
Prerequisites: You can only modify the application profile for pending jobs.
- Run
bmod -app
application_profile_name
to modify the application profile of the job.The
-appn
option dissociates the specified job from its application profile. If the application profile does not exist, the job is not modifiedbmod -app fluent 2308
Associates job 2308 with the application profile
fluent
.bmod -appn 2308
Dissociates job 2308 from the application profile
fluent
.Control jobs associated with application profiles
bstop
,bresume
, andbkill
operate on jobs associated with the specified application profile. You must specify an existing application profile. Ifjob_ID
or 0 is not specified, only the most recently submitted qualifying job is operated on.
- Run
bstop -app
to suspend jobs in an application profile.bstop -app fluent 2280
Suspends job 2280 associated with the application profile
fluent
.bstop -app fluent 0
Suspends all jobs associated with the application profile
fluent
.- Run
bresume -app
to resume jobs in an application profile.bresume -app fluent 2280
Resumes job 2280 associated with the application profile
fluent
.- Run
bkill -app
to kill jobs in an application profile.bkill -app fluent
Kills the most recently submitted job associated with the application profile
fluent
for the current user.bkill -app fluent 0
Kills all jobs associated with the application profile
fluent
for the current user.View application profile information
View available application profiles
- Run
bapp
. You can view a particular application profile or all profiles.bapp
APPLICATION_NAME NJOBS PEND RUN SUSP fluent 0 0 0 0 catia 0 0 0 0A dash (-) in any entry means that the column does not apply to the row.
View detailed application profile information
- To see the complete configuration for each application profile, run
bapp -l
.
bapp -l
also gives current statistics about the jobs in a particular application profile, such as the total number of jobs in the profile, the number of jobs running, suspended, and so on.Specify application profile names to see the properties of specific application profiles.
bapp -l fluent
APPLICATION NAME: fluent -- Application definition for Fluent v2.0 STATISTICS: NJOBS PEND RUN SSUSP USUSP RSV 0 0 0 0 0 0 PARAMETERS: CPULIMIT 600.0 min of hostA RUNLIMIT 200.0 min of hostA PROCLIMIT 9 FILELIMIT DATALIMIT STACKLIMIT CORELIMIT MEMLIMIT SWAPLIMIT PROCESSLIMIT THREADLIMIT 800 K 100 K 900 K 700 K 300 K 1000 K 400 500 RERUNNABLE: Y CHUNK_JOB_SIZE: 5View jobs associated with application profiles
- Run
bjobs -l -app
application_profile_name
.bjobs -l -app fluent
Job <1865>, User <user1>, Project <default>, Application <fluent>, Status <PSUSP>, Queue <normal>, Command <ls> Tue Jun 6 11:52:05: Submitted from host <hostA> with hold, CWD </clusters/lsf7.0/work/cluster1/logdir>; PENDING REASONS: Job was suspended by LSF admin or root while pending; SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem tlu loadSched - - - - - - - - - - - - loadStop - - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -A dash (-) in any entry means that the column does not apply to the row.
Accounting information for all jobs associated with an application profile
- Run
bacct -l -app
application_profile_name
.bacct -l -app fluent
Accounting information about jobs that are: - submitted by users jchan, - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on all service classes. - associated with application profiles: fluent ------------------------------------------------------------------------------ Job <207>, User <user1>, Project <default>, Application <fluent>, Status <DONE> , Queue <normal>, Command <dir> Wed May 31 16:52:42: Submitted from host <hostA>, CWD <$HOME/src/mainline/lsbatch /cmd>; Wed May 31 16:52:48: Dispatched to 10 Hosts/Processors <10*hostA> Wed May 31 16:52:48: Completed <done>. Accounting information about this job: CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP 0.02 6 6 done 0.0035 2M 5M ------------------------------------------------------------------------------ ... SUMMARY: ( time unit: second ) Total number of done jobs: 15 Total number of exited jobs: 4 Total CPU time consumed: 0.4 Average CPU time consumed: 0.0 Maximum CPU time of a job: 0.0 Minimum CPU time of a job: 0.0 Total wait time in queues: 5305.0 Average wait time in queue: 279.2 Maximum wait time in queue: 3577.0 Minimum wait time in queue: 2.0 Average turnaround time: 306 (seconds/job) Maximum turnaround time: 3577 Minimum turnaround time: 5 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.01 Minimum hog factor of a job: 0.00 Total throughput: 0.14 (jobs/hour) during 139.98 hours Beginning time: May 31 16:52 Ending time: Jun 6 12:51View job success exit values and requeue exit code information
- Run
bjobs -l
to see command-line requeue exit values if defined.bjobs -l
Job <405>, User <user1>, Project <default>, Status <PSUSP>, Queue <normal>, Co mmand <myjob 1234> Tue Dec 11 23:32:00: Submitted from host <hostA> with hold, CWD </scratch/d ev/lsfjobs/user1/work>, Requeue Exit Values <2>; ...- Run
bapp -l
to see SUCCESS_EXIT_VALUES when the parameter is defined in an application profile.bapp -l
APPLICATION NAME: fluent -- Run FLUENT applications STATISTICS: NJOBS PEND RUN SSUSP USUSP RSV 0 0 0 0 0 0 PARAMETERS: SUCCESS_EXIT_VALUES: 230 222 12 ...- Run
bhist -l
to show command-line specified requeue exit values withbsub
and modified requeue exit values withbmod
.bhist -l
Job <405>, User <user1>, Project <default>, Command <myjob 1234> Tue Dec 11 23:32:00: Submitted from host <hostA> with hold, to Queue <norma l>, CWD </scratch/dev/lsfjobs/user1/work>, R e-queue Exit Values <1>; Tue Dec 11 23:33:14: Parameters of Job are changed: Requeue exit values changes to: 2; ...- Run
bhist -l
andbacct -l
to see success exit values when a job is done successfully. If the job exited with default success exit value 0,bhist
anbacct
do not display the 0 exit valuebhist -l 405
Job <405>, User <user1>, Project <default>, Interactive pseudo-terminal mode, Co mmand <myjob 1234> ... Sun Oct 7 22:30:19: Done successfully. Success Exit Code: 230 222 12. ...bacct -l 405
... Job <405>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Comma nd <myjob 1234> Wed Sep 26 18:37:47: Submitted from host <hostA>, CWD </scratch/dev/lsfjobs/user1/wo rk>; Wed Sep 26 18:37:50: Dispatched to <hostA>; Wed Sep 26 18:37:51: Completed <done>. Success Exit Code: 230 222 12. ...How application profiles interact with queue and job parameters
Application profiles operate in conjunction with queue and job-level options. In general, you use application profile definitions to refine queue-level settings, or to exclude some jobs from queue-level parameters.
Application profile settings that override queue settings
The following application profile parameters override the corresponding queue setting:
- CHKPNT_DIR-overrides queue CHKPNT=
chkpnt_dir
- CHKPNT_PERIOD-overrides queue CHKPNT=
chkpnt_period
- JOB_STARTER
- LOCAL_MAX_PREEXEC_RETRY
- MAX_JOB_PREEMPT
- MAX_JOB_REQUEUE
- MAX_PREEXEC_RETRY
- MIG
- REMOTE_MAX_PREEXEC_RETRY
- REQUEUE_EXIT_VALUES
- RESUME_CONTROL-overrides queue JOB_CONTROLS
- SUSPEND_CONTROL-overrides queue JOB_CONTROLS
- TERMINATE_CONTROL-overrides queue JOB_CONTROLS
Application profile limits and queue limits
The following application profile limits override the corresponding queue-level soft limits:
- CORELIMIT
- CPULIMIT
- DATALIMIT
- FILELIMIT
- MEMLIMIT
- PROCESSLIMIT
- RUNLIMIT
- STACKLIMIT
- SWAPLIMIT
- STACKLIMIT
- THREADLIMIT
Job-level limits can override the application profile limits. The application profile limits cannot override queue-level hard limits.
Processor limits
PROCLIMIT in an application profile specifies the maximum number of slots that can be allocated to a job. For parallel jobs, PROCLIMIT is the maximum number of processors that can be allocated to the job.
You can optionally specify the minimum and default number of processors. All limits must be positive integers greater than or equal to 1 that satisfy the following relationship:
1 <=
minimum
<=default
<=maximum
Job-level processor limits (
bsub -n
) override application-level PROCLIMIT, which overrides queue-level PROCLIMIT. Job-level limits must fall within the maximum and minimum limits of the application profile and the queue.Absolute run limits
If you want the scheduler to treat any run limits as absolute, define ABS_RUNLIMIT=Y in
lsb.params
or inlsb.applications
for the application profile associated with your job. When ABS_RUNLIMIT=Y is defined inlsb.params
or in the application profile, the run time limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted with a run limit configured.Pre-execution
Queue-level pre-execution commands run
before
application-level pre-execution commands. Job level pre-execution commands (bsub -E
) override application-level pre-execution commands.Post-execution
When a job finishes, application-level post-execution commands run, followed by queue-level post-execution commands if any.
If both application-level and job-level post-execution commands (
bsub -Ep
) are specified, job level post-execution overrides application-level post-execution commands. Queue-level post-execution commands run after application-level post-execution and job-level post-execution commandsChunk job scheduling
CHUNK_JOB_SIZE in an application profile ensures that jobs associated with the application are chunked together. CHUNK_JOB_SIZE=1 disables job chunk scheduling. Application-level job chunk definition overrides chunk job dispatch configured in the queue.
CHUNK_JOB_SIZE is ignored and jobs are not chunked under the following conditions:
- CPU limit greater than 30 minutes (CPULIMIT parameter in
lsb.queues
orlsb.applications
)- Run limit greater than 30 minutes (RUNLIMIT parameter in
lsb.queues
orlsb.applications
)- Run time estimate greater than 30 minutes (RUNTIME parameter in
lsb.applications
)If CHUNK_JOB_DURATION is set in
lsb.params
, chunk jobs are accepted regardless of the value of CPULIMIT, RUNLIMIT or RUNTIME.Rerunnable jobs
RERUNNABLE in an application profile overrides queue-level job rerun, and allows you to submit rerunnable jobs to a non-rerunnable queue. Job-level rerun (
bsub -r
orbsub -rn
) overrides both the application profile and the queue.Resource requirements
Application-level resource requirements can be simple (one requirement for all slots) or compound (different requirements for specified numbers of slots). When resource requirements are set at the application-level as well as the job-level or queue-level, the requirements are combined in different ways depending on whether they are simple or compound.
Simple job-level, application-level, and queue-level resource requirements are merged in the following manner:
- If resource requirements are not defined at the application level, simple job-level and simple queue-level resource requirements are merged.
- When simple application-level resource requirements are defined, simple job-level requirements usually take precedence. Specifically:
Compound application-level resource requirements are merged in the following manner:
- When a compound resource requirement is set at the application level, it will be ignored if any job-level resource requirements (simple or compound) are defined.
- In the event no job-level resource requirements are set, the compound application-level requirements interact with queue-level resource requirement strings in the following ways:
- If no queue-level resource requirement is defined or a compound queue-level resource requirement is defined, the compound application-level requirement is used.
- If a simple queue-level requirement is defined, the application-level and queue-level requirements combine as follows:
For internal load indices and duration, jobs are rejected if they specify resource reservation requirements at the job level or application level that exceed the requirements specified in the queue.
If RES_REQ is defined at the queue level and there are no load thresholds defined, the pending reasons for each individual load index will not be displayed by
bjobs
.When LSF_STRICT_RESREQ=Y is configured in
lsf.conf
, resource requirement strings inselect
sections must conform to a more strict syntax. The strict resource requirement syntax only applies to theselect
section. It does not apply to the other resource requirement sections (order
,rusage
,same
,span
, orcu
). When LSF_STRICT_RESREQ=Y inlsf.conf
, LSF rejects resource requirement strings where anrusage
section contains a non-consumable resource.Estimated runtime and runtime limits
Instead of specifying an explicit runtime limit for jobs, you can specify an
estimated
run time for jobs. LSF uses the estimated value for job scheduling purposes only, and does not kill jobs that exceed this value unless the jobs also exceed a defined runtime limit. The format of runtime estimate is same as run limit set by thebsub -W
option or the RUNLIMIT parameter inlsb.queues
andlsb.applications
.Use JOB_RUNLIMIT_RATIO in
lsb.params
to limit the runtime estimate users can set. If JOB_RUNLIMIT_RATIO is set to 0 no restriction is applied to the runtime estimate. The ratio does not apply to the RUNTIME parameter inlsb.applications
.The job-level runtime estimate setting overrides the RUNTIME setting in an application profile in
lsb.applications
.The following LSF features use the estimated runtime value to schedule jobs:
- Job chunking
- Advance reservation
- SLA
- Slot reservation
- Backfill
Define a runtime estimate
Define the RUNTIME parameter at the application level. Use the
bsub -We
option at the job-level.You can specify the runtime estimate as hours and minutes, or minutes only. The following examples show an application-level runtime estimate of three hours and 30 minutes:
RUNTIME=3:30
RUNTIME=210
Configuring normalized run time
LSF uses normalized run time for scheduling in order to account for different processing speeds of the execution hosts.
tip:
If you want the scheduler to use wall-clock (absolute) run time instead of normalized run time, define ABS_RUNLIMIT=Y in the filelsb.params
or in the filelsb.applications
for the application associated with your job.LSF calculates the normalized run time using the following formula:
NORMALIZED_RUN_TIME = RUNTIME * CPU_Factor_Normalization_Host / CPU_Factor_Execute_HostYou can specify a host name or host model with the runtime estimate so that LSF uses a specific host name or model as the normalization host. If you do not specify a host name or host model, LSF uses the CPU factor for the default normalization host as described in the following table.
To specify a host name (defined in
lsf.cluster.
clustername
) or host model (defined inlsf.shared
) as the normalization host, insert the "/" character between the minutes and the host name or model, as shown in the following examples:RUNTIME=3:30/hostA bsub -We 3:30/hostALSF calculates the normalized run time using the CPU factor defined for
hostA
.RUNTIME=210/Ultra5S bsub -We 210/Ultra5SLSF calculates the normalized run time using the CPU factor defined for host model Ultra5S.
tip:
Uselsinfo
to see host name and host model information.Guidelines for defining a runtime estimate
- You can define an estimated run time, along with a runtime limit (job level with
bsub -W
, application level with RUNLIMIT inlsb.applications
, or queue level with RUNLIMITlsb.queues
).- If the runtime limit is defined, the job-level (
-We
) or application-level RUNTIME value must be less than or equal to the run limit. LSF ignores the estimated runtime value and uses the run limit value for scheduling when
- The estimated runtime value exceeds the run limit value, or
- An estimated runtime value is not defined
note:
When LSF uses the run limit value for scheduling, and the run limit is defined at more than one level, LSF uses the smallest run limit value to estimate the job duration.- For chunk jobs, ensure that the estimated runtime value is
- Less than the CHUNK_JOB_DURATION defined in the file
lsb.params
, or- Less than 30 minutes, if CHUNK_JOB_DURATION is not defined.
How estimated run time interacts with run limits
The following table includes all the expected behaviors for the combinations of job-level runtime estimate (
-We
), job-level rum limit (-W
), application-level runtime estimate (RUNTIME), application-level run limit (RUNLIMIT), queue-level run limit (RUNLIMIT, both default and hard limit).Ratio
is the value of JOB_RUNLIMIT_RATIO defined inlsb.params
. The dash (-) indicates no value is defined for the job.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |