Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Working with Application Profiles

Application profiles improve the management of applications by separating scheduling policies (preemption, fairshare, etc.) from application-level requirements, such as pre-execution and post-execution commands, resource limits, or job controls, job chunking, etc.

Contents

Manage application profiles

About application profiles

Use application profiles to map common execution requirements to application-specific job containers. For example, you can define different job types according to the properties of the applications that you use; your FLUENT jobs can have different execution requirements from your CATIA jobs, but they can all be submitted to the same queue.

The following application profile defines the execution requirements for the FLUENT application:

Begin Application 
NAME         = fluent 
DESCRIPTION  = FLUENT Version 6.2 
CPULIMIT     = 180/hostA      # 3 hours of host hostA 
FILELIMIT    = 20000 
DATALIMIT    = 20000          # jobs data segment limit 
CORELIMIT    = 20000 
PROCLIMIT    = 5              # job processor limit 
PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out 
REQUEUE_EXIT_VALUES = 55 34 78 
End Application 

See the lsb.applications template file for additional application profile examples.

Add or remove application profiles

Add an application profile
  1. Log in as the LSF administrator on any host in the cluster.
  2. Edit lsb.applications to add the new application profile definition.
  3. You can copy another application profile definition from this file as a starting point; remember to change the NAME of the copied profile.

  4. Save the changes to lsb.applications.
  5. Run badmin reconfig to reconfigure mbatchd.

Adding an application profile does not affect pending or running jobs.

Remove an application profile

Prerequisites: Before removing an application profile, make sure there are no pending jobs associated with the application profile.

If there are jobs in the application profile, use bmod -app to move pending jobs to another application profile, then remove the application profile. Running jobs are not affected by removing the application profile associated with them,

note:  
You cannot remove a default application profile.
  1. Log in as the LSF administrator on any host in the cluster.
  2. Run bmod -app to move all pending jobs into another application profile.
  3. If you leave pending jobs associated with an application profile that has been removed, they remain pending wit h the pending reason

    Specified application profile does not exist 
    
  4. Edit lsb.applicationss and remove or comment out the definition for the application profile you want to remove.
  5. Save the changes to lsb.applications.
  6. Run badmin reconfig to reconfigure mbatchd.

Define a default application profile

Define a default application profile that is used when a job is submitted without specifying an application profile,

  1. Log in as the LSF administrator on any host in the cluster.
  2. Set DEFAULT_APPLICATION in lsb.params to the name of the default application profile.
  3. DEFAULT_APPLICATION=catia 
    
  4. Save the changes to lsb.params.
  5. Run badmin reconfig to reconfigure mbatchd.
  6. Adding an application profile does not affect pending or running jobs.

Specify successful application exit values.

Use SUCCESS_EXIT_VALUES to specify a list of exit codes that will be considered as successful execution for the application.

  1. Log in as the LSF administrator on any host in the cluster.
  2. Set SUCCESS_EXIT_VALUES to specify a list of job success exit codes for the application.
  3. SUCCESS_EXIT_VALUES=230 222 12 
    
  4. Save the changes to lsb.applications.
  5. Run badmin reconfig to reconfigure mbatchd.
Understanding successful application exit values

Jobs that exit with one of the exit codes specified by SUCCESS_EXIT_VALUES in an application profile are marked as DONE. These exit values are not be counted in the EXIT_RATE calculation.

0 always indicates application success regardless of SUCCESS_EXIT_VALUES.

If both SUCCESS_EXIT_VALUES and REQUEUE_EXIT_VALUES are defined, job will be set to PEND state and requeued.

SUCCESS_EXIT_VALUES has no effect on pre-exec and post-exec commands. The value is only used for user jobs.

If the job exit value falls into SUCCESS_EXIT_VALUES, the job will be marked as DONE. Job dependencies on done jobs behave normally.

For parallel jobs, the exit status refers to the job exit status and not the exit status of individual tasks.

Exit codes for jobs terminated by LSF are excluded from success exit value even if they are specified in SUCCESS_EXIT_VALUES.

For example. if SUCCESS_EXIT_VALUES=2 is defined, jobs exiting with 2 are marked as DONE. However, if LSF cannot find the current working directory, LSF terminates the job with exit code 2, and the job is marked as EXIT. The appropriate termination reason is displayed by bacct.

MultiCluster jobs

In the job forwarding model, for jobs sent to a remote cluster, jobs exiting with success exit codes defined in the remote cluster are considered done successfully.

In the lease model, the parameters of lsb.applications apply to jobs running on remote leased hosts as if they are running on local hosts.

Use application profiles

Submit jobs to application profiles

Use the -app option of bsub to specify an application profile for the job.

  1. Run bsub -app to submit jobs to an application profile.
  2. bsub -app fluent -q overnight myjob 
     

    LSF rejects the job if the specified application profile does not exist.

Modify the application profile associated with a job

Prerequisites: You can only modify the application profile for pending jobs.

  1. Run bmod -app application_profile_name to modify the application profile of the job.
  2. The -appn option dissociates the specified job from its application profile. If the application profile does not exist, the job is not modified

bmod -app fluent 2308 

Associates job 2308 with the application profile fluent.

bmod -appn 2308 

Dissociates job 2308 from the application profile fluent.

Control jobs associated with application profiles

bstop, bresume, and bkill operate on jobs associated with the specified application profile. You must specify an existing application profile. If job_ID or 0 is not specified, only the most recently submitted qualifying job is operated on.

  1. Run bstop -app to suspend jobs in an application profile.
  2. bstop -app fluent 2280 
     

    Suspends job 2280 associated with the application profile fluent.

    bstop -app fluent 0

    Suspends all jobs associated with the application profile fluent.

  3. Run bresume -app to resume jobs in an application profile.
  4. bresume -app fluent 2280 
     

    Resumes job 2280 associated with the application profile fluent.

  5. Run bkill -app to kill jobs in an application profile.
  6. bkill -app fluent 
     

    Kills the most recently submitted job associated with the application profile fluent for the current user.

    bkill -app fluent 0

    Kills all jobs associated with the application profile fluent for the current user.

View application profile information

To view the...
Run...
Available application profiles
bapp
Detailed application profile information
bapp -l
Jobs associated with an application profile
bjobs -l -app application_profile_name
Accounting information for all jobs associated with an application profile
bacct -l -app application_profile_name
Job success and requeue exit code information
bapp -l
bacct -l
bhist -l
bjobs -l

View available application profiles

  1. Run bapp. You can view a particular application profile or all profiles.
  2. bapp 
    APPLICATION_NAME  NJOBS  PEND   RUN  SUSP 
    fluent               0     0     0     0 
    catia                0     0     0     0 
     

    A dash (-) in any entry means that the column does not apply to the row.

View detailed application profile information

  1. To see the complete configuration for each application profile, run bapp -l.
  2. bapp -l also gives current statistics about the jobs in a particular application profile, such as the total number of jobs in the profile, the number of jobs running, suspended, and so on.

    Specify application profile names to see the properties of specific application profiles.

    bapp -l fluent 
    APPLICATION NAME: fluent 
     -- Application definition for Fluent v2.0 
    STATISTICS: 
       NJOBS     PEND      RUN    SSUSP    USUSP      RSV 
           0        0        0        0        0        0 
      
    PARAMETERS: 
     CPULIMIT 
     600.0 min of hostA 
     RUNLIMIT 
     200.0 min of hostA 
     PROCLIMIT 
     9 
     FILELIMIT DATALIMIT STACKLIMIT CORELIMIT MEMLIMIT SWAPLIMIT 
    PROCESSLIMIT THREADLIMIT 
        800 K     100 K     900 K      700 K     300 K   1000 K     400          
    500 
    RERUNNABLE: Y 
    CHUNK_JOB_SIZE: 5 
    

View jobs associated with application profiles

  1. Run bjobs -l -app application_profile_name.
  2. bjobs -l -app fluent 
    Job <1865>, User <user1>, Project <default>, Application <fluent>,  
                         Status <PSUSP>, Queue <normal>, Command <ls> 
    Tue Jun  6 11:52:05: Submitted from host <hostA> with hold, CWD 
                         </clusters/lsf7.0/work/cluster1/logdir>; 
     PENDING REASONS: 
     Job was suspended by LSF admin or root while pending; 
     SCHEDULING PARAMETERS: 
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem   tlu 
     loadSched   -     -     -     -       -     -    -     -     -      -      -     - 
     loadStop    -     -     -     -       -     -    -     -     -      -      -     - 
     		                cpuspeed    bandwidth 
     loadSched          -            - 
     loadStop           -            - 
     

    A dash (-) in any entry means that the column does not apply to the row.

Accounting information for all jobs associated with an application profile

  1. Run bacct -l -app application_profile_name.
  2. bacct -l -app fluent 
    Accounting information about jobs that are: 
      - submitted by users jchan, 
      - accounted on all projects. 
      - completed normally or exited 
      - executed on all hosts. 
      - submitted to all queues. 
      - accounted on all service classes. 
      - associated with application profiles: fluent 
    ------------------------------------------------------------------------------ 
      
    Job <207>, User <user1>, Project <default>, Application <fluent>, Status <DONE> 
                         , Queue <normal>, Command <dir> 
    Wed May 31 16:52:42: Submitted from host <hostA>, CWD <$HOME/src/mainline/lsbatch 
                         /cmd>; 
    Wed May 31 16:52:48: Dispatched to 10 Hosts/Processors <10*hostA> 
    Wed May 31 16:52:48: Completed <done>. 
    Accounting information about this job: 
         CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP 
          0.02        6              6     done         0.0035     2M      5M 
    ------------------------------------------------------------------------------ 
    ... 
    SUMMARY:      ( time unit: second ) 
     Total number of done jobs:      15      Total number of exited jobs:     4 
     Total CPU time consumed:       0.4      Average CPU time consumed:     0.0 
     Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0 
     Total wait time in queues:  5305.0 
     Average wait time in queue:  279.2 
     Maximum wait time in queue: 3577.0      Minimum wait time in queue:    2.0 
     Average turnaround time:       306 (seconds/job) 
     Maximum turnaround time:      3577      Minimum turnaround time:         5 
     Average hog factor of a job:  0.00 ( cpu time / turnaround time ) 
     Maximum hog factor of a job:  0.01      Minimum hog factor of a job:  0.00 
     Total throughput:             0.14 (jobs/hour)  during  139.98 hours 
     Beginning time:       May 31 16:52      Ending time:          Jun  6 12:51 
    

View job success exit values and requeue exit code information

  1. Run bjobs -l to see command-line requeue exit values if defined.
  2. bjobs -l 
     
    Job <405>, User <user1>, Project <default>, Status <PSUSP>, Queue <normal>, Co 
                         mmand <myjob 1234> 
    Tue Dec 11 23:32:00: Submitted from host <hostA> with hold, CWD </scratch/d 
                         ev/lsfjobs/user1/work>, Requeue Exit Values <2>; 
    ... 
    
  3. Run bapp -l to see SUCCESS_EXIT_VALUES when the parameter is defined in an application profile.
  4. bapp -l 
    APPLICATION NAME: fluent 
     -- Run FLUENT applications 
     
    STATISTICS: 
       NJOBS     PEND      RUN    SSUSP    USUSP      RSV  
           0        0        0        0        0        0 
     
    PARAMETERS: 
     
    SUCCESS_EXIT_VALUES: 230 222 12 
    ... 
    
  5. Run bhist -l to show command-line specified requeue exit values with bsub and modified requeue exit values with bmod.
  6. bhist -l 
    Job <405>, User <user1>, Project <default>, Command <myjob 1234> 
    Tue Dec 11 23:32:00: Submitted from host <hostA> with hold, to Queue  
    <norma 
                         l>, CWD </scratch/dev/lsfjobs/user1/work>, R 
                         e-queue Exit Values <1>; 
    Tue Dec 11 23:33:14: Parameters of Job are changed: 
                             Requeue exit values changes to: 2; 
    ... 
    
  7. Run bhist -l and bacct -l to see success exit values when a job is done successfully. If the job exited with default success exit value 0, bhist an bacct do not display the 0 exit value
  8. bhist -l 405 
    Job <405>, User <user1>, Project <default>, Interactive pseudo-terminal mode, Co 
                         mmand <myjob 1234> 
    ... 
    Sun Oct  7 22:30:19: Done successfully. Success Exit Code: 230 222 12. 
    ... 
    bacct -l 405 
    ... 
    Job <405>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Comma 
                         nd <myjob 1234> 
    Wed Sep 26 18:37:47: Submitted from host <hostA>, CWD </scratch/dev/lsfjobs/user1/wo 
                         rk>; 
    Wed Sep 26 18:37:50: Dispatched to <hostA>; 
    Wed Sep 26 18:37:51: Completed <done>. Success Exit Code: 230 222 12. 
     ... 
    

How application profiles interact with queue and job parameters

Application profiles operate in conjunction with queue and job-level options. In general, you use application profile definitions to refine queue-level settings, or to exclude some jobs from queue-level parameters.

Application profile settings that override queue settings

The following application profile parameters override the corresponding queue setting:

Application profile limits and queue limits

The following application profile limits override the corresponding queue-level soft limits:

Job-level limits can override the application profile limits. The application profile limits cannot override queue-level hard limits.

Processor limits

PROCLIMIT in an application profile specifies the maximum number of slots that can be allocated to a job. For parallel jobs, PROCLIMIT is the maximum number of processors that can be allocated to the job.

You can optionally specify the minimum and default number of processors. All limits must be positive integers greater than or equal to 1 that satisfy the following relationship:

1 <= minimum <= default <= maximum

Job-level processor limits (bsub -n) override application-level PROCLIMIT, which overrides queue-level PROCLIMIT. Job-level limits must fall within the maximum and minimum limits of the application profile and the queue.

Absolute run limits

If you want the scheduler to treat any run limits as absolute, define ABS_RUNLIMIT=Y in lsb.params or in lsb.applications for the application profile associated with your job. When ABS_RUNLIMIT=Y is defined in lsb.params or in the application profile, the run time limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted with a run limit configured.

Pre-execution

Queue-level pre-execution commands run before application-level pre-execution commands. Job level pre-execution commands (bsub -E) override application-level pre-execution commands.

Post-execution

When a job finishes, application-level post-execution commands run, followed by queue-level post-execution commands if any.

If both application-level and job-level post-execution commands (bsub -Ep) are specified, job level post-execution overrides application-level post-execution commands. Queue-level post-execution commands run after application-level post-execution and job-level post-execution commands

Chunk job scheduling

CHUNK_JOB_SIZE in an application profile ensures that jobs associated with the application are chunked together. CHUNK_JOB_SIZE=1 disables job chunk scheduling. Application-level job chunk definition overrides chunk job dispatch configured in the queue.

CHUNK_JOB_SIZE is ignored and jobs are not chunked under the following conditions:

If CHUNK_JOB_DURATION is set in lsb.params, chunk jobs are accepted regardless of the value of CPULIMIT, RUNLIMIT or RUNTIME.

Rerunnable jobs

RERUNNABLE in an application profile overrides queue-level job rerun, and allows you to submit rerunnable jobs to a non-rerunnable queue. Job-level rerun (bsub -r or bsub -rn) overrides both the application profile and the queue.

Resource requirements

Application-level resource requirements can be simple (one requirement for all slots) or compound (different requirements for specified numbers of slots). When resource requirements are set at the application-level as well as the job-level or queue-level, the requirements are combined in different ways depending on whether they are simple or compound.

Simple job-level, application-level, and queue-level resource requirements are merged in the following manner:

Compound application-level resource requirements are merged in the following manner:

For internal load indices and duration, jobs are rejected if they specify resource reservation requirements at the job level or application level that exceed the requirements specified in the queue.

If RES_REQ is defined at the queue level and there are no load thresholds defined, the pending reasons for each individual load index will not be displayed by bjobs.

When LSF_STRICT_RESREQ=Y is configured in lsf.conf, resource requirement strings in select sections must conform to a more strict syntax. The strict resource requirement syntax only applies to the select section. It does not apply to the other resource requirement sections (order, rusage, same, span, or cu). When LSF_STRICT_RESREQ=Y in lsf.conf, LSF rejects resource requirement strings where an rusage section contains a non-consumable resource.

Estimated runtime and runtime limits

Instead of specifying an explicit runtime limit for jobs, you can specify an estimated run time for jobs. LSF uses the estimated value for job scheduling purposes only, and does not kill jobs that exceed this value unless the jobs also exceed a defined runtime limit. The format of runtime estimate is same as run limit set by the bsub -W option or the RUNLIMIT parameter in lsb.queues and lsb.applications.

Use JOB_RUNLIMIT_RATIO in lsb.params to limit the runtime estimate users can set. If JOB_RUNLIMIT_RATIO is set to 0 no restriction is applied to the runtime estimate. The ratio does not apply to the RUNTIME parameter in lsb.applications.

The job-level runtime estimate setting overrides the RUNTIME setting in an application profile in lsb.applications.

The following LSF features use the estimated runtime value to schedule jobs:

Define a runtime estimate

Define the RUNTIME parameter at the application level. Use the bsub -We option at the job-level.

You can specify the runtime estimate as hours and minutes, or minutes only. The following examples show an application-level runtime estimate of three hours and 30 minutes:

Configuring normalized run time

LSF uses normalized run time for scheduling in order to account for different processing speeds of the execution hosts.

tip:  
If you want the scheduler to use wall-clock (absolute) run time instead of normalized run time, define ABS_RUNLIMIT=Y in the file lsb.params or in the file lsb.applications for the application associated with your job.

LSF calculates the normalized run time using the following formula:

NORMALIZED_RUN_TIME = RUNTIME * CPU_Factor_Normalization_Host / CPU_Factor_Execute_Host 

You can specify a host name or host model with the runtime estimate so that LSF uses a specific host name or model as the normalization host. If you do not specify a host name or host model, LSF uses the CPU factor for the default normalization host as described in the following table.

If you define...
In the file...
Then...
DEFAULT_HOST_SPEC
lsb.queues
LSF selects the default normalization host for the queue.
DEFAULT_HOST_SPEC
lsb.params
LSF selects the default normalization host for the cluster.
No default host at either the queue or cluster level
 
LSF selects the submission host as the normalization host.

To specify a host name (defined in lsf.cluster.clustername) or host model (defined in lsf.shared) as the normalization host, insert the "/" character between the minutes and the host name or model, as shown in the following examples:

RUNTIME=3:30/hostA 
bsub -We 3:30/hostA 

LSF calculates the normalized run time using the CPU factor defined for hostA.

RUNTIME=210/Ultra5S 
bsub -We 210/Ultra5S 

LSF calculates the normalized run time using the CPU factor defined for host model Ultra5S.

tip:  
Use lsinfo to see host name and host model information.
Guidelines for defining a runtime estimate
  1. You can define an estimated run time, along with a runtime limit (job level with bsub -W, application level with RUNLIMIT in lsb.applications, or queue level with RUNLIMIT lsb.queues).
  2. If the runtime limit is defined, the job-level (-We) or application-level RUNTIME value must be less than or equal to the run limit. LSF ignores the estimated runtime value and uses the run limit value for scheduling when
  3. For chunk jobs, ensure that the estimated runtime value is
How estimated run time interacts with run limits

The following table includes all the expected behaviors for the combinations of job-level runtime estimate (-We), job-level rum limit (-W), application-level runtime estimate (RUNTIME), application-level run limit (RUNLIMIT), queue-level run limit (RUNLIMIT, both default and hard limit). Ratio is the value of JOB_RUNLIMIT_RATIO defined in lsb.params. The dash (-) indicates no value is defined for the job.

Job-runtime estimate
Job-run limit
Application runtime estimate
Application run limit
Queue default run limit
Queue hard run limit
Result
T1
-
-
-
-
-
Job is accepted
Jobs running longer than T1*ratio are killed
T1
T2>T1*ratio
-
-
-
-
Job is rejected
T1
T2<=T1*ratio
-
-
-
-
Job is accepted
Jobs running longer than T2 are killed
T1
T2<=T1*ratio
T3
T4
-
-
Job is accepted
Jobs running longer than T2 are killed
T2 overrides T4 or T1*ratio overrides T4
T1 overrides T3
T1
T2<=T1*ratio
-
-
T5
T6
Job is accepted
Jobs running longer than T2 are killed
If T2>T6, the job is rejected
T1
-
T3
T4
-
-
Job is accepted
Jobs running longer than T1*ratio are killed
T2 overrides T4 or T1*ratio overrides T4
T1 overrides T3
T1
-
-
-
T5
T6
Job is accepted
Jobs running longer than T1*ratio are killed
If T1*ratio>T6, the job is rejected


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index