Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Troubleshooting


This chapter describes some techniques you can use to determine the cause of a problem within an LSF desktop support environment. It also describes how the LSF batch commands behave in an LSF desktop support environment, and how each of the LSF scheduling policies affects LSF desktop support jobs.

Contents

[ Top ]


Desktop Server Stops Dispatching Jobs

Follow this procedure if you determine that the desktop server is not dispatching jobs.

If jobs are not being dispatched:

  1. Is the desktop server running?
    • No. Make sure that Apache and Tomcat are running, and then start the desktop server.
    • Yes. Continue to the next step.
  2. Are Apache and Tomcat running?
    • No. Run ACH_TOP/etc/lsfac_daemons start to start them. Restart the desktop server.
    • Yes. Continue to the next step.
  3. Run ACH_TOP/etc/lsfac_daemons stop to shut down the desktop server.
  4. Shut down and restart Apache and Tomcat.
  5. Start the desktop server.
  6. Are jobs being dispatched now?
    • No. Contact Platform Technical Support.
    • Yes. Continue processing.

Important


With EGO management of LSF desktop support services enabled, you should not use the command lsfac_daemons to start Apache or Tomcat services. Instead, you should use the egosh command to start these services.

[ Top ]


The Desktop Client Stops Working

Follow this procedure if you determine that a desktop client stops running jobs.

If a desktop client is not working:

  1. Look at http://LSF desktop support_host/servlet/StatsViewer to see when was the last time the desktop client requested a job or returned status. Has the desktop client logged in within the last polling interval?
    • No. The desktop client has not logged in within the interval specified in the SEDPollInterval. Go to step 4.
    • Yes. The desktop client has logged in within the interval specified in the SEDPollInterval. Continue with step 2.
  2. Look at the Job Status page at http://host_name/servlet/StatsViewer. Are there jobs waiting to be run?
    • No. There are no jobs to run.
    • Yes. Continue with step 3.
  3. Can you ping the desktop server from the desktop client?
    • No. There is a problem with the network.
    • Yes. Continue with step 4.
  4. Restart the desktop client by shutting down the service SED and starting it again. Does the desktop client start to run a job?
    • No. Call Platform Technical Support.
    • Yes. The problem is resolved.

[ Top ]


Writing LSF desktop support logs to a single directory

You can make it easier to find the Tomcat and Apache log files by choosing to write these files to the directory that contains log files for other LSF desktop support services.

You must restart Apache and Tomcat after changing the configuration.

Configure the Apache log files:

Configuration file Parameter and syntax Behavior
apache/conf/ httpd.conf
Error Log path_name/error_log.host_name
  • The Apache error log is written to the specified directory on the specified host
CustomLog path_name/access_log.host_name common
  • The Apache events log is written to the specified directory on the specified host

Configure the Tomcat log files:

Configuration file Syntax Behavior
jakarta-tomcat- 4.1.30-LE- jdk14/confserver- noexamples.xml.config
<Logger className="org.apache.catalina.logger.FileLogger" directory="logfile_directory" prefix="file_prefix_host_name" suffix=".txt" timestamp="true"/>
  • The Tomcat log files are written to the specified directory on the specified host
  • When your system has more than one desktop server (MED), adding the host name to the file prefix will prevent log files from being overwritten
jakarta-tomcat- 4.1.30-LE- jdk14/confserver.xml
jakarta-tomcat- 4.1.30-LE- jdk14/webapps/
admin.xml

Configure the Tomcat shell script:

Shell script Syntax Behavior
jakarta-tomcat- 4.1.30-LE- jdk14/bin/
catalina.sh

Line 213
- touch "$CATALINA_BASE"/logs/catalina.out
+ touch "path_name"/catalina.out

Lines 225 and 237
- >> "$CATALINA_BASE"/logs/catalina.out 2>&1 &
+ >> "path_name"/catalina.out 2>&1 &
  • The Tomcat log files are written to the specified directory

Configure the LSF desktop support log directory:

Configuration File Syntax Behavior
AC_TOP/wscache.conf
wscache log path_name
  • LSF desktop support log files are written to the specified directory

[ Top ]


Debugging the Desktop Client

You can place a desktop client in debug mode to log all debug information regarding the desktop client. The desktop client (SED) records connection error and debug messages created during job execution in the Windows application event service. LSF desktop support administrators can easily retrieve these error messages using the remote event viewer or a terminal services session.

LSF desktop support does not write any messages containing passwords or other authentication information to the Windows event service.

To set a desktop client to debug mode:

Configuration file Parameter and syntax Default Behavior
SEDConfig.xml
<SEDDebug>
<SEDName>
host_name</SEDName>
</SEDDebug>
Not defined
  • Sets a single desktop client (SED) to debug mode

<SEDDebug>
<SEDName>
host_name</SEDName>
<SEDName>
host_name</SEDName> </SEDDebug>
Not defined
  • Sets multiple desktop clients (SEDs) to debug mode

[ Top ]


Debugging a Desktop Application

LSF desktop support traps stdout and stderr messages in files specified by the bsub -e, -eo, -o, and -oo options.

Set a maximum log file size:

To prevent verbose applications from generating large log files, you can set a maximum log file size.

Configuration file Parameter and syntax Default Behavior
SEDConfig.xml
<SEDMaxOutputLogSize>file_size
</SEDMaxOutputLogSize>
Not defined
The file size is unlimited
  • Sets a maximum file size, in KB, for stderr and stdout log files

[ Top ]


Recovering from a Power Outage

Under normal circumstances, even after a power outage, LSF desktop support jobs should continue to run. Follow the procedures listed here to ensure resumption of normal processing.

After a power outage:

Restart the desktop server. In most cases, jobs will continue to run normally within the LSF desktop support.

If the desktop server cannot start:

  1. Check the file sbatchd.log.clustername (in the directory specified in LSF_LOGDIR) to see if an event record is corrupted. If an event record is corrupted, the log will point to the corrupted line number, be located in either $ACH_TOP/work/.clustername.sbd/med.events or $ACH_TOP/work/.clustername.sbd/sbd.events.
  2. If an event record is corrupted, contact Platform Technical Support for assistance.

If a job seems to be `stuck':

  1. Issue the bkill command to kill the job if you do not want the job to continue. Otherwise, issue the brequeue command to redispatch the job.
  2. If bkill does not work:
    1. Shut down the desktop server.
    2. Issue the bkill command on an LSF host to kill the job.
    3. Restart the desktop server.

[ Top ]


Job Blocked at the Desktop Server with Many File Transfers

LSF desktop support supports a maximum of 32 file transfer requests with the -f option in the bsub command. Specifying more than 32 file transfer requests can cause abnormal behavior.

Workaround

Use zip and unzip commands to reduce the number of file transfer requests.

Example

In the following example, myjob.exe requires a total of 66 file transfers: 33 to copy files to the desktop client, and 33 to copy the results from the desktop client.

Incorrect usage

bsub -f "data1>data1" ... -f "data33 > data33" -f "result1 < 
result1" ... "result33 < result33" myjob.exe

Recommended usage

  1. Zip the data files together into one file. For example:
    zip data.zip data1 data2 data3 ...data33
    
  2. Create a job wrapper that unzips the data files, runs the executable and zips the results. For example, the wrapper myjob.bat might look like this:
    unzip data.zip
    myjob.exe
    zip result.zip result1 result2 result3 ... result33
    
  3. Submit the job, transferring the data files, the wrapper and the executable to the desktop client, and transferring the zipped results file back from the desktop client. For example:
    bsub -f "data.zip > data.zip" -f "myjob.bat > myjob.bat" -f 
    "myjob.exe > myjob.exe" -f "result.zip < result.zip" 
    myjob.bat
    
  4. When the job is completed, unzip the result file:
    unzip result.zip
    

If you do not have zip and unzip on your system, you can get them from the Internet, and install them on each desktop client as required.

[ Top ]


LSF Policies in LSF Desktop Support

Because LSF desktop support runs jobs on desktop clients rather than directly on LSF hosts, some LSF scheduling policies and commands behave differently or are unsupported in LSF desktop support.

Batch commands

Command Supported in
LSF desktop support?
Description
bacct
Partially
The data is kept, but items such as queue time are misleading, since the denote only the time before being dispatched to MED.
bbot
Yes
Jobs that are pending can be moved using bbot. Jobs that are running cannot.
bchkpnt
No

bclusters
Yes
LSF desktop support does not affect this command.
bhist
Yes
Only displays some v queue and dispatch data. The desktop client that takes the job is logged in bhist.
bhosts
Yes
Note that this is only `near' real-time data.
bhpart
Not applicable

bjobs
Yes
The host listed is the MED host. The desktop client that actually takes the job is logged in bjobs using bpost.
bkill
Yes
Signals may be sent but they do not make any sense, since signals are not supported.
bmgroup
Not applicable

bmig
Yes
bmig forces LSF desktop support to terminate a job, which is requeued onto another MED host or the same MED host.
bmod
Not applicable
None of the post-dispatch options are supported. You can change any parameter provided the job has not been dispatched to LSF desktop support yet.
bparams
Yes

bpeek
No
You cannot peed at the output of a job running on a desktop client.
bpost
Yes
You can bpost and bread to LSF desktop support jobs.
bqueues
Partially
bqueues displays information about LSF queues, but once a job is dispatched to an MED, it is queued on the Web server. The Web server queue is not displayed in bqueues.
bread
Yes

brequeue
Yes
brequeue forces LSF to kill and reschedule an LSF desktop support job.
brestart
No

bresume
No
You cannot suspend or resume an LSF desktop support job.
bstatus
Yes
bstatus is part of the bpost, bread command set and is supported for individual LSF desktop support jobs.
bstop
No
You cannot suspend or resume an LSF desktop support job.
bsub
Yes
See bsub options.
bswitch
Yes
bswitch can change jobs that have not been dispatched to an MED to another queue. bswitch does not operate on any chunk job that has been dispatched (same as LSF).
btop
Yes
Works on all pending jobs.
bugroup
Yes

busers
Yes

ls*
Partially
The ls commands work only in LSF, not in LSF desktop support.
x*
Yes
Graphical interfaces work in accordance with their respective command line batch commands listed above.

bsub options

bsub Option Supported in
LSF desktop support?
Description
-B (email upon dispatch)
Partially
The user receives the email when the job is sent to the MED, not when it is dispatched to the desktop client.
-H (hold job)
Yes
Even though you cannot stop a job in LSF desktop support, you can submit jobs and keep them in PSUSP state.
-I (interactive)
No

-K (wait)
Yes

-N (job report email)
Yes
Works much like LSF, although the user may receive the email before job cleanup occurs.
-r (rerunnable)
Partially
LSF desktop support has its own concept of rerunnable that is controlled at the MED/web server. Using the -r option allows all jobs to be rerun or moved off of an MED if required.
-x (exclusive execution)
Partially
If a single job is sent to LSF desktop support, exclusive works correctly. However, if a chunk job is sent to LSF desktop support, it assumes the entire chunk job is on e job and executes the entire chunk. LSF treats a chunk job as a serial pipeline, while LSF desktop support treats a chunk job as a parallel pipeline (all chunks can run at once).
-b (begin time)
Yes

-C (core size limit)
No

-c (cpu time limit)
Not recommended
Due to differences in CPU speeds, it is recommended that you use runlimit (-W) instead.
-D (data size limit)
No

-e (stderr)
Yes

-eo (stderr)
Yes

-E (pre-execution)
Partially
The pre-execution command is run on the MED host, not on the individual desktop client.
-f (file copy)
Yes
Required parameter in LSF desktop support. LSF desktop support also uses this parameter to cache all files. Operators for this parameter are described in the Platform LSF Desktop Support User's Guide.
-F (file size limit)
No

-G (user group)
Yes

-I (stdin)
No

-J (job array)
Yes

-k (checkpoint)
No

-L (login shell)
No

-m (specified hosts)
Not recommended
You can specify the MED, but items will be set at the queue level, so you should specify a queue instead.
-M (memory limit)
No
Swap limit is supported.
-n (number of processors)
No

-o (stdout)
Yes

-oo (stdout)
Yes

-P (project name)
Yes

-q (queue)
Yes

-R (resource requirements)
Partially
Only the resources attributed to the MED host are taken into consideration. Desktop clients do not have resources in LSF. Use
-extsched.
-sp (priority)
Yes
Only works on pending jobs.
-S (stack size limit)
No

-t (terminate time)
Yes
You can use terminate time to kill jobs, but runlimit is better. It is very difficult to guarantee that the job has actually started in LSF desktop support.
-u (email recipient)
Yes

-v (swap limit)
Yes

-w (job dependency)
Yes

-W (run time limit)
Yes
Also indicates the time-out before rerunning a job.
-Zs (spooling)
No
Has no effect on LSF desktop support.
-extsched
Partially
Supports desktop client resource-aware scheduling.

LSF features

LSF Feature Application to LSF desktop support Description
Interactive jobs
No

Fairshare
Partial
Host-level fairshare (i.e. host partitioning) is not supported, but user-level fairshare is supported. During the execution of a job, LSF desktop support reports some resource usage. Once the job completes, resource usage is recorded. This information is used by the fairshare algorithm.
esub
Yes

eexec
No

Pre/Post execution
No

JobStarter
No

Queues
Yes
All queues types and commands work with LSF desktop support. However, decisions on dispatch to the sbatchd/MED can realistically only be made based on the number of jobs currently being run at any MED. The MED has its own queues, which cannot be controlled.
Resource scheduling
Partial
Resource scheduling is done at the MED level, not at the individual desktop client level. LSF desktop support has its own desktop client resource scheduling.
Resource requirements
Partial
Jobs can be restricted to only run on hosts with available disk and memory.
Resource limits
Yes, limited
Job resource usage can be limited by CPU, memory or time.
Checkpoint, Restart & Migration
No
These do not make sense in an LSF desktop support environment.
Pre-emption
Yes

Chunk jobs
Yes

Authentication
Yes
Authentication of nodes is done via SSL protocol, if desired.
User account mapping, environment replication
No
No attempt is mad to duplicate the submission environment. This must be completed by the job itself.
Job controls
No

Job arrays
Yes

Parallel jobs
No

Job email
Yes

Slot limits
Yes
Slot limits for user, queue and hosts are applicable to LSF desktop support.
Advanced reservation
Partial
Slots are only reserved at the MED level.
Multicluster forward model
Yes
A job can be forwarded to remove clusters.
Multicluster leasing model
Yes
Jobs can be dispatched to MED at remote clusters.
Deadline constraints
Yes

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 29, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.