Automatic failure recovery

The automatic failure recovery feature ensures maximum resource availability to run your workload when a system component fails or becomes unavailable due to a power outage, network failure, application deficiency, or other cause.

This feature is not applicable in Symphony DE.

About automatic failure recovery

Purpose of automatic failure recovery

Automatic failure recovery provides a way for the system to automatically restart critical system services and enables you to customize application (service) error handling for each of your applications. Symphony handles a number of failure recovery scenarios.

Benefits of automatic failure recovery

Automatic failure recovery provides a number of benefits, including:
  • Application isolation—failure of one application does not affect any other applications, and failure or unavailability of a resource management (EGO) component has no impact on running workload.

  • Fault tolerant tasks—with recoverable workload configured, automated failover and data persistence ensures that running workload submitted by an application client continues to run without user intervention when system processes or hosts fail.

  • Cluster reliability—master host failover and automatic restart of critical system services ensures high resource availability.

The following illustration shows the benefits of the automatic failure recovery feature once all workload management (SOAM) and resource management (EGO) components have started successfully. In this example, the application profile defines a recoverable session (workload) and the cluster administrator has defined a list of master candidates.

Scope


Applicability

Details

Operating system

  • All host types supported by the Symphony system.

Dependencies

  • For master host failover, you must specify one or more master host candidates.

  • Files required for failover must be on a shared file system.

  • Cluster administrator and consumer user accounts must have operating system permissions to access directories on the shared file system.

Limitation

  • Symphony does not provide automatic failure recovery of the shared file system if the shared file system becomes unavailable.


Configuration to enable automatic failure recovery

Automatic failure recovery is enabled for automatic process restart for critical system services and for restart of Symphony workload management (SOAM) components. Automatic failure recovery for applications is enabled by default in the application profile. You can also enable
  • Session manager failover

  • Session recovery, which makes workload recoverable

  • Master host failover

Configuration to enable session manager failover

Session manager failover is enabled by default when you use a shared file system and do not change the default values for any of the following attributes:
  • SOAM > SSM > resourceGroupName

  • SOAM > SSM > workDir

  • SOAM > DataHistory > path

  • SOAM > PagingTasksInput > path

  • SOAM > PagingTasksOutput > path

  • SOAM > PagingCommonData > path

  • SOAM > PagingCommonDataUpdates > path

  • SOAM > JournalingTasks > path

  • SOAM > JournalingSessions > path

  • SOAM > JournalingSessionTagConfig > path

Changing any of these attributes could affect session manager failover. For detailed descriptions of these attributes, see the Platform Symphony Reference.

Configuration to enable session recovery

Defining a recoverable session makes workload recoverable after session manager failover or restart.

Section

Attribute name and syntax

Behavior

SessionTypes > Type

recoverable=true | false

  • Specifies whether the session can be recovered after session manager failover or restart. If true, Symphony persists the common data and its update (if any) for the session, task data for tasks that have not yet been returned to the client, and data required to reconstruct those objects. If false (default), the system does not persist session and task data, and tasks must be rerun.

Important:

If the file system that is used by the SSM for paging and recovery purposes is not stable, you need to configure flushDataAsap=”true” in the application profile. This causes the SSM to write data to disk directly, rather than using system cache, before continuing with the next operation. This guarantees the data is actually stored on the disk and not in system cache after the SSM finishes the disk write operation. The SSM will be able to read the data back in case of recovery. Refer to the application profile reference for details about this parameter.

It is strongly recommended to use a stable and reliable file system in your Symphony cluster to avoid losing any data.


Configuration to enable master host failover

The master candidate list defines which hosts are master candidates. By default, the list includes just one host, the master host, and there is no failover. If you configure additional candidates to enable failover, the master host is first in the list. If the master host becomes unavailable, the next host becomes the master, and so on down the list.

For master candidate failover to work properly, the master candidates must share a file system that must always be available.

Important:

The shared file system should not reside on a master host or any of the master candidates. If the shared file system resides on a master host and the master host fails, the next candidate cannot access the necessary files.

If you have configured at least one management host for your cluster in addition to the master host but have not selected any failover candidates, the Platform Management Console dashboard displays a reminder message in red with a link to the page from which you define the master candidate list.

Configuration source

Setting

Behavior

Platform Management Console: Cluster > Summary >Master Candidates

  • Add available hosts to the Master Candidates list, or Remove hosts from the Master Candidates list.

  • Rearrange the order of master candidates: host_name > Up | Down

  • The master candidates are now set in the order you want them to fail over. The cluster automatically restarts when you click Apply, making the changes take effect.

  • All master candidates must be selected from the available management hosts. A compute host cannot be a master candidate.

  • The default configuration of the EGOManagementServices consumer provides for master candidate failover; do not change the number of slots owned by this consumer.

Alternatively, you can use the command line interface to specify a list of master candidates.

Command

Description

egoconfig masterlist host_name[,host_name, …]

  • Specifies the list of master candidates, starting with the master host, and including all of the candidates in the order of failover priority.

  • host_name specifies the name of the master host and each of the master candidates. Do not specify compute hosts in this list.
    CAUTION:

    Include all master candidates in the list when you issue this command; egoconfig masterlist overwrites the existing list.

Automatic failure recovery behavior

Automatic failure recovery behavior depends on which process fails or becomes unavailable, and on the type of host on which the process runs.

Recovery when individual processes fail or become unavailable

The following description provides details about what happens when a workload management (SOAM), Platform Management Console, reporting, or resource management (EGO) process fails or becomes unavailable independently of other processes.
Important:

Recovery of any workload management (SOAM), Platform Management Console, Reporting, or resource management (EGO) process usually takes less than one or two minutes, and can take as little as one or two seconds, provided that the host remains available.


When this process is in failure recovery…

The effects are…

Workload

Resource allocation

Lifecycle or other processes

Workload management (SOAM) processes

Service instance (si)

You can define the actions retry or fail for the SessionEnter, SessionUpdate, and Invoke methods.

If blockHost is defined as the actionOnSI for a service instance exit, timeout, exception, or control code, the system terminates the running service instance on this host and does not use this host to start any other service instance for the application. If restart is defined as the actionOnSI, the service instance tries to restart on the original host.

You can define the following actions for service instances based on specific states of the service lifecycle: keepAlive, restartService or blockHost. The session manager will continue to run the service, restart the service on the same host, or—through communications with the Virtual Execution Machine Kernel Daemon (vemkd)—block the host for use by the application associated with the service.

Service instance manager (sim)

The session manager requeues and reruns tasks for the session that was running on the service instance manager; no workload is lost.

If blockHostOnTimeout= "true" in the SOAM > SIM section of the application profile and if, after a service instance manager is started, the service instance manager process cannot contact the session manager within the startUpTimeout, the system does not use this host to start any other service instance managers for the application. If blockHostOnTimeout= "false", the system tries again to start the service instance manager on the original host.

If the service instance manager dies after starting successfully, the associated service instance exits. The session manager then restarts the service instance manager.

Session manager (ssm)

For recoverable sessions, the session manager persists the information needed to resume the workload without loss of data, and session manager failover or recovery is transparent to the client application. For non-recoverable sessions, the workload is lost and the client must resubmit the workload.

When it restarts, the session manager re-registers with the resource management component (EGO) and obtains a list of resources that were previously allocated to the session manager. The session manager stops and restarts all running service instance managers on those resources.

The service instance managers associated with the failed session manager also die, and requests from the Platform Management Console and command line interface fail. The session director restarts the session manager. On restart, the session manager reads only the task and session control objects, not the input/output messages; the session manager reads those messages as required when dispatching a task. Session manager monitoring information resets; the following statistical values apply to the time period that begins with session manager restart.
  • Closed sessions since SSM started

  • Aborted sessions since SSM started

  • Time of the last session aborted

  • Done tasks since SSM started

  • Error tasks since SSM started

  • Time of the last error task

When the session manager is unavailable, clients cannot create new SDK connections.
  • If the client is already connected and the session manager becomes unavailable, the Symphony APIs retry the connection.

  • If the client has not yet connected and the session manager is unavailable, the client receives an exception and must wait for the session manager to become available.

Session director (sd)

Session director failure has no impact on running workload; the session manager handles workload execution. For new workload, clients submitting workload wait momentarily for the EGO service controller to restart the session director.

Session director failure has no impact on resource allocation. The session director saves information about the resources it uses and, after restart, uses the same resources.

While the session director is down momentarily, requests from the Platform Management Console and command line interface fail. If you set view preferences for the dashboard to automatically refresh, the request succeeds once the session director has restarted. When the session director is unavailable, clients cannot create new SDK connections.
  • If the client is already connected and the session director becomes unavailable, the Symphony APIs retry the connection.

  • If the client has not yet connected and the session director is unavailable, the client receives an exception and must wait for the session director to become available.

The EGO service controller usually restarts the session director within a few seconds on the original host or on a new host if the original host has no available resources. The EGO service controller tries up to 10 times to restart the session director before setting the status to ERROR.

Repository service (rs)

Repository service failure has no effect on running workload. New workload that needs to download a service package must wait until the repository service becomes available.

Repository service failure has no effect on resource allocation.

The EGO service controller restarts the repository service on the original host or on a new host if the original host has no available resources. The EGO service controller tries up to 10 times to restart the repository service before setting the status to ERROR.

Platform Management Console processes

Web service manager (wsm)

Web service manager failure has no effect on workload.

Web service manager failure has no effect on resource allocation.

The EGO service controller restarts the Web service manager on the original host or on a new host if the original host has no available resources. The EGO service controller tries up to 10 times to restart the Web service manager before setting the status to ERROR.

The web service manager monitors the java process of TOMCAT—a key component of the Platform Management Console—and restarts the java process if it goes down.

Reporting processes

Platform loader controller (plc)

Loader controller failure has no effect on workload.

Loader controller failure has no effect on resource allocation.

If the loader controller becomes unavailable, the Platform Enterprise Reporting Framework cannot collect sampling data for reporting purposes. The EGO service controller restarts the loader controller on the original host or on a new host if the original host has no available resources.The EGO service controller tries up to 10 times to restart the loader controller before setting the status to ERROR.

Data purger (purger)

Data purger failure has no effect on workload.

Data purger failure has no effect on resource allocation.

If the data purger becomes unavailable, the database could temporarily grow until the data purger recovers and can once again purge the data. The time it takes for the database to run out of space depends on the size of your system. The EGO service controller restarts the data purger on the original host or on a new host if the original host has no available resources.The EGO service controller tries up to 10 times to restart the data purger before setting the status to ERROR.

Resource management (EGO) processes

Master load information manager (master lim)

Master load information manager failure has no effect on running workload. Clients submitting new workload receive an exception.

The system considers the master host unavailable and a master candidate takes over as master host. During failover to the master candidate, the system does not respond to resource allocation requests.

If no master candidate is available, the cluster is down. The system cannot restart the master load information manager; you can manually restart it, however, using the egosh ego start all command.

Virtual Execution Machine Kernel Daemon (vemkd)

Virtual Execution Machine Kernel Daemon failure has no effect on running workload. Clients submitting new workload receive an exception.

During failure recovery, the system does not respond to resource allocation requests.

The master load information manager restarts the Virtual Execution Machine Kernel Daemon.

Process execution monitor (pem)

Process execution monitor failure has no effect on running workload.

Process execution monitor failure has no effect on resource allocation.

The load information manager restarts the process execution monitor on a compute or management host. The master load information manager restarts the process execution monitor on the master host.

EGO service controller (egosc)

EGO service controller failure has no effect on running workload.

EGO service controller failure has no effect on resource allocation.

The Virtual Execution Machine Kernel Daemon restarts the EGO service controller.

Load information manager (lim)

The system considers the host unavailable and terminates workload on the unavailable host. EGO notifies the SOAM component (session director or session manager) that has been allocated to the unavailable host. The session director or session manager stops the service (service instance and service instance manager) on that host and requests another resource.

The system does not allocate any resources on the unavailable host.

The master load information manager restarts the load information manager on the compute or management host.


Recovery when hosts fail

When processes become unavailable in combination because of a hardware failure, you see the following behavior.
Note:

The majority of the time required for failover of compute, management, and master hosts is used to confirm that the host is actually unavailable. This prevents temporary network delays or instability from triggering frequent and unnecessary host switches.


When this host is down…

The effects are…

Compute host

  • The following processes become unavailable during failure recovery:
    • Load information manager

    • Process execution monitor

    • Service instance manager

    • Service instance

  • When the session manager-service instance manager connection breaks, the session manager requeues the affected tasks. If the session manager does not recognize the broken connection, the resource manager (EGO) notifies the session manager within three minutes that the host is down.

  • The session manager requests a new resource.

  • Workload runs on the new compute host.

Management host

  • The following processes become unavailable during failure recovery:
    • Load information manager

    • Process execution monitor

    • Session director

    • Session manager

    • Repository service

    • Web service manager

    • Loader controller

    • Data purger

  • In less than three minutes, a new management host takes over and gets configuration information from the shared configuration directory.

Master host

  • The following processes become unavailable during failure recovery:
    • Master load information manager

    • Virtual Execution Machine Kernel Daemon

    • Process execution monitor

    • EGO service controller

    • Session director

    • Repository service

    Note:

    If the session director and repository service can be running on any management host. They will become unavailable during failure recovery only if they are running on the master host.

  • By default, in less than two minutes, a management host from the master candidates list takes over and gets configuration information from the configuration directory on the shared file system.

    When the primary master host recovers, it takes over from the master candidate. The load information manager on the primary master becomes the master load information manager, and the Virtual Execution Machine Kernel Daemon and EGO service controller processes on the master candidate host are terminated and restarted on the primary master host. All other EGO services, including SOAM processes remain running on their current host.


Configuration to modify automatic failure recovery

You can modify
  • Automatic failure recovery behavior for an application

  • Service instance error handling—actions for unexpected exits, timeouts, exceptions, or control codes

  • Actions for a timeout between the service instance manager and the session manager

Configuration to modify automatic failure recovery for an application

The following attributes and environment variables can be configured to change the way that automatic failure recovery works once it is enabled for an application.

Configuration source

Setting

Behavior

Application profile:

Consumer

flushDataAsap=true | false

  • Used for recoverable sessions. Specifies whether or not the session manager caches data before writing to disk.

  • When set to true, data is not cached, it is immediately written to disk. When set to false (default), data is cached before it is written to disk.
    Important:

    Setting this parameter to true could substantially degrade performance.

transientDisconnectionTimeout= seconds

  • Specifies the number of seconds the session manager waits for the client to reconnect before it aborts the session when the connection between the client and session manager is broken.

  • Specify an integer equal to or greater than 1. The default value is 30 seconds.

  • Note that if in a new connection a session that was previously disconnected is opened within the transientDisconnectionTimeout period after the original client exited abnormally, the session is not aborted even if abortSessionIfClientDisconnect is set to true.

ioRetryDelay=seconds

  • Specifies the number of seconds to wait before retrying an I/O operation after a previous failure.

  • Specify an integer equal to or greater than 1. The default value is 1.

Application profile:

SOAM > SSM

resReq="select(select_string)" "select(select_string) order(order_string)"

  • Describes the criteria for defining a set of resources to run session managers. Session managers should run on management hosts. When specifying a resource requirement string, you must indicate the select string "select(mg)" so that only management hosts are selected to run session managers.

  • The default value is "", which specifies any host in the ManagementHosts resource group.

Application profile:

SessionTypes > Type

abortSessionIfClientDisconnect=true | false

  • Specifies whether the session is aborted if the session manager detects that the connection between the client and the session manager is broken. The default value is true.

  • Used with the transientDisconnectionTimeout attribute.

Configuration to modify service instance error handling behavior

Section

Method

Attribute name and syntax

Behavior

Service > Control > Method > Timeout

  • Register

  • CreateService

  • SessionEnter

  • SessionUpdate

actionOnSI=restartService| blockHost

  • Specifies whether to restart the service or block the host on timeout.

  • The default for Register, CreateService, and SessionEnter, SessionUpdate is blockHost.

  • Invoke

  • SessionLeave

  • The default for Invoke and SessionLeave is restartService.

  • SessionEnter

  • SessionUpdate

  • Invoke

actionOnWorkload=retry | fail

  • Specifies whether to retry the method (default) up to the number of times configured by the session and task retry limits or abort the session (SessionEnter or SessionUpdate)/fail the task (Invoke).

    Note:

    The retry count for both SessionEnter and SessionUpdate methods are considered together. For example, if SessionEnter fails once and SessionUpdate fails twice, then the session rerun count is equal to 3.

Service > Control > Method > Exception

  • CreateService

actionOnSI=restartService| blockHost

  • Specifies whether to restart the service or block the host (default) when the specified exception (failure or fatal exception) occurs.

  • Invoke

  • SessionEnter

  • SessionUpdate

  • SessionLeave

actionOnSI=keepAlive | restartService | blockHost

  • Specifies whether to continue running the service (default), restart the service, or block the host when the specified exception (failure or fatal exception) occurs.

  • SessionEnter

  • SessionUpdate

  • Invoke

actionOnWorkload=retry | fail

  • Specifies whether to retry the method up to the number of times configured by the session and task retry limits or abort the session (SessionEnter or SessionUpdate)/fail the task (Invoke).

    Note:

    The retry count for both SessionEnter and SessionUpdate methods are considered together. For example, if SessionEnter fails once and SessionUpdate fails twice, then the session rerun count is equal to 3.

Service > Control > Method > Exit

  • Register

  • CreateService

  • SessionEnter

  • SessionUpdate

actionOnSI=restartService| blockHost

  • Specifies whether to restart the service or block the host on if the service process exits during the execution of the method.

  • The default for Register, CreateService, and SessionEnter, SessionUpdate, is blockHost.

  • Invoke

  • SessionLeave

  • The default for Invoke and SessionLeave is restartService.

  • SessionEnter

  • SessionUpdate

  • Invoke

actionOnWorkload=retry | fail

  • Specifies whether to retry the method (default) up to the number of times configured by the session and task retry limits or abort the session (SessionEnter or SessionUpdate)/fail the task (Invoke).

    Note:

    The retry count for both SessionEnter and SessionUpdate methods are considered together. For example, if SessionEnter fails once and SessionUpdate fails twice, then the session rerun count is equal to 3.

Service > Control > Method > Return

  • CreateService

  • SessionEnter

  • SessionUpdate

  • Invoke

  • SessionLeave

actionOnSI=keepAlive | restartService | blockHost

  • Specifies whether to continue running the service (default), restart the service, or block the host when the method returns normally and specified code is returned.

  • SessionEnter

  • SessionUpdate

  • Invoke

actionOnWorkload=retry | fail | succeed

  • Specifies whether to consider the method task as having reached completion based on a normal return (default), retry the method up to the number of times configured by the session and task retry limits, or abort the session (SessionEnter or SessionUpdate)/fail the task (Invoke).

    Note:

    The retry count for both SessionEnter and SessionUpdate methods are considered together. For example, if SessionEnter fails once and SessionUpdate fails twice, then the session rerun count is equal to 3.

SessionTypes > Type

  • Invoke

taskRetryLimit=integer

  • Specifies the number of attempts to retry a task before the system fails the task.

  • The value can be 0 or greater. If you specify a value of 3 (default), the system makes 1 attempt to run the task followed by 3 retries before the task fails.

  • SessionEnter

  • SessionUpdate

sessionRetryLimit=integer

  • Specifies the number of times the session can retry binding to the service before the session is aborted.

  • The value can be 0 or greater. If you specify a value of 3 (default), the system makes 1 initial attempt to run the SessionEnter or SessionUpdate methods followed by 3 retries before the system aborts the session.

Configuration to modify service instance manager-session manager timeout actions

You can change how the system handles a timeout between the service instance manager and the session manager.

Section

Attribute name and syntax

Behavior

SOAM > SIM

blockHostOnTimeout="true" | "false"

  • If "true" (default), blocks the host for the application when the service instance manager times out while trying to communicate with the session manager. This means that the services associated with the application run on a different host than the one on which the timeout occurred. If "false", the service tries to restart on the original host.

  • Used with the startUpTimeout attribute.

startUpTimeout="seconds"

  • Number of seconds to wait for the service instance manager to communicate with the session manager. The default is 60 seconds. This parameter works in conjunction with blockHostOnTimeout.

  • After a service instance manager is started, if the service instance manager cannot contact the session manager within the startUpTimeout and if blockHostOnTimeout="true", the session manager requests a new host from EGO and tries to start a new service instance manager on the new host.


Automatic failure recovery interface

Actions to submit workload

No actions required. For recoverable sessions, session manager failover or recovery is transparent to the application client.

Actions to monitor

You can monitor automatic failure recovery through the Platform Management Console and from the command line. You can also set up SNMP traps to capture system events.


User

Command

Description

  • Cluster administrator

From the Platform Management Console Dashboard

  • Displays the overall health and drill-down details of the cluster, services, and workload. When a process restarts, the process ID changes.

  • Cluster administrator

From the command line:

egosh resource list -m

  • Displays the list of failover candidate hosts in the cluster and identifies which host is currently the master.

  • Cluster administrator

From the SNMP trap notifications:
  • SYS_FAILOVER_RETRIED

  • The system is trying to restart the session manager or service instance manager.

  • SYS_SSM_DOWN

  • The session manager goes down abnormally.

  • SYS_SSM_UP

  • The session manager comes up.

  • Cluster administrator

From the SNMP trap notifications:
  • SYS_VEMKD_UP

  • To receive this notification, you must first configure EGO_EVENT_PLUGIN=plugin_name and EGO_EVENT_MASK=LOG_INFO in ego.conf.

  • Indicates that the master host has failed over to a new master host, or that the cluster has been reconfigured.


You can also check the progress of failure recovery as follows:

Process

User

Command

Description

  • Service instance manager

  • Service instance

  • Cluster administrator

  • Consumer administrator

  • Consumer user

From the Platform Management Console Dashboard:

Symphony Workload > Monitor Workload >application_name

  • The presence of a running task indicates that the service instance manager and service instance processes are available.

  • If tasks are pending but no tasks are running, the service instance manager and service instance processes might be unavailable.

From the command line:

soamview app app_name -l

  • Displays the number of running and pending tasks for all sessions of an application. The presence of a running task indicates that the service instance manager and service instance processes are available.

  • If tasks are pending but no tasks are running, the service instance manager and service instance processes might be unavailable.

  • Session manager

  • Cluster administrator

  • Consumer administrator

  • Consumer user

From the command line:

soamview app app_name

  • The presence of a session manager process ID indicates that the session manager is available.

  • Session director

  • Repository service

  • Data purger

  • Loader controller

  • Web service manager

  • Cluster administrator

From the command line:

egosh service list

  • If the process appears in the STARTED state, the process is available.

  • Master load information manager

  • Virtual Execution Machine Kernel Daemon

  • EGO service controller

  • Cluster administrator

From the command line:

egosh service list

  • If the command responds, these processes are available.

  • If the command does not respond, one of these processes might be unavailable.

  • Load information manager (non-master)

  • Process execution monitor

  • Cluster administrator

From the command line:

egosh resource list

  • If a host has a status of ok, the load information manager and process execution monitor on that host are available.


Actions to control

Not applicable. Automatic failure recovery does not require user intervention.

Actions to display configuration


User

Command

Behavior

  • Cluster administrator

  • Consumer administrator

  • Consumer user

From the Platform Management Console Dashboard:

Symphony Workload > Monitor Workload > application_name > Application Profile

  • Displays settings for all of the application-level automatic failure recovery configuration.

  • Cluster administrator

  • Consumer administrator

  • Consumer user

From the command line:

soamview app app_name -p

  • Displays application profile settings for the selected application.

  • Cluster administrator

Cluster > Summary > Master Candidates

  • Displays a list of master candidates and the order in which failover occurs.

  • Cluster administrator

From the command line:

egosh resource list -m

  • Displays the list of failover candidate hosts in the cluster and identifies which host is currently the master.