Power7 System Firmware Fix History - Release level AS730

Firmware Description and History

AS730_182_182
/ FW731.82

05/29/18

Impact: Security         Severity: SPE

Response for Recent Security Vulnerabilities

  • DISRUPTIVE:  In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue number CVE-2018-3639.  In addition, Operating System updates are required in conjunction with this FW level for CVE-2018-3639.
AS730_181_093
/ FW731.81

02/19/18

Impact: Security         Severity: SPE

Response for Recent Security Vulnerabilities

  • In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue number CVE-2017-5715.  In addition, Operating System updates are available to mitigate the  CVE-2017-5753 and CVE-2017-5754 security issues.
AS730_180_093
/ FW731.80

08/29/17

Impact: Availability         Severity: ATT

New features and functions

  • DEFERRED:   Support for concurrent replacement of the DCCA on a dual DCCA system.
  • Support was added to increase the power capacity limit of the system by 30%, up to 25,000 watts, to handle workloads for drawers with high processor and memory utilization.  Highly-active workloads were driving the power capacity to the limit, resulting in system throttling that reduced performance.  These heavier workloads can now run at normal performance levels.
  • Support was added to the Advanced System Management Interface (ASMI) to be able to add an IPv4 static route definition for each ethernet interface on the service processor.  Using a static route definition,  a Hardware Management Console (HMC) configured on a private subnet that is different from the service processor subnet is now able to connect to the service processor and manage the CEC.  A static route persists until it is deleted or until the service processor settings are restored to manufacturing defaults.  The static route is managed with the ASMI panel "Network Services/Network Configuration/Static Route Configuration" IPv4 radio button.  The "Add" button is used to add a static route (only one is allowed for each ethernet interface) and the "Delete" button is used to delete the static route.
  • Support was added for a concurrent replacement of a DCCA that restores full redundancy of the service processor for the affected drawer.  The DCCA replacement is done concurrently, with the affected drawer powered up and running.
System firmware changes that affect all systems
  • DEFERRED:   A problem was fixed for filtering Local Network Manager Controller (LNMC) errors for a Host Fabric Interface (HFI) that has failed and gone to a "not ready" state.  Without the fix, the failed HFI continues to log errors (such as "Multicast HW Internal error")  and can flood the Central Network Manager (CNM) error log file.  The HFI error conditions that can cause the extra message logging are a rare occurrence.
  • A problem was fixed for PCI adapters locking up when powered on.  The problem is rare but frequency varies with the specific adapter models.  A system power down and power up is required to get the adapter out of the locked state.
  • A problem was fixed for a Network boot/install failure using bootp in a network with switches using the Spanning Tree Protocol (STP).  A Network boot/install using lpar_netboot on the management console was enhanced to allow the number of retries to be increased.  If the user is not using lpar_netboot, the number of bootp retries can be increased using the SMS menus.  If the SMS menus are not an option, the STP in the switch can be set up to allow packets to pass through while the switch is learning the network configuration.
  • A problem was fixed that prevented a second management console from being added to the CEC.  In some cases, network outages caused defunct management console connection entries to remain in the service processor connection table,  making connection slots unavailable for new management consoles  A reset of the service processor could be used to remove the defunct entries.
  • A problem was fixed for NIM installs using the Host Fabric Interface (HFI) that failed or other times appear to hang but could complete after many hours of delay.  When the NIM install operation fails,  recover by doing a retry of the operation.  This infrequent problem is triggered by hardware instructions in the HFI Fcode not executing in the required order because of missing synchronization instructions.
  • A problem was fixed for a Host Fabric Interface (HFI)  FCode driver error that caused Red Hat Enterprise 7.3 boot failures using the HFI interface.
    The problem has been seen with certain diskless boot images.  The problem is not very frequent, but once encountered, cannot be remedied without a rebuild of the Linux boot image.  The image is gzipped so simply rebuilding the image can cause gzip to compress the image differently due to the new timestamp.  This can be done several times and that may correct the issue.
  • A problem was fixed for the DCCA replacement procedure in the HMC R&V (Repair and Verify) to prevent a firmware synchronization error during the DCCA replacement.  The error would also have a connection lost between the HMC and the service processor as the service processor is reset.  The fix involved a change to the error recovery of the ncfgMultSetup application on the service processor to support the DCCA replacement process.  Without the fix, the connection between the HMC and the service processor can be lost during the R&V DCCA replacement procedure, resulting in a failure of the firmware synchronization step.  With the fix, the recovery policy of the ncfgMultSetup daemon was changed so that it would restart itself to handle the setup timing windows for the new DCCA configuration instead of forcing a reset of the service processor, allowing the DCCA replacement process to complete successfully.  The error only occurred infrequently during DCCA replacements on some systems.
  • A problem was fixed for incorrect error messages from the Advanced System Management Interface (ASMI) functions when the system is powered on but in the  "Incomplete State".  For this condition, ASMI was assuming the system was powered off because it could not communicate to the PowerVM hypervisor.  With the fix, the ASMI error messages will indicate that ASMI functions have failed because of the bad hypervisor connection instead of falsely stating that the system is powered off.
System firmware changes that affect certain systems
  • On systems in IPv6 networks, a  problem was fixed for a network boot/install failing with SRC B2004158 and IP address resolution failing using neighbor solicitation to the partition firmware client.
  • For systems with an invalid P-side or T-side in the firmware, a problem was fixed in the partition firmware Real-Time Abstraction System (RTAS) so that system Vital Product Data (VPD) is returned at least from the valid side instead of returning no VPD data.   This allows AIX host commands such as lsmcode, lsvpd, and lsattr that rely on the VPD data to work to some extent even if there is one bad code side.  Without the fix,  all the VPD data is blocked from the OS until the invalid code side is recovered by either rejecting the firmware update or attempting to update the system firmware again.
  • For systems with a IBM i load source disk attached to an Emulex-based fibre channel adapter such as F/C #5735, a problem was fixed that caused an IBM i load source boot to fail with SRC B2006110 logged and a message to the boot console of  "SPLIT-MEM Out of Room".  This problem occurred for load source disks that needed extra disk scans to be found, such as those attached to a port other than the first port of a fibre channel adapter (first port requires fewest disk scans).
  • A problem was fixed for systems in networks using the Juniper 1GBe and 10GBe switches (F/Cs #1108, #1145, and #1151) to prevent network ping errors and boot from network (bootp) failures.  The Address Resolution Protocol (ARP) table information on the Juniper aggregated switches is not being shared between the switches and that causes problems for address resolution in certain network configurations.  Therefore, the CEC network stack code has been enhanced to add three gratuitous ARPs (ARP replies sent without a request received) before each ping and bootp request to ensure that all the network switches have the latest network information for the system.
  • On systems with a PowerVM Active Memory Sharing (AMS) partition with AIX  Level 7.2.0.0 or later with Firmware Assisted Dump enabled, a problem was fixed for a Restart Dump operation failing into KDB mode.  If "q" is entered to exit from KDB mode, the partition fails to start.  The AIX partition must be powered off and back on to recover.  The problem can be circumvented by disabling Firmware Assisted Dump (default is enabled in AIX 7.2).
  • On systems with dedicated processor partitions,  a problem was fixed for the dedicated processor partition becoming intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.
AS730_165_093
/ FW731.78

07/27/17

Impact: Availability    Severity: ATT

Changes:

  • No system firmware changes. Refreshing code only to coincide with the BPC update.
AS730_163_093
/ FW731.77

04/01/16

Impact: Security         Severity: ATT

System firmware changes that affect all systems

  • A problem was fixed for logical partitions not booting after replacement of both DCCAs and service processors in the service drawer.  If the service processors contained incorrect topology data, it is not recalculated, causing bad route information and a hang when booting the partitions.  With the fix, the Local Network Management Controller (LNMC) does a recalculation for the topology when both service processors are replaced, allowing the partitions to boot successfully.
  • A problem was fixed for the Integrated Switch Network Manager (ISNM) performance counter output having an incorrect Global Counter timestamp value.  Without the fix, the global counter value is filled with the local GC ID.
  • A security problem was fixed in the lighttpd server on the service processor, where a remote attacker, while attempting authentication, could insert strings into the lighttpd server log file.  Under normal operations on the service processor, this does not impact anything because the log is disabled by default.  The Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
  • A problem was fixed for reporting all optical link UE errors through the Local Network Management Controller (LNMC).  Without the fix, some of the errors are hidden from the LNMC reports because a threshold count for the error must be exceeded before it is reported to the LNMC.  Even though some of the errors are hidden from the LNMC, they are all visible in the service processor error log.
  • On the BPC, a problem was fixed for the remote hardware vitals (rvitals) command returning an incorrect input voltage when there are failed Bulk Power Regulators (BPRs) on the line cord.  With the fix, the BPC reports the highest valid value from BPRs, instead of averaging the voltages.
  • On the BPC, a problem was fixed for the remote hardware vitals (rvitals) command returning old (stale) power usage numbers for CECs that are deactivated when the power usage should be zero.  With the fix, the deactivated CECs show zero power usage.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during RSA PPS signature verification. The Common Vulnerabilities and Exposures issue number is CVE-2015-3194.
AS730_158_093
/ FW731.76

10/25/15

Impact: Security         Severity:  SPE

System firmware changes that affect all systems

  • A security problem was fixed in OpenSSL where a remote attacker could crash the service processor with malformed Elliptic Curve private keys.  The Common Vulnerabilities and Exposures issue number is CVE-2015-0209.
  • A security problem was fixed in OpenSSL where a remote attacker could crash the service processor with a specially crafted X.509 certificate that causes an invalid pointer, out-of-bounds write, or a null pointer de-reference.  The Common Vulnerabilities and Exposures issue numbers are CVE-2015-0286,  CVE-2015-0287, and CVE-2015-0288.
  • A security problem was fixed for an OpenSSL specially crafted X.509 certificate that could cause the service processor to reset in a denial-of-service (DOS) attack.  The Common Vulnerabilities and Exposures issue number is CVE-2015-1789.
  • A problem was fixed for a stop condition in the processing of the Host Fabric Interface (HFI) broadcast traffic that resulted in network boots failing for cluster nodes.  This problem is intermittent and requires heavy HFI traffic to cause the error.  To help reduce this problem,  a staggered IPL of the nodes can be used in a large cluster instead of a simultaneous IPL.
  • A problem was fixed for the bulk power controller (BPC) not being able to connect to a service processor with Security Mode set to "SSLv3 Disabled".  The Advanced System Management Interface (ASMI) is used to change the Security Mode to "SSLv3 Disabled".  This highest level of security protection does not allow service processor clients to connect using the SSLv3 protocol.
AS730_155_093
/ FW731.75

09/15/15

Impact: Availability    Severity: SPE

New Features and Functions

  • For water cooled systems, the water flushing was enhanced to ensure that the water in the primary side pipes is fresh and accurately cold if the current water temperature is reading high compared to the lowest facility water temperature.
  • A security enhancement was made to prevent unsecured connections to the PTLIC Monitor. The BPC service processor must be logged into first now before the user can access the PTLIC Monitor.

System firmware changes that affect all systems

  • A problem was fixed for a SRC 14020059 reported against the Motor Drive Assembly (MDA) card in the BPC.

System firmware changes that affect certain systems

  • For systems with large clusters, a problem was fixed for Local Network Management Controller (LNMC) network time-outs during a simultaneous IPL of the entire cluster.  The LNMC network response was improved by optimizing its internal tracing to make it more efficient.
AS730_153_093
/ FW731.74

06/26/15

Impact: Security         Severity:  SPE

System firmware changes that affect all systems

  • A security problem was fixed in OpenSSL for padding-oracle attacks known as Padding Oracle On Dowgraded Legacy Encryption (POODLE).  This attack allows a man-in-the-middle attacker to obtain a plain text version of the encrypted session data. The Common Vulnerabilities and Exposures issue number is CVE-2014-3566.  The service processor POODLE fix is based on a selective disablement of SSLv3 using the Advanced System Management Interface (ASMI) "System Configuration/Security Configuration" menu options.  The Security Configuration options of "Disabled", "Default", and "Enabled" for SSLv3 determines the level of protection from POODLE.  The management console also requires a POODLE fix for APAR MB03838 (FIX FOR CVE-2014-3566 FOR HMC V7 R7.3.0 SP7 (PTF MH01456) ) to eliminate all vulnerability to POODLE and allow use of option 1 "Disabled" as shown below.  This HMC minimum requirement is enforced by the firmware update process for this defect.
    The POODLE fix also addresses a vulnerability commonly referred to as "Bar Mitzvah Attack" with CVE-2015-2808. The RC4 cipher algorithm, as used in the TLS protocol and SSL protocol, could allow a remote attacker to obtain sensitive information.  The use of the RC4 cipher has been discontinued.
    -1) Disabled:  This highest level of security protection does not allow service processor clients to connect using SSLv3, thereby eliminating any possibility of a POODLE attack.  All clients must be capable of using TLS to make the secured connections to the service processor to use this option.
    -2) Default:  This medium level of security protection disables SSLv3 for the web browser sessions to ASMI and for the CIM clients and assures them of POODLE-free connections.  But the legacy management consoles are allowed to use SSLv3 to connect to the service processor.  This is intended to allow non-POODLE compliant HMC levels to be able to connect to the CEC servers until they can be planned and upgraded to the POODLE compliant HMC levels.  Running a non-POODLE compliant HMC to a service processor in  "Default" mode will prevent the ASMI-proxy sessions from the HMC from connecting as these proxy sessions require SSLv3 support in ASMI.
    -3) Enabled:  This basic level of security protection enables SSLv3 for all service processor client connections.  It relies on all clients being at POODLE fix compliant levels to provide full POODLE protection using the TLS Fallback Signaling Cipher Suite Value (TLS_FALLBACK_SCSV) to prevent fallback to vulnerable SSLv3 connections.  This option is intended for customer sites on protected internal networks that have a large investment in legacy hardware that need SSLv3 to make browser and HMC connections to the service processor.  The level of POODLE protection actually achieved in "Enabled" mode is determined by the percentage of clients that are at the POODLE fix compliant levels.
  • A security problem was fixed in the OpenSSL (Secure Socket Layer) protocol that allowed a man-in -the middle attacker, via a specially crafted fragmented handshake packet, to force a TLS/SSL server to use TLS 1.0, even if both the client and server supported newer protocol versions. The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-3511.
  • A security problem was fixed in OpenSSL for formatting fields of security certificates without null-terminating the output strings.  This could be used to disclose portions of the program memory on the service processor.  The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-3508.
  • Multiple security problems were fixed in the way that OpenSSL handled Datagram Transport Layer Security (DLTS) packets.  A specially crafted DTLS handshake packet could cause the service processor to reset.  The Common Vulnerabilities and Exposures issue numbers for these problems are CVE-2014-3505, CVE-2014-3506 and CVE-2014-3507.
  • A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) ServerHello requests.  A specially crafted DTLS handshake packet with an included Supported EC Point Format extension could cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-3509.
  • A security problem was fixed in OpenSSL to prevent a denial of service by using an exploit of a null pointer de-reference during anonymous Diffie Hellman (DH) key exchange.  A specially crafted handshake packet could cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-3510.
  • A security problem was fixed in OpenSSL for memory leaks that allowed remote attackers to cause a denial of service (out of memory on the service processor). The Common Vulnerabilities and Exposures issue numbers are CVE-2014-3513 and CVE-2014-3567.
  • A problem was fixed for intermittent B181EF88 SRCs and netsSlp core dumps during network configurations on the service processor.  This error caused call home activity for the SRC and dumps but otherwise had no impact to the CEC functionality.
  • A problem was fixed for the Integrated Switch Network Manager (ISNM) that caused it to put many Integrated Switch Routers (ISRs) in the cluster into a non-functional state if all the drawers of the HPC CEC were rebooted simultaneously.
  • A security problem was fixed in OpenSSL where the service processor would, under certain conditions, accept Diffie-Hellman client certificates without the use of a private key, allowing a user to falsely authenticate .  The Common Vulnerabilities and Exposures issue number is CVE-2015-0205.
  • A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) messages.  A specially crafted DTLS message could exhaust all available memory and cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number is CVE-2015-0206.
  • A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) messages.  A specially crafted DTLS message could do an null pointer de-reference and cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number is CVE-2014-3571.
  • A security problem was fixed in OpenSSL to fix multiple flaws in the parsing of X.509 certificates.  These flaws could be used to modify an X.509 certificate to produce a certificate with a different fingerprint without invalidating its signature, and possibly bypass fingerprint-based blacklisting.  The Common Vulnerabilities and Exposures issue number is CVE-2014-8275.
  • A security vulnerability, commonly referred to as GHOST, was fixed in the service processor glibc functions getbyhostname() and getbyhostname2() that allowed remote users of the functions to cause a buffer overflow and execute arbitrary code with the permissions of the server application.  There is no way to exploit this vulnerability on the service processor but it has been fixed to remove the vulnerability from the firmware.  The Common Vulnerabilities and Exposures issue number is CVE-2015-0235.
  • On systems with redundant service processors,  a problem was fixed so that a backup memory clock failure with SRC B120CC62 is handled without terminating the system running on the primary memory clock.
  • A problem was fixed in the Advanced System Management Interface (ASMI) to reword a confusing message for systems with no deconfigured resources.  The "System Service Aids/Deconfiguration Records" message text for this situation was changed from "Deconfiguration data is currently not available." to "No deconfigured resources found in the system.
  • On a system with redundant service processors, a problem was fixed for bad pointer reference in the mailbox function during data synchronization between the two service processors.  The de-reference of the bad pointer caused a core dump, reset/reload, and fail-over to the backup service processor.
  • A problem was fixed with the fspremote service tool to make it support TLSv1.2 connections to the service processor to be compatible with systems that had been fixed for the OpenSSL Padding Oracle On Dowgraded Legacy Encryption (POODLE) vulnerabilities.  After the POODLE fix is installed, by default the system only allows secured connections from clients using the TLSv1.2 protocol.
  • The Avago firmware for the optical transmitters was updated to the 0B.41 level that fixed a problem in the 0B.31 level, where certain lasers that were partially degraded were completely turned off by the 0B.31 firmware before their effective usability lifetime was completely finished. The 0B.41 firmware will keep the lasers operating as long as they are able to transmit data in an error-free manner.

System firmware changes that affect certain systems

  • On systems with large clusters,  a problem was fixed for optical link failures when simultaneously booting all CECs of the cluster.  Links may be left in the state of "DOWN_RECV_GOOD", which means a port on one side of a optical link did not report a state of link "up".
AS730_142_093
/ FW731.73

10/17/14

Impact: Availability    Severity: ATT

System firmware changes that affect all systems

  • A problem was fixed for a net session entry file lock error that prevented the management console from connecting to the service processor.
AS730_141_093
/ FW731.72

09/08/14

Impact: Security         Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed for I/O adapters so that BA400002 errors were changed to informational for memory boundary adjustments made to the size of DMA map-in requests.  These size adjustments were marked as UE previously for a condition that is normal.
  • A  security problem was fixed for the Lighttpd web server that allowed arbitrary SQL commands to be run on the service processor.  The Common Vulnerabilities and Exposures issue number is CVE-2014-2323.
  • A security problem was fixed for the Lighttpd web server where improperly-structured URLs could be used to view arbitrary files on the service processor.  The Common Vulnerabilities and Exposures issue number is CVE-2014-2324.
  • A  security problem was fixed in the service processor TCP/IP stack to discard illegal TCP/IP packets that have the SYN and FIN flags set at the same time.  An explicit packet discard was needed to prevent further processing of the packet that could result in an bypass of the iptables firewall rules.
  • A security problem was fixed for the Network Time Protocol (NTP) client that allowed remote attackers to execute arbitrary code via a crafted packet containing an extension field.  The Common Vulnerabilities and Exposures issue number is CVE-2009-1252.
  • A security problem was fixed for the Network Time Protocol (NTP) client for a buffer overflow that allowed remote NTP servers to execute arbitrary code via a crafted response.  The Common Vulnerabilities and Exposures issue number is CVE-2009-0159.

System firmware changes that affect certain systems

  • On a system with a disk device with multiple boot partitions, a problem was fixed that caused System Management Services (SMS) to list only one boot partition.  Even though only one boot partition was listed in SMS, the AIX bootlist command could still be used to boot from any boot partition.
AS730_140_093
/ FW731.71

08/21/14

Impact: Security         Severity:  HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive:  A security problem was fixed in the OpenSSL (Secure Socket Layer) protocol that allowed clients and servers, via a specially crafted handshake packet, to use weak keying material for communication.  A man-in-the-middle attacker could use this flaw to decrypt and modify traffic between the management console and the service processor.  The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-0224.
  • HIPER/Pervasive:  A security problem was fixed in OpenSSL for a buffer overflow in the Datagram Transport Layer Security (DTLS) when handling invalid DTLS packet fragments.  This could be used to execute arbitrary code on the service processor.  The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-0195.
  • HIPER/Pervasive:  Multiple security problems were fixed in the way that OpenSSL handled read and write buffers when the SSL_MODE_RELEASE_BUFFERS mode was enabled to prevent denial of service.  These could cause the service processor to reset or unexpectedly drop connections to the management console when processing certain SSL commands.  The Common Vulnerabilities and Exposures issue numbers for these problems are CVE-2010-5298 and CVE-2014-0198.
  • HIPER/Pervasive:  A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) ServerHello requests. A specially crafted DTLS handshake packet could cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-0221.
  • HIPER/Pervasive:  A security problem was fixed in OpenSSL to prevent a denial of service by using an exploit of a null pointer de-reference during anonymous Elliptic Curve Diffie Hellman (ECDH) key exchange.  A specially crafted handshake packet could cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number for this problem is CVE-2014-3470.
  • Help text for the Advanced System Management Interface (ASMI) "System Configuration/Hardware Deconfiguration/Clear All Deconfiguration Errors" menu option was enhanced to clarify that when selecting "Hardware Resources" value of "All hardware resources", the service processor deconfiguration data is not cleared.
    The "Service processor" must be explicitly selected for that to be cleared.
  • A problem was fixed that prevented guard error logs from being reported for FRUs that were guarded during the system power on.  This could happen if the same FRU had been previously reported as guarded on a different power on of the system.  The requirement is now met that guarded FRUs are logged on every power on of the system.
AS730_138_093
/ FW731.70

05/09/14

Impact: Availability    Severity: SPE

New Features and Functions

  • Support was dropped for Secured Socket Layer (SSL) Version 2 and SSL weak and medium cipher suites in the service processor web server (Lighttpd).  Unsupported web browser connections to the Advanced System Management Interface (ASMI) secured port 443 (using https://) will now be rejected if those browsers do not support SSL version 3.  Supported web browsers for Power7 ASMI are Netscape (version 9.0.0.4), Microsoft Internet Explorer (version 7.0), Mozilla Firefox (version 2.0.0.11), and Opera (version 9.24).

System firmware changes that affect all systems

  • A problem was fixed that prevented the service processor from recognizing the I/O hub Host Fabric Interface (HFI) and Collective Acceleration Unit (CAU) components as valid functional units (FUs).   This caused guard reports to show "Invalid FU" as the hardware type of the components along with an incorrect "DECONFIGURED" call out hardware state.
  • A problem was fixed that caused system memory to guarded when service processor errors on the FRU Support Interface (FSI)  occurred.
  • A problem was fixed that caused a flood of predictive error (PE) logs with SRC B181E550 for Integrated Switch Router (ISR) chip recoverable errors.  The errors are logged by the service processor PRD component with signature description "io(n0p0) Undefined error code" but there is no hardware guarded.
  • A problem was fixed that caused a service processor dump to be generated with SRC B18187DA "NETC_RECV_ER" logged.
  • A problem was fixed that caused a SRC B1754201 predictive error to be logged without call out actions.  Missing call outs were added for bus errors accessing the Torrent chip.
  • A problem was fixed that could block Host Fabric Interface (HFI) array error recovery and eventually lead to a double bit error, which would cause the HFI to become unusable until the next system reboot.
  • A problem was fixed that caused an error log generated by the partition firmware to show conflicting firmware levels.  This problem occurs after a firmware update or a logical partition migration (LPM) operation on the system.
  • A problem was fixed in the isolation of PCI faults for stopped clocks so that the error would not cause a system-wide failure.  The error is now limited to the affected logical partition (LPAR).
  • A problem was fixed that caused a L2 cache error to not guard out the faulty processor, allowing the system to checkstop again on an error to the same faulty processor.
  • A problem was fixed that caused a HMC code update failure for the FSP on the accept operation with SRC B1811402 or FSP is unable to boot on the updated side.
  • DEFERRED: A problem was fixed that caused a system checkstop during hypervisor time keeping services. This deferred fix addresses a problem that has a very low probability of occurrence.  As such customers may wait for the next planned service window to activate the deferred fix via a system reboot.
  • A problem was fixed that caused a lose of Time of Day (TOD) clock redundancy after a power repair of a Distributed Conversion and Control Assembly (DCCA).  After the DCCA repair, the primary and secondary TOD were assigned to the same oscillator in the DCCA that never lost power, even though both system oscillators were functional.
  • A problem was fixed that caused the system attention LED to be lit without a corresponding SRC and error log for the event.  This problem typically occurs when an operating system on a partition terminates abnormally.
  • DEFERRED: A problem was fixed that caused a system checkstop with SRC B113E504 for a recoverable hardware fault.  This deferred fix addresses a problem that has a very low probability of occurrence.  As such customers may wait for the next planned service window to activate the deferred fix via a system reboot.

System firmware changes that affect certain systems

  • On systems running AIX or Linux, a problem was fixed that caused a partition to fail to boot with SRC CA260203.  This problem also can cause concurrent firmware updates to fail.
  • On systems using IPv6 addresses, the firmware was enhanced to reduce the time it take to install an operating system using the Network Installation Manager (NIM).
  • On a partition with a large number of potentially bootable devices, a problem was fixed that caused the partition to fail to boot with a default catch, and SRC BA210000 may also be logged.
  • On systems in a high-performance computing (HPC) B-side cluster with an 8D_2S cross-coupled topology, a problem in the Local Network Management Controller (LNMC) was fixed that caused distance link (D-link) virtual channel (VC) deadlocks when using indirect routes.  Secondary routes had been erroneously included in the indirect route chain.  For this problem, the Executive Manager Server (EMS) will repeatedly log  "VC Deadlock Error" messages into the /var/opt/isnm/cnm/logs/EVT_SUM.log
  • A problem was fixed in the run-time abstraction services (RTAS) extended error handling (EEH) for fundamental reset that caused partitions to crash during adapter updates.  The fundamental reset of adapters now returns a valid return code.  The adapter drivers using fundamental reset affected by this fix are the following:
    o QLogic PCIe Fibre Channel adapters (combo card)
    o IBM PCIe Obsidian
    o Emulex BE3-based ethernet adapters
    o Broadcom-based PCIe2 4-port 1Gb ethernet
    o Broadcom-based FlexSystem EN2024 4-port 1Gb ethernet for compute nodes
  • On systems with a DIMM error,  a problem was fixed in the service processor memory diagnostic that caused the de-configuration of all memory.  The memory diagnostic had failed all the memory due to special attention flooding caused by the bad hardware that did not allow the memory diagnostic to complete.   With the special attention flooding prevented, the memory diagnostic is now able to isolate the DIMM error to a FRU location and guard it so the system is able to IPL.
AS730_130_093
/ FW731.61

10/25/13

Impact: Availability    Severity: SPE

System firmware changes that affect certain systems

  • On systems in a high-performance computing (HPC) B-side cluster with an 8D_2S cross-coupled topology, a problem in the Local Network Management Controller (LNMC) was fixed that caused distance link (D-link) virtual channel (VC) deadlocks when using indirect routes.  Secondary routes had been erroneously included in the indirect route chain.  For this problem, the Executive Manager Server (EMS) will repeatedly log  "VC Deadlock Error" messages into the /var/opt/isnm/cnm/logs/EVT_SUM.log
AS730_125_093

03/11/13

Impact: Availability    Severity: SPE

System firmware changes that affect all systems

  • A problem was fixed that caused SRC B1813221, which indicates a failure of the battery on the service processor, to be erroneously logged after a service processor reset or power cycle.
  • A problem was fixed that caused various SRCs to be erroneously logged at boot time including B181E6C7 and B1818A14.
  • A problem was fixed that caused a system to abnormally terminate due to a null pointer reference. 
  • The firmware was enhanced to reduce "sender hang" errors and failures to boot nodes via the cluster fabric.
System firmware changes that affect certain systems
  • On large clusters, a problem was fixed that caused some links in the system to remain permanently in the DOWN_RECV_GOOD state.  The links in question will not be fully utilized for data transmission.  The problem occurs with regular frequency on large clusters when re-IPLing all CECs in the system.
AS730_118_093

11/02/12

Impact: Function    Severity: SPE

System firmware changes that affect all systems

  • DEFERRED:  A problem was fixed that could cause a live lock on the power bus resulting in a system crash.
  • The firmware was enhanced to increase the performance of certain applications by updating the routing tables.
  • A problem was fixed that caused a segmentation fault in the service processor firmware.  When this occurred, a PERC error with SRC B181C350 was logged.
  • On systems on which Internet Explorer (IE) is used to access the Advanced System Management Interface (ASMI) on the Hardware Management Console (HMC), a problem was fixed that caused IE to hang for about 10 minutes after saving changes to network parameters on the ASMI.
  • A problem was fixed that caused the gateway network address  to be shown incorrectly on the System Management Services (SMS) menus when booting a partition on an iSCSI network.
  • A problem was fixed that caused a "code accept" during a concurrent firmware installation from the HMC to fail with SRC E302F85C.
  • On storage drawers in a cross-coupled topology, an attempt to place an indirect (failover) route at an SNID location in the SRT1 route table may result in a failover route that uses the opposite compute sub-cluster as a bounce point.  The firmware was enhanced to prevent this, since there are no physical links between the two compute sub-clusters in a cross-coupled topology.  Having a failover route through the opposite compute sub-cluster will lead to packet loss and application failure.
  • A problem was fixed that prevented predictive guard errors from being deleted on the secondary service processor.  This caused hardware to be erroneously guarded out if a service processor failover occurred.
  • A problem was fixed that caused the service processor to be reset during a CEC power off or reboot.  This causes the system to terminate, followed by a platform reboot.  SRC B181E6C7 is typically logged when this problem occurs.
  • A problem was fixed that caused a system crash with unrecoverable SRC B7000103 and "ErFlightRecorder" in the failing stack.
  • A problem was fixed that caused the following symptoms on user-level jobs:

      1.  During job initialization when starting communication over the cluster fabric, an error message similar to the following:
              4:ERROR 629 fD4fs: Message type 21 from source 4 4:MPI-PAMI ERROR: pami_init() failed with rc(1) 4:ERROR: 0031-309 Connect failed during message
               passing initialization, task 4, reason:
       2. The initialization may succeed, but an HFI translation failure may occur, causing a time out on the cluster network and other side effects.
System firmware changes that affect certain systems
  • A problem was fixed that caused the dual-port Ethernet adapter, F/C 5270 and F/C 5708, to fail to power on with SRC B7006970.
  • On systems in a high-performance computing (HPC) cluster in 8D topology, a problem was fixed that caused a secondary route to be linked to an indirect route chain.  Jobs that are run in indirect route mode may experience hangs and performance problems.
  • The firmware was enhanced to improve the performance when indirect routing is used in large cluster systems.
AS730_103_093

06/27/12

Impact:  Availability      Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed that caused a segmentation fault in the service processor firmware.  When this occurred, a perc error with SRC B181C350 was logged.
System firmware changes that affect certain systems
  • On nodes with a single DCCA running AS730_093, a problem was fixed that prevented the node from booting, with SRC 10008732 erroneously logged.
AS730_093_093

06/13/12

Impact:  Serviceability      Severity:  SPE

System firmware changes that affect all systems

  • DEFERREDThe firmware was enhanced to fix a potential performance degradation on systems utilizing the stride-N stream prefetch instructions dcbt (with TH=1011) or dcbtst (with TH=1011).  Typical applications executing these algorithms include High Performance Computing, data intensive applications exploiting streaming instruction prefetchs, and applications utilizing the Engineering and Scientific Subroutine Library (ESSL) 5.1.
  • The firmware was enhanced to correctly handle bus errors between the P7 processor chip and the I/O hub chip.
  • The firmware was enhanced to correctly diagnose the failing FRU when SRC B1xxE504 with error signature "MCFIR[14] - Hang timer detector" was logged.
  • The firmware was enhanced to improve the FRU callouts when the number of multi-bit errors on a POWER7 processor bus exceeds the threshold.  This reduces the number of FRUs replaced on a failing system.
  • A problem was fixed the caused a system to crash when the system was in low power (or safe mode), and the system attempted to switch over to nominal mode.
  • The firmware was enhanced to reduce the impact of heavy volume errors, which can be logged as "sender hang" errors.
  • The firmware was enhanced to reduce the number of "retry fetch CE" and "DRAM spare" error logs entries that call out memory DIMMs.
  • A problem was fixed that caused the first processor module in a node to be erroneously called out if an over-temperature condition was detected, instead of the processor module that was reporting the over-temperature condition.
  • The firmware was enhanced to handle the I/O hub ISR (Integrated Switch Router) link port errors as software-recoverable, rather than as hard failures.  Before this enhancement, the links would have been guarded out even though these errors were recoverable.
  • A problem was fixed that caused a service processor kernel panic due to an out-of-memory condition, with SRC B181720D.
System firmware changes that affect certain systems
  • On systems with F/C 5708 and 5270 Dual port 10GB Ethernet adapter cards installed, a problem was fixed that caused SRC B7006970 to be erroneously logged when the card was powered on.
  • In asymmetric and cross-coupled topologies, if there are no direct dlink connections between a storage drawer and a compute supernode (either through fail-in-place or through having a compute drawer or drawers at standby), then the storage drawer, upon restart or re-initialization of the lnmc daemon (lnmcd), does not provide a failover route to the target compute supernode even though there are suitable bounce points within the compute sub-cluster that can provide the indirect route.  The firmware was enhanced to provide this indirect route.
AS730_084_084

04/12/12

Impact: Function           Severity:  SPE

New Features and Functions

  • Support for cross-coupled compute-to-storage topology for a 2 drawer storage sub-cluster.
  • Support for cross-coupled compute-to-storage topology for a 4 drawer storage sub-cluster.

System firmware changes that affect all systems

  • The firmware was enhanced to allow a node to continue to boot when unrecoverable SRC B181B70C is logged.
  • A problem was fixed that caused an extraneous error log entry calling out DCCA-B and hub R5 when power was removed from DCCA-A, and the service processor and TPMD in DCCA-A were primary.
  • The firmware was enhanced to more gracefully handle the system shutdown that is required when a hypervisor hang condition was encountered.  SRCs B7000602, B182951C, B1813918 and A7001151 were logged, and a service processor failover occurred, when the hypervisor hang condition and subsequent system crash occurred.
  • The firmware was enhanced to cause the secondary service processor to automatically pick up configuration changes stored on the primary service processor.  This prevents the new configuration information from being lost if a service processor failover occurs before the secondary has picked up the new configuration information; typically this problem will only be encountered just after a system is installed.
  • The firmware was enhanced to gracefully recover, and log the correct error logs, if the secondary DCCA loses power.
  • A problem was fixed that prevented communication between the compute and storage networks in asymmetric ISR network topologies.  This affected network topologies DD2_64_8_2A, DD2_64_8_2B, DD2_64_8_4A, and DD2_64_8_4B.
  • A problem was fixed that caused SRC B181E6F1 ("RMGR_PERSISTENT_EVENT_TIMEOUT") to be erroneously logged.
  • The firmware was enhanced to reduce the number of memory DIMMs replaced due to correctable errors being logged.
  • A problem was fixed that caused unrecoverable SRC B130CD03 to be erroneously logged.
  • A problem was fixed that caused SRC B7000602 to be erroneously logged at power on.
  • The firmware was enhance to prevent a potential deadlock in the opposite-side storage drawer if all of the cross-coupled dlinks between a compute supernode (at runtime) and a storage drawer (at runtime) are taken down.  This problem also affects indirect routing from compute to storage over cross-coupled links.
  • A problem was fixed that caused the Local Network Management Controller (LNMC) to be set to the wrong state during a service processor (DCCA) fail-over.  If this problem occurs, the most likely symptom will be a communication failure on the ISR network.
  • A problem was fixed that caused a partition running AIX to crash.
  • A new level of optical link firmware is included in this service pack, and the optical link firmware update function is enabled.  The new optical link device firmware will be automatically installed the next time the node is booted after this service pack is installed.  Please see "Additional Details About Installing This Service Pack" in the "Important Information" section of the Description File.
  • The firmware was enhanced to increase the threshold of soft NVRAM errors on the service processor to 32 before SRC B15xF109 is logged.  (Replacement of the service processor is recommended if more than one B15xF109 is logged per week.)
AS730_066_066

01/25/12

Impact: Function           Severity:  SPE

System firmware changes that affect all systems

  • HIPER/Pervasive:  The initial value of the fault isolation register (FIR) for MULTICAST_TO_HFI_TIMEOUT was changed from recoverable attention to special attention to prevent multiple deconfigured Torrent modules across multiple CECs.
  • HIPER/Not pervasive:  A problem was fixed that caused a node to hang when booting, then terminate with SRC B1813450, B181C350, and B18187D9.
  • HIPER/Not pervasive:  On systems running the Advanced Energy Manager, a problem was fixed that caused the system to crash with SRC B114E504.
  • On systems using the Advanced Energy Manager (AEM) to run in Dynamic Power Save (DPS) mode, and with deconfigured processor cores, a problem was fixed that caused the processor voltages to be set incorrectly, which in turn caused the system to use more power than it should have been using.
  • A problem was fixed that caused the processor fabric bus to be guarded out when a time-of-day (TOD) clock failure occurred.  Only the TOD clock should be guarded out.
  • A problem was fixed that caused a node to be erroneously guarded out during power on.
  • The firmware was enhanced to increase the threshold for recoverable SRC B113E504 so that the processor core reporting the SRC is not guarded out.  This prevents unnecessary performance loss and the unnecessary replacement of processor modules.
  • On the System Management Services (SMS) remote IPL (RIPL) menus, a problem was caused the SMS menu to continue to show that an Ethernet device is configured for iSCSI, even though the user has changed it to BOOTP.
  • The firmware was enhanced to report an error to the operating system (OS) when a bad packet was sent from a host fabric interface (HFI) window.

System firmware changes that affect certain systems

On systems running host fabric interface (HFI) Ethernet, a problem was fixed that caused the TPC/IP boot parameters to be zeroed out, and the partition to fail to boot with HFI, when the partition was powered on.
AS730_057_057

12/05/11

Impact: Function           Severity:  SPE

System Firmware changes that affect all systems

  • A problem was fixed that caused the wrong field replaceable units (FRUs) to be called out when a time-of-day (TOD) clock failure occurred.
  • A problem was fixed that caused multiple octants to be erroneously guarded out when correctable elastic interface errors occurred. 
  • The firmware was enhanced to list the field replaceable units (FRUs) in the proper order when a failure occurred in which the processor module was in the failure path.
  • A problem was fixed that caused all of the I/O hubs chips in a node to be erroneously guarded out when a failure of the optical clock oscillator function occurred.
  • A problem was fixed that caused the system to crash with SRC B18187DA.
  • The firmware was enhanced to improve the field replaceable units (FRUs) called out when a clock failure occurs.
  • The firmware was enhanced to call out the symbolic FRU PIOCARD, instead of the PCI riser card, with SRCs B7006900, B7006970, B7006971, B7006973, and B7006A2B.
  • The firmware was enhanced to support two new topologies, 8D_2S_A and 8D_4S_A that are extensions of the 8D topology, to support customers' configurations.
  • The firmware was enhanced to better handle a series of hardware errors at runtime that previously would have generated a check stop.
  • A problem was fixed that caused the CEC identify LED and frame identify LED to continue to blink after the successful completion of a PCI adapter replacement operation.
  • A problem was fixed the prevented the DCCA from being added to the field replaceable unit (FRU) list when the error log was generated by the temperature/pressure monitoring device (TPMD), or thermal management (TMGT), firmware.
  • A problem was fixed that caused some links to report that they were still disabled, even though they were enabled, when disabling or enabling a large number of links.
  • The firmware was enhanced to correctly log an error when the bulk power controllers' firmware levels don't match.
  • A problem was fixed the caused SRC B1502626 to be erroneously logged when a bulk power controller (BPC) firmware update was installed when the system was at runtime.
  • A problem was fixed that caused the replacement of a PCI adapter using the AIX hot plug utility to fail.
AS730_044_044

08/27/11

Impact:  New            Severity:  New

GA Level