HACMP clstrmgrES termination error

 

Introduction

The HACMP cluster failed over to another node in the cluster and the previously live node is halted.

After the node was brought back up the issue was investigated and the following log files were analyzed.

The AIX error log

errpt -a | more

------------------------------------------------------------------------
LABEL:          SRC_SVKO
IDENTIFIER:     BC3BE5A3

Date/Time:       Mon  8 Mar 17:10:07 2010
Sequence Number: 139533
Machine Id:      00C7DFFE4C03
Node Id:         node1
Class:           S
Type:            PERM
Resource Name:   SRC

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
         512
SOFTWARE ERROR CODE
       -9017
ERROR CODE
           0
DETECTING MODULE
'srchevn.c'@line:'350'
FAILING MODULE
clstrmgrES
-----------------------------------------------------------------------
LABEL:          J2_FS_FULL
IDENTIFIER:     F7FA22C9

Date/Time:       Mon  8 Mar 16:39:57 2010
Sequence Number: 139532
Machine Id:      00C7DFFE4C03
Node Id:         node1
Class:           O
Type:            INFO
Resource Name:   SYSJ2

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

        Recommended Actions
        INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
        REMOVE UNNECESSARY DATA FROM FILE SYSTEM
        USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED

Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0008
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp
------------------------------------------------------------------------

/usr/es/adm/cluster.log

The file /usr/es/adm/cluster.log displayed the following error.

Mar  8 17:10:07 node1 daemon:err|error snmpd[274636]: 
	EXCEPTIONS: no response after 200 seconds  (SMUX 127.0.0.1+45860+6)
Mar  8 17:10:07 node1 user:notice HACMP for AIX: clexit.rc : 
	Unexpected termination of clstrmgrES.
Mar  8 17:10:07 node1 user:notice HACMP for AIX: clexit.rc : 
	Halting system immediately!!!

As can be seen the /tmp filesystem was 100% utilized.

On this server the log file clstrmgr.debug is written to /tmp and when the process clstrmgrES was no longer able to write to this log file clstrmgrES died causing the cluster to failover.

Fix

An IBM APAR has been opened for this issue, details below.

 IZ05428: HACMP UNEXPECTED EXIT DURING LOG CYCLE IN FULL FILESYSTEM
 
 APAR status

      Closed as program error.

Error description

      HACMP cluster manager will exit when there is
      a failure opening a new log file.

      This behaviour was designed to protect against
      unknown issues with running a cluster without
      sufficient logging space.

      This behaviour is being changed because the
      consequences of exiting are the same, or
      greater than any possible unknown issues
      with cluster manager continuing to run.

Local fix

      Properly maintain enough space in your logging
      directories
      to be able to maintain logs for important RAS
      information.

      It is also important to remember that when the
      cluster manager continues processing without logging,
      it will be difficult, or impossible to determine the
      flow of events for debugging, or understanding
      any cluster actions.

Problem summary

      When HACMP cluster is up and running, cluster manager will call
      exit when there is a failure opening a log file resulting in a
      node failure.

Problem conclusion

      Avoid cluster manager calling exit in case of log file opening
      issue by disabling logging and allowing cluster manager to
      continue. Logging will be disabled till the issue resulting in
      log file open failure is resolved example: if log file creation
      has failed then loggin will be disabled till adequate file
      system space is provided for the log file to be created.  Every
      time a attempt is made to open a log file we notify by an error
      through errrpt and stderr.

      It is also important to remember that when the cluster manager
      continues processing without logging, it will be difficult, or
      impossible to determine the flow of events for debugging, or
      understanding any cluster actions.

Further details can be found at the following link.

http://www-01.ibm.com/support/docview.wss?uid=isg1IZ05428