10 messages in com.mysql.lists.clusterRe: failed ndbrequire -- reason?
FromSent OnAttachments
Jim Hoadley02 May 2005 18:45 
Jim Hoadley02 May 2005 20:38 
Jonas Oreland04 May 2005 04:08 
Leonard Cremer04 May 2005 07:14 
Mikael Ronström04 May 2005 07:34 
Clint Byrum04 May 2005 09:23 
Simon Garner04 May 2005 15:15 
Mikael Ronström05 May 2005 00:06 
Mikael Ronström05 May 2005 00:17 
pek...@mysql.com06 May 2005 03:49 
Subject:Re: failed ndbrequire -- reason?
From:Jonas Oreland (jona@mysql.com)
Date:05/04/2005 04:08:40 AM
List:com.mysql.lists.cluster

Jim Hoadley wrote:

More on this problem. A couple of hours later, node 4 went down, then all nodes died, taking down the cluster.

Looks like the error messages available to me our more interesting this time.

Here's what the ndb_error logs say:

Node 1:

Date/Time: Monday 2 May 2005 - 20:15:08 Type of error: assert Message: Assertion, probably a programming error Fault ID: 2301 Problem data: ArrayPool<T>::getPtr Object of reference: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 350 (b lock: BACKUP) ProgramName: ndbd ProcessID: 13637 TraceFile: /usr/local/mysql/ndb_1_trace.log.4 ***EOM***

Node 2:

Date/Time: Monday 2 May 2005 - 20:15:22 Type of error: assert Message: Assertion, probably a programming error Fault ID: 2301 Problem data: ArrayPool<T>::getPtr Object of reference: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 350 (b lock: BACKUP) ProgramName: ndbd ProcessID: 3666 TraceFile: /usr/local/mysql/ndb_2_trace.log.8 ***EOM***

Node 3:

<none>

Node 4:

<none>

Here's what the ndb_out files say:

Node 1: Error handler shutting down system Error handler shutdown completed - exiting

Node 3: 2005-05-02 20:15:23 [NDB] INFO -- Received signal 11. Running error handler.

Node 2: Ndb kernel is stuck in: Polling for Receive Error handler shutting down system Error handler shutdown completed - exiting

Node 4: 2005-05-02 20:00:04 [NDB] INFO -- Received signal 11. Running error handler.

It looks like node 1 and node 2 died, then with no node available in that node group, the management server had to shut down node 3 and node 4.

The "object of reference" line in the error log mentions BACKUP. I began an ndbcluster BACKUP just 10 or 15 minutes prior to the crash (at 20:00). Could that have been the cause? If so, why?

The backup is definitly the cause of the bug. Exactly why is hard to say...you must include the trace files from the crashes.

The previous backup ran (at 18:00) when one of the 4 nodes was offline. When it finished I deleted the directories. Could either of these caused some corruption?

There should be no problem taking a backup with some nodes offline. And the deleting of the directories does not matter.

Regarding the first crash in TC. 1) What version are you running? 2) That's related to ndb's internal triggers which are (among other things) used for backup and unique indexes. Where you running backup at the time of
the failure?

3) If so, there has been a number of bug fixes lately regarding node failure
during backup. Which might affect the second crash (but not the TC one)

/Jonas

-- Jim

After running for many days, my cluster crashed this afternoon. Can someone help me understand the reason?

Any help would be greatly appreciated.

This was the sequence of events. Node 1 crashed while I was logged in with the mysql client. I restarted node 1, then both node 1 and node 3 (both on the same host) crashed. I restarted nodes 1 and 3 successfully and they're still running.

Here's what the error log for node1 says. Node 3 did not have an error log. Please let me know which lines from the trace logs are relevant and I'll post theose too.

Date/Time: Monday 2 May 2005 - 17:49:05 Type of error: error Message: Internal program error (failed ndbrequire) Fault ID: 2341 Problem data: DbtcMain.cpp Object of reference: DBTC (Line: 12251) 0x0000000a ProgramName: ndbd ProcessID: 3669 TraceFile: /usr/local/mysql/ndb_1_trace.log.2 ***EOM***

Date/Time: Monday 2 May 2005 - 17:52:14 Type of error: error Message: Node failed during system restart Fault ID: 2308 Problem data: Unhandled node failure of started node during restart Object of reference: NDBCNTR (Line: 1417) 0x0000000a ProgramName: ndbd ProcessID: 13585 TraceFile: /usr/local/mysql/ndb_1_trace.log.3 ***EOM***

These are my specs.

3-host cluster:

host1 = node [1], node [3], API [6] host2 = node [2], node [4], API [7] host3 = mgm [5]

Each host has 6GB RAM and 2 3.6G Xeons RedHat Enterprise Linux 3 with hugemem kernel 2.4.21-27.0.4.ELhugemem #1 SMP

Here's my config.ini:

[ndbd default] LockPagesInMainMemory=1 TransactionDeadlockDetectionTimeout=14000 NoOfReplicas= 2 MaxNoOfConcurrentOperations=131072 DataMemory= 1900M IndexMemory= 400M Diskless= 0 DataDir= /var/mysql-cluster TimeBetweenWatchDogCheck=10000 HeartbeatIntervalDbDb=10000 HeartbeatIntervalDbApi=10000 NoOfFragmentLogFiles=64

NoOfDiskPagesToDiskAfterRestartTUP=54 #40 NoOfDiskPagesToDiskAfterRestartACC=8 #20

MaxNoOfAttributes = 2000 #1000 MaxNoOfOrderedIndexes = 5000 #128 MaxNoOfUniqueHashIndexes = 5000 #64

[ndbd] HostName= 10.0.1.199

[ndbd] HostName= 10.0.1.200

[ndbd] HostName= 10.0.1.199

[ndbd] HostName= 10.0.1.200

[ndb_mgmd] HostName= 10.0.1.198 PortNumber= 2200

[mysqld]

[mysqld] [tcp default] PortNumber= 2202

And show:

ndb_mgm> show Connected to Management Server at: 10.0.1.198:2200 Cluster Configuration --------------------- [ndbd(NDB)] 4 node(s) id=1 @10.0.1.199 (Version: 4.1.11, Nodegroup: 0) id=2 @10.0.1.200 (Version: 4.1.11, Nodegroup: 0, Master) id=3 @10.0.1.199 (Version: 4.1.11, Nodegroup: 1) id=4 @10.0.1.200 (Version: 4.1.11, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s) id=5 @10.0.1.198 (Version: 4.1.11)

[mysqld(API)] 2 node(s) id=6 @10.0.1.199 (Version: 4.1.11) id=7 @10.0.1.200 (Version: 4.1.11)

Thanks in advance.

-- Jim