4 messages in com.mysql.lists.clusterRe: Node restart issue
FromSent OnAttachments
Adam Dixon22 Aug 2005 19:06 
Jonathan Miller22 Aug 2005 19:24 
Adam Dixon22 Aug 2005 19:37 
Adam Dixon23 Aug 2005 21:22 
Subject:Re: Node restart issue
From:Adam Dixon (adam@gmail.com)
Date:08/23/2005 09:22:26 PM
List:com.mysql.lists.cluster

So does anyone have any idea whats happening here.

Just to see, I deleted all the tables in the 2 databases I am running across the cluster, then ran the 'nodeid restart' command. It restarted fine, nice and quick obviously. But as soon as I load the cluster with some data, apon running the restart command i get this still; ndb_mgm> 15 restart Restart failed. * 2003: Restart failed * 5005-Send to process or receive failed.

and the clusterlog shows; 2005-08-24 13:44:05 [MgmSrvr] ALERT -- Node 10: Node 15 Disconnected 2005-08-24 13:44:05 [MgmSrvr] INFO -- Node 10: Communication to Node 15 closed 2005-08-24 13:44:05 [MgmSrvr] INFO -- Node 10: Communication to Node 15 closed 2005-08-24 13:44:05 [MgmSrvr] ALERT -- Node 20: Node 15 Disconnected ..... <more communication closed messages>... 2005-08-24 13:44:05 [MgmSrvr] ALERT -- Node 10: Arbitration check won - node group majority 2005-08-24 13:44:05 [MgmSrvr] INFO -- Node 10: President restarts arbitration thread [state=6] 2005-08-24 13:44:05 [MgmSrvr] INFO -- Mgmt server state: nodeid 15 freed, m_reserved_nodes 0000004842140406. 2005-08-24 13:44:05 [MgmSrvr] WARNING -- Allocate nodeid (15) failed. Connection from ip 10.10.10.15. Returned error string "Id 15 already allocated by another node." 2005-08-24 13:44:05 [MgmSrvr] INFO -- Mgmt server state: node id's 15 connected but not reserved 2005-08-24 13:44:05 [MgmSrvr] INFO -- Mgmt server state: node id's 1 not connected but reserved 2005-08-24 13:44:08 [MgmSrvr] INFO -- Mgmt server state: nodeid 15 reserved for ip 10.10.10.15, m_reserved_nodes 0000004842148406. 2005-08-24 13:44:08 [MgmSrvr] INFO -- Node 1: Node 15 Connected 2005-08-24 13:44:25 [MgmSrvr] INFO -- Node 10: Communication to Node 15 opened 2005-08-24 13:44:25 [MgmSrvr] INFO -- Node 10: Node 15 Connected 2005-08-24 13:44:25 [MgmSrvr] INFO -- Node 35: Communication to Node 15 opened

But the ndbd process comes back just like normal. No trace logs or anything in the ndbd's errorlog. And absolutely nothing suss in the ndb_out log file either. Looks like it happened without error from the ndbd side of things.

Any help on this would be good. Is it a bug in the mgmd side of things?

On 8/23/05, Jonathan Miller <jmil@mysql.com> wrote:

Adam,

For the mysqld:

" New command in ndb_mgm, PURGE STALE SESSIONS, as a workaround for cases where nodes fail to allocate a node id even if it is free to use."

Was there an error log or trace file on the system that 35 resides on?

JBM

-----Original Message----- From: Adam Dixon [mailto:adam@gmail.com] Sent: Monday, August 22, 2005 9:07 PM To: clus@lists.mysql.com Subject: Node restart issue

Bit of a weird one here; I would assume that with no problem with the cluster to begin with from the mgt node running; ndb_mgm> 35 restart

I get this showing up in the ndb_mgm; Restart failed. * 2003: Restart failed * 5005-Send to process or receive failed.

And this in the clusterlog; 2005-08-23 11:16:15 [MgmSrvr] INFO -- Node 35: Node shutdown initiated 2005-08-23 11:16:25 [MgmSrvr] ALERT -- Node 1: Node 35 Disconnected 2005-08-23 11:16:25 [MgmSrvr] ALERT -- Node 10: Node 35 Disconnected 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 10: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 10: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] ALERT -- Node 15: Node 35 Disconnected 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 15: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] ALERT -- Node 20: Node 35 Disconnected 2005-08-23 11:16:25 [MgmSrvr] ALERT -- Node 25: Node 35 Disconnected 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 25: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] ALERT -- Node 30: Node 35 Disconnected 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 30: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 15: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 20: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 25: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 30: Communication to Node 35 closed 2005-08-23 11:16:25 [MgmSrvr] ALERT -- Node 10: Arbitration check won - node group majority 2005-08-23 11:16:25 [MgmSrvr] INFO -- Node 10: President restarts arbitration thread [state=6] 2005-08-23 11:16:26 [MgmSrvr] INFO -- Mgmt server state: nodeid 35 freed, m_reserved_nodes 0400000042108402. 2005-08-23 11:16:26 [MgmSrvr] WARNING -- Allocate nodeid (35) failed. Connection from ip 10.10.10.35. Returned error string "Id 35 already allocated by another node." 2005-08-23 11:16:26 [MgmSrvr] INFO -- Mgmt server state: node id's 14 18 23 27 33 35 59 connected but not reserved 2005-08-23 11:16:26 [MgmSrvr] INFO -- Mgmt server state: node id's 1 not connected but reserved 2005-08-23 11:16:29 [MgmSrvr] INFO -- Mgmt server state: nodeid 35 reserved for ip 10.10.10.35, m_reserved_nodes 0400000842108402. 2005-08-23 11:16:29 [MgmSrvr] INFO -- Node 1: Node 35 Connected 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 30: Communication to Node 35 opened 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 20: Communication to Node 35 opened 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 20: Node 35 Connected 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 35: Node 20 Connected 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 35: Node 30 Connected 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 30: Node 35 Connected 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 25: Communication to Node 35 opened 2005-08-23 11:16:45 [MgmSrvr] INFO -- Node 15: Communication to Node 35 opened 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 25: Node 35 Connected 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 10: Communication to Node 35 opened 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 35: Node 10 Connected 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 35: Node 15 Connected 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 35: Node 25 Connected 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 10: Node 35 Connected 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 15: Node 35 Connected 2005-08-23 11:16:46 [MgmSrvr] INFO -- Node 35: CM_REGCONF president = 10, own Node = 35, our dynamic id = 7

It did come back ok (well it shows up fine in all status and show) so I dont know.. seems a little scary for a managed restart of a node.

I also had something similar when a mysqld crashed, it attempted to connect again, and failed due to the id not being free twice, then it came up on one of the spare id connections i have leftover with no ip tied to it instead of the one that it is supposed to have; eg; [mysqld(API)] 9 node(s) id=2 (not connected, accepting connect from 10.10.10.1) ...(few others) id=33 (Version: 4.1.13) id=38 (not connected, accepting connect from 10.10.10.35) id=58 @10.10.10.1 (Version: 4.1.13) id=59 (Version: 4.1.13)