3 messages in com.mysql.lists.clusterRe: odd failure
FromSent OnAttachments
B. Keith Murphy19 Sep 2007 12:15 
B. Keith Murphy20 Sep 2007 10:21 
Stewart Smith23 Sep 2007 06:39 
Subject:Re: odd failure
From:B. Keith Murphy (kmur@icontact.com)
Date:09/20/2007 10:21:42 AM
List:com.mysql.lists.cluster

So, doing some work on this. I am fairly certain that the issue is related to
latencies of the virtual machine. I have allocated some more memory to the two
servers and am monitoring them with ganglia to see what happens. It just might
not be possible to have a reasonable environment with these VM's.

thanks,

Keith

----- Original Message ----- From: "B. Keith Murphy" <kmur@icontact.com> To: "cluster" <clus@lists.mysql.com> Sent: Wednesday, September 19, 2007 3:15:41 PM (GMT-0500) America/New_York Subject: odd failure

I have setup up a development cluster for our developers. It consists of two
physical servers running the SQL daemon and data node on each one with
management running on another server.

About an hour an a half ago the sql node on one of the two servers stopped
responding. The data node part was still responding and showing up in the
ndb_mgm console. As you can see node 4 started missing heartbeats at 1:06 pm.

2007-09-19 13:06:59 [MgmSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2 2007-09-19 13:08:57 [MgmSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2 2007-09-19 13:14:43 [MgmSrvr] INFO -- Node 2: Local checkpoint 134 started. Keep
GCI = 207358 oldest restorable GCI = 207369 2007-09-19 13:33:44 [MgmSrvr] WARNING -- Node 2: Node 4 missed heartbeat 2 2007-09-19 13:33:46 [MgmSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2 2007-09-19 13:33:48 [MgmSrvr] WARNING -- Node 3: Node 4 missed heartbeat 3 2007-09-19 13:33:49 [MgmSrvr] WARNING -- Node 2: Node 4 missed heartbeat 2 2007-09-19 13:33:50 [MgmSrvr] WARNING -- Node 3: Node 4 missed heartbeat 4 2007-09-19 13:33:50 [MgmSrvr] ALERT -- Node 3: Node 4 declared dead due to
missed heartbeat 2007-09-19 13:33:50 [MgmSrvr] INFO -- Node 3: Communication to Node 4 closed 2007-09-19 13:33:50 [MgmSrvr] ALERT -- Node 2: Node 4 Disconnected 2007-09-19 13:33:50 [MgmSrvr] INFO -- Node 2: Communication to Node 4 closed 2007-09-19 13:33:50 [MgmSrvr] ALERT -- Node 3: Node 4 Disconnected 2007-09-19 13:33:50 [MgmSrvr] ALERT -- Node 2: Node 4 Disconnected 2007-09-19 13:33:53 [MgmSrvr] INFO -- Node 3: Communication to Node 4 opened 2007-09-19 13:33:54 [MgmSrvr] INFO -- Node 3: Node 4 Connected 2007-09-19 13:33:55 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened 2007-09-19 13:33:56 [MgmSrvr] INFO -- Node 2: Node 4 Connected

I could log into the MySQL server node as normal and was able to switch
databases and list tables. Anything against a table (select * from users for
instance) would give an error 157.

The two servers I have set up (each running a sql node and a data node) are
running in virtual machines on the same server. So I can't figure out why the
heartbeat failed. The management node is on another server, but it is on the
same network.

To get things going I ended up shutting everything down and restarting. I
couldn't get the mysql processess on the sql nodes to shut down normally
(/etc/init.d/mysql stop) but had to kill the processes on one server..on the
second server I ended up rebooted the server just to shut it down. Once
everything was reset it looks fine. I can start and stop the mysql nodes,
etc..everything looks normal.

Oh, I am running 5.1.20 all around on 64-bit debian etch.

Any suggestions?

thanks,