7 messages in com.mysql.lists.clusterRe: nightly crashing| From | Sent On | Attachments |
|---|---|---|
| Kasthuri Ilankamban | 22 Jul 2004 12:35 | |
| Devananda | 22 Jul 2004 12:49 | |
| Kasthuri Ilankamban | 22 Jul 2004 13:02 | |
| Joseph E. Sacco, Ph.D. | 22 Jul 2004 13:07 | |
| Mikael Ronström | 29 Jul 2004 08:45 | |
| Jim Hoadley | 29 Jul 2004 09:32 | |
| Mikael Ronström | 29 Jul 2004 13:01 |
| Subject: | Re: nightly crashing![]() |
|---|---|
| From: | Mikael Ronström (mik...@mysql.com) |
| Date: | 07/29/2004 01:01:12 PM |
| List: | com.mysql.lists.cluster |
Hi Jim,
The ndbd node dies due to that the watch dog thread kills it. Most likely this is due to entering an eternal loop in the ndbd main thread. Check in error.log which trace file was produced and send the most interesting parts of the trace file. The trace file should basically point out where the ndbd process got stuck if the code was properly coded.
The trace file starts with the jump address memory which records the line numbers in modules where the code executed. Send this part together with the first few signals recorded which records the last signals executed before the crash.
Rgrds Mikael PS: You'll find those files in the directory where the ndbd process executes.
2004-07-29 kl. 18.33 skrev Jim Hoadley:
I've been running a 2-node cluster for 10 days or so. 1 API and 1 DB on each computer, MGM on a separate computer. Each night one of the nodes dies. Always the same node. The cluster is in tact since the second node survives, and I restart the crashed node and it rejoins the cluster with no fuss. Obviously, can't have this behaviour in production and would like to find the cause.
These are just test boxes with 512MB RAM, so it could be underpowered hardware, but I was wondering if there were logs or trace files I could provide that would help determine the source of the nightly crash.
What the console shows on BOX2:
[root@BOX2 2.ndb_db]# Ndb kernel is stuck in: Job Handling Ndb kernel is stuck in: Job Handling Error handler shutting down system Error handler shutdown completed - exiting
What ndb/cluster.log says:
<...> 2004-07-28 17:59:02 [MgmSrvr] INFO -- Node 3: Local checkpoint 174 started. Keep GCI = 260315 oldest restorable GCI = 248165 2004-07-28 18:58:55 [MgmSrvr] INFO -- Node 3: Local checkpoint 175 started. Keep GCI = 262051 oldest restorable GCI = 248165 2004-07-28 19:58:47 [MgmSrvr] INFO -- Node 3: Local checkpoint 176 started. Keep GCI = 263786 oldest restorable GCI = 248165 2004-07-28 20:55:03 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2 2004-07-28 20:55:04 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 3 2004-07-28 20:55:05 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2004-07-28 20:55:05 [MgmSrvr] INFO -- Lost connection to node 2 2004-07-28 20:55:06 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 4 2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Node 2 declared dead due to missed heartbeat 2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Network partitioning - arbitration required 2004-07-28 20:55:06 [MgmSrvr] INFO -- Node 3: President restarts arbitration thread [state=7] 2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Arbitration won - positive reply from node 1 2004-07-28 20:55:06 [MgmSrvr] INFO -- Node 3: Started arbitrator node 1 [ticket=1eab00020908f20c] 2004-07-28 20:55:07 [MgmSrvr] WARNING -- Node 3: Node 12 missed heartbeat 2 2004-07-28 20:55:08 [MgmSrvr] WARNING -- Node 3: Node 12 missed heartbeat 3 2004-07-28 20:55:10 [MgmSrvr] WARNING -- Node 3: Node 12 missed heartbeat 4 2004-07-28 20:55:10 [MgmSrvr] ALERT -- Node 3: Node 12 declared dead due to missed heartbeat 2004-07-28 20:58:27 [MgmSrvr] INFO -- Node 3: Local checkpoint 177 started. Keep GCI = 265522 oldest restorable GCI = 248165 2004-07-28 21:53:06 [MgmSrvr] INFO -- Node 3: Local checkpoint 178 started. Keep GCI = 267250 oldest restorable GCI = 248165 <...>
Any ideas. Any other places to look?
Thanks in advance.
-- Jim
__________________________________ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail
-- MySQL Cluster Mailing List For list archives: http://lists.mysql.com/cluster To unsubscribe: http://lists.mysql.com/cluster?unsub=mik...@mysql.com
Mikael Ronström, Senior Software Architect MySQL AB, www.mysql.com
Clustering: http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html
http://www.eweek.com/article2/0,1759,1567546,00.asp




