4 messages in com.mysql.lists.clusterndbd killed by "signal 9", restart fa...| From | Sent On | Attachments |
|---|---|---|
| Joachim Worringen | 05 Mar 2008 04:43 | |
| Jeff Sturm | 05 Mar 2008 06:25 | |
| Joachim Worringen | 05 Mar 2008 07:43 | |
| Joachim Worringen | 05 Mar 2008 08:45 |
| Subject: | ndbd killed by "signal 9", restart fails with error 2311![]() |
|---|---|
| From: | Joachim Worringen (joac...@dolphinics.com) |
| Date: | 03/05/2008 04:43:16 AM |
| List: | com.mysql.lists.cluster |
Hi,
I'm testing a 16-machine GigE cluster with MySQL cluster 5.0.45, with one ndbd and mysqld per machine. I ran a series of DBT2 test, and after about one hour, two ndbd's shut down due to "signal 9" (which I had not sent them - and if it was a realy SIGKILL, the could not have shut down properly!?). The cluster fails to restart itself with errror 2311 on one node.
My findings are documented below, config.ini on request. My questions are: - what causes this "signal 9" shutdown of two nodes? - how is the error 2311 provoked (and avoided)? - why does the cluster fail to start if only a single ndbd has a problem?
Any pointers?
thanks, Joachim
# ndbd.1,2 were killed by signal 9 at 01:07:19 (no idea who sent this signal): 2008-03-04 23:25:26 [ndbd] INFO -- Start initiated (version 5.0.45) 2008-03-05 01:07:19 [ndbd] ALERT -- Node 1: Forced node shutdown completed, restarting, initial. Initiated by signal 9. 2008-03-05 01:07:20 [ndbd] INFO -- Ndb has terminated (pid 11618) restarting
# this killed the whole cluster, as these two nodes formed a nodegroup. # All other nodes shut down as well: 2008-03-05 01:06:55 [ndbd] INFO -- Error handler restarting system 2008-03-05 01:06:56 [ndbd] INFO -- Error handler shutdown completed - exiting 2008-03-05 01:06:56 [ndbd] ALERT -- Node 3: Forced node shutdown completed, restarting. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart
# all 3 restart tries that follow fail with the same errror on all 16 nodes: 2008-03-05 01:05:00 [ndbd] INFO -- Start initiated (version 5.0.45) 2008-03-05 01:06:54 [ndbd] INFO -- Error handler startup restarting system 2008-03-05 01:06:54 [ndbd] INFO -- Angel received ndbd startup failure count 1. 2008-03-05 01:06:54 [ndbd] INFO -- Error handler shutdown completed - exiting 2008-03-05 01:06:54 [ndbd] ALERT -- Node 2: Forced node shutdown completed, restarting. Occured during startph ase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'. 2008-03-05 01:06:54 [ndbd] INFO -- Ndb has terminated (pid 12291) restarting
# It is unclear why restart fails with all nodes claiming "another node failed". The ndb_mgmd log shows an error with ndbd.3 occuring 3 times (on each restart try): 2008-03-05 01:07:33 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed, restarting. Occured during star tphase 1. : 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
# This error code is reported in exactly one MySQL bug report by multiple users http://bugs.mysql.com/bug.php?id=21509), with no real solution. Bug is closed due to missing responses of the reportes.
-- Joachim Worringen, Software Architect, Dolphin Interconnect Solutions phone ++49/(0)228/324 08 17 - http://www.dolphinics.com




