atom feed6 messages in com.basho.lists.riak-usersCrashed node has Bitcask merge errors...
FromSent OnAttachments
Jeff PollardAug 5, 2011 1:12 am 
Jeff PollardAug 5, 2011 5:49 am.dump
David SmithAug 5, 2011 6:01 am 
Jeff PollardAug 5, 2011 6:03 am 
David SmithAug 5, 2011 7:07 am 
Jeff PollardAug 5, 2011 7:59 am 
Subject:Crashed node has Bitcask merge errors on restart
From:Jeff Pollard (jeff@gmail.com)
Date:Aug 5, 2011 1:12:05 am
List:com.basho.lists.riak-users

Hey All,

We had one of our riak node servers crash, and when booted back up it's now in this very inconsistent state where it responds to requests for a while (minute or two), then all requests time out for a little while, then go back to not responding to requests. It's been ~90 minutes since the crash and reboot of the server, and we're still in this bad state.

We use the bitcask data store, and looking through the logs I see a lot of merge failures in the sasl-error.log file. See this gist for the tail -n 2000 of the sasl-error.log. The interesting bit is mostly at the bottom:

https://gist.github.com/1127104

I'm not really sure how to proceed and would love some help on the matter. For the time being we have this node pulled out of our load balancer and the rest of the nodes see this node as down, so we're still functional in production, but I'd obviously like to fix this up ASAP.

One final thing to note is that we have backups of the entire Riak data directory from before the crash, which we could restore from if that helps.