We had one of our riak node servers crash, and when booted back up it's now
in this very inconsistent state where it responds to requests for a while
(minute or two), then all requests time out for a little while, then go back
to not responding to requests. It's been ~90 minutes since the crash and
reboot of the server, and we're still in this bad state.
We use the bitcask data store, and looking through the logs I see a lot of
merge failures in the sasl-error.log file. See this gist for the tail -n
2000 of the sasl-error.log. The interesting bit is mostly at the bottom:
I'm not really sure how to proceed and would love some help on the matter.
For the time being we have this node pulled out of our load balancer and
the rest of the nodes see this node as down, so we're still functional in
production, but I'd obviously like to fix this up ASAP.
One final thing to note is that we have backups of the entire Riak data
directory from before the crash, which we could restore from if that helps.