|Maninya M||Feb 14, 2012 6:23 am|
|Julian Elischer||Feb 14, 2012 8:56 am|
|Jason Hellenthal||Feb 14, 2012 9:05 am|
|Joshua Isom||Feb 14, 2012 9:12 am|
|md...@FreeBSD.org||Feb 14, 2012 9:20 am|
|Brandon Falk||Feb 14, 2012 9:25 am|
|Rayson Ho||Feb 14, 2012 9:26 am|
|Eitan Adler||Feb 14, 2012 10:04 am|
|Uffe Jakobsen||Feb 14, 2012 10:43 am|
|Julian Elischer||Feb 14, 2012 3:00 pm|
|Jan Mikkelsen||Feb 14, 2012 3:50 pm|
|Devin Teske||Feb 14, 2012 4:20 pm|
|Rayson Ho||Feb 14, 2012 4:53 pm|
|Jim Bryant||Feb 14, 2012 5:34 pm|
|Jim Bryant||Feb 14, 2012 5:38 pm|
|Julian Elischer||Feb 14, 2012 9:40 pm|
|Da Rock||Feb 20, 2012 6:32 am|
|Dieter BSD||Feb 20, 2012 10:57 am|
|per...@pluto.rain.com||Feb 20, 2012 11:12 pm|
|Julian Elischer||Feb 21, 2012 12:22 am|
|Dieter BSD||Feb 24, 2012 1:09 pm|
|Adam Vande More||Feb 24, 2012 1:28 pm|
|Subject:||Re: OS support for fault tolerance|
|From:||Jan Mikkelsen (janm...@transactionware.com)|
|Date:||Feb 14, 2012 3:50:58 pm|
On 15/02/2012, at 3:57 AM, Julian Elischer wrote:
On 2/14/12 6:23 AM, Maninya M wrote:
For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores.
I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated.
This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then.
if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess.
This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware..
and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory?
but I'm interested in any answers people may have
Back in the '90s I spent a bunch of time with looking at and using systems that
dealt with this kind of failure.
There are two basic approaches: With software support and without. The basic
distinction is what the hardware can do when something breaks. Is it able to
continue, or must it stop immediately?
Tandem had systems with both approaches:
The NonStop proprietary operating system had nodes with lock-step processors and
lots of error checking that would stop immediately when something broke. A CPU
failure turned into a node halt. There was a bunch of work to have nodes move
their state around so that terminal sessions would not be interrupted,
transactions would be rolled back, and everything would be in a consistent
The Integrity Unix range was based on MIPS RISC/os, with a lot of work at
Tandem. We had the R2000 and later the R3000 based systems. They had three CPUs
all in lock step with voting ("triple modular redundancy"), and entirely
duplicated memory, all with ECC. Redundant busses, separate cabinets for
controllers and separate cabinets for each side of the disk mirror. You could
pull out a CPU board and memory board, show a manager, and then plug them back
Tandem claimed to have removed 80% of panics from the kernel, and changed the
device driver architecture so that they could recover from some driver faults by
reinitialising driver state on a running system.
We still had some outages on this system, all caused by software. It was also
expensive: AUD$1,000,000 for a system with the same underlying CPU/memory as a
$30k MIPS workstation at the time. It was also slower because of the error
checking overhead. However, it did crash much less than the MIPS boxes.
Coming back to the multicore issue:
The problem when a core fails is that it has affected more than its own state.
It will be holding locks on shared resources and may have corrupted shared
memory or asked a device to do the wrong thing. By the time you detect a fault
in a core, it is too late. Checkpointing to main memory means that you need to
be able to roll back to a checkpoint, and replay operations you know about. That
involves more that CPU core state, that includes process file and device state.
The Tandem lesson is that it much easier when you involve the higher level
software in dealing with these issues. Building a system where you can make the
application programmer ignorant of the need to deal with failure is much harder
than when you expose units of work to the application programmer and can just
fail a node and replay the work somewhere else. Transactions are your friend.
Lots of literature on this stuff. My favourite is "Transaction Processing:
Concepts and Techniques" (Gray & Reuter) that has a bunch of interesting stuff.
Also stuff on the underlying techniques. I can't recall references at the
moment; they're on the bookshelf at home.
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "free...@freebsd.org"