| From | Sent On | Attachments |
|---|---|---|
| Maninya M | Feb 14, 2012 6:23 am | |
| Julian Elischer | Feb 14, 2012 8:56 am | |
| Jason Hellenthal | Feb 14, 2012 9:05 am | |
| Joshua Isom | Feb 14, 2012 9:12 am | |
| md...@FreeBSD.org | Feb 14, 2012 9:20 am | |
| Brandon Falk | Feb 14, 2012 9:25 am | |
| Rayson Ho | Feb 14, 2012 9:26 am | |
| Eitan Adler | Feb 14, 2012 10:04 am | |
| Uffe Jakobsen | Feb 14, 2012 10:43 am | |
| Julian Elischer | Feb 14, 2012 3:00 pm | |
| Jan Mikkelsen | Feb 14, 2012 3:50 pm | |
| Devin Teske | Feb 14, 2012 4:20 pm | |
| Rayson Ho | Feb 14, 2012 4:53 pm | |
| Jim Bryant | Feb 14, 2012 5:34 pm | |
| Jim Bryant | Feb 14, 2012 5:38 pm | |
| Julian Elischer | Feb 14, 2012 9:40 pm | |
| Da Rock | Feb 20, 2012 6:32 am | |
| Dieter BSD | Feb 20, 2012 10:57 am | |
| per...@pluto.rain.com | Feb 20, 2012 11:12 pm | |
| Julian Elischer | Feb 21, 2012 12:22 am | |
| Dieter BSD | Feb 24, 2012 1:09 pm | |
| Adam Vande More | Feb 24, 2012 1:28 pm |
| Subject: | Re: OS support for fault tolerance | |
|---|---|---|
| From: | Julian Elischer (jul...@freebsd.org) | |
| Date: | Feb 21, 2012 12:22:25 am | |
| List: | org.freebsd.freebsd-hackers | |
On 2/20/12 6:32 AM, Da Rock wrote:
On 02/15/12 03:25, Brandon Falk wrote:
On 2/14/2012 12:05 PM, Jason Hellenthal wrote:
On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
On 2/14/12 6:23 AM, Maninya M wrote:
For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores.
I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated.
This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then.
if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess.
This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware..
and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory?
but I'm interested in any answers people may have
How about core redundancy ? effectively this would reduce the amount of available cores in half in you spread a process to run on two cores at the same time but with an option to adjust this per process etc... I don't see it as unfeasable.
The overhead for all of the error checking and redundancy makes this idea pretty impractical. You'd have to have 2 cores to do the exact same thing, then some 'master' core that makes sure they're doing the right stuff, and if you really want to think about it... what if the core monitoring the cores fails... there's a threshold of when redundancy gets pointless.
Make no mistake here, I'm not really up with the guts of what this would require (the dog may not hunt at all). Consider me as the little boy throwing rocks at a hornets nest :)
That out of the way, how about this scenario: why can't the master be dynamic amongst the cores? 1 core be the master of any 2 cores (not itself).
Another thought (probably more scifi then anything else) is about using the cores as individuals which work as a team and fire a weak team member that is failing.
I have absolutely no idea how to accomplish this, but I thought it might fire a few neurons in someone who does... :)
There are so many reasons this would be ineffective on standard hardware I have no idea where to begin, but see my email above..
Perhaps I'm missing out on something, but you can't check the checker (without infinite redundancy).
Honestly, if you're worried about a core failing, please take your server cluster out of the 1000 deg C forge.
-Brandon
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "free...@freebsd.org"
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "free...@freebsd.org"





