| From | Sent On | Attachments |
|---|---|---|
| Maninya M | Feb 14, 2012 6:23 am | |
| Julian Elischer | Feb 14, 2012 8:56 am | |
| Jason Hellenthal | Feb 14, 2012 9:05 am | |
| Joshua Isom | Feb 14, 2012 9:12 am | |
| md...@FreeBSD.org | Feb 14, 2012 9:20 am | |
| Brandon Falk | Feb 14, 2012 9:25 am | |
| Rayson Ho | Feb 14, 2012 9:26 am | |
| Eitan Adler | Feb 14, 2012 10:04 am | |
| Uffe Jakobsen | Feb 14, 2012 10:43 am | |
| Julian Elischer | Feb 14, 2012 3:00 pm | |
| Jan Mikkelsen | Feb 14, 2012 3:50 pm | |
| Devin Teske | Feb 14, 2012 4:20 pm | |
| Rayson Ho | Feb 14, 2012 4:53 pm | |
| Jim Bryant | Feb 14, 2012 5:34 pm | |
| Jim Bryant | Feb 14, 2012 5:38 pm | |
| Julian Elischer | Feb 14, 2012 9:40 pm | |
| Da Rock | Feb 20, 2012 6:32 am | |
| Dieter BSD | Feb 20, 2012 10:57 am | |
| per...@pluto.rain.com | Feb 20, 2012 11:12 pm | |
| Julian Elischer | Feb 21, 2012 12:22 am | |
| Dieter BSD | Feb 24, 2012 1:09 pm | |
| Adam Vande More | Feb 24, 2012 1:28 pm |
| Subject: | Re: OS support for fault tolerance | |
|---|---|---|
| From: | Julian Elischer (jul...@freebsd.org) | |
| Date: | Feb 14, 2012 3:00:25 pm | |
| List: | org.freebsd.freebsd-hackers | |
On 2/14/12 9:27 AM, Rayson Ho wrote:
On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer<jul...@freebsd.org> wrote:
but I'm interested in any answers people may have
The way other OSes handle this is by detecting any abnormal amounts of faults (sometimes it's not the fault of the hardware - eg. when a partical from the outerspace hits a core and flips the bit), then the disable the core(s).
Solaris& mainframe (z/OS) handle it this way, but you should google and find more info since I don't remember all the details.
Also, see this presentation: "Getting to know the Solaris Fault Management Architecture (FMA)": http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf
True, but you can't guarantee that a cpu is going to fail in a way that you can detect like that. what if the clock just stops.. I believe that even those systems that support cpu deactivation on error only catch some percentage of the problems, and that sometimes it was more of "bring up the system without cpu X after it all crashed in flames".
tandem and other systems in the old day s used to be able to cope with dying cpus pretty well but they had support from to to bottom and the software was written with 'clustering' in mind.
Rayson
================================= Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/
Scalable Grid Engine Support Program http://www.scalablelogic.com/
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "free...@freebsd.org"
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "free...@freebsd.org"
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "free...@freebsd.org"





