The question is, are we planning to handle >95% of the errors for >99%
of the hardware we run on, or are we really planning to spend years
trying to design something that would require special hardware
I assume this started as: "Oh look, most CPUs have multiple cores
these days, maybe we could play with fault tolerance". Which
could be useful if CPU cores failed a lot, but in reality what
fails is disks, disks, controllers, disks, random other things,
and disks. Assuming you have avoided the garbage-quality stuff,
and have the system on a UPS. If you have enough ports you can
add more disks and mirror or some other version of RAID.
The next step is to duplicate everything. Not by looking for
a mainboard with redundant everything, but by simply adding
another computer. And rather than getting two of the same machine,
you're better off if they are different, so that they don't have
the same bugs.
The problem then is how to feed both machines the same inputs,
and compare the outputs. Do we need a third machine to supervise?
Which then leads to the issue of how to avoid problems when *it* breaks.
Can we have each machine keep an eye on the other, avoiding the
need for a third machine?