| From | Sent On | Attachments |
|---|---|---|
| Terry Lambert | Jun 26, 2002 11:29 pm | |
| Greg 'groggy' Lehey | Jun 27, 2002 12:11 am | |
| Bill Huey | Jun 27, 2002 2:11 am | |
| Julian Elischer | Jun 27, 2002 10:48 am | |
| Gary Thorpe | Jun 27, 2002 11:18 am | |
| Matthew Dillon | Jun 27, 2002 12:00 pm | |
| Terry Lambert | Jun 27, 2002 1:20 pm | |
| Jonathan Lemon | Jun 27, 2002 1:25 pm | |
| Terry Lambert | Jun 27, 2002 2:27 pm | |
| Brooks Davis | Jun 27, 2002 3:25 pm | |
| Peter Wemm | Jun 27, 2002 4:01 pm | |
| ne...@xyz.com | Jun 27, 2002 5:34 pm | |
| Gary Thorpe | Jun 27, 2002 9:41 pm | |
| Matthew Dillon | Jun 27, 2002 9:53 pm | |
| Gary Thorpe | Jun 27, 2002 10:01 pm | |
| Brooks Davis | Jun 28, 2002 10:18 am |
| Subject: | Re: Larry McVoy's slides on cache coherent clusters | |
|---|---|---|
| From: | ne...@xyz.com (ne...@xyz.com) | |
| Date: | Jun 27, 2002 5:34:57 pm | |
| List: | org.freebsd.freebsd-arch | |
So you know where I'm coming from, I used to be an engineer in the base OS group (I owned the disk driver) at Sequent, the company with the best NUMA product out there even if we went the way of Beta VCRs.
The slides seem to be talking about NUMA (Non-Uniform Memory Access) machines which use CC (Cache Coherancy). These types of machines implement a cluster purely in hardware from what I have read of them (single memory address space is really distributed shared memory coordinated in hardware by high speed switches etc) and use much faster/lower latency communication methods. Examples would be SGI's Origin2000 and Origin3000 and maybe Sun's Starfire line. The big advantage is scaling and redundancy, since no one part of teh system is essential for the whole thing working (which is how clusters should also work ideally).
We (Sequent) were the first and best implementation out there with our NUMA-Q line... SGI & Sun both rely on huge memory backbones rather than finesse in software to achieve performance and they still fall short. DG tried too but I've heard nothing of them of late, sort of like the US vice presidents (quick, name the last 4).
NUMA buys you no redundancy in the real sense of the word, that is, the hardware architecture is more complex and thus more likely to fail. Of course since you have a number of quads (or whatever an implementation may chose for the basic unit) once you've had a hardware fault you can easily remove a single quad and reboot. Unfortunately your uptime requirements have gone to hell the second a reboot is needed. As far as scaling goes, you are right, code with minimal SMP awareness (Oracle) running on a top notch OS will scale incredibly well.
I think this ties in to Mr. Lambert's question about the future of FreeBSD very much. I think the NUMA model will eventually dominate all future large systems in the next 10 years (and SMP will come to be standard on small systems) and FreBSD will probably have to run efficiently on them to compete with Linux etc. Having seemless clusters (by this I mean clusters that work as a single system with one system image and identity) would probably be a an interesting problem also, since only a few OSes have made any serious attempt at implementing them. PVM, MPI, and MOSIX cannot for example migrate I/O among machines (network load balancing maybe?).
*TO ME* clustering and single memory image are contradictory. You cluster for redundancy, that is to get rid of any and all single points of failure. If the janitor trips over a power cord thus taking a big bite out of your memory space you'll quickly realize that this is not redundancy.
At Sequent we found that the #1 key to scalability in a NUMA world was to NEVER move memory from one quad to the next. This means that programs should try to migrate between procs on the same quad if possible, only move off quad as a last resort. Memory allocation has to be very aware of the fact that it is running on a collection of SMP boxen with high costs to go from proc-to-proc and prohibitive costs to go from quad-to-quad. Of course it follows that I/O must never be allowed to move over the memory backplane if possible. We had quad aware routing at all layers of the I/O stack to achieve this.
Of course YMMV. Last I looked neither Sun nor SGI had figured out how to squeeze the performance and scalability that we had. IBM who bought, chewed up, and then threw Sequent away didn't seem to have the corporate acuity to realize that there were lessons to be learned from small companies. Oh well, I'm bitter, sue me, no, forget that, IBM probably will.
In another email on the same thread, Matt Dillon wrote:
NUMA then becomes just another, faster transport mechanism. That is the direction I believe the BSDs will take... transparent clustering with NUMA transport, network transport, or a hybrid of both.
Matt: If you don't have a single memory immage you don't have NUMA. If you do have it then the transport mechanism will be saturated just moving "RAM" around and will not be available for network, I/O or whatever else.
-michael
michael at michael dot galassi dot org
To Unsubscribe: send mail to majo...@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message





