Bryant EadonFeb 10, 2009 10:35 am 
Peter SchullerFeb 10, 2009 10:52 am 
Miles NordinFeb 10, 2009 11:23 am 
Chris RiddFeb 10, 2009 11:27 am 
TimFeb 10, 2009 12:19 pm 
Peter SchullerFeb 10, 2009 1:36 pm 
David Collier-BrownFeb 10, 2009 1:55 pm 
Miles NordinFeb 10, 2009 2:56 pm 
Peter SchullerFeb 10, 2009 3:45 pm 
Bob FriesenhahnFeb 10, 2009 4:08 pm 
Jeff BonwickFeb 10, 2009 4:41 pm 
Toby ThainFeb 10, 2009 5:23 pm 
Miles NordinFeb 10, 2009 6:10 pm 
Frank CusackFeb 10, 2009 7:36 pm 
Toby ThainFeb 10, 2009 8:53 pm 
Bryant EadonFeb 10, 2009 10:28 pm 
Eric D. MudamaFeb 11, 2009 12:25 am 
David Dyer-BennetFeb 11, 2009 7:27 am 
Frank CusackFeb 11, 2009 8:24 am 
Subject:Re: [zfs-discuss] Does your device honor write barriers?
From:Peter Schuller
Date:Feb 10, 2009 1:36:08 pm

ps> A test I did was to write a minimalistic program that simply ps> appended one block (8k in this case), fsync():ing in between, ps> timing each fsync().

were you the one that suggested writing backwards to make the difference bigger? I guess you found that trick unnecessary---speeds differed enough when writing forwards?

No, that must have been someone else.

In this case I did a sequential test exactly because any trivial optimizations done by caching drives or raid controllers, should trivially be able to optimize this particular use case of sequential writing. In other words, I wanted to maximize the chance of hitting the optimization in case caching is in fact disabled.

ps> * Write-back caching on the RAID controller (lowest latency).

Did you find a good way to disable this case so you could distinguish between the second two?

Yes. I disabled things specifically and got the expected results latency wise. In particular, with the RAID controller cache disabled and drive caches not explicitly disabled, I got latencies indicating the drives did caching (too slow to be the raid controller, too fast to be on physical disk). This I then confirmed to be the case even according to the administrative tool.

like, I thought there was some type of SYNCHRONIZE CACHE with a certain flag-bit set, which demands a flush to disk not to NVRAM, and that years ago ZFS was mistakenly sending this overly aggressive command instead of the normal ``just make it persistent'' sync, so there was that stale best-practice advice to lobotomize the array by ordering it to treat the two commands equivalent.

This is something I'm interested in, since my preception so far has been that there is only one. Some driver writer has the opinion that "flush cache" means to flush the cache, while the file system writer uses "flush cache" to mean "I want a write barrier here, or even perhaps durable persistence, but I have no way to express that so I'm going to ask for a cache flush request which I assume a battery backed RAID controller will honor by battery-backed cache rather than actually flushing drives".

Hence the impedance mismatch and a whole bunch of problems.

Is it the case that SCSI defines different "levels" of "forcefulness" to flushing? If so, I'd love to hear any specific so I can then raise the question with relevant operating systems as to why there is no distinction between these cases at the block device level in the kernel(s).

Could you be referring to FUA/Force Unit Access perhaps, rather than a second type of cache flush?

Maybe it would be possible to send that old SYNC command on purpose. Then you could start the tool by comparing speeds with to-disk-SYNC and normal-nvramallowed-SYNC: if they're the same speed and oddly fast, then you know the array controller is lobotomized, and the second half of the test is thus invalid. If they're different speeds, then you can trust the second half is actually testing the disks, so lnog as you send old-SYNC. If they're the same speed but slow, then you don't have NVRAM.

True, though te absolute speeds should tell you quite a lot even without the comparison.

ps> you could write an ever increasing sequence of values to ps> deterministic but pseudo-random pages in some larger file, ps> such that you can, after a powerfail test, read them back in ps> and test the sequence of numbers (after sorting it) for the ps> existence of holes.

yeah, the perl script I linked to requires a ``server'' which is not rebooted and a ``client'' which is rebooted during the test, and the client checks in its behavior with the server. I think the server should be unnecessary---the script should just know itself, know in the check phase what it would have written. I guess the original script author is thinking more of the SYNC comand and less of the write barrier, but in terms of losing pools or _corrupting_ databases, it's really only barriers that matter, and SYNC matters only because it's also an implicit barrier, doesn't matter exactly when it returns.

Correct. You need the external server to test durability, assuming you are not satisfied with timing based tests. And as you point out, the write barrier test is fundamentally different.

so....I guess you would need the listening-server to test SYNC is not returning early, like if you want to detect that someone has disabled the ZIL, or if you have an n-tier database system with retries at higher tiers or a system that's distributed or doing replication, then you do care when SYNC returns and need the not-rebooted listening-server. But you should be able to make a serverless tool just to check write barriers and thus corruption-proofness.


Btw, a great example of a "non-enterprisy" case where you do care about persistence, is the pretty common case of simply running a mail server. Just for anyone reading the above paragraph and concluding it doesn't matter to mere mortals ;)