atom feed19 messages in org.opensolaris.zfs-discussRe: [zfs-discuss] Does your device ho...
FromSent OnAttachments
Bryant EadonFeb 10, 2009 10:35 am 
Peter SchullerFeb 10, 2009 10:52 am 
Miles NordinFeb 10, 2009 11:23 am 
Chris RiddFeb 10, 2009 11:27 am 
TimFeb 10, 2009 12:19 pm 
Peter SchullerFeb 10, 2009 1:36 pm 
David Collier-BrownFeb 10, 2009 1:55 pm 
Miles NordinFeb 10, 2009 2:56 pm 
Peter SchullerFeb 10, 2009 3:45 pm 
Bob FriesenhahnFeb 10, 2009 4:08 pm 
Jeff BonwickFeb 10, 2009 4:41 pm 
Toby ThainFeb 10, 2009 5:23 pm 
Miles NordinFeb 10, 2009 6:10 pm 
Frank CusackFeb 10, 2009 7:36 pm 
Toby ThainFeb 10, 2009 8:53 pm 
Bryant EadonFeb 10, 2009 10:28 pm 
Eric D. MudamaFeb 11, 2009 12:25 am 
David Dyer-BennetFeb 11, 2009 7:27 am 
Frank CusackFeb 11, 2009 8:24 am 
Subject:Re: [zfs-discuss] Does your device honor write barriers?
From:Peter Schuller (
Date:Feb 10, 2009 3:45:03 pm

well....if you want a write barrier, you can issue a flush-cache and wait for a reply before releasing writes behind the barrier. You will get what you want by doing this for certain. so a flush-cache is more forceful than a barrier, as long as you wait for the reply.

Yes, this is another peeve of mine since in many cases it is just so wasteful. Running an ACID compliant database on ZFS on non-battery backed storage is one example. (I started a brief conversation about fbarrier() on this list a while back. I really wish something like that would be adopted by some major OS:es, so that applications, not just kernel code, can make the distinction.)

If you have a barrier command, though, you could insert it into the command stream and NOT wait for the reply, confident nothing would be reordered across it. This way you can preserve ordering without draining the write pipe.

Also known as nirvana :)

Here's a pathological case which may be disconnected from reality in a few spots but is interesting.

The OS thinks:

* SYNC implies a write barrier. No WRITE issued after the SYNC will be performed until all WRITE issued before the SYNC are done. Also, all WRITE issued before the SYNC will be persistent, once the SYNC has returned.

This is a SYNC that includes the idea of a write barrier. You can see the idea has two pieces.

Yes. The complaint in my practical situation was that the driver had to be tweaked to not forward syncs in order to get decent performance, but not ignoring it meant an *actual* cache flush regardless of battery backed cache. Normally correctness was achieved but expensively because not only was write barriers enforced by way of syncs, the syncs were literally interpreted as 'flush the cache' syncs even if the controller had battery-backed cache with the appropriate settings to allow it to cache.

* SYNC should not return until all the writes issued before the SYNC are on disk. WRITE's issued after the SYNC do not need to be on disk before returning, but they can be, because otherwise why would the host have sent them? It makes no sense. After all the goal is to get as much onto the disk as possible, isn't it? It might be Critical Business Data, so we should write it fast.

This SYNC does not include an implicit barrier. It matches what userland programmers expect from fsync(), because they really have no choice---there is not a tagged syscall queue! :)

Well, the SYNC did not include the barrier, but the context in which you use an fsync() to enforce a barrier is one where the application actually does wait for it to return before issueing dependent I/O. Have you seen this particular mode of operation be a problem in practice?

As far as I can tell any assumptions on the part of an application that calling fsync(), rather than fsync() actually returning, implies a write barrier, would be severely broken and likely to break pretty quick in practice on most setups.

[snip example]

In this case the disk is not ``ignoring'' the SYNC command. The disk obeys its version of the rules, but 'C' is suprise-written before the initiator expects.

Note that even if the disk/controller didn't do this, the operating system's buffer cache is highly likely to introduce similar behavior internally. So unless you are using direct I/O, if you make this assumption on fsync() you're going to be toast even before the drive or storage controller become involved, in many practical setups.

[snip correct case example]

of course this is slower, maybe MUCH slower if there is a long communication delay.

It's pretty intersting that the only commonly available method of introducing a write barrier is to use fsync() which is a more demanding operation. At the same time, fsync() as actually implemented is very rarely useful to begin with, *except* in the context of a write barrier. That is, whenever you actually *do* want fsync() for persistence purposes, you almost always want some kind of write barrier functionality to go with it (in a preceeding and/or subsequent operation). Normally simply committing a bunch of data to disk is not interesting unless you can have certain guarantees with respect to the consistency of that data.

So the commen case of needing a write barrier is hindered by the only call available being a much more demanding operation, while the actual more demanding operation is not even useful that often in the absence of the previously mentioned less demanding barrier operation.

Doesn't feel that efficient that the entire world is relying on fsync(), does it...

The two kinds of synchronize-cache I was talking about were one bit-setting which writes to NVRAM, another which deamnds write to disk even when there is NVRAM.

That was my understanding, but I had never previously gotten the impression that there was such a distinction. At least not at the typical OS/block device layer - I am very weak on SCSI. For example most recently I considered this in the case of FreeBSD where there is BIO_FLUSH, but I'm not aware of any distinction such as the above.

It is the cas that SCSI has this, but that most OS:es simply don't use the more forceful version?

I am not sure why the second kind of flush exists at all---probably standards-committee-creep. It is not really any of the filesystem's business. but for making a single easy-to-use tool where you don't have or don't trust NVRAM knobs inside the RAID admin tool, the two kinds of sync command could be useful!

This is exactly my conclusion as well. I can see "really REALLY flush the cache" being useful as an administratively initiated command, decided upon by a human - similarly to issueing a 'sync' command to globally sync all buffers no matter what. But for filesystems/databases/other applications it truly should be completely irrelevant.

A barrier command is hypothetical. I don't know if it exists, and would be a third kind of command that I don't know if it's possible at all to issue it from userland---it was probably considered ``none of userland's business.'' or maybe the spec says it's implied by SYNC like the first initiator thinks---if so, I hope no iSCSI or FC stacks are confused like that disk was.

If it was considered none of userlands business I wholeheartedly disagree ;)

The conclusion from the previous discussion where I brought up fbarrier() seems to be that effectively you have an implicit fbarrier() in between each write() with ZFS. Imagine how nice it would be if the fbarrier() interface had been available, even if mapped to fsync() in most cases.

(Mapping fbarrier()->fsync() would not be a problem as long as fbarrier() is allowed to block.)

I think it is true there are levels of forcefulness based on the old sometimes-you-must disable-cache-flushes-if-you-have-nvram ZFS-advice.

I become paranoid by such advice. What's stopping a RAID device from, for example, ACK:ing an I/O before it is even in the cache? I have not designed RAID controller firmware so I am not sure how likely that is, but I don't see it as an impossibility. Disabling flushing because you have battery backed nvram implies that your battery-backed nvram guarantees ordering of all writes, and that nothing is ever placed in said battery backed cache out of order. Can this assumption really be made safely under all circumstances without intimite knowledge of the controller? I would expect not.


There is another thing we could worry about---maybe this other disk-level barrier command I do not know about does exist, for drives that have NCQ or TCQ, or for other parts of the stack like FC or the iSCSI initiator-to-target interface or AVS. It might mean ``do not

I have been under the very vague-but-not-well-supported understanding that there is some kind of barrier support going on with SCSI. But it has never been such an issue for me; I have been more concerned with enabling applications to have such a thing propagated down the operationg system stack at all to begin with.

Before it becomes relevant to me to start worrying about barriers at the SCSI level and whether it is implemented efficiently by certain drives or controllers, I have to see that propagation working to begin with. And as long as all the user land stuff does is an fsync(), we're not there yet.

The exception again is the in-kernel stuff which stands a better chance. On FreeBSD, last time I read ML posts/code about this, it's just a BIO_FLUSH and AFAIK there is no distinction made so ZFS is not able to communicate a barrier-as-oppsed-to-sync.

reorder writes across this barrier but, I don't particularly need a reply.'' It should in theory be faster to use this command where possible because you don't have to drain the device's work queue as you do while waiting for a reply to SYNCHRONIZE CACHE---if the ordering of the queue can be pushed all the way down to the inside of the hard drive, the latency of restarting writes after the barrier can be much less than draining the entire pipe's write stream including FC or iSCSI as well, so there is significant incentive, especially on modern high throughput*latency storage, to use a barrier command instead of plain SYNCHRONIZE CACHE whenver possible.

Now further imagine identifying and tagging distinct streams of I/O, such that your fsync() (where you want durability) of a handful of pages of data need not wait for those 50 MB of crap some other process wrote when copying some file. ;)

First thing's first...

But what if some part of the stack ignores these hypothetical barriers, but *does* respect the simple SYNCHRONIZE CACHE persistence command? This first round of fsync()-based tools wouldn't catch it!

On the other hand as a practical matter you can always choose to err on the side of caution and have barriers imply sync+wait still. If one is worried about these issues, and if the practical situation is such that you cannot trust the hardware/software involved, I suppose there is no other way out other than testing and adjusting until it works.

Here is another bit of FUD to worry about: the common advice for the lost SAN pools is, use multi-vdev pools. Well, that creepily matches just the scenario I described: if you need to make a write barrier that's valid across devices, the only way to do it is with the SYNCHRONIZE CACHE persistence command, because you need a reply from Device 1 before you can release writes behind the barrier to Device 2. You cannot perform that optimisation I described in the last paragraph of pushing the barrier paast the high-latency link down into the device, because your initiator is the only thing these two devices have in common. Keeping the two disks in sync would in effect force the initiator to interpret the SYNC command as in my second example. However if you have just one device, you could write the filesystem to use this hypothetical barrier command instead of the persistence command for higher performance, maybe significantly higher on high-latency SAN. I don't guess that's actually what's going on though, just an interesting creepy speculation.

This would be another case where battery-backed (local to the machine) NVRAM fundamentally helps even in a situation where you are only concerned with the barrier, since there is no problem having a battery-backed controller sort out the disk-local problems itself by whatever combination of syncs/barriers, while giving instant barrier support (by effectively implementing synch-and-wait) to the operating system.

(Referring now to individual drives being battery-backed, not using a hardware raid volume.)