atom feed19 messages in org.opensolaris.zfs-discussRe: [zfs-discuss] Does your device ho...
FromSent OnAttachments
Bryant EadonFeb 10, 2009 10:35 am 
Peter SchullerFeb 10, 2009 10:52 am 
Miles NordinFeb 10, 2009 11:23 am 
Chris RiddFeb 10, 2009 11:27 am 
TimFeb 10, 2009 12:19 pm 
Peter SchullerFeb 10, 2009 1:36 pm 
David Collier-BrownFeb 10, 2009 1:55 pm 
Miles NordinFeb 10, 2009 2:56 pm 
Peter SchullerFeb 10, 2009 3:45 pm 
Bob FriesenhahnFeb 10, 2009 4:08 pm 
Jeff BonwickFeb 10, 2009 4:41 pm 
Toby ThainFeb 10, 2009 5:23 pm 
Miles NordinFeb 10, 2009 6:10 pm 
Frank CusackFeb 10, 2009 7:36 pm 
Toby ThainFeb 10, 2009 8:53 pm 
Bryant EadonFeb 10, 2009 10:28 pm 
Eric D. MudamaFeb 11, 2009 12:25 am 
David Dyer-BennetFeb 11, 2009 7:27 am 
Frank CusackFeb 11, 2009 8:24 am 
Subject:Re: [zfs-discuss] Does your device honor write barriers?
From:Miles Nordin (car@Ivy.NET)
Date:Feb 10, 2009 2:56:09 pm

"ps" == Peter Schuller <> writes:

ps> This is something I'm interested in, since my preception so ps> far has been that there is only one. Some driver writer has ps> the opinion that "flush cache" means to flush the cache, while ps> the file system writer uses "flush cache" to mean "I want a ps> write barrier here, or even perhaps durable persistence, but I ps> have no way to express that so I'm going to ask for a cache ps> flush request which I assume a battery backed RAID controller ps> will honor by battery-backed cache rather than actually ps> flushing drives".

well....if you want a write barrier, you can issue a flush-cache and wait for a reply before releasing writes behind the barrier. You will get what you want by doing this for certain. so a flush-cache is more forceful than a barrier, as long as you wait for the reply.

If you have a barrier command, though, you could insert it into the command stream and NOT wait for the reply, confident nothing would be reordered across it. This way you can preserve ordering without draining the write pipe.

I guess if you mistook a cache-flush for a barrier, and just threw it in there thinking ``it'll act a s a barrier---I don't have to wait for a reply'', that could mess things up if someone else in the storage stack doesn't agree that flushes imply barriers.

Here's a pathological case which may be disconnected from reality in a few spots but is interesting.

The OS thinks:

* SYNC implies a write barrier. No WRITE issued after the SYNC will be performed until all WRITE issued before the SYNC are done. Also, all WRITE issued before the SYNC will be persistent, once the SYNC has returned.

This is a SYNC that includes the idea of a write barrier. You can see the idea has two pieces.

The drive thinks:

* To avoid tricky problems, let us use the cargo-cult behavior of always acknowledge commands in the same order we receive them. Of course even if it's not necessary to do this, there's no reason to DISallow it.

* SYNC should not return until all the writes issued before the SYNC are on disk. WRITE's issued after the SYNC do not need to be on disk before returning, but they can be, because otherwise why would the host have sent them? It makes no sense. After all the goal is to get as much onto the disk as possible, isn't it? It might be Critical Business Data, so we should write it fast.

This SYNC does not include an implicit barrier. It matches what userland programmers expect from fsync(), because they really have no choice---there is not a tagged syscall queue! :)

Anyway, the fsync() interpretation is not the only possible interpretation of what SYNC could mean, but it seems to be the one closest to what our drive follows.

initiator says disk says disk does

t 1: WRITE A ---> | 2: WRITE B ---> writes A | 3: <--- WRITE A is done | 4: SYNC ---> v 5: WRITE C ---> writes C 6: WRITE D ---> 7: WRITE E ---> writes B 8: <--- WRITE B is done 9: <--- SYNC is also done 10: <--- and WRITE C is done! 11: WRITE F ---> writes E 12: <--- WRITE E is done

In this case the disk is not ``ignoring'' the SYNC command. The disk obeys its version of the rules, but 'C' is suprise-written before the initiator expects. If the initiator knew of the disk's rule interpretation, it would implement the write barrier this way and not be surprised:

initiator says disk says disk does

t 1: WRITE A ---> | 2: WRITE B ---> writes A | 3: <--- WRITE A is done | 4: SYNC ---> v 5: nothing 6: nothing 7: nothing writes B 8: <--- WRITE B is done 9: <--- SYNC is also done 10: WRITE C ---> 11: WRITE D ---> writes C 12: <--- WRITE C is done

of course this is slower, maybe MUCH slower if there is a long communication delay.

The two kinds of synchronize-cache I was talking about were one bit-setting which writes to NVRAM, another which deamnds write to disk even when there is NVRAM. I am not sure why the second kind of flush exists at all---probably standards-committee-creep. It is not really any of the filesystem's business. but for making a single easy-to-use tool where you don't have or don't trust NVRAM knobs inside the RAID admin tool, the two kinds of sync command could be useful!

A barrier command is hypothetical. I don't know if it exists, and would be a third kind of command that I don't know if it's possible at all to issue it from userland---it was probably considered ``none of userland's business.'' or maybe the spec says it's implied by SYNC like the first initiator thinks---if so, I hope no iSCSI or FC stacks are confused like that disk was.

ps> Is it the case that SCSI defines different "levels" of ps> "forcefulness" to flushing?

I think it is true there are levels of forcefulness based on the old sometimes-you-must disable-cache-flushes-if-you-have-nvram ZFS-advice. But I don't think there is ever a case where the OS has business asking for the more forceful kind of NVRAM-disallowed flush. The barriers stuff is separate from that.

ps> Btw, a great example of a "non-enterprisy" case where you do ps> care about persistence [instead of just barriers], is the ps> pretty common case of simply running a mail server.

yeah. in that case, you have to send a ``message accepted, your message's ID in my queue is ASDFGHJ123'' to the sending MTA. Until the receiver sends this message, the sending MTA is still obligated to resend, and the receiver is allowed to harmlessly lose the message. so it's sort of like NFSv3 batched commits or a replicated database, where the ``when'' matters across two systems, not just the ordering within one system.

But for the ``lost my whole ZFS pool'' it's only barriers that matter. I think barriers get tangled up with the durability/persistence stuff because a cheap way for a disk driver to implement a barrier is to send the persistence command, then delay all writes after the barrier until the persistence command returns. I'm not sure this is the only way to make a barrier, though---I don't know SCSI well enough.

There is another thing we could worry about---maybe this other disk-level barrier command I do not know about does exist, for drives that have NCQ or TCQ, or for other parts of the stack like FC or the iSCSI initiator-to-target interface or AVS. It might mean ``do not reorder writes across this barrier but, I don't particularly need a reply.'' It should in theory be faster to use this command where possible because you don't have to drain the device's work queue as you do while waiting for a reply to SYNCHRONIZE CACHE---if the ordering of the queue can be pushed all the way down to the inside of the hard drive, the latency of restarting writes after the barrier can be much less than draining the entire pipe's write stream including FC or iSCSI as well, so there is significant incentive, especially on modern high throughput*latency storage, to use a barrier command instead of plain SYNCHRONIZE CACHE whenver possible. But what if some part of the stack ignores these hypothetical barriers, but *does* respect the simple SYNCHRONIZE CACHE persistence command? This first round of fsync()-based tools wouldn't catch it!

Here is another bit of FUD to worry about: the common advice for the lost SAN pools is, use multi-vdev pools. Well, that creepily matches just the scenario I described: if you need to make a write barrier that's valid across devices, the only way to do it is with the SYNCHRONIZE CACHE persistence command, because you need a reply from Device 1 before you can release writes behind the barrier to Device 2. You cannot perform that optimisation I described in the last paragraph of pushing the barrier paast the high-latency link down into the device, because your initiator is the only thing these two devices have in common. Keeping the two disks in sync would in effect force the initiator to interpret the SYNC command as in my second example. However if you have just one device, you could write the filesystem to use this hypothetical barrier command instead of the persistence command for higher performance, maybe significantly higher on high-latency SAN. I don't guess that's actually what's going on though, just an interesting creepy speculation.