atom feed19 messages in org.opensolaris.zfs-discussRe: [zfs-discuss] Does your device ho...
FromSent OnAttachments
Bryant EadonFeb 10, 2009 10:35 am 
Peter SchullerFeb 10, 2009 10:52 am 
Miles NordinFeb 10, 2009 11:23 am 
Chris RiddFeb 10, 2009 11:27 am 
TimFeb 10, 2009 12:19 pm 
Peter SchullerFeb 10, 2009 1:36 pm 
David Collier-BrownFeb 10, 2009 1:55 pm 
Miles NordinFeb 10, 2009 2:56 pm 
Peter SchullerFeb 10, 2009 3:45 pm 
Bob FriesenhahnFeb 10, 2009 4:08 pm 
Jeff BonwickFeb 10, 2009 4:41 pm 
Toby ThainFeb 10, 2009 5:23 pm 
Miles NordinFeb 10, 2009 6:10 pm 
Frank CusackFeb 10, 2009 7:36 pm 
Toby ThainFeb 10, 2009 8:53 pm 
Bryant EadonFeb 10, 2009 10:28 pm 
Eric D. MudamaFeb 11, 2009 12:25 am 
David Dyer-BennetFeb 11, 2009 7:27 am 
Frank CusackFeb 11, 2009 8:24 am 
Subject:Re: [zfs-discuss] Does your device honor write barriers?
From:Miles Nordin (car@Ivy.NET)
Date:Feb 10, 2009 6:10:20 pm
List:org.opensolaris.zfs-discuss

"jb" == Jeff Bonwick <Jeff@sun.com> writes: "tt" == Toby Thain <to@telegraphics.com.au> writes:

jb> Not if the disk drive just *ignores* barrier and flush-cache jb> commands and returns success. Some consumer drives really do jb> exactly that. That's the issue that people are asking ZFS to jb> work around.

Some are asking ZFS to work around the issue, which I think is not crazy: ZFS is already designed around failures clustered together in space, so why not failures clustered together in time as well? But I'm not in their camp, not asking for that workaround. It couldn't ever deliver the kind if integrity to which the checksum tree aspires. I'm asking for a solution to the overall problem, mostly outing, avoiding, fixing the broken devices and storage stacks.

jb> If it were possible to detect such disks, I'd add code to ZFS jb> that would simply refuse to use them. Unfortunately, there is jb> no reliable way to test the functioning of synchonize-cache jb> programmatically.

I think the situation's closer to: there's no way to test for it upon adding/attaching/replacing a device, so quickly that the user doesn't realize it's happening, and with few enough false positives that you don't mind supporting it when it goes wrong, and don't mind defending its correctness when it damages vendor relationships.

However I think developing a qualification _procedure_ that sysadmins can actually follow, possibly involving cord-yanking, and one that's decisive enough we can start sharing results instead of saying ``a major vendor'' and covering our asses all the time, is quite within reach. And I think it's all but certain to uncover all sorts of problems which are not in devices, too.

tt> This applies equally to virtual disks, of course (can we get tt> VirtualBox to NOT ignore flushes by default?)

haha but then people would say it performs so much worse than VMWare! :)

To be honest I have not absolutely verified this problem. I just hazily remember reading an email here or a bug report about it.