|Julian Elischer||Mar 17, 2003 12:22 pm|
|Alfred Perlstein||Mar 17, 2003 12:32 pm|
|Julian Elischer||Mar 17, 2003 12:39 pm|
|Julian Elischer||Mar 17, 2003 12:44 pm|
|Miguel Mendez||Mar 17, 2003 12:59 pm|
|Bakul Shah||Mar 17, 2003 1:26 pm|
|Poul-Henning Kamp||Mar 17, 2003 1:38 pm|
|Julian Elischer||Mar 17, 2003 1:51 pm|
|Julian Elischer||Mar 17, 2003 1:56 pm|
|Bakul Shah||Mar 17, 2003 2:04 pm|
|Bakul Shah||Mar 17, 2003 2:08 pm|
|Brooks Davis||Mar 17, 2003 2:12 pm|
|Julian Elischer||Mar 17, 2003 2:25 pm|
|Jeff Roberson||Mar 17, 2003 2:30 pm|
|Brad Knowles||Mar 17, 2003 2:39 pm|
|Greg 'groggy' Lehey||Mar 17, 2003 3:48 pm|
|Julian Elischer||Mar 17, 2003 4:04 pm|
|Terry Lambert||Mar 17, 2003 7:41 pm|
|Terry Lambert||Mar 17, 2003 7:43 pm|
|Terry Lambert||Mar 17, 2003 7:49 pm|
|Terry Lambert||Mar 17, 2003 7:58 pm|
|Jeff Roberson||Mar 17, 2003 8:02 pm|
|Terry Lambert||Mar 17, 2003 8:31 pm|
|Kenneth D. Merry||Mar 17, 2003 8:34 pm|
|Bakul Shah||Mar 17, 2003 8:59 pm|
|Poul-Henning Kamp||Mar 17, 2003 10:17 pm|
|Poul-Henning Kamp||Mar 17, 2003 10:23 pm|
|Greg 'groggy' Lehey||Mar 17, 2003 10:47 pm|
|Terry Lambert||Mar 17, 2003 10:48 pm|
|Greg 'groggy' Lehey||Mar 17, 2003 10:53 pm|
|Peter Wemm||Mar 17, 2003 11:45 pm|
|Dag-Erling Smørgrav||Mar 18, 2003 1:58 am|
|Miguel Mendez||Mar 18, 2003 2:08 am|
|Brad Knowles||Mar 18, 2003 5:59 am|
|Brad Knowles||Mar 18, 2003 6:04 am|
|Brad Knowles||Mar 18, 2003 6:25 am|
|Giorgos Keramidas||Mar 18, 2003 12:12 pm|
|Brandon D. Valentine||Mar 18, 2003 12:24 pm|
|Terry Lambert||Mar 18, 2003 2:41 pm|
|Brad Knowles||Mar 18, 2003 4:15 pm|
|Terry Lambert||Mar 18, 2003 4:33 pm|
|Subject:||Re: Anyone working on fsck?|
|From:||Bakul Shah (bak...@bitblocks.com)|
|Date:||Mar 17, 2003 8:59:23 pm|
UFS is the real problem here, not fsck. Its tradeoffs for improving normal access latencies may have been right in the past but not for modern big disks.
Sorry, but the track-to-track seek latency optimizations you are referring to are turned off, given the newfs defaults, and have been for a very long time.
I was thinking of the basic idea of cylinder groups as good for normal load, not so good for fsck when you have too many CGs. I wasn't thinking of what fsck does or does not do.
Basically, the problem is that for a BG fsck, it's not possible to lock access on a cylinder group basis, and then there is a hell of a lot of data on that drive.
Note that Julian said 6 hours to fsck a TB in the normal foreground mode.
What Julian needs is a Journalling or Log-structured FS.
All that Julian wants is a faster fsck without mucking with the FS!
While I agree with you that you do need a full consistency check, it is worth thinking about how one can avoid that whenever possible. For example, if you can know where the disk head is at the time of crash (based on what blocks were being written) it should be possible to avoid a full check.
The easiest way to do this is to provide some sort of domain control, so that you limit the files you have to check to a smaller set of files per CG, so that you don't have to check all the files to check a particular CG -- and then preload the CG's for the sets of files you *do* have to check.
If you have to visit a CG (during fsck) you have already paid the cost of the seek and rotational latency. Journalling wouldn't help here if you still have a zillion CGs.
Another way of saying this is... don't put all your space in a single FS. 8-).
Or in effect treat each CG (or a group of CGs) as a self contained filesystem (for the purpose of physical allocation) and maintain explicit import/export lists for files that span them.
The problem with having to (effectively) read every inode and direct block on the disk is really insurmountable, I think.
That is why I was suggesting putting them in one (or small number of) contiguous area(s). On a modern ATA100 or better disk you can read a GB in under a minute. Once the data is in-core you can divide up the checking to multiple processors. This is sort of like a distributed graph collection: you only need to worry about graphs that cross a node boundary. Most structures wille contained in one node.
Even for UFS it is probably worth dividing fsck in two or more processes, one doing IO, one or more doing computation.
Typically only a few cyl grps may be inconsistent in case of a crash. May be some info about which cyl groups need to be checked can be stored so that brute force checking of all grps can be avoided.
This would work well... under laboratory test conditions. In the field, if the machine has a more general workload (or even a heterogeneous workload, but with a hell of a lot of files, like a big mail server), this falls down, as the number of bits marking "unupdated cylinder groups" becomes large.
Possible -- it is one of the ideas I can think of. I'd have to actually model it or simulate it beyond handwaving to know one way or other. May be useful in conjunction with other ideas.
...AND... The problem is still that you must scan everything on the disk (practically) to identify the inode or indirect block that references the block on the cylinder roup in question, and THAT's the problem. If you knew a small set of CG's, that needed checked, vs. "all of them", it would *still* bust the cache, which is what takes all the time.
Assume, on average, each file goes into indirect blocks.
On my machine the average file size is 21KB (averaged over 4,000,000 files). Even with 8KB blocksize very few will have indirect blocks. I don't know how typical my file size distribution is but I suspect the average case is probably smaller files (I store lots of datasheets, manuals, databases, PDFs, MP3s, cvs repositories, compressed tars of old stuff).
But in any case wouldn't going forward from inodes make more sense? This is like a standard tracing garbage collection algorithm. Blocks that are not reachable are free. Even for a 1 TB system with 8K blocks you need 2^(40-13-3) == 16Mbytes bitmap or some multiple if you want more than 1 bit of state.
The problem is reverse mapping a bit in a CG bitmap to the file that reference it... 8^p.
Why would you want to do that?!
Multithreading fsck would be an incredibly bad idea.
It depends on the actual algorithm.
Personally, I recommend using a different FS, if you are going to create a honing big FS as a single partition. 8-(.
There are other issues with smaller partitions. I'd rather have a single logical file system where all the space can be used. If physically it is implemented as a number of smaller systems that is okay. Also note that now people can create big honking files with video streaming at the rate of 2GB per hour and even then you are compromising on quality a bit.
To Unsubscribe: send mail to majo...@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message