atom feed41 messages in com.selenic.mercurial-devel[RFC] kbfiles: an extension to track ...
FromSent OnAttachments
Andrew PritchardJul 26, 2011 11:23 am 
Adrian BuehlmannJul 26, 2011 12:09 pm 
Na'Tosha BardJul 26, 2011 12:41 pm 
Matt MackallJul 26, 2011 1:22 pm 
Adrian BuehlmannJul 28, 2011 1:49 am 
Adrian BuehlmannJul 28, 2011 2:36 am 
Andrew PritchardAug 4, 2011 7:12 am 
Greg WardAug 4, 2011 7:13 am 
Greg WardAug 4, 2011 7:28 am 
Benjamin PollackAug 4, 2011 7:43 am 
Matt MackallAug 4, 2011 8:14 am 
Na'Tosha BardAug 4, 2011 8:44 am 
Andrew PritchardAug 4, 2011 12:20 pm 
Na'Tosha BardAug 5, 2011 1:04 am 
Angel Ezquerra MoreuAug 5, 2011 2:02 am 
Adrian BuehlmannAug 5, 2011 2:08 am 
Na'Tosha BardAug 5, 2011 5:40 am 
Na'Tosha BardAug 5, 2011 1:23 pm 
Martin GeislerAug 6, 2011 3:43 am 
Andrew PritchardAug 6, 2011 11:16 am 
Greg WardAug 7, 2011 3:31 pm 
Greg WardAug 7, 2011 4:37 pm.png
Laurens HolstAug 8, 2011 3:09 am 
Na'Tosha BardAug 8, 2011 5:05 am 
Benjamin PollackAug 8, 2011 11:51 am 
Chris CannamAug 10, 2011 1:26 pm 
Andrew PritchardAug 10, 2011 3:04 pm 
Andrew PritchardAug 11, 2011 10:29 am 
Andrew PritchardAug 11, 2011 3:05 pm 
Andrew PritchardAug 13, 2011 9:52 pm 
Andrew PritchardAug 14, 2011 3:23 pm 
Greg WardAug 14, 2011 5:05 pm 
Benjamin PollackAug 20, 2011 7:19 am 
Martin GeislerSep 22, 2011 9:37 am 
Na'Tosha BardSep 23, 2011 8:46 am 
Matt MackallSep 23, 2011 3:18 pm 
Greg WardSep 28, 2011 3:03 pm 
Benjamin PollackOct 1, 2011 4:57 pm 
Greg WardOct 1, 2011 7:35 pm 
Matt MackallOct 2, 2011 12:39 pm 
Benjamin PollackOct 5, 2011 3:03 pm 
Subject:[RFC] kbfiles: an extension to track binary files with less wasted bandwidth
From:Andrew Pritchard (
Date:Jul 26, 2011 11:23:17 am

The goal of kbfiles is to maintain the benefit of version tracking for binary files without requiring clones and pulls to download versions of large, incompressible files that will likely never be needed. These files are replaced, according to the user's configuration, with small standin files containing only the SHA1 sum of the binary file. Mercurial then tracks these standin files, keeping history small, while the binary files are retrieved only as needed (when updating, for example).

The reasoning behind this is that binary files are frequently large and already compressed as part of their format, and as such, compressed diffs don't work very well to track their changes. Since it is common for many types of software development (game development being a particularly strong example) to have large volumes of binary assets, without an extension like kbfiles, clones can end up being a single many-gigabyte transaction, whereas kbfiles allows this to be split into smaller transactions and avoid transferring most of the data altogether. Kbfiles also avoids diffing the binary files, transferring them as they are in any given revision. Finally, the size of data stored locally is greatly decreased for common use cases, in which old versions of binary assets are not often needed.

The typical use case is to have these binary files available on a central server, though retrieving bfiles from both SSH and HTTP Mercurial repositories is supported in the wire protocol. There are three locations that will be checked to find the required big files: - The repository-local cache, in .hg/kilnbfiles (this will be changed as needed with the name of the extension); - The configurable system cache, defaulting to $HOME/.kilnbfiles on POSIX-y systems and AppData\Local\kilnbfiles on Windows; and - The default or default-push remote paths in .hg/hgrc.

The system cache may be on network storage, so that an entire network of developers may share their files over NFS or SMB.

When a file is committed as a bfile, it is copied to the repository-local cache and to the system cache, and its standin is written in .kbf/. When pushing changes to bfiles to a remote repository, any changed bfiles are uploaded with the changesets. When pulling, though, only the changesets are transferred, greatly reducing clone sizes for repositories containing heavily-edited binary files. Then, when updating to a revision with changes to bfiles, the required versions of the files are retrieved from either the system cache or the remote repository.

kbfiles has several mechanisms for defending its repositories against damage from non-kbfiles clients: - add a 'kbfiles' line to .hg/requires in order to keep non-kbfiles clients from breaking things; - add a 'bfilestore' server capability, without which the client will not attempt to interact with a remote repository when the local repository uses kbfiles; and - prepend 'kbfiles\n' to the output of the heads command when serving kbfiles repositories to prevent non-kbfiles clients from creating broken clones.

The last of these is fairly likely to be controversial, but it currently seems to be necessary. Although the HG19 bundle format as described on the wiki would appear to solve the problem with its feature strings, it also does not appear to be implemented yet. If and when it is, kbfiles will replace the heads command hack with a 'kbfiles' bundle feature. Unfortunately, the result is that non-kbfiles clients throw an exception with no mention of kbfiles, but we could not find a way to make the client display a useful error message while consistently preventing them from uploading changesets without the corresponding bfiles or creating clones that are missing files.

As it stands, as long as either the client or the server has the current version of kbfiles or either repo has been touched by the current version of kbfiles, there are no known cases that cause missing bfiles.

The extension wraps most operations on repositories to handle bfiles specially; this can be seen in It also explicitly handles cooperation with several other extensions, including fetch, purge, and rebase.

Bfile transfer is implemented via three additions to the wire protocol on servers with the extension loaded: - statbfile, which returns 0, 1, or 2 depending on whether the requested bfile (as identified by the SHA1 sum) is present and valid, invalid, or missing; - getbfile, which returns the requested bfile along with its length to allow the ssh protocol to avoid reading beyond its end (without modifying Mercurial core code that attempts to encode passed-in file-like object as bundles); and - putbfile, which hashes and verifies the received data and places it in the repository-local and system caches.

The extension also currently supports talking to previous versions of Kiln that still serve bfiles over a different interface, via POST and GET requests to $REPO/bfile/$SHA. Although we would prefer to keep this in the extension, we are able and willing to pull it out into its own meta-extension if necessary.

We are still in the process of cleaning up the code to ship with Mercurial, but the current status can be seen at Before the 'real' pull request, we will collapse it into a single patch in the hgext directory. Planned changes before then include removing compatibility shims for old versions of Mercurial and some minor rebranding to remove mentions of 'Kiln' from the code and repository layout.

We would prefer to avoid renaming the extension if possible, both to avoid adding extra code to handle both old repositories and new ones and to reflect the heritage of the extension, but we understand that parts of the Mercurial community may be opposed to the name 'kbfiles', and as such we are willing to rename to 'terafiles' if the name would otherwise block the extension from shipping with Mercurial.