atom feed7 messages in com.selenic.mercurial-devel[RFC] [revlog] parent-delta
FromSent OnAttachments
Benoit BoissinotAug 31, 2009 4:57
Matt MackallAug 31, 2009 6:19 pm 
Benoit BoissinotSep 1, 2009 12:15 am 
Matt MackallSep 1, 2009 1:18 am 
Benoit BoissinotSep 1, 2009 2:10 am 
Matt MackallSep 1, 2009 11:28 am 
Benoit BoissinotSep 1, 2009 12:44 pm 
Subject:[RFC] [revlog] parent-delta
From:Benoit Boissinot (
Date:Aug 31, 2009 4:57:09 pm

After playing with the topo-sort algorithm for revlog shrinking [1], I've build a script to emulate the result of parent delta [2] on big revlogs.

The attached script can be used to dump information about a revlog in the following way:

$ python dump .hg/store/00manifest.i > dump

This will create a file with some information about the compressed size of each revision, compressed size of the delta to each parent and the compressed size of the delta to tip.

Then with:

$ python emul < dump

the script will output the estimated size of the revlog if we were using parent delta.

Currently the emulation of delta parent works in the following way: - it computes all the possibles deltas (parent1, parent2 and tip) - if the distance between the of the base delta chain is higher than a certain threshold ("read factor", the value can be changed in the source) it will discard the potential delta. For example a threshold of 6 will only allow a given delta if the distance between the base offset and the current offset is smaller than 6 times the compressed full version. - from the remaining deltas, it chooses the smallest or insert a full (compressed) version

Results with the linux kernel manifest (158k revs, about 29k files):

original manifest: 704803499 bytes (672 MiB)

dump is about 14MB and took some 4 hours to generate (I can try to put it online if some people are interested).

parent-delta manifest with a "read factor" of 6: 363655853 bytes ( 346.8 MiB)

parent-delta manifest with a "read factor of 4: 464570388 bytes ( 443.0 MiB)

Notes about read factor: In our current implementation of tip-delta, the read factor was given in term of the uncompressed size of the full rev. Can someone give some insights to explain if using the compressed size is more (or less) meaningful?

Notes about the topo-sort+parent-delta: I need to check the percentage of parent-delta that gets inserted (vs. full revs or tip-delta). But if they are high enough and if we change the wire protocol to allow parent-delta as well, it would be much cheaper to do topo-sort on the fly since deltas would not be recomputed for every rev (as is the case today if we try to do topo-sort during pull).



[1] [2]

-- :wq

from mercurial import revlog, util, node import sys, os try: import json except ImportError: import simplejson as json

def dump(file_, progress=True): """dump the contents of an index file, suitable for diff-parent
calculations""" r = revlog.revlog(util.opener(os.getcwd(), audit=False), file_) revs = [] total = float(len(r)) for i in r: if (i+1) % 10 == 0: sys.stderr.write("Completion: %d/%d | %2.2f%%\r" % (i+1, total,
100*i/total)) n = r.node(i) pp = r.parentrevs(i) fulltext = len(revlog.compress(r.revision(n))[1]) tipdelta = len(revlog.compress(r.revdiff(i-1, i))[1]) ppdelta = [] for p in pp: if p == node.nullrev: s = None else: s = len(revlog.compress(r.revdiff(p, i))[1]) ppdelta.append(s) revs.append([node.hex(n), fulltext] + list(pp) + ppdelta + [tipdelta]) return json.dumps(revs, indent=0)

def load(s): return json.loads(s)

def emul(revs, maxreadfactor): bases = {} length = 0 for r, e in enumerate(revs): n, fulltext, p1, p2, pdelta1, pdelta2, tipdelta = e difflength = { r-1: tipdelta, p1: pdelta1, p2: pdelta2, }

candidates = [r-1, p1, p2] readdistances = [(difflength[p], length-bases[p]+difflength[p], p) for p in candidates if p != node.nullrev] # filter readdistances = [s for s in readdistances if s[1] <
maxreadfactor*fulltext] if not readdistances: bases[r] = length length += fulltext else: delta, ignored, p = min(readdistances) bases[r] = bases[p] length += delta return length

if __name__ == "__main__": action = sys.argv[1] if action == 'dump': # second arg must be a revlog, dump the index to stdout sys.stdout.write(dump(sys.argv[2])) elif action == 'emul': # read the revs from stdin revs = load( length = emul(revs, 3) print 'estimated file size: %12d bytes (%6.1f MiB)' % (length,