git.tokkee.org Git - git.git/commit

author	Linus Torvalds <torvalds@osdl.org>
	Mon, 5 Jun 2006 19:03:31 +0000 (12:03 -0700)
committer	Junio C Hamano <junkio@cox.net>
	Tue, 6 Jun 2006 00:23:31 +0000 (17:23 -0700)
commit	ce0bd64299ae148ef61a63edcac635de41254cb5
tree	6e6e2b4c332fd37e27c404b180a3efe9cce8feda	tree \| snapshot
parent	87cefaaff958e30204a21757012a46883175c00f	commit \| diff

pack-objects: improve path grouping heuristics.

This trivial patch not only simplifies the name hashing, it actually
improves packing for both git and the kernel.

The git archive pack shrinks from 6824090->6622627 bytes (a 3%
improvement), and the kernel pack shrinks from 108756213 to 108219021 (a
mere 0.5% improvement, but still, it's an improvement from making the
hashing much simpler!)

We just create a 32-bit hash, where we "age" previous characters by two
bits, so the last characters in a filename count most. So when we then
compare the hashes in the sort routine, filenames that end the same way
sort the same way.

It takes the subdirectory into account (unless the filename is > 16
characters), but files with the same name within the same subdirectory
will obviously sort closer than files in different subdirectories.

And, incidentally (which is why I tried the hash change in the first
place, of course) builtin-rev-list.c will sort fairly close to rev-list.c.

And no, it's not a "good hash" in the sense of being secure or unique, but
that's not what we're looking for. The whole "hash" thing is misnamed
here. It's not so much a hash as a "sorting number".

[jc: rolled in simplification for computing the sorting number
computation for thin pack base objects]

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>

pack-objects.c

diff | blob | history

Git - the stupid content tracker

RSS Atom

tokkee.org

Code