summary | shortlog | log | commit | commitdiff | tree
raw | patch | inline | side by side (parent: 6d91da6)
raw | patch | inline | side by side (parent: 6d91da6)
author | Linus Torvalds <torvalds@linux-foundation.org> | |
Wed, 23 Jan 2008 02:41:14 +0000 (18:41 -0800) | ||
committer | Junio C Hamano <gitster@pobox.com> | |
Wed, 23 Jan 2008 05:46:30 +0000 (21:46 -0800) |
This creates a hash index of every single file added to the index.
Right now that hash index isn't actually used for much: I implemented a
"cache_name_exists()" function that uses it to efficiently look up a
filename in the index without having to do the O(logn) binary search,
but quite frankly, that's not why this patch is interesting.
No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.
That, in turn, means that suddenly questions like "does this name exist
in the index under an _equivalent_ name?" becomes much much cheaper.
Guiding principles behind this patch:
- it shouldn't be too costly. In fact, my primary goal here was to
actually speed up "git commit" with a fully populated kernel tree, by
being faster at checking whether a file already existed in the index. I
did succeed, but only barely:
Best before:
[torvalds@woody linux]$ time git commit > /dev/null
real 0m0.255s
user 0m0.168s
sys 0m0.088s
Best after:
[torvalds@woody linux]$ time ~/git/git commit > /dev/null
real 0m0.233s
user 0m0.144s
sys 0m0.088s
so some things are actually faster (~8%).
Caveat: that's really the best case. Other things are invariably going
to be slightly slower, since we populate that index cache, and quite
frankly, few things really use it to look things up.
That said, the cost is really quite small. The worst case is probably
doing a "git ls-files", which will do very little except puopulate the
index, and never actually looks anything up in it, just lists it.
Before:
[torvalds@woody linux]$ time git ls-files > /dev/null
real 0m0.016s
user 0m0.016s
sys 0m0.000s
After:
[torvalds@woody linux]$ time ~/git/git ls-files > /dev/null
real 0m0.021s
user 0m0.012s
sys 0m0.008s
and while the thing has really gotten relatively much slower, we're
still talking about something almost unmeasurable (eg 5ms). And that
really should be pretty much the worst case.
So we lose 5ms on one "benchmark", but win 22ms on another. Pick your
poison - this patch has the advantage that it will _likely_ speed up
the cases that are complex and expensive more than it slows down the
cases that are already so fast that nobody cares. But if you look at
relative speedups/slowdowns, it doesn't look so good.
- It should be simple and clean
The code may be a bit subtle (the reasons I do hash removal the way I
do etc), but it re-uses the existing hash.c files, so it really is
fairly small and straightforward apart from a few odd details.
Now, this patch on its own doesn't really do much, but I think it's worth
looking at, if only because if done correctly, the name hashing really can
make an improvement to the whole issue of "do we have a filename that
looks like this in the index already". And at least it gets real testing
by being used even by default (ie there is a real use-case for it even
without any insane filesystems).
NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just
not ashamed of it enough to really care. I took all the numbers out of my
nether regions - I'm sure it's good enough that it works in practice, but
the whole point was that you can make a really much fancier hash that
hashes characters not directly, but by their upper-case value or something
like that, and thus you get a case-insensitive hash, while still keeping
the name and the index itself totally case sensitive.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Right now that hash index isn't actually used for much: I implemented a
"cache_name_exists()" function that uses it to efficiently look up a
filename in the index without having to do the O(logn) binary search,
but quite frankly, that's not why this patch is interesting.
No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.
That, in turn, means that suddenly questions like "does this name exist
in the index under an _equivalent_ name?" becomes much much cheaper.
Guiding principles behind this patch:
- it shouldn't be too costly. In fact, my primary goal here was to
actually speed up "git commit" with a fully populated kernel tree, by
being faster at checking whether a file already existed in the index. I
did succeed, but only barely:
Best before:
[torvalds@woody linux]$ time git commit > /dev/null
real 0m0.255s
user 0m0.168s
sys 0m0.088s
Best after:
[torvalds@woody linux]$ time ~/git/git commit > /dev/null
real 0m0.233s
user 0m0.144s
sys 0m0.088s
so some things are actually faster (~8%).
Caveat: that's really the best case. Other things are invariably going
to be slightly slower, since we populate that index cache, and quite
frankly, few things really use it to look things up.
That said, the cost is really quite small. The worst case is probably
doing a "git ls-files", which will do very little except puopulate the
index, and never actually looks anything up in it, just lists it.
Before:
[torvalds@woody linux]$ time git ls-files > /dev/null
real 0m0.016s
user 0m0.016s
sys 0m0.000s
After:
[torvalds@woody linux]$ time ~/git/git ls-files > /dev/null
real 0m0.021s
user 0m0.012s
sys 0m0.008s
and while the thing has really gotten relatively much slower, we're
still talking about something almost unmeasurable (eg 5ms). And that
really should be pretty much the worst case.
So we lose 5ms on one "benchmark", but win 22ms on another. Pick your
poison - this patch has the advantage that it will _likely_ speed up
the cases that are complex and expensive more than it slows down the
cases that are already so fast that nobody cares. But if you look at
relative speedups/slowdowns, it doesn't look so good.
- It should be simple and clean
The code may be a bit subtle (the reasons I do hash removal the way I
do etc), but it re-uses the existing hash.c files, so it really is
fairly small and straightforward apart from a few odd details.
Now, this patch on its own doesn't really do much, but I think it's worth
looking at, if only because if done correctly, the name hashing really can
make an improvement to the whole issue of "do we have a filename that
looks like this in the index already". And at least it gets real testing
by being used even by default (ie there is a real use-case for it even
without any insane filesystems).
NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just
not ashamed of it enough to really care. I took all the numbers out of my
nether regions - I'm sure it's good enough that it works in practice, but
the whole point was that you can make a really much fancier hash that
hashes characters not directly, but by their upper-case value or something
like that, and thus you get a case-insensitive hash, while still keeping
the name and the index itself totally case sensitive.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
cache.h | patch | blob | history | |
dir.c | patch | blob | history | |
read-cache.c | patch | blob | history |
index 3a47cdc9d26886db66379c0000bbe80565e2f742..409738ca6b4b6dcefb60937c665bf0f1456a0f5a 100644 (file)
--- a/cache.h
+++ b/cache.h
#include "git-compat-util.h"
#include "strbuf.h"
+#include "hash.h"
#include SHA1_HEADER
#include <zlib.h>
};
struct cache_entry {
+ struct cache_entry *next;
unsigned int ce_ctime;
unsigned int ce_mtime;
unsigned int ce_dev;
#define CE_UPDATE (0x10000)
#define CE_REMOVE (0x20000)
#define CE_UPTODATE (0x40000)
+#define CE_UNHASHED (0x80000)
static inline unsigned create_ce_flags(size_t len, unsigned stage)
{
struct cache_tree *cache_tree;
time_t timestamp;
void *alloc;
+ struct hash_table name_hash;
};
extern struct index_state the_index;
#define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL)
#define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options))
#define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options))
+#define cache_name_exists(name, namelen) index_name_exists(&the_index, (name), (namelen))
#endif
enum object_type {
extern int write_index(struct index_state *, int newfd);
extern int discard_index(struct index_state *);
extern int verify_path(const char *path);
+extern int index_name_exists(struct index_state *istate, const char *name, int namelen);
extern int index_name_pos(struct index_state *, const char *name, int namelen);
#define ADD_CACHE_OK_TO_ADD 1 /* Ok to add */
#define ADD_CACHE_OK_TO_REPLACE 2 /* Ok to replace file/directory */
index 1b9cc7a8a81002fa2971d71c41a256444e6112fe..6543105b9622212430a9e5ed131a81074e019d9a 100644 (file)
--- a/dir.c
+++ b/dir.c
struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len)
{
- if (cache_name_pos(pathname, len) >= 0)
+ if (cache_name_exists(pathname, len))
return NULL;
ALLOC_GROW(dir->entries, dir->nr+1, dir->alloc);
diff --git a/read-cache.c b/read-cache.c
index 07abd5d7ebfc10998a3c6140a7a325d752d70b49..9477c0b398125c3759ae3692de80e75ae62e11e0 100644 (file)
--- a/read-cache.c
+++ b/read-cache.c
struct index_state the_index;
+static unsigned int hash_name(const char *name, int namelen)
+{
+ unsigned int hash = 0x123;
+
+ do {
+ unsigned char c = *name++;
+ hash = hash*101 + c;
+ } while (--namelen);
+ return hash;
+}
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+ void **pos;
+ unsigned int hash = hash_name(ce->name, ce_namelen(ce));
+
+ istate->cache[nr] = ce;
+ pos = insert_hash(hash, ce, &istate->name_hash);
+ if (pos) {
+ ce->next = *pos;
+ *pos = ce;
+ }
+}
+
+/*
+ * We don't actually *remove* it, we can just mark it invalid so that
+ * we won't find it in lookups.
+ *
+ * Not only would we have to search the lists (simple enough), but
+ * we'd also have to rehash other hash buckets in case this makes the
+ * hash bucket empty (common). So it's much better to just mark
+ * it.
+ */
+static void remove_hash_entry(struct index_state *istate, struct cache_entry *ce)
+{
+ ce->ce_flags |= CE_UNHASHED;
+}
+
+static void replace_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+ struct cache_entry *old = istate->cache[nr];
+
+ if (ce != old) {
+ remove_hash_entry(istate, old);
+ set_index_entry(istate, nr, ce);
+ }
+ istate->cache_changed = 1;
+}
+
+int index_name_exists(struct index_state *istate, const char *name, int namelen)
+{
+ unsigned int hash = hash_name(name, namelen);
+ struct cache_entry *ce = lookup_hash(hash, &istate->name_hash);
+
+ while (ce) {
+ if (!(ce->ce_flags & CE_UNHASHED)) {
+ if (!cache_name_compare(name, namelen, ce->name, ce->ce_flags))
+ return 1;
+ }
+ ce = ce->next;
+ }
+ return 0;
+}
+
/*
* This only updates the "non-critical" parts of the directory
* cache, ie the parts that aren't tracked by GIT, and only used
/* Remove entry, return true if there are more entries to go.. */
int remove_index_entry_at(struct index_state *istate, int pos)
{
+ struct cache_entry *ce = istate->cache[pos];
+
+ remove_hash_entry(istate, ce);
istate->cache_changed = 1;
istate->cache_nr--;
if (pos >= istate->cache_nr)
@@ -702,8 +769,7 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e
/* existing match? Just replace it. */
if (pos >= 0) {
- istate->cache_changed = 1;
- istate->cache[pos] = ce;
+ replace_index_entry(istate, pos, ce);
return 0;
}
pos = -pos-1;
@@ -763,7 +829,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
memmove(istate->cache + pos + 1,
istate->cache + pos,
(istate->cache_nr - pos - 1) * sizeof(ce));
- istate->cache[pos] = ce;
+ set_index_entry(istate, pos, ce);
istate->cache_changed = 1;
return 0;
}
@@ -892,11 +958,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
has_errors = 1;
continue;
}
- istate->cache_changed = 1;
- /* You can NOT just free istate->cache[i] here, since it
- * might not be necessarily malloc()ed but can also come
- * from mmap(). */
- istate->cache[i] = new;
+
+ replace_index_entry(istate, i, new);
}
return has_errors;
}
@@ -971,6 +1034,20 @@ static void convert_from_disk(struct ondisk_cache_entry *ondisk, struct cache_en
memcpy(ce->name, ondisk->name, len + 1);
}
+static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries)
+{
+ long per_entry;
+
+ per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry);
+
+ /*
+ * Alignment can cause differences. This should be "alignof", but
+ * since that's a gcc'ism, just use the size of a pointer.
+ */
+ per_entry += sizeof(void *);
+ return ondisk_size + entries*per_entry;
+}
+
/* remember to discard_cache() before reading a different cache! */
int read_index_from(struct index_state *istate, const char *path)
{
* has room for a few more flags, we can allocate using the same
* index size
*/
- istate->alloc = xmalloc(mmap_size);
+ istate->alloc = xmalloc(estimate_cache_size(mmap_size, istate->cache_nr));
src_offset = sizeof(*hdr);
dst_offset = 0;
disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
ce = (struct cache_entry *)((char *)istate->alloc + dst_offset);
convert_from_disk(disk_ce, ce);
- istate->cache[i] = ce;
+ set_index_entry(istate, i, ce);
src_offset += ondisk_ce_size(ce);
dst_offset += ce_size(ce);
istate->cache_nr = 0;
istate->cache_changed = 0;
istate->timestamp = 0;
+ free_hash(&istate->name_hash);
cache_tree_free(&(istate->cache_tree));
free(istate->alloc);
istate->alloc = NULL;