summary | shortlog | log | commit | commitdiff | tree
raw | patch | inline | side by side (parent: 30d12d4)
raw | patch | inline | side by side (parent: 30d12d4)
author | Linus Torvalds <torvalds@linux-foundation.org> | |
Sat, 8 Aug 2009 04:16:46 +0000 (21:16 -0700) | ||
committer | Junio C Hamano <gitster@pobox.com> | |
Sat, 8 Aug 2009 05:32:46 +0000 (22:32 -0700) |
I think I have found a way to avoid the gcc crazyness.
Lookie here:
# TIME[s] SPEED[MB/s]
rfc3174 5.094 119.8
rfc3174 5.098 119.7
linus 1.462 417.5
linusas 2.008 304
linusas2 1.878 325
mozilla 5.566 109.6
mozillaas 5.866 104.1
openssl 1.609 379.3
spelvin 1.675 364.5
spelvina 1.601 381.3
nettle 1.591 383.6
notice? I outperform all the hand-tuned asm on 32-bit too. By quite a
margin, in fact.
Now, I didn't try a P4, and it's possible that it won't do that there, but
the 32-bit code generation sure looks impressive on my Nehalem box. The
magic? I force the stores to the 512-bit hash bucket to be done in order.
That seems to help a lot.
The diff is trivial (on top of the "rename registers with cpp" patch), as
appended. And it does seem to fix the P4 issues too, although I can
obviously (once again) only test Prescott, and only in 64-bit mode:
# TIME[s] SPEED[MB/s]
rfc3174 1.662 36.73
rfc3174 1.64 37.22
linus 0.2523 241.9
linusas 0.4367 139.8
linusas2 0.4487 136
mozilla 0.9704 62.9
mozillaas 0.9399 64.94
that's some really impressive improvement. All from just saying "do the
stores in the order I told you to, dammit!" to the compiler.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Lookie here:
# TIME[s] SPEED[MB/s]
rfc3174 5.094 119.8
rfc3174 5.098 119.7
linus 1.462 417.5
linusas 2.008 304
linusas2 1.878 325
mozilla 5.566 109.6
mozillaas 5.866 104.1
openssl 1.609 379.3
spelvin 1.675 364.5
spelvina 1.601 381.3
nettle 1.591 383.6
notice? I outperform all the hand-tuned asm on 32-bit too. By quite a
margin, in fact.
Now, I didn't try a P4, and it's possible that it won't do that there, but
the 32-bit code generation sure looks impressive on my Nehalem box. The
magic? I force the stores to the 512-bit hash bucket to be done in order.
That seems to help a lot.
The diff is trivial (on top of the "rename registers with cpp" patch), as
appended. And it does seem to fix the P4 issues too, although I can
obviously (once again) only test Prescott, and only in 64-bit mode:
# TIME[s] SPEED[MB/s]
rfc3174 1.662 36.73
rfc3174 1.64 37.22
linus 0.2523 241.9
linusas 0.4367 139.8
linusas2 0.4487 136
mozilla 0.9704 62.9
mozillaas 0.9399 64.94
that's some really impressive improvement. All from just saying "do the
stores in the order I told you to, dammit!" to the compiler.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
block-sha1/sha1.c | patch | blob | history |
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index b715916675469595c0955a15dc0bfde7e6508052..886bcf25e2f52dff239f1c744c11774af12da48a 100644 (file)
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
/* This "rolls" over the 512-bit array */
#define W(x) (array[(x)&15])
+#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
/*
* Where do we get the source from? The first 16 iterations get it from
#define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
- unsigned int TEMP = input(t); W(t) = TEMP; \
- TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
- B = SHA_ROR(B, 2); E = TEMP; } while (0)
+ unsigned int TEMP = input(t); setW(t, TEMP); \
+ E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
+ B = SHA_ROR(B, 2); } while (0)
#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )