-
Notifications
You must be signed in to change notification settings - Fork 12
Implementation of Daira reduction #98
base: master
Are you sure you want to change the base?
Conversation
Your implementation looks pretty solid; I tried a few trivial optimizations but they didn't have much affect. It seems like the adds and subs are taking longer than I would have expected, like closer to 1 add/cycle than the theoretical capacity of 4 adds/cycle. Maybe we should look into add/sub/mul to see how much room for improvement there is, and if we end up with significant improvements there, we could come back to see if this method can become the fastest. I started poking around the generated asm; will report back if I find anything interesting. |
The generated asm for The compiler seems to handle This might be obvious, but (If we did write an asm implementation later, maybe we could find creative ways to avoid memory access, e.g. by abusing XMM registers for storage. Not sure if it's a good or bad idea...) Also, it seems like the compiler isn't smart enough to use We could maybe use |
Thanks heaps for your comments Daniel. I'll have a look at the stack access of On the subject of |
Ah you're right, I see
I didn't study the asm too much, but I'm guessing that it's tracking the high bits of Perhaps we should try "row-wise" multiplication, as described here. Seems like that should make better use of carry chains, since only one bit is carried at a time. OTOH, my understanding is that a sequence of |
I've checked that compiling with |
So, new developments:
|
I did a slightly more detailed breakdown of the proportion of time that each stage of Daira's reduction is currently taking. Total time for a reduction is ~16.5ns; here are the line-by-line costs in nanoseconds (and proportions) for future reference (timings on the Core i7): let (x0, x1, x2) = rebase_8(x); // 2.5 (15%)
// s = C * x1
let s = mul_4_2(x1, Self::C); // 5.3 (32%)
// t = C^2 * x2 + x0
let t = if x2 == 0 { // 1.4 (8%)
x0
} else {
add_no_overflow(Self::C_SQR_TBL[(x2 - 1) as usize], x0)
};
// xp = kM - s + t
let xp = add_6_4(sub(Self::K_M, s), t); // 2.4 (14%)
// xp = (xp0, xp1)
let (xp0, xp1) = rebase_6(xp); // 1.2 (7%)
// u = C * xp1
let u = mul_2_2(Self::C, xp1); // 2.1 (13%)
// return M - u + xp0 // 1.9 (11%)
let res = add_no_overflow(sub(Self::ORDER, u), xp0); (Note: 2.5 + 5.3 + 1.4 + 2.4 + 1.2 + 2.1 + 1.9 = 16.8 != 16.5, so there is a very slight issue with how I arrived at these numbers; probably not important though.) The implementations of |
This branch contains an implementation of Daira Hopwood's reduction algorithm for the Tweedle{dee,dum} curves. More precisely, the reduction algorithm is implemented but only enabled for Tweedledee, while Tweedledum remains with the original Montgomery reduction algorithm. This allows easy benchmark comparison between the two reduction algorithms.
Unfortunately I wasn't able to get Daira's algorithm to run faster than Montgomery's, even though after counting multiplications and additions and staring at assembly output I think that should be possible. Anyway, since this code does not represent an improvement I'm leaving this PR as a draft and I do not recommend merging at this stage. I intend to revisit it on occasion as ideas for improvement occur to me. Suggestions for improvement would be welcome in the comments below.
Current micro-benchmark timings as of 21 Jan 2021 (in nanoseconds):
Notes:
lto="fat"
andcodegen-units=1
in the toml file (source).RUSTFLAGS = '-C target-cpu=native'
. This generated code usingmulx
instead of plainmul
(on the i7; the i5 doesn't have bmi2 or adx), but the timings ended up exactly the same.