-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[agb-fixnum] Implement checked, overflowing, saturating and wrapping operations and a lot more #706
base: master
Are you sure you want to change the base?
Conversation
da59045
to
a13537e
Compare
Thanks so much for your PR, it's quite late right now, so I can't fully review it yet. I do have some bits to highlight first.
We tried using long multiplication and found it unacceptable at the time (#443). At the time:
Given the word size is 32 bits, is this an improvement? I wrote the following "benchmark", note that the 1000 repetition isn't for measurement purposes, the GBA doesn't need that (no cache, no complex pipeline, no branch prediction, etc.), it's just to make any fixed overhead of calling the test function negligible. use core::hint::black_box;
use agb_fixnum::{num, Num};
#[test_case]
fn bench_fixed_num_multiplication_i32(_: &mut crate::Gba) {
let a: Num<i32, 8> = black_box(num!(1_000.235));
let b: Num<i32, 8> = black_box(num!(1_000.235));
for _ in 0..1000 {
let a = black_box(a);
let b = black_box(b);
black_box(a * b);
}
}
#[test_case]
fn bench_fixed_num_multiplication_u8(_: &mut crate::Gba) {
let a: Num<u8, 4> = black_box(num!(1.2));
let b: Num<u8, 4> = black_box(num!(2.7));
for _ in 0..1000 {
let a = black_box(a);
let b = black_box(b);
black_box(a * b);
}
} the results of which are
Note that these benchmarks may still be flawed, it is a micro-benchmark after all. They show the existing i32 multiplication to be significantly faster in both debug and release. While I'm not certain, I explain this by the overhead of calling an arm function from thumb and the shifting is more expensive. I also see that the existing u8 multiplication to be faster in debug and equivalent with this PR in release mode. From your description, the rest all sounds great. I'll have a look at the code for it in more detail when I get the chance. Thanks again! |
I'll take a better look in the following days, but for now:
That's unfortunate, I thought it would have add a bit of overhead for the jump from thumb to arm and vice versa, but still be faster than three MUL (+ stuff). In this case I'd propose to keep the old implementation for fixnum with 15 or less bits of precision, and mine for the others, since the old one is not correct for those cases anyway.
In any case, thanks a lot for the benchmarks, they are very useful. |
@Chiptun3r I've cherry picked the changes from this PR into #711 if you'd like to check that it does what you're looking for :) |
I gave a quick look at it and I noticed that you're relying on upcast to 64 bits for multiplications when you need to check if they are overflowing, and that is quite suboptimal. From my tests (same setup as @corwinkuiper's, in release mode):
Also, the fast solution is wrong with high precision numbers, since, if the multiplication of the fract parts overflow the result is just incorrect and not the wrapping behavior you would expect. |
5605cfc
to
15f7ce3
Compare
Now it uses the fast path if the precision is <= 16 (in that case the multiplication between the fract parts can't overflow in any case, so the result is always the expected, wrapping, one).
Results (debug):
Results (release):
Sadly the debug build is much slower now (even tho the dev profile has |
Thanks for taking this further. I'm happy to do the later commits and PR them separately once this is merged. My current worry is that the Ideally we'd want to use something like https://doc.rust-lang.org/nightly/std/intrinsics/fn.is_val_statically_known.html but I don't think that'll ever get stabilised. I've experimented in the past with defining the abi method |
I checked on godbolt and it doesn't do any kind of optimization even with O3 optimizations enabled. The problem is not only the |
15f7ce3
to
5116d37
Compare
My original plan was just to implement wrapping_add, but the more I worked on it, the more I found things to fix, so here we are.
In order:
num!
macro didn't work with high precision numbers, so I fixed that too.FixedWidthUnsignedInteger
andFixedWidthSignedInteger
were quite a mess (they appears as mutually exclusive, bu the signed one actually derived from the unsigned one) and did not took advantage of thenum_traits
crate as much as they could, so I removedFixedWidthSignedInteger
and renamedFixedWidthUnsignedInteger
intoFixedWidthInteger
, since it is what it was.num_traits
supports, because, for example,checked_div
is the only division it supports).All the changes are backed by some tests, that I also run on mgba to be sure the i32/u32 multiplications work on there too, since the code is different for arm (btw, is there a better way then just copy the tests as functions and put them in a rom as I did?).
Hopefully this pull request isn't too big, I tried to split all the changes in their own commit to make it easier to review.