Performance of cyl_bessel_i() on a low-powered arm64 device #92

mskvortsov · 2024-05-02T18:49:41Z

While running the receiver on a low-powered device like Raspberry Pi, I'm seeing a high CPU load. A signal gets sampled at a 5 Msps rate, SF 11, BW 250.

A quick profiling of a run-to-completion flow from a File Source w/o throttling block shows the boost::math::cyl_bessel_i() function takes a substantial time. As it turns out, a default Boost math policy promotes doubles to long doubles the device is struggling to compute with.

The promotion can be disabled as described in https://www.boost.org/doc/libs/1_85_0/libs/math/doc/html/math_toolkit/tradoffs.html:

diff --git a/lib/fft_demod_impl.cc b/lib/fft_demod_impl.cc
index 784403a..f622ada 100644
--- a/lib/fft_demod_impl.cc
+++ b/lib/fft_demod_impl.cc
@@ -14,2 +14,5 @@ extern "C" {

+using namespace boost::math::policies;
+auto no_double_promotion_policy = make_policy(promote_double<false>());
+
 namespace gr {
@@ -197,3 +200,4 @@ namespace gr {
                 if (bessel_arg < 713)  // 713 ~ log(std::numeric_limits<LLR>::max())
-                    LLs[n] = boost::math::cyl_bessel_i(0, bessel_arg);  // compute Bessel safely
+                    // TODO? std::cyl_bessel_i() exists since C++17
+                    LLs[n] = boost::math::cyl_bessel_i(0, bessel_arg, no_double_promotion_policy);  // compute Bessel safely
                 else {

The fix gives a whopping ~3x speed up on RPi4 without decoding degradation on my signal. However, I don't know whether this long double precision is strictly required and can be downgraded just like that.

The text was updated successfully, but these errors were encountered:

miweber67 · 2024-05-08T11:04:54Z

The fix gives a whopping ~3x speed up on RPi4 without decoding degradation on my signal. However, I don't know whether this long double precision is strictly required and can be downgraded just like that.

You could create a set of test input files of varying 'quality' by adding varying amounts of Gaussian white noise and center frequency shift to see if the precision is an issue for those variables.

mskvortsov · 2024-05-08T23:52:02Z

I didn't see any difference in response in terms of the number of packets decoded with valid CRC's. I used Channel Model block and varied noise_voltage and frequency_offset parameters independently in small steps until the number of valid crc's declined to zero. On the other hand, there are too many other LoRa block configurations to make a definite conclusion from this limited experiment.

However, a more obvious point is that my 5 Msps sampling rate is somewhat high, and unfortunately, it's the lowest usable rate of my receiver. cyl_bessel_i() is executed in the order of O(samp_rate * 2^sf) times, so reducing the input sampling rate would probably be a simpler approach for my particular problem.

miweber67 · 2024-05-09T10:25:08Z

I didn't see any difference in response in terms of the number of packets decoded with valid CRC's. I used Channel Model block and varied noise_voltage and frequency_offset parameters independently in small steps until the number of valid crc's declined to zero. On the other hand, there are too many other LoRa block configurations to make a definite conclusion from this limited experiment.

Nice... a single data point to be sure, but, it's a pleasant single data point. :-)

However, a more obvious point is that my 5 Msps sampling rate is somewhat high, and unfortunately, it's the lowest usable rate of my receiver. cyl_bessel_i() is executed in the order of O(samp_rate * 2^sf) times, so reducing the input sampling rate would probably be a simpler approach for my particular problem.

So your frame_sync of_factor is ... 20? In issue 91 it was suggested that 4 should be adequate. If you filter and decimate by 5, do you still get good results?

mskvortsov · 2024-05-09T13:49:03Z

It looks like Low Pass Filter and Rational Resampler are quite CPU intensive. A receiver flow with additional filtering or resampling blocks makes 4x more load and occupies one Cortex-A72 core entirely. I'm going to try just a cheapo 1Msps radio the next week.

mskvortsov mentioned this issue May 2, 2024

Improve frame_sync_impl::get_symbol_val() performance #93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of cyl_bessel_i() on a low-powered arm64 device #92

Performance of cyl_bessel_i() on a low-powered arm64 device #92

mskvortsov commented May 2, 2024 •

edited

Loading

miweber67 commented May 8, 2024

mskvortsov commented May 8, 2024

miweber67 commented May 9, 2024

mskvortsov commented May 9, 2024

Performance of cyl_bessel_i() on a low-powered arm64 device #92

Performance of cyl_bessel_i() on a low-powered arm64 device #92

Comments

mskvortsov commented May 2, 2024 • edited Loading

miweber67 commented May 8, 2024

mskvortsov commented May 8, 2024

miweber67 commented May 9, 2024

mskvortsov commented May 9, 2024

mskvortsov commented May 2, 2024 •

edited

Loading