sse non-temporal loads/stores #11

KWillets · 2017-10-28T21:37:30Z

Switching SSE to non-temporal store gives a large speedup (~50% on my i5) in decoding. Non-temporal loads don't seem to make a difference in encoding. Both require 16-byte alignment, so we may want to make this a build option rather than straight merge.

perf in this branch tests both encoding and decoding in one target.

clean up perf and merge in decode_perf get rid of unused function warnings

lemire · 2017-10-31T16:11:42Z

Can you extend the benchmark so that the decoded data is actually accessed immediately rather than written to RAM?

Because here is my expectation. People do not grab compressed data from RAM, uncompress it and then push it back to RAM for safe-keeping without accessing it. People decode data and immediately make use to it. You actually want to avoid pushing back to RAM the decoded data. It is kind of counterproductive to hold in memory both the compressed and uncompressed data.

My concern is that the performance will actually be worse in an actual application with non-temporal writes.

For encoding, it is something else... it would be kind of silly to compress the data and then immediately, as you are still holding the uncompressed data, to uncompress the newly compressed data... So my view is that for compression, then non-temporal writes make sense.

I am much less convinced regarding decoding.

I think we should measure in a realistic scenario.

Thoughts?

lemire

Can you extend the benchmark so that the decoded data is actually accessed immediately rather than written to RAM?

Because here is my expectation. People do not grab compressed data from RAM, uncompress it and then push it back to RAM for safe-keeping without accessing it. People decode data and immediately make use to it. You actually want to avoid pushing back to RAM the decoded data. It is kind of counterproductive to hold in memory both the compressed and uncompressed data.

My concern is that the performance will actually be worse in an actual application with non-temporal writes.

For encoding, it is something else... it would be kind of silly to compress the data and then immediately, as you are still holding the uncompressed data, to uncompress the newly compressed data... So my view is that for compression, then non-temporal writes make sense.

I am much less convinced regarding decoding.

I think we should measure in a realistic scenario.

KWillets · 2017-10-31T17:05:53Z

Yes, it doesn't leave the data in cache obviously.

Unfortunately the non-aligned compressed data is hard to write non-temporally. Maybe it could be packed into a buffer register or two and flushed in 16 byte units, but it's a headache.

lemire · 2017-10-31T17:07:56Z

What I'd suggest, if we are going to merge this, is to make it an option, but not necessarily the default.

aqrit · 2018-10-06T22:24:26Z

Unfortunately the non-aligned compressed data is hard to write non-temporally. Maybe it could be packed into a buffer register or two and flushed in 16 byte units, but it's a headache.

_mm_maskmoveu_si128(data, ~shuffle_mask, out); might be fast on Intel (though slow on AMD CPUs?)

sse non-temporal loads/stores

7ad79ad

clean up perf and merge in decode_perf get rid of unused function warnings

KWillets requested a review from lemire October 28, 2017 21:37

lemire reviewed Oct 31, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sse non-temporal loads/stores #11

sse non-temporal loads/stores #11

KWillets commented Oct 28, 2017

lemire commented Oct 31, 2017

lemire left a comment

KWillets commented Oct 31, 2017

lemire commented Oct 31, 2017

aqrit commented Oct 6, 2018

sse non-temporal loads/stores #11

Are you sure you want to change the base?

sse non-temporal loads/stores #11

Conversation

KWillets commented Oct 28, 2017

lemire commented Oct 31, 2017

lemire left a comment

Choose a reason for hiding this comment

KWillets commented Oct 31, 2017

lemire commented Oct 31, 2017

aqrit commented Oct 6, 2018