Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sse non-temporal loads/stores #11

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

sse non-temporal loads/stores #11

wants to merge 1 commit into from

Conversation

KWillets
Copy link
Collaborator

Switching SSE to non-temporal store gives a large speedup (~50% on my i5) in decoding. Non-temporal loads don't seem to make a difference in encoding. Both require 16-byte alignment, so we may want to make this a build option rather than straight merge.

perf in this branch tests both encoding and decoding in one target.

clean up perf and merge in decode_perf
get rid of unused function warnings
@KWillets KWillets requested a review from lemire October 28, 2017 21:37
@lemire
Copy link
Member

lemire commented Oct 31, 2017

Can you extend the benchmark so that the decoded data is actually accessed immediately rather than written to RAM?

Because here is my expectation. People do not grab compressed data from RAM, uncompress it and then push it back to RAM for safe-keeping without accessing it. People decode data and immediately make use to it. You actually want to avoid pushing back to RAM the decoded data. It is kind of counterproductive to hold in memory both the compressed and uncompressed data.

My concern is that the performance will actually be worse in an actual application with non-temporal writes.

For encoding, it is something else... it would be kind of silly to compress the data and then immediately, as you are still holding the uncompressed data, to uncompress the newly compressed data... So my view is that for compression, then non-temporal writes make sense.

I am much less convinced regarding decoding.

I think we should measure in a realistic scenario.

Thoughts?

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extend the benchmark so that the decoded data is actually accessed immediately rather than written to RAM?

Because here is my expectation. People do not grab compressed data from RAM, uncompress it and then push it back to RAM for safe-keeping without accessing it. People decode data and immediately make use to it. You actually want to avoid pushing back to RAM the decoded data. It is kind of counterproductive to hold in memory both the compressed and uncompressed data.

My concern is that the performance will actually be worse in an actual application with non-temporal writes.

For encoding, it is something else... it would be kind of silly to compress the data and then immediately, as you are still holding the uncompressed data, to uncompress the newly compressed data... So my view is that for compression, then non-temporal writes make sense.

I am much less convinced regarding decoding.

I think we should measure in a realistic scenario.

@KWillets
Copy link
Collaborator Author

Yes, it doesn't leave the data in cache obviously.

Unfortunately the non-aligned compressed data is hard to write non-temporally. Maybe it could be packed into a buffer register or two and flushed in 16 byte units, but it's a headache.

@lemire
Copy link
Member

lemire commented Oct 31, 2017

What I'd suggest, if we are going to merge this, is to make it an option, but not necessarily the default.

@aqrit
Copy link
Collaborator

aqrit commented Oct 6, 2018

Unfortunately the non-aligned compressed data is hard to write non-temporally. Maybe it could be packed into a buffer register or two and flushed in 16 byte units, but it's a headache.

_mm_maskmoveu_si128(data, ~shuffle_mask, out); might be fast on Intel (though slow on AMD CPUs?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants