R5900: Improve the EE cache performance #12108
Draft
+289
−94
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of Changes
Entry prefetching
I shrunk the size of a TLB entry from 48 bytes to 16 bytes. Theoretically on a lookup, we would prefetch up to 3 other TLB entries (due to the 64 byte cache length), which is nice because the hottest code looks up entries linearly.
Because I made the mistake of assuming this was any sort of bottleneck without checking, this actually slowed things down. We weren't memory bound here and the precomputed entry values that were bloating the structure were actually beneficial.
This optimization combined with the ones below turned out to be an improvement, so it is present in this PR.
Common Subexpression Elimination
From
Into
Because of how hot this code is, I wanted to help out the compiler and processor some. Instead of constantly indexing into the array during our entry accesses, we create a reference of it at the top of the loop.
This is a common pattern so I was hoping to hit some sort of compiler heuristic, or at least access memory in a more cache friendly way. Turns out it does as I saw a general speed increase and we were less memory bound by around 0.6%.
Only Check Valid Entries
Instead of looking through every cache entry to see if a specific address should be cached, we can instead build a separate list of "cached entries" and only look through those.
This was the most significant optimization. It reduced the number of branches and increased branch prediction accuracy.
Overall I've seen a performance increase of around 20%
.
Rationale behind Changes
I want to get more familiar with VTune profiling. The EE cache is also very slow.
Suggested Testing Steps
Test games that require EE cache with this PR (ensure any patches we have for the game are disabled)
Run the EE cache and compare the speed to master.