Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R5900: Improve the EE cache performance #12108

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

F0bes
Copy link
Member

@F0bes F0bes commented Dec 19, 2024

Description of Changes

Entry prefetching

I shrunk the size of a TLB entry from 48 bytes to 16 bytes. Theoretically on a lookup, we would prefetch up to 3 other TLB entries (due to the 64 byte cache length), which is nice because the hottest code looks up entries linearly.

Because I made the mistake of assuming this was any sort of bottleneck without checking, this actually slowed things down. We weren't memory bound here and the precomputed entry values that were bloating the structure were actually beneficial.

This optimization combined with the ones below turned out to be an improvement, so it is present in this PR.

Common Subexpression Elimination

From

for(int i = 0; i < 48; i++)
{
   if(entry_list[i].isCached())
   {
       // do work with entry_list[i]
   }
}

Into

for(int i = 0; i < 48; i++)
{
   const tlbentry& entry = entry_list[i];
   if(entry.isCached())
   {
       // do work with entry
   }
}

Because of how hot this code is, I wanted to help out the compiler and processor some. Instead of constantly indexing into the array during our entry accesses, we create a reference of it at the top of the loop.
This is a common pattern so I was hoping to hit some sort of compiler heuristic, or at least access memory in a more cache friendly way. Turns out it does as I saw a general speed increase and we were less memory bound by around 0.6%.

Only Check Valid Entries

Instead of looking through every cache entry to see if a specific address should be cached, we can instead build a separate list of "cached entries" and only look through those.
This was the most significant optimization. It reduced the number of branches and increased branch prediction accuracy.

Overall I've seen a performance increase of around 20%
.

Rationale behind Changes

I want to get more familiar with VTune profiling. The EE cache is also very slow.

Suggested Testing Steps

Test games that require EE cache with this PR (ensure any patches we have for the game are disabled)
Run the EE cache and compare the speed to master.

@JordanTheToaster
Copy link
Member

Just one benchmark for now but in Ape Escape 2 EE cache is 48% faster with the PR.

image

@JordanTheToaster
Copy link
Member

With 128 bit SIMD Ape Escape 2 is now 240% faster than master.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants