Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sim: Faster type coercion in toID #10619

Merged

Conversation

larry-the-table-guy
Copy link
Contributor

@larry-the-table-guy larry-the-table-guy commented Oct 14, 2024

Derivative of #10606

Stats

npm run full-test got 2.5% faster

(4,000,000 calls per row)

already id already string same as previous text
0.647 0.948 .104
0.628 0.973 .304
0.437 0.999 .158
0.510 0.999 .116
0.493 0.999 .114
0.499 0.999 .105
0.389 0.999 .160
0.340 0.999 .345
0.441 0.999 .141
0.303 0.999 .080
0.407 0.999 .101
0.539 0.999 .075
0.347 0.999 .148
0.411 0.999 .159
0.535 0.999 .080

About 60M calls total. Before the test perf PR, there were ~100M.

So, this change makes sense because the overwhelming majority of inputs are already strings.


When I double the 'toLower().replace()' expression, runtime increased by 4%, indicating that there's still decent time to be saved here.

Next step would be to find the codepaths that trigger 'same as previous text'.
Up to 8.76 M / 60 M calls are redundant. (this catches "AAABBBCCC" sequences, but not "ABCABCABC").
It's not practical to remove all the repeated calls, but removing the easy cases is a good start.

Then, to search for the code paths that most frequently pass IDs to .get or toID.

(These measurements are probably heavily skewed towards sim code, so the last two measures might not be worthwhile in practice)

After that, it might help to do a preliminary regex to return early on strings that are already IDs.

After that, you could add a small cache of name->id, built with data from Dex. Other than speed, this can reduce sim/dex* mem usage by a little since you can re-use the string literal IDs when building the cache, instead of the computed IDs. String literals are interned so they already live forever. Might as well consider having dex* objects hold onto the interned strings instead of the computed strings.


profiling code
let already_string = 0;
let call_count = 0;
let already_id = 0;
let same_as_previous_text = 0;
let prev_text = ' some unlikely string ';

export function toID(text: any): ID {
        if (text === prev_text) ++same_as_previous_text;
        else prev_text = text;
        if (++call_count === (4 * 1000 * 1000)) {
                console.log(); // newline and symbol for grepping
                console.log(`| ${already_id / call_count} | ${already_string / call_count} | ${same_as_previous_text / call_count} |`);
                same_as_previous_text = call_count = already_id = already_string = 0;
        }
        if (typeof text !== 'string') {
                if (text) text = text.id || text.userid || text.roomid || text;
                if (typeof text === 'number') text = '' + text;
                else if (typeof text !== 'string') return '';
        } else {
                already_string++;
        }
        const id = text.toLowerCase().replace(/[^a-z0-9]+/g, '') as ID;
        if (id === text) already_id++;
        return id;
}

@larry-the-table-guy larry-the-table-guy marked this pull request as draft October 14, 2024 20:52
@larry-the-table-guy
Copy link
Contributor Author

larry-the-table-guy commented Oct 14, 2024

Ok, technically, this breaks for boolean and others. But there's no good reason to be passing those in the first place imo. Simple fix is to do '' + text more broadly, but I'll just put this on the backburner.

Edit: aaand now I remember the original code doesn't handle bool anyway. Sigh. any is not a good type for a parameter...

@Slayer95
Copy link
Contributor

Slayer95 commented Oct 14, 2024

The reason why any is the parameter type, is that toID is often used as a handy sanitizer. So, if an untrusted peer sends, say, a JSON object where a contained value is expected to be a non-empty string, then by using toId the whole validation can be simplified a lot. Supporting numbers is a nice extension to round up the concept of identifier.

So, the fact that toID is often called with arbitrary data as its parameter brings about my comment yesterday regarding the huge complexity that changing callsites not to pass IDs to toID would entail.

@larry-the-table-guy
Copy link
Contributor Author

I understand that's the intent but based on #10549 (comment)
The vast majority of the callsites already know they have a string.
And some know they have an ID, yet call methods like DexSpecies.get.

The function's contract is just unnecessarily weak.
IMO type coercion should have been separate from the lower+replace logic.
But it's not worth changing now.

@Slayer95
Copy link
Contributor

Slayer95 commented Oct 14, 2024

The vast majority of the callsites already know they have a string.

That's true.

And some know they have an ID, yet call methods like DexSpecies.get
But it's not worth changing now.

Yea. Changing that means that each of these functions would need a strict and a lax variant, thus doubling mental overhead from function names. And god forbid refactoring bugs, and having to think about Hidden Power *.

@larry-the-table-guy
Copy link
Contributor Author

larry-the-table-guy commented Oct 14, 2024

mental overhead

To me, the more pressing source of that is how many states the system can be in.
More possible types -> more states.
It's also why I think the lazy loading in Dex is meh. It adds another state to consider. I mean, at this very moment, there's a bug in DexStats - it reads dex.gen before that's been written.

I think a simple system comes from predictable data flow. Lazy loading and any types just don't help with that, IMO.
/endrant

Anyway, the fact that I can't meaningfully profile this ATM indicates I should be working on other problems.

@Zarel
Copy link
Member

Zarel commented Oct 29, 2024

For the record, I do admit it was probably a mistake to use this function in hot code. I don't mind a separate nameToID function for known strings. toID probably shouldn't be used in inner loops at all, although that might be a lost cause at this point. Blame... uh, the lack of any way to enforce type safety when I first wrote Showdown, I guess.

@larry-the-table-guy larry-the-table-guy marked this pull request as ready for review November 9, 2024 16:46
@larry-the-table-guy
Copy link
Contributor Author

Re-opened because this is now a measurable improvement.

@Zarel
Copy link
Member

Zarel commented Nov 10, 2024

Thanks!

@Zarel Zarel merged commit ff8c9a0 into smogon:master Nov 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants