Sim: Faster type coercion in `toID` #10619

larry-the-table-guy · 2024-10-14T20:42:10Z

Derivative of #10606

Stats

npm run full-test got 2.5% faster

(4,000,000 calls per row)

already id	already string	same as previous text
0.647	0.948	.104
0.628	0.973	.304
0.437	0.999	.158
0.510	0.999	.116
0.493	0.999	.114
0.499	0.999	.105
0.389	0.999	.160
0.340	0.999	.345
0.441	0.999	.141
0.303	0.999	.080
0.407	0.999	.101
0.539	0.999	.075
0.347	0.999	.148
0.411	0.999	.159
0.535	0.999	.080

About 60M calls total. Before the test perf PR, there were ~100M.

So, this change makes sense because the overwhelming majority of inputs are already strings.

When I double the 'toLower().replace()' expression, runtime increased by 4%, indicating that there's still decent time to be saved here.

Next step would be to find the codepaths that trigger 'same as previous text'.
Up to 8.76 M / 60 M calls are redundant. (this catches "AAABBBCCC" sequences, but not "ABCABCABC").
It's not practical to remove all the repeated calls, but removing the easy cases is a good start.

Then, to search for the code paths that most frequently pass IDs to .get or toID.

(These measurements are probably heavily skewed towards sim code, so the last two measures might not be worthwhile in practice)

After that, it might help to do a preliminary regex to return early on strings that are already IDs.

After that, you could add a small cache of name->id, built with data from Dex. Other than speed, this can reduce sim/dex* mem usage by a little since you can re-use the string literal IDs when building the cache, instead of the computed IDs. String literals are interned so they already live forever. Might as well consider having dex* objects hold onto the interned strings instead of the computed strings.

profiling code

let already_string = 0;
let call_count = 0;
let already_id = 0;
let same_as_previous_text = 0;
let prev_text = ' some unlikely string ';

export function toID(text: any): ID {
        if (text === prev_text) ++same_as_previous_text;
        else prev_text = text;
        if (++call_count === (4 * 1000 * 1000)) {
                console.log(); // newline and symbol for grepping
                console.log(`| ${already_id / call_count} | ${already_string / call_count} | ${same_as_previous_text / call_count} |`);
                same_as_previous_text = call_count = already_id = already_string = 0;
        }
        if (typeof text !== 'string') {
                if (text) text = text.id || text.userid || text.roomid || text;
                if (typeof text === 'number') text = '' + text;
                else if (typeof text !== 'string') return '';
        } else {
                already_string++;
        }
        const id = text.toLowerCase().replace(/[^a-z0-9]+/g, '') as ID;
        if (id === text) already_id++;
        return id;
}

larry-the-table-guy · 2024-10-14T20:53:23Z

Ok, technically, this breaks for boolean and others. But there's no good reason to be passing those in the first place imo. Simple fix is to do '' + text more broadly, but I'll just put this on the backburner.

Edit: aaand now I remember the original code doesn't handle bool anyway. Sigh. any is not a good type for a parameter...

Slayer95 · 2024-10-14T23:16:27Z

The reason why any is the parameter type, is that toID is often used as a handy sanitizer. So, if an untrusted peer sends, say, a JSON object where a contained value is expected to be a non-empty string, then by using toId the whole validation can be simplified a lot. Supporting numbers is a nice extension to round up the concept of identifier.

So, the fact that toID is often called with arbitrary data as its parameter brings about my comment yesterday regarding the huge complexity that changing callsites not to pass IDs to toID would entail.

larry-the-table-guy · 2024-10-14T23:20:53Z

I understand that's the intent but based on #10549 (comment)
The vast majority of the callsites already know they have a string.
And some know they have an ID, yet call methods like DexSpecies.get.

The function's contract is just unnecessarily weak.
IMO type coercion should have been separate from the lower+replace logic.
But it's not worth changing now.

Slayer95 · 2024-10-14T23:30:19Z

The vast majority of the callsites already know they have a string.

That's true.

And some know they have an ID, yet call methods like DexSpecies.get
But it's not worth changing now.

Yea. Changing that means that each of these functions would need a strict and a lax variant, thus doubling mental overhead from function names. And god forbid refactoring bugs, and having to think about Hidden Power *.

larry-the-table-guy · 2024-10-14T23:51:24Z

mental overhead

To me, the more pressing source of that is how many states the system can be in.
More possible types -> more states.
It's also why I think the lazy loading in Dex is meh. It adds another state to consider. I mean, at this very moment, there's a bug in DexStats - it reads dex.gen before that's been written.

I think a simple system comes from predictable data flow. Lazy loading and any types just don't help with that, IMO.
/endrant

Anyway, the fact that I can't meaningfully profile this ATM indicates I should be working on other problems.

Zarel · 2024-10-29T12:20:21Z

For the record, I do admit it was probably a mistake to use this function in hot code. I don't mind a separate nameToID function for known strings. toID probably shouldn't be used in inner loops at all, although that might be a lost cause at this point. Blame... uh, the lack of any way to enforce type safety when I first wrote Showdown, I guess.

This reverts commit f462716. It would break negatives and decimals.

larry-the-table-guy · 2024-11-09T16:47:11Z

Re-opened because this is now a measurable improvement.

Zarel · 2024-11-10T03:09:49Z

Thanks!

Sim: Faster type coercion in toID

b4a6d37

larry-the-table-guy marked this pull request as draft October 14, 2024 20:52

larry-the-table-guy mentioned this pull request Oct 27, 2024

Test: Correct several format IDs #10633

Merged

Return early on numbers in toID

f462716

larry-the-table-guy force-pushed the sim/toID-faster-type-coercion branch from 51f29d3 to f462716 Compare November 1, 2024 01:42

larry-the-table-guy added 2 commits November 7, 2024 08:32

Revert "Return early on numbers in toID"

811e23b

This reverts commit f462716. It would break negatives and decimals.

Merge branch 'smogon:master' into sim/toID-faster-type-coercion

cfd4c2a

larry-the-table-guy marked this pull request as ready for review November 9, 2024 16:46

Zarel merged commit ff8c9a0 into smogon:master Nov 10, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sim: Faster type coercion in `toID` #10619

Sim: Faster type coercion in `toID` #10619

larry-the-table-guy commented Oct 14, 2024 •

edited

Loading

larry-the-table-guy commented Oct 14, 2024 •

edited

Loading

Slayer95 commented Oct 14, 2024 •

edited

Loading

larry-the-table-guy commented Oct 14, 2024

Slayer95 commented Oct 14, 2024 •

edited

Loading

larry-the-table-guy commented Oct 14, 2024 •

edited

Loading

Zarel commented Oct 29, 2024

larry-the-table-guy commented Nov 9, 2024

Zarel commented Nov 10, 2024

Sim: Faster type coercion in toID #10619

Sim: Faster type coercion in toID #10619

Conversation

larry-the-table-guy commented Oct 14, 2024 • edited Loading

Stats

larry-the-table-guy commented Oct 14, 2024 • edited Loading

Slayer95 commented Oct 14, 2024 • edited Loading

larry-the-table-guy commented Oct 14, 2024

Slayer95 commented Oct 14, 2024 • edited Loading

larry-the-table-guy commented Oct 14, 2024 • edited Loading

Zarel commented Oct 29, 2024

larry-the-table-guy commented Nov 9, 2024

Zarel commented Nov 10, 2024

Sim: Faster type coercion in `toID` #10619

Sim: Faster type coercion in `toID` #10619

larry-the-table-guy commented Oct 14, 2024 •

edited

Loading

larry-the-table-guy commented Oct 14, 2024 •

edited

Loading

Slayer95 commented Oct 14, 2024 •

edited

Loading

Slayer95 commented Oct 14, 2024 •

edited

Loading

larry-the-table-guy commented Oct 14, 2024 •

edited

Loading