Add valid_identifier_{pvn,sv} API functions #22769

leonerd · 2024-11-20T15:30:05Z

I've wanted to perform this kind of check from core code or XS modules for quite some time, but oddly we didn't appear to have an API for it until now.

There's likely places already in internal code that could be neatened by calling these functions. We can identify and fix those up later.

This set of changes requires a perldelta entry, and it is included.

leonerd · 2024-11-20T15:30:46Z

I put these functions in toke.c just because they feel related to the Perl tokenizer - i.e. it's all about the rules that perl applies to perl source code. But there's no firm reason for them to live there, if elsewhere would be better.

khwilliamson

It also occurs to me that maybe identifiers should be restricted to every character being in the same script. This word pауpаl is not what you likely think. All but one letter in it is Cyrillic. This could be used in an attack of pull requests of malicious source code. We already have a function that checks that a string is all from the same script. It could be called just before the return true, or it could easily be changed to have an extra parameter which would be true if the string also needed to be a legal identifier

And it occurs to me that the loop could be simplified further if the underlying _safe code checked that the string had positive length. Right now I think we would get a malformed warning without that check in the loop.

khwilliamson · 2024-11-20T16:13:55Z

pod/perldelta.pod

@@ -350,6 +350,12 @@ well.

 XXX

+=item *
+
+New API functions C<valid_identifier_pvn()> and C<valid_identifier_sv()> have


Shouldn't these link to the new functions?

khwilliamson · 2024-11-20T17:29:19Z

toke.c

+                break;
+            if(!isIDCONT_utf8_safe((U8 *)s, end))
+                return false;
+        }


Couldn't this loop be more simply written as

while (s < end) { s += UTF8SKIP((U8 *) s); if (s >= end || ! isIDCONT_utf8_safe((U8 *) s, end)) { return false; } }

And the second loop below similarly simplified?

Not quite; it still needs the if(s == end) break condition, but it seems fine if I add that.

Updated and force-pushed.

leonerd · 2024-11-20T19:28:57Z

It also occurs to me that maybe identifiers should be restricted to every character being in the same script. This word pауpаl is not what you likely think. All but one letter in it is Cyrillic. This could be used in an attack of pull requests of malicious source code. We already have a function that checks that a string is all from the same script. It could be called just before the return true, or it could easily be changed to have an extra parameter which would be true if the string also needed to be a legal identifier

That could be a useful behaviour to have somewhere, perhaps with an optional flag. But my intention with this function was simply to match what the parser does, for purposes of using it to address this comment: #22765 (comment)

toke.c

tonycoz · 2024-11-20T22:29:59Z

toke.c

+    if(!len)
+        return false;
+
+    if(flags & SVf_UTF8) {


This is a bit tricky, in the context of the parser it's fine - we only allow non-ASCII code points in identifiers, but if this is a general use function it has the Unicode bug:

$ ./perl -Ilib -Mutf8 -MXS::APItest -E 'my $x = "café"; say valid_identifier($x) || 0; utf8::downgrade($x); say valid_identifier($x) || 0' 1 0

This could be "fixed" (I hope) by using isIDFIRST_L1()/isIDCONT_L1(), but that would allow identifiers the parser wouldn't otherwise accept when used during parsing.

Yeah I tried to make it clear from the docs that this is all about what the parser would accept.

toke.c

leonerd · 2024-11-21T11:49:37Z

Addressed the latest round of comments, and pushed again. Will need a squash before I merge it (I kept the fixes as separate commits to view them easier).

Leont · 2024-11-21T19:26:41Z

It sounds useful, except it's not clear to me when it would be useful. Do you have some examples?

tonycoz · 2024-11-21T22:15:23Z

It sounds useful, except it's not clear to me when it would be useful. Do you have some examples?

I don't know who or what you're responding to here.

nm

leonerd · 2024-11-21T23:17:17Z

It sounds useful, except it's not clear to me when it would be useful. Do you have some examples?

I'm about to add a call to it in the class field :reader attribute handler, to check that the requested method name is actually valid. Then the new :writer can do the same. Inspired by this PR comment - #22765 (comment)

khwilliamson · 2024-11-22T20:37:38Z

One set of my comments got lost. Having worked with these kinds of functions a lot, I believe the better API is to pass the end pointer instead of the length. That pointer typically must be calculated anyway, as this function now uses the length solely for that purpose, so just skip the intermediate value. I can give further reasons if this is not enough to convince you

leonerd · 2024-11-22T20:51:19Z

@khwilliamson Oh I see. Mm. I could add a _pve variant and put the real logic in there, and then have the _pvn and _sv versions just invoke that by calculations. I suspect having a _pvn in the API anyway would still be useful. It's still the expected pattern that the majority of existing functions all follow, so it would seem odd to not have that.

leonerd · 2024-11-22T21:47:15Z

@khwilliamson ^-- that latest commit then adds such a function.

khwilliamson

I like the _pve idea, which _sv should call directly. Other than that lgtm

khwilliamson · 2024-11-23T17:38:38Z

toke.c

+    if(!sv || !SvOK(sv) || !SvPOK(sv))
+        return false;
+
+    return valid_identifier_pvn(SvPVX_const(sv), SvCUR(sv), SvUTF8(sv));


Use SvEND and call the _pve form saves work

I tried that just now but it broke the stringify test. Maybe the two separate calls to SvPV and SvEND cause it to stringify two separate buffers? Anyway, just using pv + len here works fine.

leonerd · 2024-11-23T18:57:31Z

force-pushed to squash before merge

These functions test whether a given string would be considered by the Perl parser to be a valid identifier. Three variants are provided: one taking a string start/end pair, one a string start/length pair, and one looking at the string contained in an SV.

leonerd requested a review from khwilliamson November 20, 2024 17:33

khwilliamson reviewed Nov 20, 2024

View reviewed changes

leonerd force-pushed the valid-identifier branch 2 times, most recently from 37987d7 to 36bcb73 Compare November 20, 2024 19:27

tonycoz reviewed Nov 20, 2024

View reviewed changes

toke.c Outdated Show resolved Hide resolved

tonycoz reviewed Nov 20, 2024

View reviewed changes

toke.c Outdated Show resolved Hide resolved

tonycoz reviewed Nov 20, 2024

View reviewed changes

toke.c Outdated Show resolved Hide resolved

leonerd force-pushed the valid-identifier branch from 36bcb73 to 406fceb Compare November 21, 2024 11:48

leonerd added the squash-before-merge Author must squash the commits down before merging to blead label Nov 21, 2024

leonerd force-pushed the valid-identifier branch from b0fda64 to 49b9a0b Compare November 23, 2024 15:14

khwilliamson approved these changes Nov 23, 2024

View reviewed changes

leonerd force-pushed the valid-identifier branch from de8c628 to 366d974 Compare November 23, 2024 18:57

leonerd force-pushed the valid-identifier branch from 366d974 to 049897a Compare November 23, 2024 19:39

leonerd removed the squash-before-merge Author must squash the commits down before merging to blead label Nov 23, 2024

leonerd merged commit 429dcba into Perl:blead Nov 23, 2024
33 checks passed

leonerd deleted the valid-identifier branch November 23, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add valid_identifier_{pvn,sv} API functions #22769

Add valid_identifier_{pvn,sv} API functions #22769

leonerd commented Nov 20, 2024

leonerd commented Nov 20, 2024

khwilliamson left a comment

khwilliamson Nov 20, 2024

khwilliamson Nov 20, 2024

leonerd Nov 20, 2024

leonerd commented Nov 20, 2024

tonycoz Nov 20, 2024

leonerd Nov 21, 2024

leonerd commented Nov 21, 2024

Leont commented Nov 21, 2024

tonycoz commented Nov 21, 2024 •

edited

Loading

leonerd commented Nov 21, 2024

khwilliamson commented Nov 22, 2024

leonerd commented Nov 22, 2024

leonerd commented Nov 22, 2024

khwilliamson left a comment

khwilliamson Nov 23, 2024

leonerd Nov 23, 2024

leonerd commented Nov 23, 2024

Add valid_identifier_{pvn,sv} API functions #22769

Add valid_identifier_{pvn,sv} API functions #22769

Conversation

leonerd commented Nov 20, 2024

leonerd commented Nov 20, 2024

khwilliamson left a comment

Choose a reason for hiding this comment

khwilliamson Nov 20, 2024

Choose a reason for hiding this comment

khwilliamson Nov 20, 2024

Choose a reason for hiding this comment

leonerd Nov 20, 2024

Choose a reason for hiding this comment

leonerd commented Nov 20, 2024

tonycoz Nov 20, 2024

Choose a reason for hiding this comment

leonerd Nov 21, 2024

Choose a reason for hiding this comment

leonerd commented Nov 21, 2024

Leont commented Nov 21, 2024

tonycoz commented Nov 21, 2024 • edited Loading

leonerd commented Nov 21, 2024

khwilliamson commented Nov 22, 2024

leonerd commented Nov 22, 2024

leonerd commented Nov 22, 2024

khwilliamson left a comment

Choose a reason for hiding this comment

khwilliamson Nov 23, 2024

Choose a reason for hiding this comment

leonerd Nov 23, 2024

Choose a reason for hiding this comment

leonerd commented Nov 23, 2024

tonycoz commented Nov 21, 2024 •

edited

Loading