Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add valid_identifier_{pvn,sv} API functions #22769

Merged
merged 1 commit into from
Nov 23, 2024

Conversation

leonerd
Copy link
Contributor

@leonerd leonerd commented Nov 20, 2024

I've wanted to perform this kind of check from core code or XS modules for quite some time, but oddly we didn't appear to have an API for it until now.

There's likely places already in internal code that could be neatened by calling these functions. We can identify and fix those up later.

  • This set of changes requires a perldelta entry, and it is included.

@leonerd
Copy link
Contributor Author

leonerd commented Nov 20, 2024

I put these functions in toke.c just because they feel related to the Perl tokenizer - i.e. it's all about the rules that perl applies to perl source code. But there's no firm reason for them to live there, if elsewhere would be better.

Copy link
Contributor

@khwilliamson khwilliamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also occurs to me that maybe identifiers should be restricted to every character being in the same script. This word pауpаl is not what you likely think. All but one letter in it is Cyrillic. This could be used in an attack of pull requests of malicious source code. We already have a function that checks that a string is all from the same script. It could be called just before the return true, or it could easily be changed to have an extra parameter which would be true if the string also needed to be a legal identifier

And it occurs to me that the loop could be simplified further if the underlying _safe code checked that the string had positive length. Right now I think we would get a malformed warning without that check in the loop.

@@ -350,6 +350,12 @@ well.

XXX

=item *

New API functions C<valid_identifier_pvn()> and C<valid_identifier_sv()> have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these link to the new functions?

break;
if(!isIDCONT_utf8_safe((U8 *)s, end))
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this loop be more simply written as

while (s < end) {
    s += UTF8SKIP((U8 *) s);
    if (s >= end || ! isIDCONT_utf8_safe((U8 *) s, end)) {
        return false;
    }
}

And the second loop below similarly simplified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite; it still needs the if(s == end) break condition, but it seems fine if I add that.

Updated and force-pushed.

@leonerd leonerd force-pushed the valid-identifier branch 2 times, most recently from 37987d7 to 36bcb73 Compare November 20, 2024 19:27
@leonerd
Copy link
Contributor Author

leonerd commented Nov 20, 2024

It also occurs to me that maybe identifiers should be restricted to every character being in the same script. This word pауpаl is not what you likely think. All but one letter in it is Cyrillic. This could be used in an attack of pull requests of malicious source code. We already have a function that checks that a string is all from the same script. It could be called just before the return true, or it could easily be changed to have an extra parameter which would be true if the string also needed to be a legal identifier

That could be a useful behaviour to have somewhere, perhaps with an optional flag. But my intention with this function was simply to match what the parser does, for purposes of using it to address this comment: #22765 (comment)

toke.c Outdated Show resolved Hide resolved
if(!len)
return false;

if(flags & SVf_UTF8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky, in the context of the parser it's fine - we only allow non-ASCII code points in identifiers, but if this is a general use function it has the Unicode bug:

$ ./perl -Ilib -Mutf8 -MXS::APItest -E 'my $x = "café"; say valid_identifier($x) || 0; utf8::downgrade($x); say valid_identifier($x) || 0'
1
0

This could be "fixed" (I hope) by using isIDFIRST_L1()/isIDCONT_L1(), but that would allow identifiers the parser wouldn't otherwise accept when used during parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried to make it clear from the docs that this is all about what the parser would accept.

toke.c Outdated Show resolved Hide resolved
toke.c Outdated Show resolved Hide resolved
@leonerd
Copy link
Contributor Author

leonerd commented Nov 21, 2024

Addressed the latest round of comments, and pushed again. Will need a squash before I merge it (I kept the fixes as separate commits to view them easier).

@leonerd leonerd added the squash-before-merge Author must squash the commits down before merging to blead label Nov 21, 2024
@Leont
Copy link
Contributor

Leont commented Nov 21, 2024

It sounds useful, except it's not clear to me when it would be useful. Do you have some examples?

@tonycoz
Copy link
Contributor

tonycoz commented Nov 21, 2024

It sounds useful, except it's not clear to me when it would be useful. Do you have some examples?

I don't know who or what you're responding to here.

nm

@leonerd
Copy link
Contributor Author

leonerd commented Nov 21, 2024

It sounds useful, except it's not clear to me when it would be useful. Do you have some examples?

I'm about to add a call to it in the class field :reader attribute handler, to check that the requested method name is actually valid. Then the new :writer can do the same. Inspired by this PR comment - #22765 (comment)

@khwilliamson
Copy link
Contributor

One set of my comments got lost. Having worked with these kinds of functions a lot, I believe the better API is to pass the end pointer instead of the length. That pointer typically must be calculated anyway, as this function now uses the length solely for that purpose, so just skip the intermediate value. I can give further reasons if this is not enough to convince you

@leonerd
Copy link
Contributor Author

leonerd commented Nov 22, 2024

@khwilliamson Oh I see. Mm. I could add a _pve variant and put the real logic in there, and then have the _pvn and _sv versions just invoke that by calculations. I suspect having a _pvn in the API anyway would still be useful. It's still the expected pattern that the majority of existing functions all follow, so it would seem odd to not have that.

@leonerd
Copy link
Contributor Author

leonerd commented Nov 22, 2024

@khwilliamson ^-- that latest commit then adds such a function.

Copy link
Contributor

@khwilliamson khwilliamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the _pve idea, which _sv should call directly. Other than that lgtm

toke.c Outdated
if(!sv || !SvOK(sv) || !SvPOK(sv))
return false;

return valid_identifier_pvn(SvPVX_const(sv), SvCUR(sv), SvUTF8(sv));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use SvEND and call the _pve form saves work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that just now but it broke the stringify test. Maybe the two separate calls to SvPV and SvEND cause it to stringify two separate buffers? Anyway, just using pv + len here works fine.

@leonerd
Copy link
Contributor Author

leonerd commented Nov 23, 2024

force-pushed to squash before merge

These functions test whether a given string would be considered by the
Perl parser to be a valid identifier. Three variants are provided: one
taking a string start/end pair, one a string start/length pair, and one
looking at the string contained in an SV.
@leonerd leonerd removed the squash-before-merge Author must squash the commits down before merging to blead label Nov 23, 2024
@leonerd leonerd merged commit 429dcba into Perl:blead Nov 23, 2024
33 checks passed
@leonerd leonerd deleted the valid-identifier branch November 23, 2024 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants