-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add valid_identifier_{pvn,sv} API functions #22769
Conversation
I put these functions in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also occurs to me that maybe identifiers should be restricted to every character being in the same script. This word pауpаl
is not what you likely think. All but one letter in it is Cyrillic. This could be used in an attack of pull requests of malicious source code. We already have a function that checks that a string is all from the same script. It could be called just before the return true
, or it could easily be changed to have an extra parameter which would be true
if the string also needed to be a legal identifier
And it occurs to me that the loop could be simplified further if the underlying _safe
code checked that the string had positive length. Right now I think we would get a malformed warning without that check in the loop.
pod/perldelta.pod
Outdated
@@ -350,6 +350,12 @@ well. | |||
|
|||
XXX | |||
|
|||
=item * | |||
|
|||
New API functions C<valid_identifier_pvn()> and C<valid_identifier_sv()> have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these link to the new functions?
break; | ||
if(!isIDCONT_utf8_safe((U8 *)s, end)) | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this loop be more simply written as
while (s < end) {
s += UTF8SKIP((U8 *) s);
if (s >= end || ! isIDCONT_utf8_safe((U8 *) s, end)) {
return false;
}
}
And the second loop below similarly simplified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite; it still needs the if(s == end) break
condition, but it seems fine if I add that.
Updated and force-pushed.
37987d7
to
36bcb73
Compare
That could be a useful behaviour to have somewhere, perhaps with an optional flag. But my intention with this function was simply to match what the parser does, for purposes of using it to address this comment: #22765 (comment) |
if(!len) | ||
return false; | ||
|
||
if(flags & SVf_UTF8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky, in the context of the parser it's fine - we only allow non-ASCII code points in identifiers, but if this is a general use function it has the Unicode bug:
$ ./perl -Ilib -Mutf8 -MXS::APItest -E 'my $x = "café"; say valid_identifier($x) || 0; utf8::downgrade($x); say valid_identifier($x) || 0'
1
0
This could be "fixed" (I hope) by using isIDFIRST_L1()/isIDCONT_L1(), but that would allow identifiers the parser wouldn't otherwise accept when used during parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I tried to make it clear from the docs that this is all about what the parser would accept.
36bcb73
to
406fceb
Compare
Addressed the latest round of comments, and pushed again. Will need a squash before I merge it (I kept the fixes as separate commits to view them easier). |
It sounds useful, except it's not clear to me when it would be useful. Do you have some examples? |
I don't know who or what you're responding to here. nm |
I'm about to add a call to it in the class field |
One set of my comments got lost. Having worked with these kinds of functions a lot, I believe the better API is to pass the |
@khwilliamson Oh I see. Mm. I could add a |
@khwilliamson ^-- that latest commit then adds such a function. |
b0fda64
to
49b9a0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the _pve
idea, which _sv
should call directly. Other than that lgtm
toke.c
Outdated
if(!sv || !SvOK(sv) || !SvPOK(sv)) | ||
return false; | ||
|
||
return valid_identifier_pvn(SvPVX_const(sv), SvCUR(sv), SvUTF8(sv)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use SvEND
and call the _pve
form saves work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that just now but it broke the stringify test. Maybe the two separate calls to SvPV and SvEND cause it to stringify two separate buffers? Anyway, just using pv + len
here works fine.
de8c628
to
366d974
Compare
force-pushed to squash before merge |
These functions test whether a given string would be considered by the Perl parser to be a valid identifier. Three variants are provided: one taking a string start/end pair, one a string start/length pair, and one looking at the string contained in an SV.
366d974
to
049897a
Compare
I've wanted to perform this kind of check from core code or XS modules for quite some time, but oddly we didn't appear to have an API for it until now.
There's likely places already in internal code that could be neatened by calling these functions. We can identify and fix those up later.