Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParseXS: refactor OUTPUT and RETVAL #22731

Merged
merged 31 commits into from
Nov 18, 2024
Merged

ParseXS: refactor OUTPUT and RETVAL #22731

merged 31 commits into from
Nov 18, 2024

Conversation

iabyn
Copy link
Contributor

@iabyn iabyn commented Nov 12, 2024

  • This set of changes does not require a perldelta entry.

My previous merge commit, 5.41.4-108-g9621dfa822, added Node::Param objects - which store information about each parameter in an XSUB's signature - but updated with any further data from an INPUT line.

This branch extends that work by:

  • Also updating each Param object with any data from OUTPUT lines.

  • Adding a synthetic Param object for RETVAL, which means that in many places, RETVAL can now be handled just as another parameter (albeit one which is declared but not initialised, and which doesn't consume an arg). There's still plenty of special-casing code for RETVAL, but it now tends to be more in the places where it will have an effect (e.g. some special-handling in INPUT_handler() if RETVAL is seen in an INPUT section) rather than the effect being a side-effect of the way the code works elsewhere.

  • Does some general refactoring of the code responsible for emitting output code, e.g. in generate_output() etc. There's still more refactoring to be done in this area, but this is a start.

  • fixes CASE statements, which were partially broken by the previous merge commit.

@iabyn
Copy link
Contributor Author

iabyn commented Nov 13, 2024

I've pushed 3 further commits which fix the build under old perls and couple of other minor things I noticed along the way.

Comment on lines 3265 to 3172
$self->blurt("Can't determine output type for '$var'");
next;
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reading the commit message, my first impulse is to look for a new test that would have caught this.

I don't see any tests that check for this message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of "panic: can never happen" message, which can't (or at least shouldn't) be triggered by real-life XS code. I only spotted it during a mistake I made while refactoring, when I got an "exiting via next" warning along with the indication that something was up with types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, it can be triggered by real XS code. I'll add some tests on my next set of commits.

A recent piece of refactoring by me, v5.41.4-92-g9c24c4ed29, moved
some error-checking code from OUTPUT_handler() to generate_output().
The former is in a loop, the latter isn't. So in

    if (bad) { blurt(..); next }

the 'next' needs changing to 'return'.

It turns out no tests yet exercise this error condition, so it it wasn't
spotted at the time.
During earlier refactoring, the captures of a particular match were
assigned to variables; but a few lines further down, the code was still
using '$1' rather than the variable which $1 had been assigned to.

This commit makes it use the var instead.

No functional changes.
There is an optimisation whereby the OP_ENTERSUB's TARG is sometimes
used to return the result, rather than creating a new mortal.

This optimisation has been present for many years, but was untested.
This commit adds tests, including ones for custom typemap entries
of the sort that can still be optimised.

It also tests for the optimisation in conjunction with OUTLIST args.

It also adds tests for when the output code is specified explicitly
on the OUTPUT line rather than being retrieved from a typemap. Currently
this isn't actually optimised.
The perl code which does the emitting of the C code for the RETVAL value
return using TARG if possible, is in a different place than the general
RETVAL code emitting. This commit moves them into the same function.

For now this is a crude cut and paste: any simplifying of the resultant
function will be done later. For now, it's good just that more C code
generation functionality is concentrated in one place rather than being
diffused around all parts of ParseXS.pm.

In more detail: before this commit, the code looked like:

    sub process_file {
        ...
        if (RETVAL needs returning) {
            if (have template code from OUTPUT line) {
                # just directly emit the template code
            }
            elsif (can do TARG optimisation) {
                # directly emit code to do PUSHi() etc
            }
            else {
                $self->generate_output({ var => 'RETVAL', ...});
            }
        }
        ...
    }

    sub generate_output {
        ...
        if ($var eq 'RETVAL') {
                # emit generic C code to return RETVAL to the caller
        }
        ...
    }

After this commit, it looks like:

    sub process_file {
        ...
        if (RETVAL needs returning) {
            $self->generate_output({ var => 'RETVAL', ...});
        }
        ...
    }

    sub generate_output {
        ...
        if ($var eq 'RETVAL') {
            if (have template code from OUTPUT line) {
                # just directly emit the template code
                return;

            if (can do TARG optimisation) {
                # directly emit code to do PUSHi() etc
            }
            else {
                # emit generic C code to return RETVAL to the caller
            }
        }
        ...
    }
In particular, test that the 'SETMAGIC: EN/DISABLE' modifiers work,
and that duplicate vars are reported.
If RETVAL appears twice in the OUTPUT section, it will be reported
as an error (using blurt()), but parsing will continue.
During parsing of the second RETVAL, it is treated as an ordinary
parameter rather than special-cased as RETVAL. This doesn't seem useful,
although ultimately it doesn't matter, as the parse will be reported as
failing at the end anyway, so it doesn't matter what code gets
generated in the meantime.

This commit makes RETVAL still be treated as RETVAL on second appearance
in OUTPUT, which get makes the code slightly simpler.

Also, change the dup-detecting lines above it from a postfix 'if' to a
block 'if', so that the flow-control 'next' appears prominently on its
own line.
If the parser sees RETVAL in a CODE section, but there isn't an OUTPUT
section for the RETVAL, then it emits a warning. Test for this warning.

Note that it currently warns only if the OUTPUT section is not present;
it doesn't warn if the OUTPUT is present but doesn't contain RETVAL.
The next commit will fix this.
ParseXS currently emits a warning if RETVAL is seen in a CODE section
but there is no OUTPUT section. This commit extends this so it warns even
if there *is* an OUTPUT section but that section doesn't contain
RETVAL.

To stop one of our existing tests giving a false positive warning, this
commit now makes it ignore 'PERL_UNUSED_VAR(RETVAL)' in a CODE section
in terms of 'have seen RETVAL in CODE', since this isn't a real use of
the variable.

Note that the original intent of adding this warning was that, without an

    OUTPUT
        RETVAL

the emitted code would still do XSRETURN(1), but nothing would be stored
at ST(0), so whatever was returned on the stack at that slot would be
whatever was already at that position on the stack, which might be one
of the passed args, or random garbage. This is true regardless of
whether OUTPUT is completely absent, or just RETVAL missing from OUTPUT.
So it's better to warn in both cases, not just the former.

Doing this also means that the 'xsub_seen_OUTPUT' field member is no longer
needed, simplifying the code.
If a var is listed in an OUTPUT section, but is not actually a
parameter, then the parser outputs an error message:

    Error: OUTPUT foo not a parameter

This commit:

- Fixes a typo in the error message: it was previously "not an parameter".

- Extends the warning to include RETVAL where the XSUB is void.
  Previously, it would plough on, and then give an error that it
  couldn't find 'void' in the typemap, which is less obvious what the
  problem is.

- Adds tests - this warning previously had no tests for it.
Currently, an E::PXS::Node::Param object each created for each parameter
declared in the XSUB's signature. This object is then updated with any
extra data that appears in an INPUT line.

This commit extends that so that the Param object is also updated with
any extra data from an OUTPUT line.

This allows us to remove the $self->{xsub_map_varname_to_seen_in_OUTPUT}
field.

The next few commits will build upon this to simplify various bits of
code, now that the Param object contains OUTPUT state too.
The generate_output() method currently takes a hash of args. The fields
of this hash are typically generated on the fly by each caller from the
fields of an ExtUtils::ParseXS::Node::Param object.

This commit makes it so that instead, that Param object is just passed
directly to generate_output(). It simplifies the code, and is a step
along the path to the eventual goal of making generate_output() be a
method of ExtUtils::ParseXS::Node::Param rather than of
ExtUtils::ParseXS.

This commit causes one slight change in behaviour: set magic; and this
is arguably a bug fix.

There are two ways of specifying args that get updated on return (the
XS equivalent of 'sub foo { $_[0]++ }'): either by declaring the parameter
as 'OUT' in the signature:

    int
    foo(OUT int a)

or by listing it on an OUTPUT line. For the latter, the arg has set
magic called on it after the update by default, but this can be
overridden with the SETMAGIC keyword:

    OUTPUT:
        arg1               // has a SETMAGIC() call
        SETMAGIC: DISABLE
        arg2               // doesn't have a SETMAGIC() call
        SETMAGIC: ENABLE
        arg3               // has a SETMAGIC() call

On the other hand, OUT params have set magic by default *unless*
there's an OUTPUT section with at least one SETMAGIC line, in which case
the last such SETMAGIC line dictates whether the OUT arg gets set magic.
This seems wrong, and this commit changes it so that OUT params *always*
get set magic.

If this turns out to be wrong, the line to look at is

    print "\tSvSETMAGIC($arg);\n" if !$param->{in_output} || $do_setmagic;

which says to do magic always for OUT params, and to honour the SETMAGIC
in force at the line for OUTPUT params.
The last couple of commits broke the unusual (and possibly not very sane)
combination of a parameter declared both as IN_OUTLIST and listed in an
OUTPUT section.

This means that generate_output() is called twice for the same
parameter, once to update the value of the arg's SV on the stack, and
once to push its value onto the stack.

So, distinguish between those two effects by whether the caller of
generate_output() passed an extra arg (for OUTLIST) or not, rather than
by whether the Param object has an in_out field of /OUTLIST$/.

This was manifesting itself as an extra PUSHs and SvSETMAGIC appearing
compared with 5.40.0.

Fix and add tests.
A recent commit moved, from outside to inside generate_output(), the
perl code which is responsible for generating the C code which handles
the "use TARG if possible when returning RETVAL" optimisation.

This commit now follows that up by simplifying and tidying that code.
Now that it's within generate_output(), the setting of $type, looking up
of the typemap etc, have all already been done at the start of the sub,
so there's no need to duplicate that effort.

Also, set {output_code} as a field value in the faked-up RETVAL
param object, so that inside generate_output() it can just check for
 a generic $param->{output_code} field rather than the special-cased
$self->{xsub_RETVAL_typemap_code} value.
OUTPUT sections can include explicit setting code to override the
normal typemap. For example:

    OUTPUT:
        arg1                              # use standard typemap entry
        arg2  my_setiv(ST(1), (IV)arg2);  # override

Formerly, the custom code was emitted directly during parsing by
OUTPUT_handler(). This commit moves that emitting code into
generate_output(), so that the emitting with/without override
are done in the same place.

This is a small part of an effort to concentrate all the code-emitting
parts into a small number of well-defined places.
In earlier refactoring, I renamed an E::PXS object field to
xsub_seen_NO_RETURN, when it should actually have been to
xsub_seen_NO_OUTPUT.

This commit renames the field again.

Should be no functional changes.
Add a test for the specific:

    OUTPUT:
        RETVAL  template_override_code
Normally RETVAL is an implicitly declared local var, but it is legal
to also declare it as an explicit parameter. Add tests for the various
permutations of this:

  - void and long XSUB return values;
  - no param type; int type in signature; int type in INPUT;
  - with and without an autocall.

This isn't well documented as to how it should be interpreted, so these
tests are more about checking current behaviour so that inadvertent
changes are detected, rather than approving the current behaviour.

Some of the current behaviour is less than satisfactory (generally
marked with 'XXX' in the test code comments).

In particular, a commit shortly is going to change how RETVAL is
implemented internally.

It also adds a few tests for NO_OUTPUT.
Currently, RETVAL is handled almost entirely using special-case code
all over the place.

For non-void XSUBs, this commit adds a Node::Param object with a name of
RETVAL to the start of the list of Param objects in the Sig object - but
flagged as synthetic and no_init (so it gets declared but not
initialised, and doesn't consume an argument from the stack).

This makes the emitting of the

    some_type RETVAL;

declaration be part of general parameter processing, rather than being
special-cased. Soon, this new object will also be used to make emitting
of the code to return RETVAL be more regular; but for now just the
initial declaration is handled in the new fashion.

Note however, that things get complicated when RETVAL is also declared
as a parameter, e.g.

    int
    foo(RETVAL)
        long RETVAL
        ....

or

    int
    foo(long RETVAL)
        ....

or even as a parameter but without a type:

    int
    foo(RETVAL)
      # no INPUT lines
      ...

or even as an alien variable:

    int
    foo()
        long RETVAL = 99;

Much of the behaviour in such cases is unspecified, and is somewhat
emergent - i.e. it behaves that way due to the way the code happened to
be structured, rather than due to any particular intent. This commit
tries very hard to preserve the current behaviour - hence all the tests
added in the previous commit. It has however changed behaviour in a
couple of places where the old behaviour emitted broken C code. These
changes are indicated by the updates to t/001-basic.t in this commit.

To keep the backwards-compatibility behaviour, the basic approach of
this commit has been to make the RETVAL object be in one of three main
states:

fully-synthetic  What is initially pushed at the *start* of the param
                 list. RETVAL hasn't yet appeared in a signature or INPUT.
                 Its type is set to the XSUB's return type.

semi-real        Same as fully-synthetic, but with a defined
                 argument number, and with an updated position within
                 the parameter list. This signifies that a RETVAL has
                 appeared in the signature, but without a type yet
                 specified. It acts as a placeholder, causing one passed
                 argument to be skipped, and to appear in usage() error
                 messages etc.

real             The is_synthetic and no_init flags are turned off and
                 its type comes from the signature or INPUT line. This
                 is now just a normal parameter.
The previous commit added a synthetic RETVAL Node::Param object and
used it for generating RETVAL declaration code rather than
special-casing it.

This commit extends this to use the object for output processing too,
removing more special-case handling, and removing the need for some
special state fields.

The next few commits will remove more RETVAL special-handling code and
generally simplify and clean up.

Note that this commit actually introduces a bug which will be fixed in
a few commits' time: the new code doesn't correctly handle multiple
CASEs, each with a RETVAL output, such as:

    int
    foo(...)
        CASE: x
            ...
            OUTPUT: RETVAL
        CASE: y
            ...
            OUTPUT: RETVAL

The code now sees this as a duplicate OUTPUT parameter. For now, a test
in t/002-more.t and t/XSMore.xs has been disabled until this is fixed.

(Actually, CASE with duplicate INPUT or OUTPUT sections has been broken
for a few weeks - but this is the first commit that actually caused a
test failure. All will be fixed soon...)
Following on from the previous commit, which uses the new RETVAL
Node::Param object for generating output code, ensure that the
return type used for returning the XSUB's value is the XSUB's type
rather than RETVAL's type, where they differ.

For example, with

     int
     foo(long RETVAL)

RETVAL is is declared as long, but an int is returned to the caller.
Add tests to confirm.
In an XSUB like

    NO_OUTPUT int
    foo(...)

An 'int RETVAL;' declaration is emitted, but no code is emitted to
return the value of RETVAL to the caller.

Before this commit, the way this was handled was that, after all INPUT
and CODE/PPCODE/autocall code had been emitted, ParseXS.pm did the
following:

    ($implicit_OUTPUT_RETVAL, $self->{xsub_return_type}) =
                                  (0, 'void') if $self->{xsub_seen_NO_OUTPUT};

That is to say, halfway through parsing the XSUB, it suddenly pretends
that the XSUB was declared void all along, and that there wasn't an
autocall (which would give an implicit OUTPUT: RETVAL declaration).

While this works, it is a confusing action-at-a-distance: at any
particular point in the code, you don't easily know whether

    if ($self->{xsub_return_type} eq 'void) { ... }

is dealing with a real void or fake void.

This commit removes that halfway-along state modification, and instead
handles NO_OUTPUT directly where needed:

For $implicit_OUTPUT_RETVAL, it now just never sets it if NO_OUTPUT is
present, rather than setting it and then later resetting it.

For $self->{xsub_return_type}, this now always represents the actual
declared return type of the XSUB, and whenever necessary,
$self->{xsub_seen_NO_OUTPUT} is checked along side
$self->{xsub_return_type} in various places.

In addition, the $var_num variable in OUTPUT_handler() handler has been
removed, as it was a vestigial value, now only used to determine whether
to call

    $self->blurt("Error: OUTPUT $outarg not a parameter")

which can be done directly as boolean expression.

Several new NO_OUTPUT tests have been added.
3 commits ago I added a new scope. This commit re-indents that new block.

Whitespace-only (apart from deleting a stray empty comment line too).
The $self->ST($num) method typically emits "ST(" . ($num-1) . ")"
but it has special-case handling for $num being undef, or (in the case
of output), being 0.

The latter was a special-case for outputting RETVAL return code. Recent
refactoring means that the ST() method is no longer called with
$num == 0, so the method's code can be simplified (and the $is_output
parameter eliminated).

Should be no change in functionality.
For the specific case of RETVAL appearing in an OUTPUT section when
NO_OUTPUT is in force, output this new more specific error message:

    can't use RETVAL in OUTPUT when NO_OUTPUT declared

rather than the more generic (and in this case, slightly confusing)
message:

    OUTPUT RETVAL not a parameter
My recent merge commit v5.41.4-108-g9621dfa822 partially broke the CASE:
functionality; although I didn't spot it at first since the ParseXS test
suite only has a single CASE: test, and none of the XS code bundled with
perl core makes use of it.

The issue was that an XSUB with multiple CASEs has a single signature,
but potentially multiple sets of INPUT and OUTPUT sections. For example:

    int
    foo(abc, def)
        CASE: X
                int abc;
                int def;
            CODE:
                RETVAL = abc + def;
            OUTPUT:
                RETVAL

        CASE: Y
                long abc;
                long def;
            CODE:
                RETVAL = abc - def;
            OUTPUT:
                RETVAL

Once my merge commit made the data from INPUT and OUTPUT sections be
used to update each Node::Param object (which has been derived from the
signature entry) then this starts to fall apart. It only started
failing the single CASE test in t/002-more.t once *this* branch
made RETVAL an object (rather than it being special-cased).

The fix in this commit is a bit of hack. Once all of the XSUB's
signature has been parsed into a list of Node::Param objects, a copy of
this list is taken and kept as the original. Then each time the body is
parsed (typically once, but multiple times for bodies with CASE
keywords) that original list is copied and becomes the current parameter
list. The objects in that list are updated by INPUT and OUTPUT sections
as usual. Then for the start of the next CASE, the next copying blows
away all those updates.

If and when ParseXS takes on more of an AST-like appearance (with,
hypothetically, Node::INPUT, Node::OUTPUT etc objects) then this
approach should be revisited.

Note that this commit restores the CASE test in t/002-more.t which I
temporarily disabled a few commits ago.
My previous merge commit moved (and renamed) this function from
ParseXS.pm to ParseXS/Node.pm. The former file has an indent of 2, while
the latter has an indent of 4. When I moved the code across, I forgot to
reindent it. So this commit does that now.

Whitespace-only change, apart from rewrapping some code comment blocks.
A couple of newly-added regexes had (in part) /....\b*/, which gives the
warning:

    \b* matches null string many times in regex

so this commit removes the extraneous '*'
One line had: 'exists $param->{type}' as part of a condition. This
worked fine when $param was a normal hash ref (created under 'use
fields'), but the combination of:

- a recent commit which copies those hashes to $sig->{old_params}
  then resets some of the fields (to allow for multiple CASEs);

- older perls (in particular 5.8.9) which used pseudo-hashes to
  implement fields,

was causing the field to spring into existence, even though it wasn't defined.
So change the test from 'exists' to 'defined'.
A recent commit needed, at one point, to make a copy of a Node::Param
object, and did so in a clunky way. But this can be simplified by
instead using the new() method's built-in ability to take an
initialisation hash ref as an arg. So this commit does that.

However, it turns out that this new() method *shared* rather than copied
that hash ref when running in an environment where 'use fields' isn't
available - namely during the building of perl itself. So fix that too.
@iabyn iabyn merged commit b0e9f2b into blead Nov 18, 2024
67 checks passed
@iabyn iabyn deleted the davem/xs_refactor6 branch November 18, 2024 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants