Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhaul output dimension initialization #16

Closed
3 tasks done
moocow-the-bovine opened this issue Dec 20, 2024 · 21 comments
Closed
3 tasks done

overhaul output dimension initialization #16

moocow-the-bovine opened this issue Dec 20, 2024 · 21 comments

Comments

@moocow-the-bovine
Copy link
Owner

moocow-the-bovine commented Dec 20, 2024

originally discussed in and around

TODOs:

  • use RedoDimsCode
  • use null instead of zeroes to initialize output PDLs
  • reduce PMCode wrappers, dropping them wherever possible
This was referenced Dec 20, 2024
@mohawk2
Copy link
Collaborator

mohawk2 commented Dec 27, 2024

This might be none of my business, but did you mean to version-bump on the 20241226-matmult2d-bad-support branch?

@moocow-the-bovine
Copy link
Owner Author

version-bump on the 20241226-matmult2d-bad-support branch?

kind of ... in preparation for merging & releasing, but then got distracted.

@moocow-the-bovine
Copy link
Owner Author

@mohawk2
I've hit my first head-scratcher regarding RedoDimsCode here:

Is there a good way to detect whether an output PDL was passed as null in my RedoDimsCode ?


Details

  • Signature is
    indx ixIn(Ndims,NnzIn);
    nzvalsIn(NnzIn);
    missing();
    indx N();
    indx [o]ixOut(Ndims,NnzOut);
         [o]nzvalsOut(NnzOut);
    indx [o]nOut();
  • I'm passing in all output PDLs (ixOut, nzValsOut, nOut) as PDL->null()
  • I want to re-create the default-dimension logic from the PMCode here, i.e.
    • if the user passes in a pre-allocated PDL, then use that PDL directly, regardless of any relation between NnzIn and NnzOut (but Ndims still needs to align of course)
    • otherwise, fall back to setting $SIZE(NnzOut) = $SIZE(NnzIn) ... (so only if the user passes in null to the PP code or undef to the PMCode)

So far, it seems that if I pass in nulls, then $SIZE(NnzOut) == 1 when RedoDimsCode gets evaluated ... but that can also be the case if the user passed in an explicitly allocated output PDL (--> maybe they're doing a flat sum). I tried looking at ($PDL(ixOut)->state & PDL_NOMYDIMS) a la isnull(), but that evaluates to zero from RedoDimsCode.

Can those 2 cases (explicitly allocated dimension of size 1 vs. null) be teased apart here?

@moocow-the-bovine
Copy link
Owner Author

Is there a good way to detect whether an output PDL was passed as null in my RedoDimsCode ?

... after looking at the PDL_MAYBE_SIZE here as suggested by #6 (comment) , I've come up with this:

RedoDimsCode => q(
if ( $PDL(ixOut)->state & $PDL(nzvalsOut)->state & PDL_MYDIMS_TRANS ) {
/*-- output PDLs passed as null; use default output dimensions --*/
$SIZE(NnzOut) = $SIZE(NnzIn);
}),

@mohawk2 can you confirm that this is the recommended way of checking what I want to be checking (i.e. that both ixOut and nzvalsOut were passed as nulls)?

@mohawk2
Copy link
Collaborator

mohawk2 commented Dec 28, 2024

I am a bit surprised that these n-dimensional sparse objects really do vary in the number of indices each one has? But I guess that's the nature of the sparseness varying between sections of the notional fixed-size object.

That is currently the best known way of doing this. I've had a stare at the code you kindly linked to in ccs_accum_sum's PMCode, and it seems fine. As you know, my ambition is to make it easier and easier to express this sort of thing, and I'm not pretending that the P:LA solution is the best possible one yet. Please feel free to keep asking questions, and to make suggestions, or even just to express problems that the back of your mind suspects there's a "better" answer to ("surely PDL should be able to do this" is a feeling I constantly have, and act on)).

On that note, I'm hoping that the "type variable" thing you feel is fully sorted out? To recap, I believe the answer to that issue was the "implied type" (left off the Par spec) is the generic type of that xform and works as the "value type", while the "index type" you needed there is indx, still not needing any more-materialised "type variable" concept. Do you agree?

@moocow-the-bovine
Copy link
Owner Author

@mohawk2 thanks for the reply! I'll continue on this basis then.

still not needing any more-materialised "type variable" concept

Yes, I agree that's not necessary here as long as we can nail down indx to a fixed type ... and I can't think of any good reason not to do so.

sparse objects really do vary in the number of indices each one has?

for ufuncs at least yes ... there's no good (and efficient) way to know how many non-missing values a sumover() result will have based only on the number of non-missing values in its input. Using NnzOut=NnzIn works in the absence of other information, but will usually be wasteful (NnzIn is really just an upper bound for NnzOut). Accumulating over a flat (sparse) PDL will always work with NnzOut=1.

@mohawk2
Copy link
Collaborator

mohawk2 commented Dec 28, 2024

Yes, I agree that's not necessary here as long as we can nail down indx to a fixed type ... and I can't think of any good reason not to do so.

Great!

sparse objects really do vary in the number of indices each one has?

for ufuncs at least yes ... there's no good (and efficient) way to know how many non-missing values a sumover() result will have based only on the number of non-missing values in its input. Using NnzOut=NnzIn works in the absence of other information, but will usually be wasteful (NnzIn is really just an upper bound for NnzOut). Accumulating over a flat (sparse) PDL will always work with NnzOut=1.

That makes sense! I think that until such time as PDLPorters/pdl#418 gets addressed, we have to accept the limitations of PDL only handling rectangular data.

@moocow-the-bovine
Copy link
Owner Author

@mohawk2 I think that 172be81 checks all of the above boxes for the ufuncs:

  • use RedoDimsCode
  • use null instead of zeroes to initialize output PDLs
  • reduce PMCode wrappers

... I also looked into your other suggestions from #6 (comment) :

  • set the badflag as desired in the Code with $PDLSTATESETBAD(ndarrayname), outside an inserted broadcastloop -

done, and thanks for the tip -- the PP code is much cleaner this way!

just make the missing Par be the last value of your valsIn

I tried this, and it does work for the "low-level" ccs_accum_XYZ functions ... but it would cause a nontrivial amount of havoc in the CCS::Nd wrapper class, so I'm leaving $missing() as its own input parameter.


I also tried to get rid of the redefined warnings I see when running perl Makefile.PL for a bunch of (all?) symbols from PDL/Basic/Core/Types.pm.PL, e.g.

$ perl Makefile.PL
Subroutine typesrtkeys redefined at Basic/Core/Types.pm.PL line 484.
Subroutine ppdefs redefined at Basic/Core/Types.pm.PL line 502.
[...]
Subroutine orig_badvalue redefined at Basic/Core/Types.pm.PL line 811.
Subroutine PDL::Type::symbol redefined at Basic/Core/Types.pm.PL line 767.

These are emitted when evaluating pdlpp_stdargs here for the 1st (and only the 1st) PP submodule (currently CCS/Utils), arising in PDL v2.081 from a require $typespm; in PDL::Core::Dev::pdlpp_list_functions() (I also get the same warnings from v2.091). Neither wrapping the pdlpp_stdargss() call in an eval {...} nor scoping it under no warnings 'redefine' silenced them.

They're "just" warnings, so I'm not too bothered, but wonder if you might have an idea how I might get rid of them.

@mohawk2
Copy link
Collaborator

mohawk2 commented Dec 30, 2024

Great work in deleting code!

Regarding that "redefined" warning, which version of PDL are you using? You can get more information on exactly where that's triggered by changing perl Makefile.PL to perl -d:Confess Makefile.PL - I suspect that will be a bug that's got fixed in more recent PDL.

It may also be relevant that when I do make for PDL::Cluster, it complains it's changing targets for Cluster.pm. The generated Makefile has the postamble in there twice. It looks like at least that copy of pdlmaker.plm has what I'd call overly clever checking of ->can() and calling that, rather than just taking SUPER::postamble. It doesn't seem necessary or useful. I didn't look into it further than that.

@moocow-the-bovine
Copy link
Owner Author

Regarding that "redefined" warning [...] I suspect that will be a bug that's got fixed in more recent PDL.

Confirmed, sorry for the noise. I got warnings under both PDL v2.081 and v2.091, but they're gone under current git master v2.095 PDLPorters/pdl@95be933.

Also, thanks for the -d:Confess tip; I had been using the debugger to get stack traces, but that's much more convenient.

It may also be relevant that when I do make for PDL::Cluster, it complains it's changing targets for Cluster.pm. The generated Makefile has the postamble in there twice.

That should be fixed by moocow-the-bovine/PDL-Cluster@646da1a ... the problem was 2 calls to pdlpp_postamble(), 1 from the MY::postamble override in Makefile.PL and the other from pdlmaker.plm. That was probably a historical artefact (PDL::Cluster actually predates pdlmaker.plm).

It looks like at least that copy of pdlmaker.plm has what I'd call overly clever checking of ->can() and calling that, rather than just taking SUPER::postamble. It doesn't seem necessary or useful. I didn't look into it further than that.

It might very well not be necessary or useful anymore. I believe that the "overly clever" (aka "ugly") stuff is there in order to allow:

  1. pdlmaker.plm to be used for both PDL::PP modules and non-PP modules;
  2. EU:MM overrides (e.g. MY::postamble) in Makefile.PL to work as expected;
  3. all of the required pdlpp_* hand-waving to be done in one place (rather than copy+pasted to each Makefile.PL);
  4. all of my own personal "convenience" hand-waving (README.txt etc.) to be done in one place

... it also used to take care of generating & distributing PODs from *.pd for CPAN-friendliness, but that's no longer required because PDL core now does that (better).

All of that said, I think the Right Thing To Do™ would be to replace pdlmaker.plm with an independent module I could put in CONFIGURE_REQUIRES, probably just wrapping EU:MM itself. Another thing for my TODO list... sigh.

@mohawk2
Copy link
Collaborator

mohawk2 commented Dec 31, 2024

I believe you are trying to solve two fairly different problems with one Great Big Solution:

  • pure-Perl distros, which really just need the add_doc boilerplate in postamble
  • PDL::PP distros, which are well supported by PDL::Core::Dev, with about 3 lines of boilerplate (use, stdargs, postamble)

For both, you can just symlink README.pod to your main module for GitHub purposes. Your pdlmaker.plm would need to be updated in all your distros when something changes, and it would need to handle all the cases you want to deal with. That's a lot of effort, and testing it against all the scenarios would be even more.

While comparison can be the thief of joy, MetaCPAN says you have 9 PDL distros. I maintain rather more than that, especially with the recent burst of carving out those no longer belonging in main PDL. I am fanatically lazy, and yet have felt no need to make a Great Big Solution to the above two problems. I'm not telling you how to run your distros, but I am telling you how I run mine.

@moocow-the-bovine
Copy link
Owner Author

moocow-the-bovine commented Jan 1, 2025

@mohawk2 another question regarding checking for null args in RedoDimsCode using

 $PDL(myOutputPdl)->state & PDL_MYDIMS_TRANS 

... it seems that this can evaluate to true even if myOutputPdl was not passed as null. I think what I'm seeing is that it will also be true if there's an implicit type-conversion going on. So for a pp def like:

pp_def('myFunc',
  Pars => q{
    myInputPdl(Nin);
    indx [o] myOutputPdl(Nout);
  },
  RedoDimsCode => q{
    if ( $PDL(myOutputPdl)->state & PDL_MYDIMS_TRANS ) {
      fprintf(stderr, ">>> setting default output dimension <<<\n");
      $SIZE(Nout) = $SIZE(Nin);
    }
  },
  ...
);

... I'm seeing the the setting default message even if I call it with myFunc(random(10), zeroes(double, 4)) ... which (iiuc) will do an implicit type-conversion to indx on the output arg (presumably with dataflow).

Is there a better way to check for "real" nulls? I think I've also seen (long)$SIZE(someDimFromNullPdl) evaluating to -1 in places ... is that safe to rely on? Or maybe I should also be checking $PDL(myOutputPdl)->nvals == 0 ?

@mohawk2
Copy link
Collaborator

mohawk2 commented Jan 1, 2025

I believe checking $SIZE(ind) < 0 should be the correct check. Please don't cast it to long, why would you do that?

Also, I still don't see why you are trying to keep these two sizes as independent variables. Please make sure this actually makes sense. If it does, then the way is to have an OtherPars variable to set it, like I demonstrated in rldvec. Please don't have magic parameters that are hidden in characteristics of passed-in data.

EDIT: a further benefit of passing the size in as an actual parameter is that PDL, as well as doing the allocating, can get the right type, and also automatically include the broadcast dimensions.

@mohawk2
Copy link
Collaborator

mohawk2 commented Jan 1, 2025

I had a look at the latest pushed code, because the stuff you're dealing with here is pretty fundamental and I'm very interested both out of curiosity, and to see if there are features that PDL needs to add. One example is the ragged-array idea, as mentioned already.

A couple of thoughts I had: in ccs_encode_pointers, this line:

    loop (Nnz)    %{ ixval=$ix(); ++$ptr(Nplus1=>ixval); %}

doesn't do bounds-checking?

And in ccs_xindex2d, this line:

    long a_min=0, a_max=$SIZE(Nnz);

should definitely use the type PDL_Indx. I'd be surprised if any use of int or long is actually the right type, but I haven't checked.

@moocow-the-bovine
Copy link
Owner Author

I believe checking $SIZE(ind) < 0 should be the correct check

That works in ccsutils.pd, but not for the ufuncs here ... so I'm using PDL_MYDIMS_TRANS together with nvals==0 there.

Please don't cast it to long, why would you do that?

Because I was really just printf-ing through an %ld in throwaway debugging code and didn't bother to check the actual C type. I wouldn't do that on purpose.

Also, I still don't see why you are trying to keep these two sizes as independent variables.

Because they are logically independent, and I have in the past wanted/needed that independence & explicit pre-allocation when working with large structured sparse matrices (e.g. a 10000x10000 matrix where I know that each row has exactly 3 non-missing values).

the way is to have an OtherPars variable to set it

I have been adding in optional OtherPars parameters where I need them
(e.g. here in ccs_decode_pointer), but that changes the call signatures and requires more PMCode to handle. In that case, the output dimension can come from a pre-allocated output parameter (projix or nzix) if available, and otherwise needs to be passed in to the PP code via OtherPars. So the PMCode doesn't need the zeroes(), but it needs to add the OtherPars in order to keep things compatible.

Please don't have magic parameters that are hidden in characteristics of passed-in data.

Trying to avoid that.

@moocow-the-bovine
Copy link
Owner Author

moocow-the-bovine commented Jan 2, 2025

in ccs_encode_pointers, this line:

    loop (Nnz)    %{ ixval=$ix(); ++$ptr(Nplus1=>ixval); %}

doesn't do bounds-checking?

You mean ensuring that ixval < $SIZE(Nplus1) ? Not in the loop body, no. Nothing prevents a user from passing in a pathological $ix() or "lying" about $N(). But $N defaults to $ix->max+1 here, and there's a high-level check for Nplus1 > N here, so "naive" usage should be safe. (nb: I've fixed the long typecast & printf formats there too)

And in ccs_xindex2d, this line:

    long a_min=0, a_max=$SIZE(Nnz);

should definitely use the type PDL_Indx. I'd be surprised if any use of int or long is actually the right type, but I haven't checked.

good catch, thanks! that line is probably from pre-indx times. fixed in 52d5270

@mohawk2
Copy link
Collaborator

mohawk2 commented Jan 2, 2025

One thought I've just had: I see that (wisely) you're adding tests for stuff as you go. If those are for stuff that already works, I'd advocate putting them on master, then rebasing your working branch past that. With rebase -i, that also offers the opportunity to consolidate/rethink commits so far, so that each one is a small, meaningful change. Any changes that are just good anyway should be getting put on master, including any whitespace adjustments also. That allows the substantive changes being done on a branch (with any test adjustments that are inherent) are much smaller and more easily thought about.

@moocow-the-bovine
Copy link
Owner Author

putting them on master, then rebasing your working branch past that

yes, I admit that would be a better procedure. I'm not going to be doing that (this time around at least) due to limits on (a) my working memory & (b) my TUIT allowance.

consolidate/rethink commits so far, so that each one is a small, meaningful change

fwiw, I think the messy git history is much still an improvement over my bad old muscle-memory habit of svn ci -m "" :-P

@mohawk2
Copy link
Collaborator

mohawk2 commented Jan 2, 2025

Your willingness to do best practice is going to have an effect on observers' willingness to:

  • look at the changes you're making
  • keep your module(s) in their downstream CI, i.e. even knowing if your modules get broken by their changes

As always, it's up to you ;-)

@moocow-the-bovine
Copy link
Owner Author

have an effect on observers' willingness

Let's pretty-please-with-sugar-on-top all-caps-really-I-mean-it NOT get into that discussion here.

  • look at the changes you're making
  • keep your module(s) in their downstream CI, i.e. even knowing if your modules get broken by their changes

As always, it's up to you ;-)

... and others' decisions about whether, how, & in what manner they choose to use and/or engage with my code are -- as always -- their decisions. I'm not selling anything.

@moocow-the-bovine
Copy link
Owner Author

closing as resolved in PDL-CCS v1.24.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants