Automated test of substantive results tied to HARK development branch #10

sbenthall · 2023-01-25T09:06:56Z

Continued from here: sbenthall#3 (comment)

@sbenthall: Because of the design of the original cstwMPC code, which involves a lot on non-standard code execution (using exec()) and file output (saving txt files with custom data), I designed this script to work with the files output to the Results/ directory. This means that it doesn't operate like a normal Python unit test of some part of the code. Rather, the test must be ran by hand to verify results.

@llorracc : I'd really like to get this into some form that can be run automatically when we update the development branch of the HARK toolkit. Whether that requires something in the form of a unit test, I don't know, but my goal is to choose a small number of REMARKs that are "unpinned" because they give a thorough workout to the substantive, quantitative results of the toolkit and any code merge that changes those substantive results needs to be closely scrutinized to understand why.

There are a number of reasons why this is difficult.

The extremely non-standard code structure and way of outputting results makes it awkward to use conventional tooling. See Refactor execution code #2
Because getting the results from this repository is a long/slow operation, it will be prone to timeout errors, creating false negative test results. It is also potentially 'expensive' in other ways.
In general, triggering a test in one repository when there is a change in a different repository is not how things are typically done. It would be much more standard to have an automated test that triggers when there is a PR on this repository. For the above reasons, this hasn't been implemented yet.

So there are at least two components to this:

Develop the automated test for this repo
Determine the conditions under which that test is automatically triggered

While I understand the motive ('thorough workout to the substantive, quantitative results of the toolkit'), my honest view is that it would be ultimately wiser to:

(a) develop the REMARK standard so that it requires repositories to be less convoluted, enabling a more standardized 'substantive results testing' framework that works across REMARKS rather than forcing each to have a sui generis solution

(b) refactor the code in this repository to be less convoluted in how it depends on HARK, and

Then write the automated test for this repository in a way that isn't ad-hoc, but really much better would be to:

(c) make more substantive tests in HARK for any functionality that this repository depends on.

Trying to test HARK by verifying cstwMPC results is like trying to take out a fly with a bazooka -- it seems like it would be guaranteed to solve the problem, but is in fact grossly inefficient and a more elegant solution can be found if its given a little more careful thought.

llorracc · 2023-01-25T16:03:39Z

OK, I get your points.

The extremely non-standard code structure and way of outputting results makes it awkward to use conventional tooling. See Refactor execution code #2

Yes, see the reply I just posted to that issue.

Because getting the results from this repository is a long/slow operation, it will be prone to timeout errors, creating false negative test results. It is also potentially 'expensive' in other ways.

I thought we had a designated "min" configuration that was explicitly designed to be reasonably fast -- exactly for this purpose of just kicking the tires. Are you saying that even that min version is too long/slow?

With DemARKs, we test them nightly whenever there has been a commit to HARK. Seems like that would be the right way to handle any unpinned REMARKs too; or maybe if the REMARKs are heftier, we could do it every 3 days or something.

In general, triggering a test in one repository when there is a change in a different repository is not how things are typically done. It would be much more standard to have an automated test that triggers when there is a PR on this repository. For the above reasons, this hasn't been implemented yet.

In my struggles with submodules, I believe I recall that one of the uses of submodules was to accomplish this kind of goal. In any case, I thought we were doing something like this with the econ-ark.org repo - it relies on markdown metadata that lives in the REMARK repo, and I thought that when any of that metadata changed we had somehow set up a GitHub Actions workflow that triggered a rebuild of econ-ark.org to incorporate the new metadata.

In any case, I feel strongly that there should be at least a one or two unpinned REMARKs as a way of identifying cases where we've made code changes that change key results.

(a) develop the REMARK standard so that it requires repositories to be less convoluted, enabling a more standardized 'substantive results testing' framework that works across REMARKS rather than forcing each to have a sui generis solution

The natural way to develop a template would be to construct an example; I'm happy with the idea of DistributionOfWealthMPC being the testbed for this.

But I'm not on board with saying all REMARKs must comply with this more rigorous standard. It's hard enough to get people to write a reproduce.sh script.

(b) refactor the code in this repository to be less convoluted in how it depends on HARK, and

Yes.

Then write the automated test for this repository in a way that isn't ad-hoc, but really much better would be to:

(c) make more substantive tests in HARK for any functionality that this repository depends on.

Trying to test HARK by verifying cstwMPC results is like trying to take out a fly with a bazooka -- it seems like it would be guaranteed to solve the problem, but is in fact grossly inefficient and a more elegant solution can be found if its given a little more careful thought.

But we have the bazooka, and it kills the fly. It would be a lot of work to invent a directed-energy weapon to laser-target the fly. To the extent that the bazooka does what we need, we can defer inventing the laser targeting system to some future date!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated test of substantive results tied to HARK development branch #10

Automated test of substantive results tied to HARK development branch #10

sbenthall commented Jan 25, 2023 •

edited

Loading

llorracc commented Jan 25, 2023

Automated test of substantive results tied to HARK development branch #10

Automated test of substantive results tied to HARK development branch #10

Comments

sbenthall commented Jan 25, 2023 • edited Loading

llorracc commented Jan 25, 2023

sbenthall commented Jan 25, 2023 •

edited

Loading