Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor many things, fix extra/missing commas #58

Open
wants to merge 102 commits into
base: develop
Choose a base branch
from

Conversation

NickCrews
Copy link

@NickCrews NickCrews commented Apr 7, 2023

I wanted to tackle #24, but I got distracted on the way ;) Now this thing is huge, sorry about that. If it's intimidating, take solace in the many tests which I added.

If this is just way too much to review, then I can try to split this up into more manageable pieces. I learned a lot as I went along, so if I re-do it then it should be much more cohesive. But that would be a solid day or two of work for me, so I'd love to avoid it if people are feeling heroic.

PLEASE, anyone leave some reviews.

  • Overhaul tests so that we have a lot more test cases. They are now defined in a domain specific language, and it's now easier to add a new test case. The actual contents of the results are tested now, which was sorta absurd we weren't testing that before.
  • Tries to simplify the overuse of *_CONTEXTs everywhere. When an FEC_CONTEXT is passed to a function, with all of its internal state, it's impossible to know what that function is using in the context, and what it could possibly be modifying. It's just one giant global state. I tried to make as many things pure functions as possible.
  • Improve csv parsing so there is a clear parsing API, and it's way simpler. CSV_FIELD is a first class citizen now.
  • Pull out lots of little clutter, such as string utils and path utils, into their own files, so that the actual using files are less cluttered.
  • Store constant regexes statically, so they don't have to be stored inside PERSISTENT_MEMORY. Much simpler to understand their lifecycle.
  • Extract mappings stuff into mappings.c so there is a clear, well-defined API that returns a FORM_SCHEMA object.

Also adds some tests for this
By using static variables inside helper functions,
these regexes will only get compiled/allocated once.
Since this only happens once, we don't need to worry about freeing the
memory after use.

This greatly simplifies init/cleanup
of the context struct.

While I was at it, make the helper functions
more readable at what they are doing.
Also:
- remove passing "extension" around excessively in fec.c
- stop passing around the full FEC_CONTEXT excessively when you only
need access to a single line.
- Capitalize CSV_EXTENSION since it's a constant
@NickCrews NickCrews requested a review from a team as a code owner April 7, 2023 00:41
NickCrews added 14 commits April 6, 2023 19:39
After I did all that refactoring, I now
realize we don't even use this info at all.
As we read each line, we see if it contains
any asci28. If so, then we treat that as the delimiter.
Otherwise default to comma.
It really is a private method of PARSE_CONTEXT,
we don't care about the implementation in fec.c
This is an implementation detail of the csv parser,
we don't want to have to deal with it.
Another implementation detail of the csv reader.
We never need to share one FIELD_INFO between multiple places,
and the lifetimes make sense to be tied to each other. This makes the
API and lifecycle management simpler.
If I add "-Werror" build flag, then I found a few errors.

Ideally I think we should turn on this flag by
default to help catch more bugs.
It was weird to store these regexes in the persistentMemory. We
only have one version of the mappings
per compilation (and runtime) so
we can just store them statically.

Next up would be to refactor the lookupMappings()
function from fec.c
to mappings.c

While I'm at it, clean up the regex compilation to share
the helper function we already have.
This involved creating
regex.h to hold that shared regexCompile()
tmpdir is deprecated in pytest.
We never used the load kwarg, and the
helper function did not need to be a fixture itself.
We want the reader thread to be stopped
if the parent thread stops.
Before, my pytest runs sometime hung.
Before, the testing was a bit half-baked.
We didn't actually check the contents of the output files.
We also weren't very thorough, and only checked some APIs on
some examples. Now we test all APIs on all examples.

Some example .fec files that were tracked in git weren't actually
getting tested at all. For example,
the 13360.fec file actually produces a segfault!
We need to track down the issue and fix it.
It is skipped for now.

These changes also revealed an inconsistency in  howform types
are normalized. See `python/tests/filings
@freedmand
Copy link
Contributor

Thank you for this large-scale, monumental PR! From a brief skim, looks to be lots of good ideas and practices in here. I'll take a more thorough look and double-check the perf of these changes early next week. 🎉

We didn't need all the info that PARSE_CONTEXT contained.
So to run the "trailing_commas" test, you can do:
  pytest -m trailing_commas
Now there is a single understandable
`FORM_SCHEMA *lookupSchema(const char *version, int versionLength, const char *form, int formLength);`
function.

The stuff in fec.c just wraps it with logic specific to FEC_CONTEXT
and similarly,
freeSchema() to
formSchemaFree()
We might want to write to a different dir and want control
of that outside of WRITE_CONTEXT
This is way more readable I think
It was really hard to tell what each represented
This can be inferred from whether or not filingID is NULL,
and now that the WRITE_CONTEXT doesn't depend on the filingId,
all the filingId processing can happen
in the callers of newFecContext()
These "options" should all be near each other.
Because I'm about to add another that has extra fields
and I want to make the distinction
when you hit an assert, pytest will show the values of the args
of the encolsing function.
Before, you only saw the values to dir1 and dir2,
but now you can see file1 and file2.
I thought this test indicated a bug in the lib, but
actually I just wasn't testing it correctly.
Now if the supplied value can't be coerced to
a Path, you get the error when you first call the function, not when
the file is opened int he callback.
make them shorter to conform to line legnth. improve wording a bit.
dont repeat defaults that are in the funciton
signature.
Now we can see the line of which file.
If we run into a field that is supposed to be a float, but it is an
empty string, this isn't actually a problem.
Before, we were throwing a warning.
It was hard to tell where the stirng ended before, like
if it was an empty string.
Fixes washingtonpost#24

Before, we printed exactly what was in the .fec file. If a row had
more fields or fewer fields than we expected from the schema, we
just printed it as is. This broke downstream
tooling, such as loading the .csv's into
databases.

Now, by default we
- pad short rows with empty fields
- truncate long rows
You can get back to the old behavior by setting the `raw` flag to True.
By default it is False.

Note that this could be BREAKING.

This also adjusts the warnings a bit:
BEfore, you got a warning for every extra field in a row.
Now you only get one warning per row, and we print out the full row
(even though that row is a bit mangled by the csv parser as
it removes quotes and delimiters)

The tests only currently test the default behavior. A follow up should
adjust how we define test cases. Currently, we expect a 1:1
correspondence between an input .fec file and an output. But really we
want a 1:N relationship, where one .fec file
can generate multiple outputs depending on the
options passed. That will require updating our test definition
format.
@NickCrews NickCrews changed the title Refactor many things Refactor many things, fix extra/missing commas Apr 13, 2023
@NickCrews
Copy link
Author

Seeing as how no one has wanted to tackle this bear in the last 3 weeks (don't blame them) I can shrink this down so it's way less intimidating. But before I do that I'd love a top-level review so I know that 1. I'm going in the right direction and 2. the next version will actually get reviewed and merged. Thanks! @freedmand (or whoever)

@freedmand
Copy link
Contributor

More a function of me being busy! I think I can tackle it in one pass but may be a bit before I can get to it. I really appreciate the contribution and am excited/think I can learn some things from your C style :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants