-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor many things, fix extra/missing commas #58
base: develop
Are you sure you want to change the base?
Conversation
Also adds some tests for this
By using static variables inside helper functions, these regexes will only get compiled/allocated once. Since this only happens once, we don't need to worry about freeing the memory after use. This greatly simplifies init/cleanup of the context struct. While I was at it, make the helper functions more readable at what they are doing.
Also: - remove passing "extension" around excessively in fec.c - stop passing around the full FEC_CONTEXT excessively when you only need access to a single line. - Capitalize CSV_EXTENSION since it's a constant
After I did all that refactoring, I now realize we don't even use this info at all. As we read each line, we see if it contains any asci28. If so, then we treat that as the delimiter. Otherwise default to comma.
It really is a private method of PARSE_CONTEXT, we don't care about the implementation in fec.c
This is an implementation detail of the csv parser, we don't want to have to deal with it.
Another implementation detail of the csv reader.
We never need to share one FIELD_INFO between multiple places, and the lifetimes make sense to be tied to each other. This makes the API and lifecycle management simpler.
If I add "-Werror" build flag, then I found a few errors. Ideally I think we should turn on this flag by default to help catch more bugs.
It was weird to store these regexes in the persistentMemory. We only have one version of the mappings per compilation (and runtime) so we can just store them statically. Next up would be to refactor the lookupMappings() function from fec.c to mappings.c While I'm at it, clean up the regex compilation to share the helper function we already have. This involved creating regex.h to hold that shared regexCompile()
tmpdir is deprecated in pytest.
We never used the load kwarg, and the helper function did not need to be a fixture itself.
We want the reader thread to be stopped if the parent thread stops. Before, my pytest runs sometime hung.
Before, the testing was a bit half-baked. We didn't actually check the contents of the output files. We also weren't very thorough, and only checked some APIs on some examples. Now we test all APIs on all examples. Some example .fec files that were tracked in git weren't actually getting tested at all. For example, the 13360.fec file actually produces a segfault! We need to track down the issue and fix it. It is skipped for now. These changes also revealed an inconsistency in howform types are normalized. See `python/tests/filings
Thank you for this large-scale, monumental PR! From a brief skim, looks to be lots of good ideas and practices in here. I'll take a more thorough look and double-check the perf of these changes early next week. 🎉 |
We didn't need all the info that PARSE_CONTEXT contained.
So to run the "trailing_commas" test, you can do: pytest -m trailing_commas
Now there is a single understandable `FORM_SCHEMA *lookupSchema(const char *version, int versionLength, const char *form, int formLength);` function. The stuff in fec.c just wraps it with logic specific to FEC_CONTEXT
and similarly, freeSchema() to formSchemaFree()
put the parts in order
We might want to write to a different dir and want control of that outside of WRITE_CONTEXT
This is way more readable I think
It was really hard to tell what each represented
This can be inferred from whether or not filingID is NULL, and now that the WRITE_CONTEXT doesn't depend on the filingId, all the filingId processing can happen in the callers of newFecContext()
These "options" should all be near each other.
Because I'm about to add another that has extra fields and I want to make the distinction
when you hit an assert, pytest will show the values of the args of the encolsing function. Before, you only saw the values to dir1 and dir2, but now you can see file1 and file2.
I thought this test indicated a bug in the lib, but actually I just wasn't testing it correctly.
Now if the supplied value can't be coerced to a Path, you get the error when you first call the function, not when the file is opened int he callback.
make them shorter to conform to line legnth. improve wording a bit. dont repeat defaults that are in the funciton signature.
Now we can see the line of which file.
If we run into a field that is supposed to be a float, but it is an empty string, this isn't actually a problem. Before, we were throwing a warning.
It was hard to tell where the stirng ended before, like if it was an empty string.
Fixes washingtonpost#24 Before, we printed exactly what was in the .fec file. If a row had more fields or fewer fields than we expected from the schema, we just printed it as is. This broke downstream tooling, such as loading the .csv's into databases. Now, by default we - pad short rows with empty fields - truncate long rows You can get back to the old behavior by setting the `raw` flag to True. By default it is False. Note that this could be BREAKING. This also adjusts the warnings a bit: BEfore, you got a warning for every extra field in a row. Now you only get one warning per row, and we print out the full row (even though that row is a bit mangled by the csv parser as it removes quotes and delimiters) The tests only currently test the default behavior. A follow up should adjust how we define test cases. Currently, we expect a 1:1 correspondence between an input .fec file and an output. But really we want a 1:N relationship, where one .fec file can generate multiple outputs depending on the options passed. That will require updating our test definition format.
Seeing as how no one has wanted to tackle this bear in the last 3 weeks (don't blame them) I can shrink this down so it's way less intimidating. But before I do that I'd love a top-level review so I know that 1. I'm going in the right direction and 2. the next version will actually get reviewed and merged. Thanks! @freedmand (or whoever) |
More a function of me being busy! I think I can tackle it in one pass but may be a bit before I can get to it. I really appreciate the contribution and am excited/think I can learn some things from your C style :) |
I wanted to tackle #24, but I got distracted on the way ;) Now this thing is huge, sorry about that. If it's intimidating, take solace in the many tests which I added.
If this is just way too much to review, then I can try to split this up into more manageable pieces. I learned a lot as I went along, so if I re-do it then it should be much more cohesive. But that would be a solid day or two of work for me, so I'd love to avoid it if people are feeling heroic.
PLEASE, anyone leave some reviews.