-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OC4IDS transforms cli #125
Conversation
@jpmckinney Current StatusWe have build a mini transformation framework to do the conversion (with a class for each transform), which may be a little over engineered but means each transform can be tested mostly independently and means we can work on the transforms themselves in parallel. (they are mostly independent as some transforms need to be run in a certain order). The "Fields from docs sheet" in https://docs.google.com/spreadsheets/d/1xyKXbNktcfKm6siSzM_C7aHCOsOjWUQro5aU8ZYIHyc/edit#gid=690200833 outlines each mapping expressed in the OC4IDS docs and the transforms and config variable that are associated with them. It also has notes as to how we intend to do each transform. There a quite a few transforms still left to do, most are straightforward but some may be too difficult or require too much human judgement over the original data to make work. The CLI bit of the tool is mostly done but there will probably be a couple of extra arguments to add when the transforms are finished. Work still todo
We were hoping to get this finished this week but due to xmas and a few illnesses we are now aiming for the end of next week and we think we have the capacity to do that. |
Thanks, @kindly. I had opened #123 (and commented on the first commit) earlier, as there was no issue or Trello card that I could find to discuss. I've requested access to the spreadsheet. My main feedback is around the overall design. OCDS Kit is careful to stream as much as possible (see docs). This new command should also stream wherever possible. The main points where changes would be required are:
Other observations:
Some questions:
|
Also, I anticipate the tests file will become very long. Given that most tests follow the same format, I suggest creating a few reusable methods, so that it's easier for a developer or analyst to skim each test and quickly see what it's doing that's unique, and spend less time looking at the same boilerplate. This should be done sooner rather than later, as it will be more work the more tests are added. |
Sorry thought that was shared.
The design of the transform framework is to take a list of releases (or potentially records) and transform them into a single project. I was assuming that there would never be a project large enough not hold all its associated releases/contractingProcesses in memory, and as we are likely to stream out a project at a time, this has to be the case. Also, as the amount of contractingProcesses per project is likely to be small I assumed the overhead of looping through them per transform would not be a big burden. As you said there is no way currently to differentiate a project from OCDS data, but when there is, I assumed that would be where the streaming would happen. So the streaming would happen at a higher level (something that calls) to the framework itself. There would have to be a mechanism very much like the combine command which gathered all releases/records of a particular project and would have to hold them in memory (or in a sqlite database like combine). This was the main reason for not requiring streaming of releases/records in the framework. I thought it unnecessary for streaming to happen at multiple levels considering there will always be a fairly limited amount of contractingProcesses per project.
Yes Duncan spotted that typo too! There are transforms, for example Cost Estimate (not yet written) that do require knowledge of releases and they were ordered for convenience. These transforms are at the contractingProcess level so it could be possible that as long as the embeddedReleases (not linked releases) are always in the record then the input of the releases may not be necessary.
Embeded Releases need to be in the input records for this work and without packages it limits the possibility of linked releases. There is no way in OCDS to specify both. This is a consideration especially as linked releases (contractingProcesses/releases) are the only thing currently in the OC4IDS standard. If we are happy to forgo linked releases in OC4IDS then this could be possible. My preference for ergonomic reasons is it should that it should be possible to specify any of releases/release_packages/records (with embedded releases)/record packages (with embedded releases). At the moment only the first two are possible. I am nonetheless happy if these options could be restricted! The higher level streaming function could also have a much more restricted in input.
Some transforms rely on the success or failure of the previous transform and that is only determined after all the previous compiled releases are looked at eg for Project Name we only look at tender.title if there is no planning.project.title in ANY compiled release. This could be achieved by holding some state against the object and then having a wrap up function for each class the decides the final logic, but such code is much harder to reason about and to test, so seemed unnecessary.
When this is agreed, this will be level the streaming should happen and should be above the framework. The transforms themselves can be oblivious to this change if their input is a list of records and a project_id and their output is a project. This separates concerns and makes the framework easier to reason about than if we added the concept of multiple projects within it.
Happy to remove that!
The logic here is a bit slack it should have something like The idea was to not be allowed to put transforms in any order but the transform_list is more a testing convenience which we could remove from the public API.
This is a consideration. I initially had each class have a 'name' class variable and was going to use that as a way to specify transforms that would be run.
See above I am currently thinking is that transform_list should not be a public parameter. It is more a testing convenience.
The boilerplate is better later in the tests (and we need to back-port that) as it uses run_transforms and transform_list for the tests (we need to fix some earlier ones). There may be some simplifications to this and the source releases to make the data changes are easier to read. |
Thanks for clarifying the requirements! To summarize:
Other questions:
It'd be good to:
If you're concerned about the ease of testing the last two changes, you can always have another method ( Ideally, it'd be great to reduce the transforms down to just plain methods that accept I'll let Duncan review the logic of transforms first, but I don't think some transforms are correct, e.g. locations are determined by taking the first location(s) found in any compiled release… but a project can occur in many locations, and should be the (unique) aggregation of all locations from all compiled releases (the OC4IDS mapping is perhaps ambiguous). There are also errors like |
I have implemented most of you suggestions. All transforms are now functions. However currently they all accept a single state object with compiled_releases, releases and output in. I do not want fixed arguments yet until I am sure there are not any awkward transforms that require different inputs that could be pre-processed. I have removed config entirely from the transforms and it is now done but the running function. Part of the requirements I got for @duncandewhurst was that a core set of transforms should be run always and there is config options for running the additional ones. At the moment I have detected the additional ones by looking in the docstring of the transform, this may not be ideal in the long run and we could just have a separate list of the optional ones. Also the concept of success (even thought the function still return success or failure) is gone and I may remove the concept entirely soon. Currently all dependant transforms can tell from the output data of previous ones to see if the previous transform was successful. So at the moment there is nothing that needs ordering except 'contract_process_setup' which is run first. |
Great! A few small tweaks:
If success booleans are going away (great!), then I won't suggest anything relating to them. |
@duncandewhurst |
6d55eff
to
2955c5d
Compare
I am fairly happy with this pull request now. Everything can be improved but this feels like a decent first attempt. There are a few transforms that are missing. Here is the issue about this #129 There are some contentious areas about how we deal with multiple process cases. Issue #130 @duncandewhurst @jpmckinney would be good review this. |
793b434
to
33363e8
Compare
d8e44f7
to
486562c
Compare
Thanks @kindly I'm happy for this to be merged once the following is done:
|
@duncandewhurst @jpmckinney would you be able to do a final review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments relate to the user-facing documentation and API.
I had a few doubts about the type checking and coercion. I also noted some inconsistency in the transform APIs (some return values, some success booleans).
I trust that the tests (as reviewed by @duncandewhurst) are testing for the correct behavior. So, I haven't reviewed that part of the code.
In short, most of my comments are minor. Looking forward to getting this merged!
Make new string casting so it works with float, int. Make number casting more explicit. Improve docstrings.
@jpmckinney I have covered the points in your comments. The type checking would be good for you to look over again. |
We should add In both those methods, should we have |
For For A decimal(float) is a lot uglier then a float(decimal) so keeping them as floats makes sense here. The only other options are:
What do you think?
Good catch! I was meaning to do that then forgot. |
I'm not sure that there's an issue. |
Also convert numbers to decimals for processing.
Great that is no problem then. Just tried to convert everything to decimial and if it fails return 0. |
Great, thanks! |
This is the main branch of the OCDS 2 OC4IDS conversion branch.