Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OC4IDS transforms cli #125

Merged
merged 50 commits into from
Mar 2, 2020
Merged

OC4IDS transforms cli #125

merged 50 commits into from
Mar 2, 2020

Conversation

kindly
Copy link
Contributor

@kindly kindly commented Jan 29, 2020

This is the main branch of the OCDS 2 OC4IDS conversion branch.

@kindly kindly changed the title Oc4ids transforms cli [WIP DO NOT MERGE] OC4IDS transforms cli [WIP DO NOT MERGE] Jan 29, 2020
@kindly
Copy link
Contributor Author

kindly commented Jan 29, 2020

@jpmckinney
It would be good for you to look over this pull request as you are the maintainer. I just want to check you are happy and understand the direction of the work.

Current Status

We have build a mini transformation framework to do the conversion (with a class for each transform), which may be a little over engineered but means each transform can be tested mostly independently and means we can work on the transforms themselves in parallel. (they are mostly independent as some transforms need to be run in a certain order).
Each transform has its own unit test(s) so we are confident they work. @duncandewhurst said he would look over the tests to make sure the logic of each rule in the docs is as expected but it may be valuable to look at that too.

The "Fields from docs sheet" in https://docs.google.com/spreadsheets/d/1xyKXbNktcfKm6siSzM_C7aHCOsOjWUQro5aU8ZYIHyc/edit#gid=690200833 outlines each mapping expressed in the OC4IDS docs and the transforms and config variable that are associated with them. It also has notes as to how we intend to do each transform.

There a quite a few transforms still left to do, most are straightforward but some may be too difficult or require too much human judgement over the original data to make work.

The CLI bit of the tool is mostly done but there will probably be a couple of extra arguments to add when the transforms are finished.

Work still todo

  • The rest of the viable transforms.
  • Need to add documentation for the CLI and document what the multitude of config variables do.
  • Toucan integration - this seems straightforward as it is very similar to the merge tool, we just need to find/make a logo for it.

We were hoping to get this finished this week but due to xmas and a few illnesses we are now aiming for the end of next week and we think we have the capacity to do that.

@jpmckinney
Copy link
Member

jpmckinney commented Jan 30, 2020

Thanks, @kindly. I had opened #123 (and commented on the first commit) earlier, as there was no issue or Trello card that I could find to discuss. I've requested access to the spreadsheet.

My main feedback is around the overall design. OCDS Kit is careful to stream as much as possible (see docs). This new command should also stream wherever possible. The main points where changes would be required are:

  1. Presently, if the command receives releases, it sorts and merges them into records on its own.
    1. I don't know why ocdsmerge.sorted_releases is called (which requires loading all input into memory, and which just sorts releases by date), given that (1) the input might have releases with many OCIDs and (2) ocdskit.combine.merge will take care of sorting releases at the right time, without ever having the full input in memory (it sorts the releases for one OCID at a time).
    2. As far as I can tell, the releases that are passed in are only ever used to construct records (which are then re-worked into compiled releases with releases and embededReleases (typo) fields). In OCDS Kit, the pattern is to pipe data from one command to another, rather than to give a single command the responsibility of doing multiple steps. This also makes streaming feasible. If this command needs records, then it should only accept records. The documentation can show users an example of what to do if they are starting with releases (i.e. stream from the compile command – with whatever relevant options – into this new command). I can quickly close compile: Change --package option to --into <package|records> #122 so that the compile command has an option to output records (easy to stream) instead of record packages (harder to stream).
  2. Presently, the command loads all compiled releases into memory, and then each transform iterates over the compiled releases.
    1. As in Transform OCDS to OC4IDS #123, I don't see why the framework couldn't instead pull a record from the input (after the above changes), re-work its compiled release (same as it does now), and then run just that one compiled release through each transform. That would avoid loading the entire input into memory.

Other observations:

  1. It looks like the tool assumes that the input all belongs to the same project (though I don't see any documentation of this). In OC4IDS, there seems to be no documented mapping for project identifiers between OC4IDS and OCDS – though I think there had been discussions of how to do this. Anyhow, if the previous assumption will hold for all-time, then the above performance concerns are somewhat less serious. However, I anticipate that we will document a mapping for project identifiers, in which case it will (presumably) be simpler for users to stream all their OCDS data to this command, rather than come up with their own solution for bundling inputs per project. In that scenario, we definitely need to address the concerns above.
  2. Given that Python 3.5 will soon be EOL, and given that Python 3.6 forward have insertion-ordered dicts, you only need OrderedDict if you are using methods like move_to_end. Otherwise, it just adds pressure to processing and memory. I think we can furthermore drop the dict_cls option unless there is a real use case for it.
  3. Some of the if self.last_transforms[-1].success logic doesn't make sense to me. For example, if TitleFromTender always comes after Title, then it makes sense. However, the current framework gives users freedom to put any transforms in any order, in which case the previous transform might be anything, and its success or failure might not imply anything for what TitleFromTender should do…
  4. It looks like users can control the transforms both through the config options (which sometimes turn a transform into a no-op) or through transform_list (in which case they can simply remove the transform). The logic will be simplified if run_transforms used config to remove the transforms that would become no-ops from transform_list.

Some questions:

  1. Are there real use cases for giving users full control over both the choice and order of transforms? I can perhaps see a use case for choice, but I don't really see a use case for order.

@jpmckinney
Copy link
Member

jpmckinney commented Jan 30, 2020

Also, I anticipate the tests file will become very long. Given that most tests follow the same format, I suggest creating a few reusable methods, so that it's easier for a developer or analyst to skim each test and quickly see what it's doing that's unique, and spend less time looking at the same boilerplate. This should be done sooner rather than later, as it will be more work the more tests are added.

@kindly
Copy link
Contributor Author

kindly commented Jan 31, 2020

Thanks, @kindly. I had opened #123 (and commented on the first commit) earlier, as there was no issue or Trello card that I could find to discuss. I've requested access to the spreadsheet.

Sorry thought that was shared.

My main feedback is around the overall design. OCDS Kit is careful to stream as much as possible (see docs). This new command should also stream wherever possible.

The design of the transform framework is to take a list of releases (or potentially records) and transform them into a single project. I was assuming that there would never be a project large enough not hold all its associated releases/contractingProcesses in memory, and as we are likely to stream out a project at a time, this has to be the case. Also, as the amount of contractingProcesses per project is likely to be small I assumed the overhead of looping through them per transform would not be a big burden.

As you said there is no way currently to differentiate a project from OCDS data, but when there is, I assumed that would be where the streaming would happen. So the streaming would happen at a higher level (something that calls) to the framework itself. There would have to be a mechanism very much like the combine command which gathered all releases/records of a particular project and would have to hold them in memory (or in a sqlite database like combine). This was the main reason for not requiring streaming of releases/records in the framework. I thought it unnecessary for streaming to happen at multiple levels considering there will always be a fairly limited amount of contractingProcesses per project.
This could be avoided if we said that all projects had to appear together in the record input stream but even then a higher level function could manage this when we decided on the rules.

The main points where changes would be required are:

  1. Presently, if the command receives releases, it sorts and merges them into records on its own.

    1. I don't know why ocdsmerge.sorted_releases is called (which requires loading all input into memory, and which just sorts releases by date), given that (1) the input might have releases with many OCIDs and (2) ocdskit.combine.merge will take care of sorting releases at the right time, without ever having the full input in memory (it sorts the releases for one OCID at a time).
    2. As far as I can tell, the releases that are passed in are only ever used to construct records (which are then re-worked into compiled releases with releases and embededReleases (typo) fields).

Yes Duncan spotted that typo too!

There are transforms, for example Cost Estimate (not yet written) that do require knowledge of releases and they were ordered for convenience. These transforms are at the contractingProcess level so it could be possible that as long as the embeddedReleases (not linked releases) are always in the record then the input of the releases may not be necessary.

In OCDS Kit, the pattern is to pipe data from one command to another, rather than to give a single command the responsibility of doing multiple steps. This also makes streaming feasible. If this command needs records, then it should only accept records. The documentation can show users an example of what to do if they are starting with releases (i.e. stream from the compile command – with whatever relevant options – into this new command). I can quickly close #122 so that the compile command has an option to output records (easy to stream) instead of record packages (harder to stream).

Embeded Releases need to be in the input records for this work and without packages it limits the possibility of linked releases. There is no way in OCDS to specify both. This is a consideration especially as linked releases (contractingProcesses/releases) are the only thing currently in the OC4IDS standard. If we are happy to forgo linked releases in OC4IDS then this could be possible.

My preference for ergonomic reasons is it should that it should be possible to specify any of releases/release_packages/records (with embedded releases)/record packages (with embedded releases). At the moment only the first two are possible. I am nonetheless happy if these options could be restricted! The higher level streaming function could also have a much more restricted in input.

  1. Presently, the command loads all compiled releases into memory, and then each transform iterates over the compiled releases.

    1. As in Transform OCDS to OC4IDS #123, I don't see why the framework couldn't instead pull a record from the input (after the above changes), re-work its compiled release (same as it does now), and then run just that one compiled release through each transform. That would avoid loading the entire input into memory.

Some transforms rely on the success or failure of the previous transform and that is only determined after all the previous compiled releases are looked at eg for Project Name we only look at tender.title if there is no planning.project.title in ANY compiled release. This could be achieved by holding some state against the object and then having a wrap up function for each class the decides the final logic, but such code is much harder to reason about and to test, so seemed unnecessary.
Also aggregate transforms (like sums and lasts) are more difficult to write and test as require the holding of state for each one until the next record comes through.
So as I did not see a problem with memory or speed if we are only processing one project at a time then the simplified logic seems preferable.

Other observations:

  1. It looks like the tool assumes that the input all belongs to the same project (though I don't see any documentation of this). In OC4IDS, there seems to be no documented mapping for project identifiers between OC4IDS and OCDS – though I think there had been discussions of how to do this. Anyhow, if the previous assumption will hold for all-time, then the above performance concerns are somewhat less serious. However, I anticipate that we will document a mapping for project identifiers, in which case it will (presumably) be simpler for users to stream all their OCDS data to this command, rather than come up with their own solution for bundling inputs per project. In that scenario, we definitely need to address the concerns above.

When this is agreed, this will be level the streaming should happen and should be above the framework. The transforms themselves can be oblivious to this change if their input is a list of records and a project_id and their output is a project. This separates concerns and makes the framework easier to reason about than if we added the concept of multiple projects within it.

  1. Given that Python 3.5 will soon be EOL, and given that Python 3.6 forward have insertion-ordered dicts, you only need OrderedDict if you are using methods like move_to_end. Otherwise, it just adds pressure to processing and memory. I think we can furthermore drop the dict_cls option unless there is a real use case for it.

Happy to remove that!

  1. Some of the if self.last_transforms[-1].success logic doesn't make sense to me. For example, if TitleFromTender always comes after Title, then it makes sense. However, the current framework gives users freedom to put any transforms in any order, in which case the previous transform might be anything, and its success or failure might not imply anything for what TitleFromTender should do…

The logic here is a bit slack it should have something like and last_tranform.__class__ == 'Title'

The idea was to not be allowed to put transforms in any order but the transform_list is more a testing convenience which we could remove from the public API.

  1. It looks like users can control the transforms both through the config options (which sometimes turn a transform into a no-op) or through transform_list (in which case they can simply remove the transform). The logic will be simplified if run_transforms used config to remove the transforms that would become no-ops from transform_list.

This is a consideration. I initially had each class have a 'name' class variable and was going to use that as a way to specify transforms that would be run.

Some questions:

  1. Are there real use cases for giving users full control over both the choice and order of transforms? I can perhaps see a use case for choice, but I don't really see a use case for order.

See above I am currently thinking is that transform_list should not be a public parameter. It is more a testing convenience.

Also, I anticipate the tests file will become very long. Given that most tests follow the same format, I suggest creating a few reusable methods, so that it's easier for a developer or analyst to skim each test and quickly see what it's doing that's unique, and spend less time looking at the same boilerplate. This should be done sooner rather than later, as it will be more work the more tests are added.

The boilerplate is better later in the tests (and we need to back-port that) as it uses run_transforms and transform_list for the tests (we need to fix some earlier ones). There may be some simplifications to this and the source releases to make the data changes are easier to read.

@jpmckinney
Copy link
Member

jpmckinney commented Jan 31, 2020

Thanks for clarifying the requirements! To summarize:

  • We'll work on the assumption that a single project never exceeds memory; as such, it makes sense for the transform framework to not worry about memory.
  • Unimplemented transforms will use self.releases, and will need them to be sorted; as such, it makes sense to pre-sort in the initial state.
  • Some transforms should only be run if another transform fails. It is simpler to run a transform and then decide whether to run another transform, than to run all transforms in parallel and then decide which results to keep in a finalizer. Given our first assumption, it makes sense to take the simpler option.
  • Linked releases are needed by OC4IDS, and full releases are needed by some transforms. Since the OCDS merge routine can only provide one or the other, and since calculating the other from a record package would require expensive requests to download the linked release (to get the full release) or the packages (to calculate the linked release), it makes sense for this command to handle merging on its own.

Other questions:

  • I don't see self.records ever being used in subclasses of BaseTransform. Is it needed?

It'd be good to:

  • Remove dict_cls
  • Put the logic about which transforms to run (based on config) in run_transforms, rather than having each transform check a config variable to decide whether to run at all. That way, each transform has one less responsibility.
  • Make it the transform manager's (run_transform's) responsibility to determine whether to try another transform if one failed. As above, this simplifies the logic; transforms just transform – they don't care about context. They can just return True if successful.

If you're concerned about the ease of testing the last two changes, you can always have another method (_next_transform(…)) that encapsulates the logic about which transform to run next. run_transforms can then do something like while _next_transform(…):.

Ideally, it'd be great to reduce the transforms down to just plain methods that accept output and compiled_releases (and I suppose releases as well, for unimplemented transforms) (i.e. "modify this output using this input"), and return a tuple of the output and a success boolean. Having classes that store state and that care about the context of previous transforms is more complicated. This change should be fairly straightforward after the listed changes, as it should just require removing self. and replacing the class line with a def line.

I'll let Duncan review the logic of transforms first, but I don't think some transforms are correct, e.g. locations are determined by taking the first location(s) found in any compiled release… but a project can occur in many locations, and should be the (unique) aggregation of all locations from all compiled releases (the OC4IDS mapping is perhaps ambiguous). There are also errors like /items instead of /tender/items, and I notice that AdministrativeEntity starts by coping the administrativeEntity parties, but then later claims to 'skip' the transform if multiple are found.

@kindly
Copy link
Contributor Author

kindly commented Feb 3, 2020

I have implemented most of you suggestions.

All transforms are now functions. However currently they all accept a single state object with compiled_releases, releases and output in. I do not want fixed arguments yet until I am sure there are not any awkward transforms that require different inputs that could be pre-processed.

I have removed config entirely from the transforms and it is now done but the running function.

Part of the requirements I got for @duncandewhurst was that a core set of transforms should be run always and there is config options for running the additional ones. At the moment I have detected the additional ones by looking in the docstring of the transform, this may not be ideal in the long run and we could just have a separate list of the optional ones.

Also the concept of success (even thought the function still return success or failure) is gone and I may remove the concept entirely soon. Currently all dependant transforms can tell from the output data of previous ones to see if the previous transform was successful. So at the moment there is nothing that needs ordering except 'contract_process_setup' which is run first.

@jpmckinney
Copy link
Member

jpmckinney commented Feb 3, 2020

Great! A few small tweaks:

  • Make transform_list all-caps to indicate it's a constant, defined outside the method. When I first read the method, I thought it was a missing local variable.
  • I'd prefer a separate list of optional transforms, since docstrings aren't expected to have an impact on code execution.
  • Added: Some code-style changes, e.g. just use if var instead of if var is not "" or if len(var) or if len(var) > 0, and just use if not var instead of if len(var) == 0, etc.

If success booleans are going away (great!), then I won't suggest anything relating to them.

@kindly
Copy link
Contributor Author

kindly commented Feb 5, 2020

@duncandewhurst
Most transforms are done would you be able to spend some of today or tomorrow looking at the tests. You can comment in the spreadsheet if easier.

@kindly kindly force-pushed the oc4ids-transforms-cli branch from 6d55eff to 2955c5d Compare February 10, 2020 19:27
@kindly kindly changed the title OC4IDS transforms cli [WIP DO NOT MERGE] OC4IDS transforms cli Feb 10, 2020
@kindly
Copy link
Contributor Author

kindly commented Feb 10, 2020

I am fairly happy with this pull request now. Everything can be improved but this feels like a decent first attempt.

There are a few transforms that are missing. Here is the issue about this #129

There are some contentious areas about how we deal with multiple process cases. Issue #130
has been raised to discuss them

@duncandewhurst @jpmckinney would be good review this.

@kindly kindly force-pushed the oc4ids-transforms-cli branch from 793b434 to 33363e8 Compare February 11, 2020 09:19
@duncandewhurst
Copy link
Contributor

Thanks @kindly I'm happy for this to be merged once the following is done:

  • add note to documentation on handling of objects in arrays where ids differ across contracting processes but other properties are the same

  • add note to documentation on handling of duplicate parties across contracting processes, including what happens when id is provided but not identifer.

  • add note to documentation on handling of project sector

  • add note to docs to explain that if you supply unpackaged releases then they will be embedded in the output

  • add note to docs to explain project scope output (adding items and milestones)

  • Open issue on expanding location transform tests

@kindly
Copy link
Contributor Author

kindly commented Feb 25, 2020

@duncandewhurst
The above has been done.

@jpmckinney would you be able to do a final review?

Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments relate to the user-facing documentation and API.

I had a few doubts about the type checking and coercion. I also noted some inconsistency in the transform APIs (some return values, some success booleans).

I trust that the tests (as reviewed by @duncandewhurst) are testing for the correct behavior. So, I haven't reviewed that part of the code.

In short, most of my comments are minor. Looking forward to getting this merged!

docs/cli/oc4ids.rst Outdated Show resolved Hide resolved
docs/cli/oc4ids.rst Outdated Show resolved Hide resolved
docs/cli/oc4ids.rst Outdated Show resolved Hide resolved
docs/cli/oc4ids.rst Outdated Show resolved Hide resolved
docs/cli/oc4ids.rst Outdated Show resolved Hide resolved
docs/cli/oc4ids.rst Outdated Show resolved Hide resolved
ocdskit/oc4ids_transforms.py Outdated Show resolved Hide resolved
ocdskit/oc4ids_transforms.py Outdated Show resolved Hide resolved
ocdskit/oc4ids_transforms.py Outdated Show resolved Hide resolved
ocdskit/oc4ids_transforms.py Outdated Show resolved Hide resolved
Make new string casting so it works with float, int.
Make number casting more explicit.
Improve docstrings.
@kindly
Copy link
Contributor Author

kindly commented Feb 28, 2020

@jpmckinney I have covered the points in your comments. The type checking would be good for you to look over again.

@jpmckinney
Copy link
Member

jpmckinney commented Feb 28, 2020

We should add decimal.Decimal to the if-statement in cast_string, in case the JSON was parsed into decimals (which is always the case with ijson). Similarly, in cast_number_or_zero, we could do return decimal.Decimal(item) and add decimal.Decimal to the if-statement – to avoid losing accuracy if the data was parsed to decimals.

In both those methods, should we have if item: before the log message like in check_type? We don't check for the presence of keys in the calling code, so when a key isn't set, we'll end up with a log message due to the missing key, rather than due to a bad value.

@kindly
Copy link
Contributor Author

kindly commented Feb 28, 2020

We should add decimal.Decimal to the if-statement in cast_string, in case the JSON was parsed into decimals (which is always the case with ijson). Similarly, in cast_number_or_zero, we could do return decimal.Decimal(item) and add decimal.Decimal to the if-statement – to avoid losing accuracy if the data was parsed to decimals.

For cast_string this is no problem.

For cast_number_or_zero then we have issue if some numbers are floats and some (that were strings originally) are decimals. As adding them up will means a crash. float + decimal does not work.

A decimal(float) is a lot uglier then a float(decimal) so keeping them as floats makes sense here.

The only other options are:

  • To add a better sum function that detects if the contents are all decimals or ints than use decimal, otherwise use float.
  • To have a use decimal option so we know that there will be no floats in the JSON
  • To assume ijson usage.

What do you think?

In both those methods, should we have if item: before the log message like in check_type? We don't check for the presence of keys in the calling code, so when a key isn't set, we'll end up with a log message due to the missing key, rather than due to a bad value.

Good catch! I was meaning to do that then forgot.

@jpmckinney
Copy link
Member

A decimal(float) is a lot uglier then a float(decimal) so keeping them as floats makes sense here.

I'm not sure that there's an issue. decimal.decimal(1.1) has an ugly internal representation (1.100000000000000088817841970012523233890533447265625), but this ends up getting rounded again if it's converted to float (float(decimal.Decimal(1.1)) is 1.1), which is what happens when OCDS Kit serializes back to JSON, as we haven't implemented a careful decimal serializer (see ocdskit.util._default).

Also convert numbers to decimals for processing.
@kindly
Copy link
Contributor Author

kindly commented Mar 1, 2020

Great that is no problem then. Just tried to convert everything to decimial and if it fails return 0.

@jpmckinney
Copy link
Member

Great, thanks!

@jpmckinney jpmckinney merged commit ab9eaa8 into master Mar 2, 2020
@jpmckinney jpmckinney deleted the oc4ids-transforms-cli branch March 2, 2020 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants