Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
Table of contents
Fork the Reversible Data Transforms repo on GitHub.
Clone your fork locally:
$ git clone [email protected]:your_name_here/RDT.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv RDT $ cd RDT/ $ make install-develop
Claim or file an issue on GitHub. If there is already an issue on GitHub for the contribution you wish to make, claim it. If not, please file an issue and then claim it before creating a branch.
Create a branch for local development:
$ git checkout -b issue-[issue-number]-description-of-your-bugfix-or-feature
The naming scheme for your branch should have a prefix of the format
issue-X
where X is the associated issue's number (eg.issue-123-fix-foo-bug
). If you are not developing on your own fork, further prefix the branch with your GitHub username, likegithubusername/gh-123-fix-foo-bug
.Now you can make your changes locally.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m "Your detailed description of your changes." $ git push origin name-of-your-branch
RDT follows certain coding style guidelines. Any change made should conform to these guidelines. RDT using the following third party libraries to check the code style.
ruff:
$ ruff check rdt/ tests/ $ ruff format --check --diff rdt/ tests/
To run all of the code style checks in RDT, use the following command:
$ make lint
or if you are developing on Windows you can use:
$ invoke lint
There should be unit tests created specifically for any changes you add. The unit tests are expected to cover 100% of your contribution's code based on the coverage report. All the Unit Tests should comply with the following requirements:
- Unit Tests should be based only in unittest and pytest modules.
- The tests that cover a module called
rdt/path/to/a_module.py
should be implemented in a separated module calledtests/unit/path/to/test_a_module.py.
Note that the module name has thetest_
prefix and is located in a path similar to the one of the tested module, just inside the tests folder. - Each method of the tested module should have at least one associated test method, and each test method should cover only one use case or scenario.
- Test case methods should start with the
test_
prefix and have descriptive names that indicate which scenario they cover. Names such astest_some_method_input_none
,test_some_method_value_error
ortest_some_method_timeout
are good, but names liketest_some_method_1
,some_method
ortest_error
are not. - Each test should validate only what the code of the method being tested does, and not cover the behavior of any third party package or tool being used, which is assumed to work properly as far as it is being passed the right values.
- Any third party tool that may have any kind of random behavior, such as some Machine
Learning models, databases or Web APIs, will be mocked using the
mock
library, and the only thing that will be tested is that our code passes the right values to them. - Unit tests should not use anything from outside the test and the code being tested. This includes not reading or writing to any file system or database, which will be properly mocked.
To run the test suite in RDT locally, use the following command:
$ make test
or if you are developing on Windows, use:
$ invoke test
Before you submit a pull request, check that it meets these guidelines:
- It resolves an open GitHub Issue and contains its reference in the title or the comment. If there is no associated issue, feel free to create one.
- Whenever possible, it resolves only one issue. If your PR resolves more than one issue, try to split it in more than one pull request.
- The pull request should include unit tests that cover all the changed code.
- The pull request should work for all the supported Python versions. Check the Github actions page and make sure that all the checks pass.
In addition to the guidelines mentioned above, there are extra steps that need to be taken
when adding a new Transformer
class. They are described in detail in this section.
When contributing a new transformer, the most obvious requirement is creating the new Transformer class. The class should inherit from BaseTransformer or one of its child classes. There are only three required methods for a transformer:
_fit(data: pd.DataFrame)
: Used to store and learn any values from the input data that might be useful for the transformer._transform(data: pd.DataFrame)
: Used to transform the input data into completely numeric data. This method should not modify the internal state of the Transformer instance._reverse_transform(data: pd.DataFrame)
: Used to convert data that is completely numeric back into the format of the fitted data. This method should not modify the internal state of the Transformer instance.
Each transformer class should be placed inside the rdt/transformers
folder, in a module
file named after the data type that the transformer operates on. The data
types used by RDT are called sdtypes
, and you can think of them as representing the semantic
or statistical meaning of a datatype. For example, if you are writing a transformer that works
with categorical
data, your transformer should be placed inside the
rdt/transformers/categorical.py
module.
For more detailed guide on writing transformers, refer to the Development Guide.
On top of adding the new class, unit tests must be written to cover all of the methods the new class uses. In some cases, integration tests may also be required. More details on this can be found below.
If the transformer adds a previously unsupported sdtype to RDT, then more steps will need to be taken for performance tests. A new DatasetGenerator class may need to be created for the sdtype. More details for these steps can be found below in the Transformer Performance section.
The code added for the new transformer must abide by the code style used in RDT. In addition,
there are custom code style requirements that must also be met. These mostly have to do with
class and method naming conventions. For example, all transformer classes must ened in
Transformer
. They also have to inherit from the rdt.transformers.BaseTransformer
class.
To validate the overall code style for your transformer, you can use the custom code validation
function, validate_transformer_code_style
. This function returns a boolean indicating whether
or not the transformer passed all the code style checks. It also prints a table describing each
check and whether or not it passed.
In [1]: from tests.contributing import validate_transformer_code_style
In [2]: valid = validate_transformer_code_style('rdt.transformers.BinaryEncoder') # Replace BinaryEncoder with your transformer
Validating source file C:\Datacebo\RDT\rdt\transformers\boolean.py
SUCCESS: The code style is correct.
Check Correct Details
------------------------- --------- ---------------------------------------------------------
ruff_lint Yes Code follows PEP8 standards.
ruff_format Yes Imports are properly sorted.
Transformer is subclass Yes The transformer is subclass of ``BaseTransformer``.
Valid module Yes The transformer is placed inside a valid module.
Valid test module Yes The transformer tests are placed inside the valid module.
Valid test function names Yes The transformer tests are named correctly.
Valid transformer addon Yes The addon is configured properly.
Importable from module Yes The transformer can be imported from the parent module.
In [3]: valid
Out[3]: True
- Unit tests should cover specific cases for each of the following methods:
__init__
,fit
,transform
andreverse_transform
. - Unit tests for a transformer must have 100% coverage based on the code coverage report.
- The tests should go in a module called
tests/unit/transformers/{transformer_module}
.
The transformer unit tests and their coverage can be validated using the
validate_transformer_unit_tests
function. This function returns a float
value representing
the test coverage where 1.0 is 100%. It also prints each test and whether or not it passed. It also
prints a table summarizing the test coverage and provides a link to the full coverage report.
In [1]: from tests.contributing import validate_transformer_unit_tests
In [2]: test_coverage = validate_transformer_unit_tests('rdt.transformers.BinaryEncoder') # Replace BinaryEncoder with your transformer
Validating source file C:\Datacebo\RDT\rdt\transformers\boolean.py
================================================= test session starts =================================================
collected 12 items
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test___init__ PASSED [ 8%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__fit_array PASSED [ 16%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__fit_nan_ignore PASSED [ 25%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__fit_nan_not_ignore PASSED [ 33%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__reverse_transform_2d_ndarray PASSED [ 41%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__reverse_transform_float_values PASSED [ 50%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__reverse_transform_float_values_out_of_range PASSED [ 58%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__reverse_transform_nan_ignore PASSED [ 66%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__reverse_transform_nan_not_ignore PASSED [ 75%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__reverse_transform_not_null_values PASSED [ 83%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__transform_array PASSED [ 91%]
tests/unit/transformers/test_boolean.py::TestBinaryEncoder::test__transform_series PASSED [100%]
============================================ 12 passed, 1 warning in 0.08s ============================================
SUCCESS: The unit tests passed.
Name Stmts Miss Cover Missing
-----------------------------------------------------------
rdt\transformers\boolean.py 37 19 49% 3-36, 40-55, 68, 88, 100
-----------------------------------------------------------
TOTAL 37 19 49%
ERROR: The unit tests only cover 48.649% of your code.
Full coverage report here:
file:///C:/Datacebo/RDT/htmlcov/rdt_transformers_boolean_py.html
In [3]: test_coverage
Out [3]: 0.486
Integration tests should test the entire workflow of going from input data, to fitting, to transforming and finally reverse transforming the data. By default, we run integration tests for each transformer that validate the following checks:
- The Transformer correctly defines the sdtype that it supports.
- At least one Dataset Generator exists for the Transformer sdtype.
- The Transformer can transform data and produces outputs of the indicated sdtypes.
- The Transformer can reverse transform the data it produces, recovering the original sdtype.
- The HyperTransformer is able to use the Transformer and produce float values.
- The HyperTransformer is able to reverse the data that has previously transformed, and restore the original sdtype.
If you wish to test any specific end-to-end scenarios that were not covered in the above checks,
add a new integration test. Integration tests can be added under
tests/integration/path/to/test_a_module.py
.
- Before putting up a PR, confirm that the automatic integration tests pass. If new functionality that isn't covered is added, feel free to add new integration tests.
- Integration tests should be added under
tests/unit/transformers/{transformer_module}
.
Integration tests can be validated using the validate_transformer_integration
function. This
function returns a boolean representing whether or not the transformer passes all integration
checks. It also prints a table describing each check and whether or not it passed.
In [1]: from tests.contributing import validate_transformer_integration
In [2]: valid = validate_transformer_integration('rdt.transformers.BinaryEncoder') # Replace BinaryEncoder with your transformer
Validating Integration Tests for transformer BinaryEncoder
SUCCESS: The integration tests were successful.
Check Correct Details
-------------------------------------- --------- -----------------------------------------------------------------------------------------------------------------------
Dataset Generators Yes At least one Dataset Generator exists for the Transformer sdtype.
Output Sdtypes Yes The Transformer can transform data and produce output(s) of the indicated sdtype(s).
Reverse Transform Yes The Transformer can reverse transform the data it produces, going back to the original sdtype.
Hypertransformer can transform Yes The HyperTransformer is able to use the Transformer and produce float values.
Hypertransformer can reverse transform Yes The HyperTransformer is able to reverse the data that it has previously transformed and restore the original sdtype.
In [3]: valid
Out [3]: True
We want to ensure our transformers are as efficient as possible, in terms of time and memory. In order to do so, we run performance tests on each transformer, based on the input sdtype specified by the transformer.
We generate test data using Dataset Generators. Each transformer should have at least one Dataset Generator that produces data of the transformer's input sdtype. If there are any specific dataset characteristics that you think may affect your transformer performance (e.g. constant data, mostly null data), consider adding a Dataset Generator for that scenario as well.
In order to test performance, we have a class that is responsible for generating data to test
the transformer methods against. Each subclass implements two static methods, generate
and get_performance_thresholds
.
generate
takes in the number of rows to generate, and outputs the expected number of data rows.get_performance_thresholds
returns the time and memory threshold for each of the required transformer methods. These thresolds are per row.
You should make a generator for every type of column that you believe would be useful to test against. For some examples, you can look in the dataset generator folder.
The generators each have a SDTYPE
class variable. This should match the sdtype that your
transformer
accepts as input.
More details can be found in the Development Guide.
It is important to keep the performance of these transformers as efficient as possible. Below are some tips and common pitfalls to avoid when developing your transformer, so as to optimize performance.
- Avoid duplicate operations. If you need to do some change to an array/series, try to only do it once and reuse that variable later.
- Try to use vectorized operations when possible.
- When working with Pandas Series, a lot of the operations are able to handle nulls. If you need to round, get the max or get the min of a series, there is no need to filter out nulls before doing that calculation.
pd.to_numeric
is preferred overas_type
.pd.to_numeric
also replaces all None values with NaNs that can be operated on sincenp.nan
is a float type.- If you are working with a series that has booleans and null values, there is a nullable boolean type that can be leveraged to avoid having to filter out null values.
Validate the performance of your transformer using the validate_transformer_performance
function. This function returns a pandas.DataFrame
containing the performance results
of the transformer.
In [1]: from tests.contributing import validate_transformer_performance
In [2]: results = validate_transformer_performance('rdt.transformers.UnixTimestampEncoder') # Replace UnixTimestampEncoder with your transformer
Validating Performance for transformer UnixTimestampEncoder
SUCCESS: The Performance Tests were successful.
In [3]: results
Out [3]:
Evaluation Metric Value Acceptable Units Compared to Average
0 Fit Memory 9.334700e+01 Yes Mb / row 0.757455
1 Fit Time 6.232677e-07 Yes s / row 0.574041
2 Reverse Transform Memory 1.451382e+02 Yes Mb / row 0.966153
3 Reverse Transform Time 6.641531e-07 Yes s / row 1.080660
4 Transform Memory 8.896317e+01 Yes Mb / row 0.656664
5 Transform Time 5.217231e-07 Yes s / row 0.484631
Fix any performance issues that are reported. If there are no errors but performance can be improved, this function should be used for reference.
Re-run all the previous validations until they pass. For a final verification, run
validate_pull_request
and fix any errors reported. This function runs all the checks described
above. It also prints a table summarizing the results of all these checks.
In [1]: from tests.contributing import validate_pull_request
In [2]: valid = validate_pull_request('rdt.transformers.BinaryEncoder') # Replace BinaryEncoder with your transformer
...................
Check Correct Details
----------------- --------- ----------------------------------------------------------------------
Code Style Yes Code Style is acceptable.
Unit Tests Yes The unit tests are correct and run successfully.
Integration tests Yes The integration tests run successfully.
Performance Tests Yes The performance of the transformer is acceptable.
Clean Repository Yes There are no unexpected changes in the repository.
SUCCESS: The Pull Request can be made!
You can now commit all your changes, push to GitHub and create a Pull Request.
In [3]: valid
Out [3]: True
Once you have done everything above, you can create a PR. Do this by following the steps in the Pull Request Guidelines section. Review and fill out the checklist in the PR template to ensure your code is ready for review.
- If it does not exist, open an Issue in Github and describe the Transformer that will be added, including the sdtype that it handles and how it will handle it.
- Create and clone a fork of the RDT repository.
- Create a branch in this repository using the naming convention issue-[issue-number]-[transformer-name] (eg. issue-123-address-transformer).
- Implement the Transformer class.
- Run the
validate_transformer_code_stye
function described in the Code Style section and fix the reported errors. - Implement Unit Tests for the Transformer.
- Run the
validate_transformer_unit_tests
function and fix the reported errors. - Run the
validate_transformer_integration
function and fix the reported errors. - If required, implement the Dataset Generators for the new sdtype. This is described in the Creating Dataset Generators section.
- Run the
validate_transformer_performance
function and fix any errors reported. If there are no errors but performance can be improved, this function should be used for reference. - Run the
validate_pull_request
function as a final check and fix any errors reported. - After all the previous steps pass, all the new and modified files can be committed and pushed to github, and a Pull Request can be submitted. Follow the steps in the Pull Request Guidelines section to submit your Pull Request.