Skip to content
This repository has been archived by the owner on Aug 25, 2023. It is now read-only.

Feature/manipulate version #44

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 135 additions & 28 deletions docs/tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

## ValidateCamelCasedCols

The function takes a dataframe and validates if all columns or a given subset of columns are camelCased.
The function takes a dataframe and validates if all columns or a given subset of
columns are camelCased.
The algorithm is simple, where the following must hold:
* Column name must be camelCased.
* Column name must NOT contain two or more recurrent characters.
Expand Down Expand Up @@ -32,11 +33,15 @@ OUTPUT: True

*This is just a tool for investigating - not for production purposes.*

Some files - like eventhub capture files - contains a binary encoded *Body* column. The `ExtractEventhubBody` class can help decode the column.
Either one can get the encoded schema as a json schema (`extract_json_schema`) or transform the dataframe using `transform_df`.
Some files - like eventhub capture files - contains a binary encoded *Body* column.
The `ExtractEventhubBody` class can help decode the column.
Either one can get the encoded schema as a json schema (`extract_json_schema`) or
transform the dataframe using `transform_df`.

Be aware, that the schema extraction can be a slow process, so it is not recommended to use the extractor in a production setting.
*HINT: You should in stead find a way to have a static schema definition. Either as a json schema, pyspark struct
Be aware, that the schema extraction can be a slow process, so it is not recommended
to use the extractor in a production setting.
*HINT: You should in stead find a way to have a static schema definition. Either as
a json schema, pyspark struct
or read the schema from a target table - and use that for decode the Body.*

``` python
Expand Down Expand Up @@ -65,9 +70,20 @@ OUTPUT:

## ModuleHelper

The `ModuleHelper` class provides developers with a useful tool for interacting with modules in Python. Its primary purpose is to allow developers to retrieve all modules from a given package or module in a flexible manner, without requiring detailed knowledge of the module structure. Additionally, the `ModuleHelper` class enables developers to retrieve classes and/or subclasses of a specified type from a package or module, further simplifying the process of working with multiple modules.

For example, consider a scenario where a developer is working on a large-scale Python project with numerous modules, many of which may not be directly related to the current task at hand. By using the `ModuleHelper` class, the developer can quickly and easily retrieve all relevant modules or classes/subclasses, without needing to know the precise structure or location of each individual module/class/subclass. This can save significant time and effort, as well as making the code more maintainable and easier to understand.
The `ModuleHelper` class provides developers with a useful tool for interacting with
modules in Python. Its primary purpose is to allow developers to retrieve all
modules from a given package or module in a flexible manner, without requiring
detailed knowledge of the module structure. Additionally, the `ModuleHelper` class
enables developers to retrieve classes and/or subclasses of a specified type from a
package or module, further simplifying the process of working with multiple modules.

For example, consider a scenario where a developer is working on a large-scale
Python project with numerous modules, many of which may not be directly related to
the current task at hand. By using the `ModuleHelper` class, the developer can
quickly and easily retrieve all relevant modules or classes/subclasses, without
needing to know the precise structure or location of each individual
module/class/subclass. This can save significant time and effort, as well as making
the code more maintainable and easier to understand.

### Example - `get_modules()` method

Expand All @@ -86,7 +102,9 @@ Consider the following project:
└── __init__.py
```

The modules `dataplatform.foo.main` and `dataplatform.bar.sub` can be retrieved using the `get_modules()` method (if either module had any submodules those would be retrieved as well):
The modules `dataplatform.foo.main` and `dataplatform.bar.sub` can be retrieved
using the `get_modules()` method (if either module had any submodules those would be
retrieved as well):

```python
from spetlrtools.helpers import ModuleHelper
Expand All @@ -96,7 +114,9 @@ denmark_modules = ModuleHelper.get_modules(
)
```

The above returns a dictionary, where the each key point to the location of a module. The values are the respective module of type `ModuleType` (from the builtin types library):
The above returns a dictionary, where the each key point to the location of a module
The values are the respective module of type `ModuleType` (from the builtin types
library):

```python
{
Expand Down Expand Up @@ -129,9 +149,14 @@ class D:
... # implementation of class D
```

We have that `/main.py` defines a `class A`. And `class B` and `class C` are subclasses (inherited) hereof. Keep in mind, `class C` is inherits from `class A` and that `class A` is imported from the `dataplatform.foo.main` module. `class D` just sits in `dataplatform.bar.sub` but is not a subclass of any of the other classes.
We have that `/main.py` defines a `class A`. And `class B` and `class C` are
subclasses (inherited) hereof. Keep in mind, `class C` is inherits from `class A`
and that `class A` is imported from the `dataplatform.foo.main` module. `class D`
just sits in `dataplatform.bar.sub` but is not a subclass of any of the other classes.

Using the `get_classes_of_type()` method from the `ModuleHelper` all definitions of `class A` can be retrieved together with its subclasses `class B` and `class C` (and not `class D`):
Using the `get_classes_of_type()` method from the `ModuleHelper` all definitions of
`class A` can be retrieved together with its subclasses `class B` and `class C` (and
not `class D`):

```python
from spetlrtools.helpers import ModuleHelper
Expand All @@ -143,17 +168,35 @@ classes_and_subclasses_of_type_A = ModuleHelper.get_classes_of_type(
)
```

The above returns a dictionary, where the keys point to the location of the classes. The values are a respective dictionary containing information about the module that the class is associated with and the class itself:
The above returns a dictionary, where the keys point to the location of the classes.
The values are a respective dictionary containing information about the module that
the class is associated with and the class itself:

```python
{
"dataplatform.foo.main.A": {"module_name": str, "module": ModuleType, "cls_name": str, "cls", type},
"dataplatform.foo.main.B": {"module_name": str, "module": ModuleType, "cls_name": str, "cls", type},
"dataplatform.bar.sub.C": {"module_name": str, "module": ModuleType, "cls_name": str, "cls", type},
"dataplatform.foo.main.A": {
"module_name": str,
"module": ModuleType,
"cls_name": str,
"cls": type
},
"dataplatform.foo.main.B": {
"module_name": str,
"module": ModuleType,
"cls_name": str,
"cls": type
},
"dataplatform.bar.sub.C": {
"module_name": str,
"module": ModuleType,
"cls_name": str,
"cls": type
},
}
```

The `get_classes_of_type()` method is configurable such that only classes of the `obj` type is returned and not its subclasses:
The `get_classes_of_type()` method is configurable such that only classes of the
`obj` type is returned and not its subclasses:
```python
from spetlrtools.helpers import ModuleHelper
from dataplatform.foo.main import A
Expand All @@ -168,7 +211,12 @@ only_main_classes_of_type_A = ModuleHelper.get_classes_of_type(
The above returns:
```python
{
"dataplatform.foo.main.A": {"module_name": str, "module": ModuleType, "cls_name": str, "cls", type}
"dataplatform.foo.main.A": {
"module_name": str,
"module": ModuleType,
"cls_name": str,
"cls": type
}
}
```

Expand All @@ -188,14 +236,25 @@ only_main_classes_of_type_A = ModuleHelper.get_classes_of_type(
The above returns:
```python
{
"dataplatform.foo.main.B": {"module_name": str, "module": ModuleType, "cls_name": str, "cls", type},
"dataplatform.bar.sub.C": {"module_name": str, "module": ModuleType, "cls_name": str, "cls", type},
"dataplatform.foo.main.B": {
"module_name": str,
"module": ModuleType,
"cls_name": str,
"cls": type
},
"dataplatform.bar.sub.C": {
"module_name": str,
"module": ModuleType,
"cls_name": str,
"cls": type
},
}
```

## TaskEntryPointHelper

The `TaskEntryPointHelper` provides the method `get_all_task_entry_points()`, which uses the ModuleHelper (see the documentation above) to retrieve all `task()` methods of the subclasses of the class `TaskEntryPoint`. Note that `TaskEntryPoint` is an abstract base class from atc-dataplatform, see the documentation over there.
The `TaskEntryPointHelper` provides the method `get_all_task_entry_points()`, which
uses the ModuleHelper (see the documentation above) to retrieve all `task()` methods of the subclasses of the class `TaskEntryPoint`. Note that `TaskEntryPoint` is an abstract base class from atc-dataplatform, see the documentation over there.

### Example - `get_all_task_entry_points()` method

Expand Down Expand Up @@ -236,7 +295,8 @@ class Second(TaskEntryPoint):
... # implementation of the task here
```

Now, by utilizing the `get_all_task_entry_points()` method all the `task()` class methods can automatically be discovered as entry points:
Now, by utilizing the `get_all_task_entry_points()` method all the `task()` class
methods can automatically be discovered as entry points:

```python
from spetlrtools.entry_points import TaskEntryPointHelper
Expand All @@ -256,9 +316,12 @@ This returns a dictionary:
}
```

The developer can add this key-value pair to their setup of their package. When new subclasses of the `TaskEntryPoint` class are added then this function automatically discover the entry points for their `task()` methods.
The developer can add this key-value pair to their setup of their package. When new
subclasses of the `TaskEntryPoint` class are added then this function automatically
discover the entry points for their `task()` methods.

If the developer wants to see the entry points, a path to a txt file can be added when executing the method:
If the developer wants to see the entry points, a path to a txt file can be added
when executing the method:

```python
from spetlrtools.entry_points import TaskEntryPointHelper
Expand All @@ -275,11 +338,15 @@ dataplatform.foo.main.First = dataplatform.foo.main:First.task
dataplatform.bar.sub.Second = dataplatform.bar.sub:Second.task
```

This way it is easy to verify and check entry points manually if the developers workflow depends on this.
This way it is easy to verify and check entry points manually if the developers
workflow depends on this.

### Example - Using the `get_all_task_entry_points()` method with a different base class

The `get_all_task_entry_points()` method is tied closely with the atc-dataplatform `TaskEntryPoint` class. If there is a use case for implementing other custom base classes (with a `task()` abstract class method) then a `entry_point_objects` list variable can be set to look for a different base classes. See below example:
The `get_all_task_entry_points()` method is tied closely with the atc-dataplatform
`TaskEntryPoint` class. If there is a use case for implementing other custom base
classes (with a `task()` abstract class method) then a `entry_point_objects` list
variable can be set to look for a different base classes. See below example:

```python
from abc import ABC, abstractmethod
Expand Down Expand Up @@ -323,5 +390,45 @@ TaskEntryPointHelper.get_all_task_entry_points(
)
```

This returns a dictionary of entry points pointing to `A`, `B`, and `C` as they are children of the new `OtherBaseClass` and `AnotherBaseClass` classes.
```
This returns a dictionary of entry points pointing to `A`, `B`, and `C` as they are
children of the new `OtherBaseClass` and `AnotherBaseClass` classes.


## Manipulate Versions

In our release pipelines, we pursue a stategy of combined manual and automated
version handling. The file `src/VERSION.txt` contains a version of the form `major.
minor.micro` in conformance with [Python PEP-0440](https://peps.python.org/pep-0440/).
We provide a tool to automatically increment the micro and release candidate version so
that it is higher with respect to PyPI and test.PyPI, so that uploads can happen
automatically.

The intention is that all release candidates are uploaded only to test.PyPi, while
all final versions are uploaded to PyPI proper.

The tool supports this manipulation in when used as follows:
```
usage: spetlr-manipulate-version [-h] [-t] [--name NAME] [--version-file VERSION_FILE]

Automatically set the version for upload to pypi

optional arguments:
-h, --help show this help message and exit
-t prepare pre-release version for test.pypi
--name NAME Package name, if different from name in setup.cfg
--version-file VERSION_FILE
location of version to manipulate
```

In the current repo, it can be used without arguments. The manipulations are best
illustrated by this example:

| situation | VERSION.txt | pypi.org | test.pypi.org | cli flags | new version |
|--------------------------------|-------------|----------|---------------|-----------|-------------|
| post-integration version 0.2.8 | 0.2.8 | 0.1.34 | 0.1.34rc4 | -t | 0.2.8rc1 |
| release new version 0.2.8 | 0.2.8 | 0.1.34 | 0.2.8rc1 | | 0.2.8 |
| normal post-integration | 0.2.8 | 0.2.8 | 0.2.8rc1 | -t | 0.2.9rc1 |
| second post-integration | 0.2.8 | 0.2.8 | 0.2.9rc1 | -t | 0.2.9rc2 |
| normal release | 0.2.8 | 0.2.8 | 0.2.9rc1 | | 0.2.9 |
| re-run of release | 0.2.8 | 0.2.9 | 0.2.9rc1 | | 0.2.10 |

2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ install_requires =
requests
dateparser
pytest
packaging

[options.packages.find]
where=src
Expand All @@ -55,6 +56,7 @@ console_scripts =
spetlr-freeze-req = spetlrtools.requirements:main
spetlr-az-databricks-token = spetlrtools.az_databricks_token.main:main
spetlr-test-job = spetlrtools.test_job.main:main
spetlr-manipulate-version = spetlrtools.manipulate_version:main


[flake8]
Expand Down
Loading