Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial skbase design proposal #76

Merged
merged 15 commits into from
Jan 8, 2023
220 changes: 220 additions & 0 deletions enhancement_proposals/sbep_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# skbase Enhancement Proposal 1

Contributors: @RNKuhns, @fkiraly

## Overview

`skbase` seeks to provide a general framework for creating and working with classes
that follow scikit-learn and sktime style design patterns. To accomplish this
`skbase` will provide tools to make it easier for developers of other packages,
toolboxes or applications to reuse the `skbase` interfaces and design patterns.

Specifically,`skbase` will provide:

- [Base classes](#Base-Class-Interfaces) with `scikit-learn` and `sktime`
style interfaces
- [Tools for working with base classes](#Tools-For-Working-With-Base-Classes)
- Object collection (retrieval)
- Object testing
- Object comparison
- An [example repository](#Example-Repository) that serves the dual purpose of
illustrating how developers can use `skbase` in their own proejcts and
providing test cases
- A [template repository](#Template-Repository) that developers can clone to
easiliy set up a a new project using `skbase`'s principles

Although the package will initially inherit some of this functinality from
`scikit-learn` the goal is to make it easy to use the design patterns in a
variety of contexts, not just those that depend on `scikit-learn`. Accordingly,
`skbase` has a goal of providing the proposed functionality with minimal
third-party dependencies. This will eventually include the removal of any
dependency on `scikit-learn`.

The rest of this design document provides an outline of the proposed interfaces.

## Design

`skbase`'s core functionality will be available through submodule's tailored to
a given use case.

- `skbase.base` will include the `BaseObject` class and related base classes.
- `skbase.lookup` will include the tools for retrieving (i.e., collectiong,
looking up) any `BaseObject`'s from a project.
- `skbase.validate` will include tools for validating and comparing `BaseObject`'s
and/or collections of `BaseObject`'s.
- `skbase.testing` will include tools for testing `BaseObject`s for interface
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I just wanted to say that I love the idea of skbase.validate and skbase.testing. Especially think there is a space to also specify the tools as classes themselves. (eg a LengthChecker that can validate that certain specified inputs have the same type).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed! Did not notice it, but I agree with @miraep8 that tests as objects would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an interesting idea, though beyond the intended scope of this proposal. I do think we should revisit this topic when we dive further into the skbase.testing interface (hopefully in the not too distant future).

Going this route, testing is really executing a pipeline of "checker" or "validator" objects (which makes intuitive sense). This could also be useful in parameter validation.

compliance.

The proposed [example repository](#Example-Repository) and
[repository template](#Template-Repository) will live in separate repositories.

### skbase.base: Base Class Interfaces

`skbase`'s primary API is provided through classes that allow for
`scikit-learn`'s and `sktime`'s design patterns to be re-used in additional
contexts. This includes:

- [BaseObject](#BaseObject): a base class providing the package's primary
class level interface. Other classes are subclasses of `BaseObject`.
- [BaseEstimator](#BaseEstimator): A subclass of `BaseObject` that adds a
high-level interface for *fittable* estimators
- [BaseMetaObject](#BaseMetaObject): A subclass of `BaseObject`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, you're subclassing this. Is this fine?
What would an estimator be that is also a metaobject? Do we get a diamond inheritance diagram then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly aren't forecasting pipelines really estimators that are also metaobjects? That being said, I went the Mixin route in the final version.

that provides a high-level interface for working with classes composed of
collections of `BaseObject`s.
- [Base pipeline classes](#Base-Pipeline-Classes): Subclasses of `BaseMetaObject`
that provide generic functionalty for common pipeline use cases.

#### BaseObject

BaseObjects are base classes with:

- `scikit-learn` style interface to get and set parameters
- `sktime` style interface for working with *tags*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question - has come up recently within sktime - I think one could in theory replace a lot of the functionality currently captured in tags via input validation. Of course I think tags could still play an important role in estimator search.. but think this could still be done in a more robust way using the input tests - ie do we implement this type of input for test for this object etc. (Not necessarily against tags across the board, but think in the current implementation they are perhaps a bit under-documented..)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an old, related, STEP that never got implemented: https://github.com/sktime/enhancement-proposals/blob/main/steps/05_scitype_based_IO_checks/step.md

I would do it differently today, but just FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miraep8 from the skbase perspective we want to make it possible to use "tags". We leave the usage to the individual packages (but I agree there can be non-tag ways to accopmlish the same outcome. But I do believe there are cases where tags can make the most sense).

- `sktime` style interface for cloning and re-instantiation (resetting)
- `sktime` style interface for generating test instances
- `sktime` style interface for retrieving fitted parameters
- `scikit-learn` style interface for representing objects (e.g., pretty printing
and drawing a simple block representation in HTML)
- Support simple composition where parameter arguments are other `BaseObject`s,
including the ability to get and set the parameters of component `BaseObject`s
- Provide an interface for parameter validation.

fkiraly marked this conversation as resolved.
Show resolved Hide resolved
`BaseObject`s should also follow certain design patterns and coding practices,
including:

- Specify all parameters in the classes ``__init__`` as explicit keyword arguments.
No ``args`` or ``kwargs`` should be used to set class parameters.
- Keyword arguments should be stored in the class as attributes with the same name.
These should be documented in the parameters section of the docstring, and not
documented in the attributes section of the docstring.
- All instance attributes should be created in ``__init__``. If the attribute
is not assigned a value until later, initialize it as None.
- Attributes that depend on the state of the instance's parameters should end
in an underscore to easily communicate that they are "state" dependent.
- Start non-public attributes and methods with an underscore
(per standard Python conventions).
- Document all public attributes that are not parameters in the class docstring's
attributes section.

#### BaseEstimator

Scikit-learn style [estimators](https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html?highlight=estimator#estimators-objects) are *"objects that learn from data"*.

In `scikit-learn` and `sktime` these can be *regressors*, *classifiers*,
*clusterers*, *annotators*, *forecasters*, *transformers* and other type of
classes implementing learning algorithms.

`BaseEstimator` ties together the different algorithm categories through a
common high-level interface for learning (fitting) parameters from data by inheriting
from `BaseObject` and providing an additional interface for *fittable* learning
algorithms, including:

- An instance `is_fitted` property denoting whether the the estimator has been
fit (`is_fitted` does this by inspecting a non-public `_is_fitted` attribute
set in each algorithm's call to its `fit` method).
- An instance `check_is_fitted` method for raising an error when an estimator
has not been fitted yet.

Although the `BaseEstimator` interface may seem like it should include a `fit`
method for learning the parameters from data, it is not included. Instead
`BaseEstimator` assumes sub-classes implement a `fit` method (and that this
method appropriately sets the `_is_fitted` attribute), since the signature of
`fit` is learning task specific. Therefore, the specific `fit` implementation
is left to child classes implemented outside of `skbase`.

#### BaseMetaObject

The `BaseObject` interface is designed to make it easy for developers to provide uniform functionality for interacting with parametric objects in their
projects. This makes is particularly helpful in use cases where working with
collections of objects is important. For example, iterativately applying computations to a dataset is common-place in statistical and data processing applications.

For example, `scikit-learn` and `sktime` include pipeline classes that let
users easily iteratively apply a series of estimators to their data. Meanwhile,
in data engineering pipelines data may undergo a series of transformations prior
to being used or stored.s

`skbase` supports these use cases by providing `BaseMetaObject`to provide a
high-level interface of working with classes composed of collections of `BaseObject`s or `BaseEstimator`s. `BaseMetaObject` expands on the `BaseObject`
interface by expanding support for composite objects where parameter values include
collections of `BaseObject`s. This includes built-in parameter getting/setting on
nested `BaseObject`s when parameter values are collections of `BaseObjects`, and
functionality for working with the tags of nested objects.

##### Base Pipeline Classes
`skbase` intends to provide additional meta classes that expand on the `BaseMetaObject` API to include specific functionality
for the type of common pipeline use cases found in `scikit-learn`, `sktime` and
data processing workflows. This functionality will eventually be made available through two sub-classes of `BaseMetaObject`:

- `BasePipeline` expands on the `BaseMetaObject` framework by providing
common functionality for (linear) stepped based pipelines like those found
in `scikit-learn` and `sktime`
- `BaseDAGPipeline` expands on the `BaseMetaObject` framework by providing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, this is probably going to be hairy, because it may require base class polymorphism.
Have you been involved with the pywatts discussion stream?

Although, in skbase, it might be easier since there is only one scitype (or two).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - do we need a BasePipeline class at all? Or do we simply need to make a sufficiently flexible pipeline class? I think having a BasePipeline class opens one up to making many types of pipeline classes (which is kind of what we do in sktime now tbh). But a better way to do this is perhaps to just build a pipeline that can handle sufficient complexity of different use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miraep8 I think that is a valid point for a single package. But the end-stage goal is for skbase to be used external to sktime (I for instance have other use-cases and the hope is it would make creating a new package easy for someone else too). In that case, a pipeline in a different package would be able to inherit some base functionality, but the implementation of the "action" or "state-changing" methods that the pipeline performs would be up to the external developer.

functionality for pipelines composed of directed-acyclic graphs (DAGs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo


Each base pipeline class will focus on providing generic functionality for
implementing the specific pipeline use case, while stopping short of implementing methods for fitting, transforming or otherwise updating the state of the pipeline.

### Tools For Working With Base Classes

`skbase` should make it easy for developers to work with BaseObjects and create
their own packages that follow `skbase`'s principles. To accomplish this
`skbase` includes tools that make it easier to accomplish common workflows
that arise.

#### `skbase.lookup`: Collecting (Retrieving) Information on BaseObjects and Package Metadata

The need to lookup classes arises in several contexts when working working with
parametric objects, including the need to collect *similar* objects for
testing or reporting. For example, the developer of a machine learning package
may want to gather all the objects that subclass the base class for a given
type of learning problem so they can be tested for interface compliance (e.g.,
collect all *regressors* or *forecasters*).

`skbase` provides this through two function interfaces:

- `all_objects` provides the ability to recursively walk packages (or sub-packages)
and return the objects that meet the filters specified in the function call.
- `get_package_metadata` provides the ability to recursively walk a package's modules
and collect information on the items contained in the modules (including
objects and functions)

#### `skbase.testing`: Testing BaseObjects

When building packages like `sktime` and `scikit-learn` that are made up of
related objects, there is a need to make all the objects comply with expected
interfaces and required functionality works as expected. `skbase` seeks to
make this easy by providing extensible functionality for automatically
collecting and testing classes that descend from BaseObject.

This benefits users by:

- Letting them incorporate these tests in their own projects, reducing the need
to spend time testing that their classes comply with the interface inheritted
from `BaseObject`.
- Providing an extensible framework they can use to collect and test their own
object interfaces and functionality.

#### `skbase.validate`: Validating and Comparing BaseObjects

When developing packages that include parametric objects, verifying and comparing
objects is a common worfklow.

To aid this `skbase` will provide functions to:
- Check if a BaseObject complies with the expected interface
- Functionality to test if two BaseObjects have same parameters and parameter
values.
- Check if a sequence is all BaseObjects.
- Check if a collection contains named objects in allowable interface formats.

### Example Repository
This would be a simple example package that illustrates how `skbase`'s functionality
can be used to create another package. It will also be used by `skbase` to test
`skbase.testing`.

### Template Repository
To make it easy for others to use `skbase` in new projects, it is useful to provide
RNKuhns marked this conversation as resolved.
Show resolved Hide resolved
a template repository that provides a starting point. This can be accomplished
by creating and maintaining a
[cookiecutter](https://cookiecutter.readthedocs.io/en/stable/README.html#features)
template. Users, can then use the `cookiecutter` package's functionality to
setup their own project.