Skip to content

Commit

Permalink
feat: Restructure packages to streamline the addition of new algorithms
Browse files Browse the repository at this point in the history
  • Loading branch information
roquelopez committed Nov 8, 2024
1 parent 8352db1 commit 8217711
Show file tree
Hide file tree
Showing 30 changed files with 4,244 additions and 763 deletions.
50 changes: 50 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
Contributing to bdi-kit
=======================

There are many ways to contribute to bdi-kit, such as improving the codebase, reporting
issues or bugs, enhancing the documentation, reviewing pull requests from other developers,
adding new matching methods, or expanding support for additional standards.
See the instructions below to get started!


Formatting the Code
-------------------

We format code using the [black](https://black.readthedocs.io/en/stable/) code formatter.
The CI runs for every pull request and will fail if code is not properly formatted.
To make sure formatting is correct, you can do the following steps.

Make sure you have black installed:
```
pip install black
```

To format the code, anyone can use the command before committing your changes:
```
make format
```

Or you can use the black command directly:
```
black ./bdikit/
```


Adding New Matching Methods
---------------------------

Contributors can add new methods for schema and value matching by following these steps:

1. Create a Python module inside the "task folder" folder (e.g., `bdikit/value_matching`).

2. Define a class in the module that implements either `BaseValueMatcher` (for value matching) or `BaseSchemaMatcher` (for schema matching).

3. Add a new entry in `matcher_factory.py` (e.g., `bdikit/value_matching/matcher_factory.py`). Make sure to add the correct import path for your
module to ensure it can be accessed without errors.


Code of Conduct
---------------

We abide by the principles of openness, respect, and consideration of others
of the Python Software Foundation: https://www.python.org/psf/codeofconduct/.
20 changes: 2 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The `bdi-kit` is a library that assist users in performing data harmonization. I

**Warning:** `bdi-kit` is currently in *alpha* stage and under heavy development. Expect APIs to change.


## Documentation

Documentation is available at [https://bdi-kit.readthedocs.io/](https://bdi-kit.readthedocs.io/).
Expand All @@ -36,21 +37,4 @@ pip install git+https://github.com/VIDA-NYU/bdi-kit@devel

## Contributing

We format code using the [black](https://black.readthedocs.io/en/stable/) code formatter.
The CI runs for every pull request and will fail if code is not properly formatted.
To make sure formatting is correct, you can do the following steps.

Make sure you have black installed:
```
pip install black
```

To format the code, anyone can use the command before committing your changes:
```
make format
```

Or you can use the black command directly:
```
black ./bdikit/
```
To learn more about making a contribution to bdi-kit, please see our [Contributing guide](./CONTRIBUTING.md).
154 changes: 20 additions & 134 deletions bdikit/api.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
from __future__ import annotations
import logging
from enum import Enum

from collections import defaultdict
from os.path import join, dirname
from typing import (
Union,
Type,
List,
Dict,
TypedDict,
Set,
Optional,
Tuple,
Callable,
Mapping,
Any,
)
import itertools
Expand All @@ -22,37 +19,15 @@
import panel as pn
from IPython.display import display, Markdown
from bdikit.utils import get_gdc_data, get_gdc_metadata
from bdikit.mapping_algorithms.column_mapping.algorithms import (
BaseSchemaMatcher,
SimFloodSchemaMatcher,
ComaSchemaMatcher,
CupidSchemaMatcher,
DistributionBasedSchemaMatcher,
JaccardSchemaMatcher,
GPTSchemaMatcher,
ContrastiveLearningSchemaMatcher,
TwoPhaseSchemaMatcher,
MaxValSimSchemaMatcher,
)
from bdikit.mapping_algorithms.value_mapping.value_mappers import ValueMapper
from bdikit.models.contrastive_learning.cl_api import (
DEFAULT_CL_MODEL,
)
from bdikit.mapping_algorithms.column_mapping.topk_matchers import (
TopkColumnMatcher,
CLTopkColumnMatcher,
)
from bdikit.mapping_algorithms.value_mapping.algorithms import (
ValueMatch,
BaseValueMatcher,
TFIDFValueMatcher,
GPTValueMatcher,
EditDistanceValueMatcher,
EmbeddingValueMatcher,
AutoFuzzyJoinValueMatcher,
FastTextValueMatcher,
)
from bdikit.mapping_algorithms.value_mapping.value_mappers import (

from bdikit.schema_matching.best.base import BaseSchemaMatcher
from bdikit.schema_matching.best.matcher_factory import SchemaMatchers
from bdikit.schema_matching.topk.base import BaseTopkSchemaMatcher
from bdikit.schema_matching.topk.matcher_factory import TopkMatchers
from bdikit.value_matching.base import BaseValueMatcher, ValueMatch, ValueMatchingResult
from bdikit.value_matching.matcher_factory import ValueMatchers

from bdikit.mapping_functions import (
ValueMapper,
FunctionValueMapper,
DictionaryMapper,
Expand All @@ -67,37 +42,6 @@
logger = logging.getLogger(__name__)


class SchemaMatchers(Enum):
SIMFLOOD = ("similarity_flooding", SimFloodSchemaMatcher)
COMA = ("coma", ComaSchemaMatcher)
CUPID = ("cupid", CupidSchemaMatcher)
DISTRIBUTION_BASED = ("distribution_based", DistributionBasedSchemaMatcher)
JACCARD_DISTANCE = ("jaccard_distance", JaccardSchemaMatcher)
GPT = ("gpt", GPTSchemaMatcher)
CT_LEARGNING = ("ct_learning", ContrastiveLearningSchemaMatcher)
TWO_PHASE = ("two_phase", TwoPhaseSchemaMatcher)
MAX_VAL_SIM = ("max_val_sim", MaxValSimSchemaMatcher)

def __init__(self, method_name: str, method_class: Type[BaseSchemaMatcher]):
self.method_name = method_name
self.method_class = method_class

@staticmethod
def get_instance(
method_name: str, **method_kwargs: Mapping[str, Any]
) -> BaseSchemaMatcher:
methods = {method.method_name: method.method_class for method in SchemaMatchers}

try:
return methods[method_name](**method_kwargs)
except KeyError:
names = ", ".join(list(methods.keys()))
raise ValueError(
f"The {method_name} algorithm is not supported. "
f"Supported algorithms are: {names}"
)


def match_schema(
source: pd.DataFrame,
target: Union[str, pd.DataFrame] = "gdc",
Expand Down Expand Up @@ -130,7 +74,7 @@ def match_schema(
if isinstance(method, str):
if method_args is None:
method_args = {}
matcher_instance = SchemaMatchers.get_instance(method, **method_args)
matcher_instance = SchemaMatchers.get_matcher(method, **method_args)
elif isinstance(method, BaseSchemaMatcher):
matcher_instance = method
else:
Expand All @@ -154,34 +98,12 @@ def _load_table_for_standard(name: str) -> pd.DataFrame:
raise ValueError(f"The {name} standard is not supported")


class TopkMatchers(Enum):
CT_LEARNING = ("ct_learning", CLTopkColumnMatcher)

def __init__(self, method_name: str, method_class: Type[TopkColumnMatcher]):
self.method_name = method_name
self.method_class = method_class

@staticmethod
def get_instance(
method_name: str, **method_kwargs: Mapping[str, Any]
) -> TopkColumnMatcher:
methods = {method.method_name: method.method_class for method in TopkMatchers}
try:
return methods[method_name](**method_kwargs)
except KeyError:
names = ", ".join(list(methods.keys()))
raise ValueError(
f"The {method_name} algorithm is not supported. "
f"Supported algorithms are: {names}"
)


def top_matches(
source: pd.DataFrame,
columns: Optional[List[str]] = None,
target: Union[str, pd.DataFrame] = "gdc",
top_k: int = 10,
method: Union[str, TopkColumnMatcher] = "ct_learning",
method: Union[str, BaseTopkSchemaMatcher] = "ct_learning",
method_args: Optional[Dict[str, Any]] = None,
) -> pd.DataFrame:
"""
Expand Down Expand Up @@ -210,12 +132,12 @@ def top_matches(
if isinstance(method, str):
if method_args is None:
method_args = {}
topk_matcher = TopkMatchers.get_instance(method, **method_args)
elif isinstance(method, TopkColumnMatcher):
topk_matcher = TopkMatchers.get_matcher(method, **method_args)
elif isinstance(method, BaseTopkSchemaMatcher):
topk_matcher = method
else:
raise ValueError(
"The method must be a string or an instance of TopkColumnMatcher"
"The method must be a string or an instance of BaseTopkColumnMatcher"
)

top_k_matches = topk_matcher.get_recommendations(
Expand All @@ -232,47 +154,11 @@ def top_matches(
return pd.concat(dfs, ignore_index=True)


class ValueMatchers(Enum):
TFIDF = ("tfidf", TFIDFValueMatcher)
EDIT = ("edit_distance", EditDistanceValueMatcher)
EMBEDDINGS = ("embedding", EmbeddingValueMatcher)
AUTOFJ = ("auto_fuzzy_join", AutoFuzzyJoinValueMatcher)
FASTTEXT = ("fasttext", FastTextValueMatcher)
GPT = ("gpt", GPTValueMatcher)

def __init__(self, method_name: str, method_class: Type[BaseValueMatcher]):
self.method_name = method_name
self.method_class = method_class

@staticmethod
def get_instance(
method_name: str, **method_kwargs: Mapping[str, Any]
) -> BaseValueMatcher:
methods = {method.method_name: method.method_class for method in ValueMatchers}
try:
return methods[method_name](**method_kwargs)
except KeyError:
names = ", ".join(list(methods.keys()))
raise ValueError(
f"The {method_name} algorithm is not supported. "
f"Supported algorithms are: {names}"
)


class ValueMatchingResult(TypedDict):
source: str
target: str
matches: List[ValueMatch]
coverage: float
unique_values: Set[str]
unmatch_values: Set[str]


def match_values(
source: pd.DataFrame,
target: Union[str, pd.DataFrame],
column_mapping: Union[Tuple[str, str], pd.DataFrame],
method: str = DEFAULT_VALUE_MATCHING_METHOD,
method: Union[str, BaseValueMatcher] = DEFAULT_VALUE_MATCHING_METHOD,
method_args: Optional[Dict[str, Any]] = None,
) -> Union[pd.DataFrame, List[pd.DataFrame]]:
"""
Expand Down Expand Up @@ -314,11 +200,11 @@ def match_values(
if method_args is None:
method_args = {}

if "top_n" in method_args and method_args["top_n"] > 1:
if "top_k" in method_args and method_args["top_k"] > 1:
logger.warning(
f"Ignoring 'top_n' argument, use the 'top_value_matches()' method to get top-k value matches."
f"Ignoring 'top_k' argument, use the 'top_value_matches()' method to get top-k value matches."
)
method_args["top_n"] = 1
method_args["top_k"] = 1

matches = _match_values(source, target, column_mapping, method, method_args)

Expand Down Expand Up @@ -457,7 +343,7 @@ def _match_values(
target_domain, column_mapping_list = _format_value_matching_input(
source, target, column_mapping
)
value_matcher = ValueMatchers.get_instance(method, **method_args)
value_matcher = ValueMatchers.get_matcher(method, **method_args)
mapping_results: List[ValueMatchingResult] = []

for mapping in column_mapping_list:
Expand Down
Loading

0 comments on commit 8217711

Please sign in to comment.