Remote-only packaging of MLMD Python lib

opendatahub-io · Jan 29, 2024 · dfe7252 · dfe7252
1 parent bf5c8d3
commit dfe7252
Show file tree

Hide file tree

Showing 10 changed files with 213 additions and 189 deletions.
diff --git a/README.md b/README.md
@@ -1,126 +1,6 @@
+A remote-only, gRPC-only, MLMD Python client variant.
 
-# ML Metadata
+## See also:
 
-[![Python](https://img.shields.io/badge/python%20-3.8%7C3.9%7C3.10-blue)](https://github.com/google/ml-metadata)
-[![PyPI](https://badge.fury.io/py/ml-metadata.svg)](https://badge.fury.io/py/ml-metadata)
-
-*ML Metadata (MLMD)* is a library for recording and retrieving metadata
-associated with ML developer and data scientist workflows.
-
-NOTE: ML Metadata may be backwards incompatible before version 1.0.
-
-## Getting Started
-
-For more background on MLMD and instructions on using it, see the
-[getting started guide](https://github.com/google/ml-metadata/blob/master/g3doc/get_started.md)
-
-## Installing from PyPI
-
-The recommended way to install ML Metadata is to use the
-[PyPI package](https://pypi.org/project/ml-metadata/):
-
-```bash
-pip install ml-metadata
-```
-
-Then import the relevant packages:
-
-```python
-from ml_metadata import metadata_store
-from ml_metadata.proto import metadata_store_pb2
-```
-
-### Nightly Packages
-
-ML Metadata (MLMD) also hosts nightly packages at
-https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest
-nightly package, please use the following command:
-
-```bash
-pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple ml-metadata
-```
-
-## Installing with Docker
-
-This is the recommended way to build ML Metadata under Linux, and is
-continuously tested at Google.
-
-Please first install `docker` and `docker-compose` by following the directions:
-[docker](https://docs.docker.com/install/);
-[docker-compose](https://docs.docker.com/compose/install/).
-
-Then, run the following at the project root:
-
-```bash
-DOCKER_SERVICE=manylinux-python${PY_VERSION}
-sudo docker-compose build ${DOCKER_SERVICE}
-sudo docker-compose run ${DOCKER_SERVICE}
-```
-
-where `PY_VERSION` is one of `{38, 39, 310}`.
-
-A wheel will be produced under `dist/`, and installed as follows:
-
-```shell
-pip install dist/*.whl
-```
-
-## Installing from source
-
-
-### 1. Prerequisites
-
-To compile and use ML Metadata, you need to set up some prerequisites.
-
-
-#### Install Bazel
-
-If Bazel is not installed on your system, install it now by following [these
-directions](https://bazel.build/versions/master/docs/install.html).
-
-#### Install cmake
-If cmake is not installed on your system, install it now by following [these
-directions](https://cmake.org/install/).
-
-### 2. Clone ML Metadata repository
-
-```shell
-git clone https://github.com/google/ml-metadata
-cd ml-metadata
-```
-
-Note that these instructions will install the latest master branch of ML
-Metadata. If you want to install a specific branch (such as a release branch),
-pass `-b <branchname>` to the `git clone` command.
-
-### 3. Build the pip package
-
-ML Metadata uses Bazel to build the pip package from source:
-
-```shell
-python setup.py bdist_wheel
-```
-
-You can find the generated `.whl` file in the `dist` subdirectory.
-
-### 4. Install the pip package
-
-```shell
-pip install dist/*.whl
-```
-
-### 5.(Optional) Build the grpc server
-
-ML Metadata uses Bazel to build the c++ binary from source:
-
-```shell
-bazel build -c opt --define grpc_no_ares=true  //ml_metadata/metadata_store:metadata_store_server
-```
-
-## Supported platforms
-
-MLMD is built and tested on the following 64-bit operating systems:
-
-*   macOS 10.14.6 (Mojave) or later.
-*   Ubuntu 20.04 or later.
-*   Windows 10 or later.
+Upstream project: https://github.com/google/ml-metadata
+Motivations for this client variant: https://github.com/opendatahub-io/model-registry/blob/main/doc/remote_only_packaging_of_MLMD_Python_lib.md
diff --git a/ml_metadata-1.14.0-remote-testing/conn_config.pb b/ml_metadata-1.14.0-remote-testing/conn_config.pb
@@ -0,0 +1,6 @@
+connection_config {
+  sqlite {
+    filename_uri: '/tmp/shared/metadata.sqlite.db'
+    connection_mode: READWRITE_OPENCREATE
+  }
+}
diff --git a/ml_metadata-1.14.0-remote-testing/demo_test.py b/ml_metadata-1.14.0-remote-testing/demo_test.py
@@ -0,0 +1,163 @@
+from pprint import pprint
+
+import ml_metadata as mlmd
+from ml_metadata.metadata_store import metadata_store
+from ml_metadata.proto import metadata_store_pb2
+
+def test_demo():
+    # Setup client config
+    client_connection_config = metadata_store_pb2.MetadataStoreClientConfig()
+    client_connection_config.host = 'localhost'
+    client_connection_config.port = 8080
+
+    store = metadata_store.MetadataStore(client_connection_config)
+
+    # Create ArtifactTypes, e.g., DataSet
+    data_type = metadata_store_pb2.ArtifactType()
+    data_type.name = "DataSet"
+    data_type.properties["day"] = metadata_store_pb2.INT
+    data_type.properties["split"] = metadata_store_pb2.STRING
+    data_type_id = store.put_artifact_type(data_type)
+    pprint(data_type_id)
+
+    # Create ArtifactTypes, e.g.,SavedModel
+    model_type = metadata_store_pb2.ArtifactType()
+    model_type.name = "SavedModel"
+    model_type.properties["version"] = metadata_store_pb2.INT
+    model_type.properties["name"] = metadata_store_pb2.STRING
+    model_type_id = store.put_artifact_type(model_type)
+    pprint(model_type_id)
+
+    # ModelVersion
+    model_version_type = metadata_store_pb2.ContextType()
+    model_version_type.name = "odh.ModelVersion"
+    model_version_type.properties["model_name"] = metadata_store_pb2.STRING
+    model_version_type.properties["version"] = metadata_store_pb2.STRING
+    model_version_type_id = store.put_context_type(model_version_type)
+    pprint(model_version_type_id)
+
+    # Query all registered Artifact types.
+    artifact_types = store.get_artifact_types()
+    pprint(artifact_types)
+
+    # Create an ExecutionType, e.g., Trainer
+    trainer_type = metadata_store_pb2.ExecutionType()
+    trainer_type.name = "Trainer"
+    trainer_type.properties["state"] = metadata_store_pb2.STRING
+    trainer_type_id = store.put_execution_type(trainer_type)
+    pprint(trainer_type_id)
+
+    # Query a registered Execution type with the returned id
+    [registered_type] = store.get_execution_types_by_id([trainer_type_id])
+    pprint(registered_type)
+
+    # Create an input artifact of type DataSet
+    data_artifact = metadata_store_pb2.Artifact()
+    data_artifact.uri = 'path/to/data'
+    data_artifact.properties["day"].int_value = 1
+    data_artifact.properties["split"].string_value = 'train'
+    data_artifact.type_id = data_type_id
+    [data_artifact_id] = store.put_artifacts([data_artifact])
+    pprint(data_artifact_id)
+
+    # Query all registered Artifacts
+    artifacts = store.get_artifacts()
+    pprint(artifacts)
+
+    # Plus, there are many ways to query the same Artifact
+    [stored_data_artifact] = store.get_artifacts_by_id([data_artifact_id])
+    print(stored_data_artifact)
+    artifacts_with_uri = store.get_artifacts_by_uri(data_artifact.uri)
+    pprint(artifacts_with_uri)
+
+    artifacts_with_conditions = store.get_artifacts(
+        list_options=mlmd.ListOptions(
+            filter_query='uri LIKE "%/data" AND properties.day.int_value > 0'))
+    pprint(artifacts_with_conditions)
+
+    # Register the Execution of a Trainer run
+    trainer_run = metadata_store_pb2.Execution()
+    trainer_run.type_id = trainer_type_id
+    trainer_run.properties["state"].string_value = "RUNNING"
+    [run_id] = store.put_executions([trainer_run])
+    pprint(run_id)
+
+    # Query all registered Execution
+    executions = store.get_executions_by_id([run_id])
+    pprint(executions)
+
+    # Similarly, the same execution can be queried with conditions.
+    executions_with_conditions = store.get_executions(
+        list_options = mlmd.ListOptions(
+            filter_query='type = "Trainer" AND properties.state.string_value IS NOT NULL'))
+    pprint(executions_with_conditions)
+
+    # Define the input event
+    input_event = metadata_store_pb2.Event()
+    input_event.artifact_id = data_artifact_id
+    input_event.execution_id = run_id
+    input_event.type = metadata_store_pb2.Event.DECLARED_INPUT
+
+    # Record the input event in the metadata store
+    store.put_events([input_event])
+    # Declare the output artifact of type SavedModel
+    model_artifact = metadata_store_pb2.Artifact()
+    model_artifact.uri = 'path/to/model/file'
+    model_artifact.properties["version"].int_value = 1
+    model_artifact.properties["name"].string_value = 'MNIST-v1'
+    model_artifact.type_id = model_type_id
+    [model_artifact_id] = store.put_artifacts([model_artifact])
+    pprint(model_artifact_id)
+
+    # Declare the output event
+    output_event = metadata_store_pb2.Event()
+    output_event.artifact_id = model_artifact_id
+    output_event.execution_id = run_id
+    output_event.type = metadata_store_pb2.Event.DECLARED_OUTPUT
+
+    # Submit output event to the Metadata Store
+    store.put_events([output_event])
+    trainer_run.id = run_id
+    trainer_run.properties["state"].string_value = "COMPLETED"
+    store.put_executions([trainer_run])
+
+    # Create a ContextType, e.g., Experiment with a note property
+    experiment_type = metadata_store_pb2.ContextType()
+    experiment_type.name = "Experiment"
+    experiment_type.properties["note"] = metadata_store_pb2.STRING
+    experiment_type_id = store.put_context_type(experiment_type)
+
+    # Group the model and the trainer run to an experiment.
+    my_experiment = metadata_store_pb2.Context()
+    my_experiment.type_id = experiment_type_id
+    # Give the experiment a name
+    my_experiment.name = "exp1"
+    my_experiment.properties["note"].string_value = "My first experiment."
+    [experiment_id] = store.put_contexts([my_experiment])
+
+    attribution = metadata_store_pb2.Attribution()
+    attribution.artifact_id = model_artifact_id
+    attribution.context_id = experiment_id
+
+    association = metadata_store_pb2.Association()
+    association.execution_id = run_id
+    association.context_id = experiment_id
+
+    store.put_attributions_and_associations([attribution], [association])
+
+    # Query the Artifacts and Executions that are linked to the Context.
+    experiment_artifacts = store.get_artifacts_by_context(experiment_id)
+    pprint(experiment_artifacts)
+    experiment_executions = store.get_executions_by_context(experiment_id)
+    pprint(experiment_executions)
+
+    # You can also use neighborhood queries to fetch these artifacts and executions
+    # with conditions.
+    experiment_artifacts_with_conditions = store.get_artifacts(
+        list_options = mlmd.ListOptions(
+            filter_query=('contexts_a.type = "Experiment" AND contexts_a.name = "exp1"')))
+    pprint(experiment_artifacts_with_conditions)
+    experiment_executions_with_conditions = store.get_executions(
+        list_options = mlmd.ListOptions(
+            filter_query=('contexts_a.id = {}'.format(experiment_id))))
+    pprint(experiment_executions_with_conditions)
diff --git a/ml_metadata/BUILD b/ml_metadata/BUILD
@@ -42,17 +42,17 @@ _public_protos = [
     "//ml_metadata/proto:metadata_store_service_pb2_grpc.py",
 ]
 
-_py_extension = select({
-    ":windows": [
-        "//ml_metadata/metadata_store/pywrap:metadata_store_extension.pyd",
-    ],
-    "//conditions:default": [
-        "//ml_metadata/metadata_store/pywrap:metadata_store_extension.so",
-    ],
-})
+# _py_extension = select({
+#     ":windows": [
+#         "//ml_metadata/metadata_store/pywrap:metadata_store_extension.pyd",
+#    ],
+#    "//conditions:default": [
+#        "//ml_metadata/metadata_store/pywrap:metadata_store_extension.so",
+#    ],
+# })
 
 sh_binary(
     name = "move_generated_files",
     srcs = ["move_generated_files.sh"],
-    data = _py_extension + _public_protos,
+    data = _public_protos,
 )
diff --git a/ml_metadata/metadata_store/metadata_store.py b/ml_metadata/metadata_store/metadata_store.py
@@ -28,7 +28,8 @@
 
 from ml_metadata import errors
 from ml_metadata import proto
-from ml_metadata.metadata_store.pywrap.metadata_store_extension import metadata_store as metadata_store_serialized
+# fork of ml-metadata supporting ONLY remote gRPC connection
+# from ml_metadata.metadata_store.pywrap.metadata_store_extension import metadata_store as metadata_store_serialized
 from ml_metadata.proto import metadata_store_pb2
 from ml_metadata.proto import metadata_store_service_pb2
 from ml_metadata.proto import metadata_store_service_pb2_grpc
@@ -110,19 +111,7 @@ def __init__(self, config, enable_upgrade_migration: bool = False):
     self._max_num_retries = 5
     self._service_client_wrapper = None
     if isinstance(config, proto.ConnectionConfig):
-      self._using_db_connection = True
-      migration_options = metadata_store_pb2.MigrationOptions()
-      migration_options.enable_upgrade_migration = enable_upgrade_migration
-      self._metadata_store = metadata_store_serialized.CreateMetadataStore(
-          config.SerializeToString(), migration_options.SerializeToString())
-      logging.log(logging.INFO, 'MetadataStore with DB connection initialized')
-      logging.log(logging.DEBUG, 'ConnectionConfig: %s', config)
-      if config.HasField('retry_options'):
-        self._max_num_retries = config.retry_options.max_num_retries
-        logging.log(logging.INFO,
-                    'retry options is overwritten: max_num_retries = %d',
-                    self._max_num_retries)
-      return
+      raise RuntimeError('Unimplemented. This is fork of ml-metadata supporting ONLY remote gRPC connection')
     if not isinstance(config, proto.MetadataStoreClientConfig):
       raise ValueError('MetadataStore is expecting either '
                        'proto.ConnectionConfig or '
@@ -220,8 +209,7 @@ def _call_method(self, method_name, request, response) -> None:
       response: a protobuf message, filled from the return value of the method.
     """
     if self._using_db_connection:
-      cc_method = getattr(metadata_store_serialized, method_name)
-      self._pywrap_cc_call(cc_method, request, response)
+      raise RuntimeError('Unimplemented. This is fork of ml-metadata supporting ONLY remote gRPC connection')
     else:
       grpc_method = getattr(self._metadata_store_stub, method_name)
       try:
@@ -1783,8 +1771,7 @@ def downgrade_schema(config: proto.ConnectionConfig,
   try:
     migration_options = metadata_store_pb2.MigrationOptions()
     migration_options.downgrade_to_schema_version = downgrade_to_schema_version
-    metadata_store_serialized.CreateMetadataStore(
-        config.SerializeToString(), migration_options.SerializeToString())
+    raise RuntimeError('Unimplemented. This is fork of ml-metadata supporting ONLY remote gRPC connection')
   except RuntimeError as e:
     if str(e).startswith('MLMD cannot be downgraded to schema_version'):
       raise errors.make_exception(str(e), errors.INVALID_ARGUMENT) from e