Skip to content

Commit

Permalink
Remove value registration wrappers (#672)
Browse files Browse the repository at this point in the history
* Remove value registration wrappers

George's cloudpickle approach makes them obsolete.

* Add note about testing ingestors using cloudpickle

* A little more detail and repetition
  • Loading branch information
sgillies authored Oct 21, 2024
1 parent b036227 commit d4cd10b
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 18 deletions.
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,31 @@ Documentation uses [Quarto](https://quarto.org/) style documentation.
- For continuous integration, the token is configured for the `unittest` user and all tests should pass
- For interactive use, if your `TILEDB_REST_TOKEN` points to your own account, most tests will run, except for those that explicitly check against contents of the `unittest` account which are skipped

### Ingestion testing

This package contains modules and functions that will be run in the TileDB Cloud as UDFs. Local ingestor changes can be tested in the cloud by using a feature of [cloudpickle](https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs). Below is a runnable example. `TILEDB_NAMESPACE` is your TileDB namespace ("TileDB-Inc", for example). `TILEDB_ACCESS_CREDENTIAL_NAME` is the name of the stored credentials for accessing `AWS_BUCKET`. `IMAGE_FILE_KEY` is the key for an object in that bucket and `OUTPUT_GROUP_KEY` is the key to be used for the group that the `ingest()` UDF will create in `AWS_BUCKET`.

```python
import cloudpickle

import tiledb.cloud.bioimg

cloudpickle.register_pickle_by_value(tiledb.cloud.bioimg)

tiledb.cloud.bioimg.ingest(
source="s3://AWS_BUCKET/IMAGE_FILE_KEY",
output="s3://AWS_BUCKET/OUTPUT_GROUP_KEY",
config=None,
namespace="TILEDB_NAMESPACE",
acn="TILEDB_ACCESS_CREDENTIAL_NAME",
ingest_resources={"cpu": "8", "memory": "32Gi"},
)
```

In this case `tiledb.cloud.bioimg.ingest()` uses cloudpickle to send a local function to TileDB Cloud, and `cloudpickle.register_pickle_by_value(tiledb.cloud.bioimg)` directs cloudpickle to bring the currently imported `tiledb.cloud.bioimg` module along with the function. Your local version of the module will be used instead of the version currently deployed in TileDB Cloud.

Note: your local changes to the Cloud-Py package will need to be installed in order for cloudpickle to serialize them, as cloudpickle needs to find them at runtime.

### Releasing

Releasing is entirely automated. Releases made on GitHub using tags that start with "v", like "v0.12.28", trigger sdist and wheel builds and upload of those distributions to the Python Package Index.
24 changes: 6 additions & 18 deletions src/tiledb/cloud/soma/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@

import tiledb
from tiledb.cloud import dag
from tiledb.cloud._common import functions
from tiledb.cloud._common import utils
from tiledb.cloud.utilities import as_batch
from tiledb.cloud.utilities import get_logger_wrapper
Expand Down Expand Up @@ -155,7 +154,7 @@ def run_ingest_workflow_udf(
logger.debug("Submitting h5ad file: entry_input_uri=%r", entry_input_uri)

node = grf.submit(
_ingest_h5ad_byval,
ingest_h5ad,
output_uri=entry_output_uri,
input_uri=entry_input_uri,
measurement_name=measurement_name,
Expand All @@ -182,7 +181,7 @@ def run_ingest_workflow_udf(
# we can use it as the resources argument of this method
# call.
h5ad_ingest = grf.submit(
_ingest_h5ad_byval,
ingest_h5ad,
output_uri=output_uri,
input_uri=input_uri,
measurement_name=measurement_name,
Expand All @@ -201,7 +200,7 @@ def run_ingest_workflow_udf(
# Register the SOMA result if not DRY-RUN
if not dry_run:
register_soma = grf.submit(
_register_dataset_udf_byval,
register_dataset_udf,
output_uri,
namespace=namespace,
register_name=register_name,
Expand Down Expand Up @@ -305,7 +304,7 @@ def __getattr__(self, name: str) -> object:
logging.info("Dry run for %s to %s", input_uri, output_uri)
return

with _hack_patch_anndata_byval():
with _hack_patch_anndata():
try:
input_data = anndata.read_h5ad(
_FSPathWrapper(input_file, input_uri), "r"
Expand Down Expand Up @@ -425,7 +424,7 @@ def run_ingest_workflow(
}

grf.submit(
_run_ingest_workflow_udf_byval,
run_ingest_workflow_udf,
output_uri=output_uri,
input_uri=input_uri,
measurement_name=measurement_name,
Expand Down Expand Up @@ -455,15 +454,4 @@ def run_ingest_workflow(
}


# FIXME: Until we fully get this version of tiledb.cloud deployed server-side,
# we must refer to all functions by value rather than by reference
# -- which is a fancy way of saying these functions _will not work at all_ until
# and unless they are checked into tiledb-cloud-py and deployed server-side.
# _All_ dev work _must_ use this idiom.
_ingest_h5ad_byval = functions.to_register_by_value(ingest_h5ad)
_run_ingest_workflow_byval = functions.to_register_by_value(run_ingest_workflow)
_run_ingest_workflow_udf_byval = functions.to_register_by_value(run_ingest_workflow_udf)
_register_dataset_udf_byval = functions.to_register_by_value(register_dataset_udf)
_hack_patch_anndata_byval = functions.to_register_by_value(_hack_patch_anndata)

ingest = as_batch(_run_ingest_workflow_byval)
ingest = as_batch(run_ingest_workflow)

0 comments on commit d4cd10b

Please sign in to comment.