Extend _FileDataNodeMixin to support storage on remote storage #1859

jrobinAV · 2024-09-30T12:16:01Z

Today, we have an s3 object data node that reads/writes data from/to an S3 object. The added value of exposing s3_objects seems greater if the user receives file data, such as a CSV, Excel, etc., instead of raw binary data.

We can extend the _FileDataNodeMixin by adding a config property to specify the storage as follows:

class _FileDataNodeMixin(object):
    """Mixin class designed to handle file-based data nodes
    (CSVDataNode, ParquetDataNode, ExcelDataNode, PickleDataNode, JSONDataNode, etc.)."""

    __EXTENSION_MAP = {"csv": "csv", "excel": "xlsx", "parquet": "parquet", "pickle": "p", "json": "json"}

    _IS_GENERATED_KEY = "is_generated"
    _DEFAULT_DATA_KEY = "default_data"
    _PATH_KEY = "path"
    _DEFAULT_PATH_KEY = "default_path"

    _POSSIBLE_STORAGE = ["local", "google_cloud_storage", "aws_s3_object", "azure_blob_storage", "dropbox", "google_drive"]
    _DEFAULT_STORAGE = "local"

    ...

Taipy should be able to read/write/append/download/upload the data to/from the specified storage with the right storage_type ("csv", "xlsx", "pickle", etc.)

The path properties should be made generic enough to handle every storage.

Originally posted by @jrobinAV in #1822 (review)

The text was updated successfully, but these errors were encountered:

trgiangdo · 2024-11-20T08:08:47Z

There are a few problems with extending support for several remote storage.

Problem with private files or limited access.

If the link of the file is publicly accessible, it's going to be easy to read/write the remote file

However, if the data link is private, or require authorization, the Taipy application needs to be connected to the data service somehow.

(If the link is public but read-only, then uploading will require authorization also)

Then Taipy needs to handle both the download and upload.
The developer will also need to provide some kind of token to the configuration.

Take configuring an csv data node for example,

from taipy import Config

Config.configure_csv_data_node(
    id="dropbox_csv_dn",
    path="https://www.dropbox.com/scl/fi/y1c7m47y9s1x5a2zbc5ex/test.csv",
    scope=Scope.GLOBAL,
    dropbox_access_token="32pikf12c7m37y9s1x5z2zcc5kx"
)

This is just an example and may not work since Dropbox uses Oauth and may require the refresh token too.
For Google Drive, it's similar.

However, for Google Cloud Storage, AWS S3, Azure Blog Storage, different kinds of authentication are needed.

Conflict with the default data

For file-based data node, if the path doesn't exists, the default data will be written to the file.

However, if the path is a remote path, it may not be possible since the downloadable link is usually defined by the remote storage provider.

jrobinAV · 2024-11-20T13:43:31Z

Depending on the selected storage (local, google_cloud_storage, aws_s3_object, azure_blob_storage, Dropbox, Google Drive), the data node should be provided with some optional and required extra properties. As you proposed, these properties can be provided easily through the config for global data nodes, while for non-global data nodes, they should be set as a data node property at runtime. These properties could also be defined as environment variables benefiting from the existing env variable mechanism.
I don't see this first bullet point as a problem.
If Taipy needs to create the remote file based on default data, it must use an appropriate API. Tools like Dropbox or Google Drive should have APIs for creating new files. We need to check. IF so, we should be able to get the downloadable link of the new file and set it as the data node path. No?

trgiangdo · 2024-11-21T00:49:19Z

Yes. My point is that it will be complex since it's different for each remote storage provider.
What I mean is in this case

from taipy import Config

Config.configure_pickle_data_node(
    id="dropbox_csv_dn",
    path="https://www.dropbox.com/scl/fi/y1c7m47y9s1x5a2zbc5ex/test.p",
    scope=Scope.GLOBAL,
    dropbox_access_token="32pikf12c7m37y9s1x5z2zcc5kx",
    default_data=10,
)

where the user defines both the path and default_data, but then Dropbox will create a new file with different path that we provided.

I guess we can solve this by allowing the path to be optional, and then set the path of the newly created file returned by Dropbox.

trgiangdo · 2024-12-18T10:33:15Z

List of tickets to be engaged in the future:

jrobinAV assigned trgiangdo Dec 16, 2024

jrobinAV closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend _FileDataNodeMixin to support storage on remote storage #1859

Extend _FileDataNodeMixin to support storage on remote storage #1859

jrobinAV commented Sep 30, 2024 •

edited

Loading

trgiangdo commented Nov 20, 2024

jrobinAV commented Nov 20, 2024

trgiangdo commented Nov 21, 2024 •

edited

Loading

trgiangdo commented Dec 18, 2024

Extend _FileDataNodeMixin to support storage on remote storage #1859

Extend _FileDataNodeMixin to support storage on remote storage #1859

Comments

jrobinAV commented Sep 30, 2024 • edited Loading

trgiangdo commented Nov 20, 2024

jrobinAV commented Nov 20, 2024

trgiangdo commented Nov 21, 2024 • edited Loading

trgiangdo commented Dec 18, 2024

jrobinAV commented Sep 30, 2024 •

edited

Loading

trgiangdo commented Nov 21, 2024 •

edited

Loading