Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend _FileDataNodeMixin to support storage on remote storage #1859

Closed
jrobinAV opened this issue Sep 30, 2024 · 4 comments
Closed

Extend _FileDataNodeMixin to support storage on remote storage #1859

jrobinAV opened this issue Sep 30, 2024 · 4 comments
Assignees
Labels
Core: ⚙️ Configuration Core: Data node 💬 Discussion Requires some discussion and decision 🟧 Priority: High Must be addressed as soon 📝Release Notes Impacts the Release Notes or the Documentation in general 🔒 Staff only Can only be assigned to the Taipy R&D team

Comments

@jrobinAV
Copy link
Member

jrobinAV commented Sep 30, 2024

Today, we have an s3 object data node that reads/writes data from/to an S3 object. The added value of exposing s3_objects seems greater if the user receives file data, such as a CSV, Excel, etc., instead of raw binary data.

We can extend the _FileDataNodeMixin by adding a config property to specify the storage as follows:

class _FileDataNodeMixin(object):
    """Mixin class designed to handle file-based data nodes
    (CSVDataNode, ParquetDataNode, ExcelDataNode, PickleDataNode, JSONDataNode, etc.)."""

    __EXTENSION_MAP = {"csv": "csv", "excel": "xlsx", "parquet": "parquet", "pickle": "p", "json": "json"}

    _IS_GENERATED_KEY = "is_generated"
    _DEFAULT_DATA_KEY = "default_data"
    _PATH_KEY = "path"
    _DEFAULT_PATH_KEY = "default_path"

    _POSSIBLE_STORAGE = ["local", "google_cloud_storage", "aws_s3_object", "azure_blob_storage", "dropbox", "google_drive"]
    _DEFAULT_STORAGE = "local"

    ...

Taipy should be able to read/write/append/download/upload the data to/from the specified storage with the right storage_type ("csv", "xlsx", "pickle", etc.)

The path properties should be made generic enough to handle every storage.

Originally posted by @jrobinAV in #1822 (review)

@jrobinAV jrobinAV added 🟧 Priority: High Must be addressed as soon 📝Release Notes Impacts the Release Notes or the Documentation in general Core: Data node 💬 Discussion Requires some discussion and decision 🔒 Staff only Can only be assigned to the Taipy R&D team Core: ⚙️ Configuration labels Sep 30, 2024
@trgiangdo
Copy link
Member

There are a few problems with extending support for several remote storage.

  1. Problem with private files or limited access.

If the link of the file is publicly accessible, it's going to be easy to read/write the remote file

However, if the data link is private, or require authorization, the Taipy application needs to be connected to the data service somehow.

(If the link is public but read-only, then uploading will require authorization also)

Then Taipy needs to handle both the download and upload.
The developer will also need to provide some kind of token to the configuration.

Take configuring an csv data node for example,

from taipy import Config

Config.configure_csv_data_node(
    id="dropbox_csv_dn",
    path="https://www.dropbox.com/scl/fi/y1c7m47y9s1x5a2zbc5ex/test.csv",
    scope=Scope.GLOBAL,
    dropbox_access_token="32pikf12c7m37y9s1x5z2zcc5kx"
)

This is just an example and may not work since Dropbox uses Oauth and may require the refresh token too.
For Google Drive, it's similar.

However, for Google Cloud Storage, AWS S3, Azure Blog Storage, different kinds of authentication are needed.

  1. Conflict with the default data

For file-based data node, if the path doesn't exists, the default data will be written to the file.

However, if the path is a remote path, it may not be possible since the downloadable link is usually defined by the remote storage provider.

@jrobinAV
Copy link
Member Author

  1. Depending on the selected storage (local, google_cloud_storage, aws_s3_object, azure_blob_storage, Dropbox, Google Drive), the data node should be provided with some optional and required extra properties. As you proposed, these properties can be provided easily through the config for global data nodes, while for non-global data nodes, they should be set as a data node property at runtime. These properties could also be defined as environment variables benefiting from the existing env variable mechanism.
    I don't see this first bullet point as a problem.

  2. If Taipy needs to create the remote file based on default data, it must use an appropriate API. Tools like Dropbox or Google Drive should have APIs for creating new files. We need to check. IF so, we should be able to get the downloadable link of the new file and set it as the data node path. No?

@trgiangdo
Copy link
Member

trgiangdo commented Nov 21, 2024

  1. Yes. My point is that it will be complex since it's different for each remote storage provider.

  2. What I mean is in this case

from taipy import Config

Config.configure_pickle_data_node(
    id="dropbox_csv_dn",
    path="https://www.dropbox.com/scl/fi/y1c7m47y9s1x5a2zbc5ex/test.p",
    scope=Scope.GLOBAL,
    dropbox_access_token="32pikf12c7m37y9s1x5z2zcc5kx",
    default_data=10,
)

where the user defines both the path and default_data, but then Dropbox will create a new file with different path that we provided.

I guess we can solve this by allowing the path to be optional, and then set the path of the newly created file returned by Dropbox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: ⚙️ Configuration Core: Data node 💬 Discussion Requires some discussion and decision 🟧 Priority: High Must be addressed as soon 📝Release Notes Impacts the Release Notes or the Documentation in general 🔒 Staff only Can only be assigned to the Taipy R&D team
Projects
None yet
Development

No branches or pull requests

2 participants