Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OPFS (Origin Private File System) Support #1856

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

e1arikawa
Copy link

Description:

This PR implements OPFS (Origin Private File System) support in the latest version of duckdb-wasm based on PR #1490.
This allows database files to be read and written to the OPFS.

API:
When opening a database file, you can now use the following API:

await db.open({
    path: 'opfs://test.db',
    accessMode: duckdb.DuckDBAccessMode.READ_WRITE,
});
const conn = await db.connect();

Changes:

  • Added support for OPFS.
    • Bug Fixes:
      • Resolved issues related to COPY TO functionality.
      • Fixed support for signed S3 URLs.

@carlopi
Copy link
Collaborator

carlopi commented Sep 18, 2024

This is very very welcome, thanks.

I am in the process of releasing a new version of duckdb-wasm, connected to duckdb 1.1.0.
Once that is out, I will properly review this PR.

@e1arikawa
Copy link
Author

@carlopi
Thank you very much for your update and for taking the time to review this PR.
I look forward to your feedback once the release is out.
Please let me know if there's anything I can assist with in the meantime.

@carlopi
Copy link
Collaborator

carlopi commented Sep 18, 2024

One comment would be if you could have a look at the failing test

@e1arikawa e1arikawa force-pushed the feature/opfs_support branch 2 times, most recently from bbb2067 to 383ff1b Compare September 23, 2024 02:57
@e1arikawa
Copy link
Author

e1arikawa commented Sep 24, 2024

@carlopi
I have merged the latest main branch and confirmed that the tests using DuckDB 1.1.1 pass successfully.

https://github.com/AKABANAKK/duckdb-wasm/actions/runs/11016245572

@carlopi
Copy link
Collaborator

carlopi commented Sep 25, 2024

Thanks, I gave a proper look, this looks solid, thanks a lot for the contribution.

This PR does introduces some API changes that might be somewhat unexpected, so I think the proper way forward would be:

  • tag v1.29.0 (later today)
  • merge this PR / iterate
  • tag v1.30.0 in the next weeks

@e1arikawa
Copy link
Author

e1arikawa commented Sep 25, 2024

@carlopi
Thank you for reviewing my contribution. Regarding the API changes, I will carefully test them after tagging v1.29.0.
Please proceed with the work as per your proposed plan. I appreciate your continued support.

@e1arikawa e1arikawa changed the title Add OPFS (Origin Private File System) Support to the Latest Version of duckdb-wasm Add OPFS (Origin Private File System) Support Oct 5, 2024
@e1arikawa
Copy link
Author

@carlopi
I will be using the release version that includes this PR for my work. Therefore, I would like to ask when the PR will be merged. If there is anything I can do to assist with the merge process, please let me know. Thank you for your time and support.

@amiller-gh
Copy link

amiller-gh commented Nov 13, 2024

@e1arikawa have you published your release version somewhere accessible on NPM while we wait? I'd love to kick the wheels on it in the meantime.

(edit: copied source files over from https://github.com/e1arikawa/duckdb-wasm-samples/tree/main/demo-app so I could play, but would still appreciate an NPM module for simplicity!)

await writable.write(parquetBuffer);
await writable.close();
//2. handle is empty object, because worker gets a File Handle using the file name.
await db.registerFileHandle('test.parquet', {}, duckdb.DuckDBDataProtocol.BROWSER_FSACCESS, true);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, this API choice feels a little counter-intuitive. I understand that we can't pass FileSystemSyncAccessHandle to the workers over postMessage, but if we're using the file name to open an OPFS file handle, why would we ever choose to use this over referencing opfs://test.parquet in the query? I assume with this implementation read_parquet('opfs://test.parquet') would work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_parquet requires registering a file handler in advance with registerFileHandler. Therefore, if you register an empty handler, such as {} or null, it will enable the use of OPFS files.

if use directories, Please use registerFileHandler when working with directories.

Copy link

@amiller-gh amiller-gh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if you're specifying a protocol, duckdb-wasm should know how to access the data protocol out of the box – much like how SQLite does it, or how DuckDB will auto-install the parquet extension when you read a parquet file. The extra ceremony of registering a file handler should not be needed. Couldn't this OPFS auto-discovery happen here? https://github.com/duckdb/duckdb-wasm/pull/1856/files#diff-b177d9b16bfe243b4b81d4b3ccd9e36cff40a7f26d8ab15f6b78c27d869bb1a8R368

And, as said in another comment, directories are a core feature of OPFS. Ignoring them in in this implementation, and telling integrators to fall back on a slower API is not a good developer experience, especially when they may not be in control of file locations in OPFS. This limitation is additionally confusing when prepareDBFileHandle supports directories, and registerFileHandler does not.

Copy link
Author

@e1arikawa e1arikawa Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using db.registerFileHandler, please obtain the file handler from a directory in OPFS.

const datadir = await opfsRoot.getDirectoryHandle("datadir");
const fileHandle = await datadir.getFileHandle('test.parquet');

// some process

await db.registerFileHandle('test.parquet', fileHandle, duckdb.DuckDBDataProtocol.BROWSER_FSACCESS, true);

I’m using this PR for product development, and there are cases where I use the OPFS file handler for purposes other than DuckDB. That’s why I sometimes choose to manually register the file handler.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in other comments, adding more features at once in the current situation where this PR hasn’t been merged yet will only broaden the scope of the review. My focus is on getting this PR merged.

Copy link
Author

@e1arikawa e1arikawa Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opfs:// protocol is already implemented.
However, since merging this PR is the top priority, I haven’t committed it yet.

// should get sync handle from the file name.
try {
const opfsRoot = await navigator.storage.getDirectory();
const fileHandle = await opfsRoot.getFileHandle(name);
Copy link

@amiller-gh amiller-gh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe as written, this appears to only get shallow files. If a file path is nested deep in a directory, this will throw! E.g. if we pass opfs://parent_dir/file_name.db. For reference, prepareDBFileHandle appears to handle directory traversal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use registerFileHandler when working with directories.

Copy link

@amiller-gh amiller-gh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directories are a core feature of OPFS. Ignoring them in in this implementation, and telling integrators to fall back on a slower API, is not a good developer experience, especially when they may not be in control of file locations in OPFS. This limitation is additionally confusing when prepareDBFileHandle supports directories, and registerFileHandler does not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not yet focused on enhancing the developer experience beyond what’s necessary to get OPFS working. The reason for this is that adding more features at once would widen the scope of the review, especially given that this PR has not yet been merged. My focus is on ensuring that this PR gets merged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using db.registerFileHandler, please obtain the file handler from a directory in OPFS.

const datadir = await opfsRoot.getDirectoryHandle("datadir");
const fileHandle = await datadir.getFileHandle('test.parquet');
await db.registerFileHandle('test.parquet', fileHandle, duckdb.DuckDBDataProtocol.BROWSER_FSACCESS, true);

throw e;
});
try {
const handle = await fileHandle.createSyncAccessHandle();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystemSyncAccessHandles have strict locking requirements. If another web worker or tab has an open handle to this resource, this call will fail.

SQLite attempts to get around this with an automatic retry with exponential backoff. https://sqlite.org/forum/info/58a377083cd24a

It is also possible to implement a fully async awaitable lock with the Web Locks API: https://developer.mozilla.org/en-US/docs/Web/API/Web_Locks_API

A full implementation would likely use both approaches, since you don't know what other JS threads may be accessing the OPFS file that don't rely on web locks.

I think this is another reason to only provide opfs access through the opfs:// protocol: I assume this access handle is released once the query is done? Or do these handles stay open for the lifetime of the db connection?

If the handles are automatically closed at the end of the query, then this would free up the lock to be used by other tabs / web workers. If you give users a sanctioned API to register FileSystemSyncAccessHandles through registerFileHandle, a user may shoot themselves in the foot by forgetting to call duckdb.dropFile() when they're done.

Copy link
Author

@e1arikawa e1arikawa Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR does not consider other tabs.
It focuses on enabling the use of OPFS with WASM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s natural for a file to be locked when it’s being used by another tab. Therefore, I think it’s understandable that it results in an error.

The implementation of OPFS support for WASM and handling OPFS locking across tabs should be considered separately.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it is normal for a file to be locked when in use by another tab, but it's a library's choice about what to do about it.

Most users will not want to, won't have the skill to, or won't want to bother implementing a retry or async locking system for multi-tab or multi-worker query orchestration. Implementing this will be important for developer experience, and save the world from having hundreds of half-assed query concurrency systems lying around. And, if I understand the multithreaded cross-origin-isolation implementation correctly, this will be a requirement to use OPFS in that version of the runtime as well.

However, I 100% get the argument for separating this feature improvement form this PR! Should be a fast-follow.

Copy link
Author

@e1arikawa e1arikawa Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the first step will be to get this PR merged.

@amiller-gh
Copy link

amiller-gh commented Nov 13, 2024

My above FileSystemSyncAccessHandles locking comments aside, the added complexity is totally worth it for the increased speed.

In our test product, using handle.createSyncAccessHandle() + DuckDBDataProtocol.BROWSER_FSACCESS instead of handle.getFile() + DuckDBDataProtocol.BROWSER_FILEREADER reduced one of our large query's execution time from ~32s to 8.6s!

(I'm querying ~24M records across two joins – it won't fit in the browser's 4mb memory limit, so must be read from parquet files in OPFS).

Who knows if this speed improvement will carry over to the -coi multithreaded implementation, or if the thread safe, but much slower, handle.getFile() data protocol type will win out for parallel access, but thats a test for when I can get the COI experiments working with the Parquet extension: #1916

@e1arikawa
Copy link
Author

@amiller-gh
Please make these adjustments after this PR has been merged.

@amiller-gh
Copy link

@e1arikawa Would love to make that call, but I'm not a maintainer – just a very interested party :)

Will let the maintainers chime in on the API callouts I made. For all I know the API thrash on registerFileHandler re: directory support and potential opfs:// protocol urls will be okay.

Just calling out though, if this library wants to implement it's own OPFS file access concurrency management down the line, exposing user-land control over FileSystemSyncAccessHandle registration may make this more difficult.

@tobilg
Copy link

tobilg commented Nov 13, 2024

Maybe @carlopi can chime into this discussion? Thanks!

@e1arikawa
Copy link
Author

@amiller-gh
I completely understand what you're getting at.
However, this PR is simply to enable the use of OPFS.
If there are concerns, perhaps we can address them after merging this PR.
I’m not aiming to perfect OPFS functionality in a single PR.

@amiller-gh
Copy link

Makes sense!

If that's the case, in order to keep this well scoped, it may be best to not expose any changes to the registerFileHandle method, and only add support for passing an opfs:// uri to db.open! That magic null or {} passed as a fake file handle feels odd.

Because it's currently hard coded to use OPFS (without directory traversal), the method is now a bit of a mis-nomer when invoked like: await db.registerFileHandle(local.parquet, {}, DuckDBDataProtocol.BROWSER_FSACCESS, true).

(Side note: because there are multiple ways to get FileSystemSyncAccessHandles in the browser to use for use with the DuckDBDataProtocol.BROWSER_FSACCESS data protocol, it's not exclusive to OPFS. However, because FileSystemSyncAccessHandles cannot be shared between threads, using registerFileHandle + BROWSER_FSACCESS is only really useful when using the blocking version of the library!)

Perhaps you can add a new private method called registerOPFSHandle(name: string, uri: string, directIO: boolean). This way registerFileHandle can remain clean, and will always require a real file handle when invoked with BROWSER_FSACCESS as expected. It's a little less "magic" that way.

Alt: Just augment registerFileURL to take opfs protocol URLs instead (see code example below). Same effect, different method. The downside here is the change will be public facing, and may require more scrutiny before landing.

But yeah, will let @carlopi chime in with thoughts on API design. I'm unsure what their intent / caution level is!

In a separate PR, I'd be happy to take a stab at adding opfs URL schema, inspired by the WASM flavored HTTPFS supprt. Having that dedicated private registerOPFSHandle method would be very helpful for the implementation.

E.g.

await db.registerFileURL('local.parquet', 'opfs://origin/local.parquet', DuckDBDataProtocol.OPFS, false);
await c.query(`SELECT * FROM 'local.parquet'`);

and more simply

await c.query(`SELECT * FROM 'opfs://origin/remote.parquet'`);

@e1arikawa
Copy link
Author

@amiller-gh
Passing null or {} in place of a file handle does feel a bit awkward, but please bear with it for now. We’ll make improvements after this is merged.

@e1arikawa
Copy link
Author

e1arikawa commented Nov 16, 2024

I am planning to submit a follow-up PR after this one is merged, which will address the following:

  1. Support for the opfs:// protocol
    ex) FROM read_parquet('opfs://test.parquet');
  2. Multi-tab and multi-worker support within the same domain With Web Locks
  3. Compatibility with mobile browsers

These features are already implemented and are awaiting the merge of this PR.

@seanbirchall
Copy link

seanbirchall commented Nov 26, 2024

@e1arikawa to build your branch I just need to clone your repo and build from source?

cd duckdb-wasm
git submodule init
git submodule update
make apply_patches
make serve

Then I imagine there's somewhere I can find something like https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/+esm except locally after I build that I can use in my web app?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants