-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OPFS (Origin Private File System) Support #1856
base: main
Are you sure you want to change the base?
Conversation
This is very very welcome, thanks. I am in the process of releasing a new version of duckdb-wasm, connected to duckdb |
@carlopi |
One comment would be if you could have a look at the failing test |
bbb2067
to
383ff1b
Compare
@carlopi https://github.com/AKABANAKK/duckdb-wasm/actions/runs/11016245572 |
Thanks, I gave a proper look, this looks solid, thanks a lot for the contribution. This PR does introduces some API changes that might be somewhat unexpected, so I think the proper way forward would be:
|
@carlopi |
886b161
to
54b82e9
Compare
b4159f3
to
b78525d
Compare
@carlopi |
@e1arikawa have you published your release version somewhere accessible on NPM while we wait? I'd love to kick the wheels on it in the meantime. (edit: copied source files over from https://github.com/e1arikawa/duckdb-wasm-samples/tree/main/demo-app so I could play, but would still appreciate an NPM module for simplicity!) |
await writable.write(parquetBuffer); | ||
await writable.close(); | ||
//2. handle is empty object, because worker gets a File Handle using the file name. | ||
await db.registerFileHandle('test.parquet', {}, duckdb.DuckDBDataProtocol.BROWSER_FSACCESS, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, this API choice feels a little counter-intuitive. I understand that we can't pass FileSystemSyncAccessHandle
to the workers over postMessage, but if we're using the file name to open an OPFS file handle, why would we ever choose to use this over referencing opfs://test.parquet
in the query? I assume with this implementation read_parquet('opfs://test.parquet')
would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_parquet requires registering a file handler in advance with registerFileHandler. Therefore, if you register an empty handler, such as {} or null, it will enable the use of OPFS files.
if use directories, Please use registerFileHandler when working with directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if you're specifying a protocol, duckdb-wasm should know how to access the data protocol out of the box – much like how SQLite does it, or how DuckDB will auto-install the parquet extension when you read a parquet file. The extra ceremony of registering a file handler should not be needed. Couldn't this OPFS auto-discovery happen here? https://github.com/duckdb/duckdb-wasm/pull/1856/files#diff-b177d9b16bfe243b4b81d4b3ccd9e36cff40a7f26d8ab15f6b78c27d869bb1a8R368
And, as said in another comment, directories are a core feature of OPFS. Ignoring them in in this implementation, and telling integrators to fall back on a slower API is not a good developer experience, especially when they may not be in control of file locations in OPFS. This limitation is additionally confusing when prepareDBFileHandle supports directories, and registerFileHandler does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using db.registerFileHandler, please obtain the file handler from a directory in OPFS.
const datadir = await opfsRoot.getDirectoryHandle("datadir");
const fileHandle = await datadir.getFileHandle('test.parquet');
// some process
await db.registerFileHandle('test.parquet', fileHandle, duckdb.DuckDBDataProtocol.BROWSER_FSACCESS, true);
I’m using this PR for product development, and there are cases where I use the OPFS file handler for purposes other than DuckDB. That’s why I sometimes choose to manually register the file handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in other comments, adding more features at once in the current situation where this PR hasn’t been merged yet will only broaden the scope of the review. My focus is on getting this PR merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opfs:// protocol is already implemented.
However, since merging this PR is the top priority, I haven’t committed it yet.
// should get sync handle from the file name. | ||
try { | ||
const opfsRoot = await navigator.storage.getDirectory(); | ||
const fileHandle = await opfsRoot.getFileHandle(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe as written, this appears to only get shallow files. If a file path is nested deep in a directory, this will throw! E.g. if we pass opfs://parent_dir/file_name.db
. For reference, prepareDBFileHandle
appears to handle directory traversal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use registerFileHandler when working with directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directories are a core feature of OPFS. Ignoring them in in this implementation, and telling integrators to fall back on a slower API, is not a good developer experience, especially when they may not be in control of file locations in OPFS. This limitation is additionally confusing when prepareDBFileHandle
supports directories, and registerFileHandler does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is not yet focused on enhancing the developer experience beyond what’s necessary to get OPFS working. The reason for this is that adding more features at once would widen the scope of the review, especially given that this PR has not yet been merged. My focus is on ensuring that this PR gets merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using db.registerFileHandler, please obtain the file handler from a directory in OPFS.
const datadir = await opfsRoot.getDirectoryHandle("datadir");
const fileHandle = await datadir.getFileHandle('test.parquet');
await db.registerFileHandle('test.parquet', fileHandle, duckdb.DuckDBDataProtocol.BROWSER_FSACCESS, true);
throw e; | ||
}); | ||
try { | ||
const handle = await fileHandle.createSyncAccessHandle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileSystemSyncAccessHandles have strict locking requirements. If another web worker or tab has an open handle to this resource, this call will fail.
SQLite attempts to get around this with an automatic retry with exponential backoff. https://sqlite.org/forum/info/58a377083cd24a
It is also possible to implement a fully async awaitable lock with the Web Locks API: https://developer.mozilla.org/en-US/docs/Web/API/Web_Locks_API
A full implementation would likely use both approaches, since you don't know what other JS threads may be accessing the OPFS file that don't rely on web locks.
I think this is another reason to only provide opfs access through the opfs://
protocol: I assume this access handle is released once the query is done? Or do these handles stay open for the lifetime of the db connection?
If the handles are automatically closed at the end of the query, then this would free up the lock to be used by other tabs / web workers. If you give users a sanctioned API to register FileSystemSyncAccessHandles
through registerFileHandle
, a user may shoot themselves in the foot by forgetting to call duckdb.dropFile()
when they're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR does not consider other tabs.
It focuses on enabling the use of OPFS with WASM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s natural for a file to be locked when it’s being used by another tab. Therefore, I think it’s understandable that it results in an error.
The implementation of OPFS support for WASM and handling OPFS locking across tabs should be considered separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it is normal for a file to be locked when in use by another tab, but it's a library's choice about what to do about it.
Most users will not want to, won't have the skill to, or won't want to bother implementing a retry or async locking system for multi-tab or multi-worker query orchestration. Implementing this will be important for developer experience, and save the world from having hundreds of half-assed query concurrency systems lying around. And, if I understand the multithreaded cross-origin-isolation implementation correctly, this will be a requirement to use OPFS in that version of the runtime as well.
However, I 100% get the argument for separating this feature improvement form this PR! Should be a fast-follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the first step will be to get this PR merged.
My above In our test product, using (I'm querying ~24M records across two joins – it won't fit in the browser's 4mb memory limit, so must be read from parquet files in OPFS). Who knows if this speed improvement will carry over to the |
@amiller-gh |
@e1arikawa Would love to make that call, but I'm not a maintainer – just a very interested party :) Will let the maintainers chime in on the API callouts I made. For all I know the API thrash on Just calling out though, if this library wants to implement it's own OPFS file access concurrency management down the line, exposing user-land control over |
Maybe @carlopi can chime into this discussion? Thanks! |
@amiller-gh |
Makes sense! If that's the case, in order to keep this well scoped, it may be best to not expose any changes to the Because it's currently hard coded to use OPFS (without directory traversal), the method is now a bit of a mis-nomer when invoked like: (Side note: because there are multiple ways to get Perhaps you can add a new private method called Alt: Just augment But yeah, will let @carlopi chime in with thoughts on API design. I'm unsure what their intent / caution level is! In a separate PR, I'd be happy to take a stab at adding E.g. await db.registerFileURL('local.parquet', 'opfs://origin/local.parquet', DuckDBDataProtocol.OPFS, false);
await c.query(`SELECT * FROM 'local.parquet'`); and more simply await c.query(`SELECT * FROM 'opfs://origin/remote.parquet'`); |
@amiller-gh |
ad9b192
to
dedc2aa
Compare
I am planning to submit a follow-up PR after this one is merged, which will address the following:
These features are already implemented and are awaiting the merge of this PR. |
1da9a46
to
eb6ec89
Compare
@e1arikawa to build your branch I just need to clone your repo and build from source?
Then I imagine there's somewhere I can find something like |
Description:
This PR implements OPFS (Origin Private File System) support in the latest version of duckdb-wasm based on PR #1490.
This allows database files to be read and written to the OPFS.
API:
When opening a database file, you can now use the following API:
Changes: