Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: WebAssembly standalone support #103

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

jerbob92
Copy link
Contributor

@bblanchon I promised to come back to you when I finished my WebAssembly implementation, so here it is! It's not ready to merge this in, since it will break support for the web build, but it is possible to use this as a base to create 2 WASM builds: web and standalone.

This contains the following changes:

  • A patch for PDFium to add support for the callbacks, this was necessary because the runtime that I use (Wazero) doesn't support adding functions to the function table dynamically, so it basically adds these function pointers during compile time and that calls back into the host function of the runtime. This does mean that you can't just start loading the WebAssembly file on any runtime, since you do have to implement these host methods, but the same could be said for WASI and the host calls that the web builds add. We might want to add some documentation on these methods though?
  • A patch for Emscripten to implement the WASI methods in the standalone build to make file operations work
  • An update to the latest Emscripten (you probably want to include this for the web build too)
  • A switch between debug/non-debug builds that also includes optimizations (you might want to partially include this for the web build too)
  • A change in the build step that also includes other methods that should be exposed (you might want to partially include this for the web build too)

The thing that I'm not really happy with is how the Emscripten patch is applied, but I couldn't get it to work in another way, since emsdk/upstream/emscripten is not a git repo.

@bblanchon
Copy link
Owner

Hi @jerbob92,

Thank you very much for sharing this!

I didn't expect that we'd have to patch emscriptem and add a new features PDF.
I think this is out of the scope of this repository: the goal here is to provide a pre-built PDFium, not extend PDFium.
Do you think these changes could be integrated upstream?

I'll try to cherry-pick the changes you recommend right away.
I originally downgraded emscriptem in f09df4a to work around the error "Uncaught RuntimeError: unreachable"; hopefully, this bug is gone.

Best regards,
Benoit

@jerbob92
Copy link
Contributor Author

jerbob92 commented Mar 23, 2023

I think Emscripten is not going to accept the patch in this state because it only implements what PDFium uses, nothing else.

And regarding extending PDFium: I don't think there is a way to work around this, to use function pointers in WebAssembly, the pointer needs to exist during compile time or you need to add the pointer to a method during runtime, which many runtimes don't support, that's why I had to implement various callbacks in the PDFium patch and then implement the callback logic on the runtime side. So I wouldn't really see it as extending PDFium, this is just a way to make all functionality work on in WebAssembly.

I originally downgraded emscriptem in f09df4a to work around the error "Uncaught RuntimeError: unreachable"; hopefully, this bug is gone.

I think this was related to the memory allocator that has since been disabled.

@jerbob92
Copy link
Contributor Author

@bblanchon small warning, free and malloc don't seem to export by default anymore in the latest Emscripten, I had to add them to EXPORTED_FUNCTIONS myself. Just in case you want to upgrade Emscripten in the future. The build works fine but people using the Webassembly build will need access to free and malloc, although I'm not sure if that's also the case for JS users.

@jerbob92
Copy link
Contributor Author

@bblanchon I have been working on making an embind implementation for Go/Wazero here: https://github.com/jerbob92/wazero-emscripten-embind

Would you consider merging this If I convert the changes from patches/wasm/callbacks.patch into Embind? That way the JS implementations can also benefit from it (a lot), since it makes it a lot easier to use/access C things from JS (and in my case Go).

@bblanchon
Copy link
Owner

My main concern with this PR is that it not only patches PDFium, but it also patches emscriptem.
There are roughly 1500 lines of patches in the change set!

I am concerned that this build will fail often and require frequent maintenance.
Also, I'm afraid the patches are so complex that no one would be able to take over the maintenance if you were to stop working on it.
Lastly, I don't see a lot of demand for the build, so it looks like a lot of work for very few users.

I'm okay with adding a standalone build or modifying the existing wasm build to support both, but I don't want this project to be a maintenance nightmare.

@jerbob92
Copy link
Contributor Author

Understandable, I was not talking about the emscripten patch though, I'm still working on getting that merged into Emscripten, but the current discussion there is whether they are just going to include wasi-libc into Emscripten (most of the stuff in my Emscripten patch comes from wasi-libc).

I was specifically asking about the embind integration, which is a patch I'm currently working on (in favor of the callbacks patch).

@bblanchon
Copy link
Owner

Honestly, I'm not a big fan of this callback patch, probably because I don't understand it.
It looks like you trying to overcome the limitations of the Wazero runtime, but that has never been my goal.
This repo is supposed to provide prebuilt binaries of PDFium, not to modify the library.

@jerbob92
Copy link
Contributor Author

jerbob92 commented Oct 1, 2023

I'm starting to feel like I'm the only one using the WebAssembly build :D I will try to explain:

  • WebAssembly doesn't give you any way to access C/C++ types from the runtime (they are working on a generic thing called the component model/system for that afaik, but it's not there yet).
  • So let's say you want to create an instance of a new struct, you would have to know the byte size of the struct and allocate enough memory for it, then you would also have to know the byte size of the fields in the struct and write away the memory manually to the correct addresses. Even reading an array or just a normal value requires you to do the data transformations manually directly from the memory values
  • This is exactly the same in JS as in Wazero right now if you want to integrate the pdfium-binaries WebAssembly build
  • This makes it quite hard to use libraries like pdfium in WebAssembly
  • I have created the callbacks patch to make some of these structs for me, because they require callbacks, which is what Wazero didn't support back then, all of the other structs I make manually, using the method above, this is very dirty since pdfium might add fields to the struct at some point, breaking the code
  • Emscripten has a way to let the runtime know (Embind) which structs, classes, functions, enums, memory views and more are available, and automatically map them to JS types, it also handles data transformation between the runtime side and the WASM side, it also handles reference counting and object deletion.
  • I did not have Embind support in Wazero before, so I could not use it anyway, which is why I didn't invest time in it back then
  • I do have Embind support now because I spent a long time porting the Embind JS runtime to the Wazero runtime: https://github.com/jerbob92/wazero-emscripten-embind
  • Adding Embind support to pdfium-binaries will make the WebAssembly build way more usable for everyone (JS users too), you can see it a bit as the .h file if you do a native integration, right now there is no .h file so you're just navigating blindly
  • I do not like the currently patch either, which is why I asked if you would accept it if it was written using Embind, which is native to Emscripten
  • There are some projects that include Embind integration by default (for example libheif)
  • I don't really see this as modifying the library, just making the build more usable by providing bindings for the specific build because it doesn't have any other possibility for binding.

@sungaila
Copy link
Contributor

sungaila commented Oct 1, 2023

I'm starting to feel like I'm the only one using the WebAssembly build :D

There are at least two. I have two GitHub pages hosted right now using these PDFium WebAssembly builds: https://www.sungaila.de/PDFtoImage/ and https://www.sungaila.de/PDFtoZPL/

My biggest issue right now is the WASM NuGet package. I need specific builds with specific Emscripten versions and the NuGet package would need to include one build per version. Each .NET release depends on a single Emscripten release.

I didn't find a way an easy way to modify the existing GitHub workflows to make this happen (without rewritting most bash scripts and workflow YMLs). So instead I use my fork to build with user-defined Emscripten version: https://github.com/sungaila/pdfium-binaries/blob/3213c3f882bd75891f7da8b9644111708a3fa370/.github/workflows/build-one.yml#L44C7-L48C26

@jerbob92
Copy link
Contributor Author

jerbob92 commented Oct 1, 2023

There are at least two. I have two GitHub pages hosted right now using these PDFium WebAssembly builds: https://www.sungaila.de/PDFtoImage/ and https://www.sungaila.de/PDFtoZPL/

Ah cool! But is there anyone using the JS WASM version?

It looks like the .NET runtime is already doing a lot for you in terms of data conversion and function pointer handling. So basically it looks like they built the Embind thing inside the runtime.

My biggest issue right now is the WASM NuGet package. I need specific builds with specific Emscripten versions and the NuGet package would need to include one build per version. Each .NET release depends on a single Emscripten release.

Honestly, this sounds like a major design flaw in the .NET WASM support? But seeing how they integrate the library (with the .a file and not the .wasm file), it makes sense that it works like this.

@sungaila
Copy link
Contributor

sungaila commented Oct 1, 2023

Honestly, this sounds like a major design flaw in the .NET WASM support? But seeing how they integrate the library (with the .a file and not the .wasm file), it makes sense that it works like this.

They transpile the .NET code into C++ and then into Wasm. Might explain why they want .a files but I still despise that Emscripten SDK version lock-in.

@jerbob92
Copy link
Contributor Author

jerbob92 commented Oct 1, 2023

Yeah it's probably not something they like to do, but it works like that because they include the .a file in the compilation process, which does allow them to use the WebAssembly more freely, but comes with the downside that the compiled .a file can only be used by that Emscripten version, since the .a file is not a build result, but a intermediary file during the build process to WebAssembly.

I don't see that changing without .NET changing the way how it integrates the WebAssembly.

@bblanchon
Copy link
Owner

I do not like the currently patch either, which is why I asked if you would accept it if it was written using Embind, which is native to Emscripten

@jerbob92, I didn't realize that Embind would allow you to get rid of callback.patch.
I'm okay with generating binding with Embind and including them in the wasm build.
You're right; this is not a modification of PDFium and doesn't seem to require a lot of maintenance, so please go for it!

@jerbob92
Copy link
Contributor Author

I have started implementing this but it's starting to look like I can't export everything from pdfium directly using Embind because it doesn't support everything, for example incomplete types, which most types are, and pointers to native types, so I have been hitting some roadblocks.

I'm currently in contact with Emscripten to see how we could implement this, but if it requires wrapping every function with an Embind function this whole endeavor seems kinda pointless. I'll keep you updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants