Skip to content
This repository has been archived by the owner on Sep 21, 2024. It is now read-only.

Can't compile Swift for TensorFlow quickly #15

Closed
philipturner opened this issue Jun 1, 2022 · 14 comments
Closed

Can't compile Swift for TensorFlow quickly #15

philipturner opened this issue Jun 1, 2022 · 14 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@philipturner
Copy link
Owner

philipturner commented Jun 1, 2022

The main reason I made the overhauls present in Swift-Colab 2.0 was so that in the future, I could run S4TF code without facing bottlenecks that make it virtually unusable. However, I am unable to compile S4TF for use in the interactive experience. This is after avoiding the problems described in #14.

The test notebook S4TF with TF 2.4 shows my effort to compile S4TF for use in the Swift interpreter. Even though that failed, I can technically compile it using %system flags like in s4tf-on-colab-example-1.ipynb and add custom code to the test suite. But that isn't ergonomic or reproducible in any way.

Specifically, the debugger shows an error when I run the following code. Back in the swift-jupyter era, the TensorFlow module was embedded in the toolchain. So the error below was likely never encountered.

import TensorFlow
print(Tensor<Float>.self)
<Cell 1>:2:7: error: cannot find 'Tensor' in scope
print(Tensor<Float>.self)
      ^~~~~~
@philipturner
Copy link
Owner Author

One simple solution to #14 and #15 is a new magic command: %install-x10. But I have to be 100% sure it is necessary. If I change my mind, it's source-breaking.

@philipturner
Copy link
Owner Author

philipturner commented Jun 4, 2022

It actually does load, but you need to restart the runtime first. I haven't tested it yet because I got sidetracked with a bug on the Python side. Either way, I need to investigate why this won't load into the Swift interpreter without restarting the runtime first. That restriction is not present on PythonKit, SwiftPlot, and other libraries.

@mikowals
Copy link

mikowals commented Jun 5, 2022

Hi @philipturner . I can get past your error above by removing some of the flags you set to install s4tf.

I comment these flags:

//%install-swiftpm-flags -c release -Xswiftc -Onone

And then TensorFlow is available to import. I believe setting those flags actually breaks the import of any package, not just TensorFlow. I tried some other packages without clearing the flags and they also silently failed to import.

Sadly though problems still remain. The specific error is:

import TensorFlow
let x = Tensor(0)

Produces:

Couldn't lookup symbols:
  TensorFlow.Tensor.init(_: τ_0_0, on: TensorFlow.Device) -> TensorFlow.Tensor<τ_0_0>
  TensorFlow.Tensor.init(_: τ_0_0, on: TensorFlow.Device) -> TensorFlow.Tensor<τ_0_0>

It looks similar to swift-apis issue 1016 which I don't believe was ever fixed. The error is a generic linking or runtime availability problem though so it is likely a different cause.

The env var LD_LIBRARY_PATH in Colab looks a bit strange and points to /usr/local/nvidia/lib. I don't think any TensorFlow files end up there so maybe that is the cause.

Thanks for the work you are doing on this. You are making impressive progress!

@philipturner
Copy link
Owner Author

Thanks for investigating! I should be able to narrow this problem down to a small reproducer. Other packages like PythonKit behave just fine, there's some specific reason S4TF is being uncooperative.

@philipturner
Copy link
Owner Author

philipturner commented Jun 5, 2022

I have encountered your error "Couldn't lookup symbols" multiple times today when using PythonKit. It always happens when I forget to execute the %install command after restarting the runtime. Did you execute the command that does %install .package(...) TensorFlow before receiving that error?

I have also used PythonKit multiple times with the -c release -Xswiftc -Onone flags. What packages didn't work when you used those flags? Also, remember to $clear the SwiftPM flags when appropriate.

@mikowals
Copy link

mikowals commented Jun 5, 2022

Success! I added -rpath flag.

%install-swiftpm-flags -Xlinker "-rpath=/content/Library/tensorflow-2.4.0/usr/lib"

The full colab is here.

Now this:

import TensorFlow

print(Device.default) 
let x = Tensor(0)
print(x)
print(x.device)

let y = Tensor(0, on: .defaultXLA)
print(y.device)

Shows this:

Device(kind: .CPU, ordinal: 0, backend: .TF_EAGER)
0.0
Device(kind: .CPU, ordinal: 0, backend: .TF_EAGER)
Device(kind: .CPU, ordinal: 0, backend: .XLA)

I also did some fiddling around the -c release -Swiftc -Onone and determined it is -c release causing the problem. The output shows the flag working correctly - building for production when included and building for debug when excluded. But the production build leads to the import not actually working.

@philipturner
Copy link
Owner Author

philipturner commented Jun 5, 2022

I actually ran through the entire Model Training Walkthrough tutorial on tensorflow/swift, using -c release -Xswiftc -Onone. That specific set of flags makes it take 2 minutes to compile, while standard debug mode compiles in 3 minutes. I haven't tested compiling it in debug mode. You're saying that if it's in debug mode, you don't have to restart the runtime to load the library?

I will definitely narrow this down and find the culprit, because I believe that is a bug with SwiftPM or the Swift compiler. SwiftPlot depends on C dependencies and doesn't have that issue.

%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10_optimizers_optimizer.so /usr/lib/libx10_optimizers_optimizer.so
%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10_optimizers_tensor_visitor_plan.so /usr/lib/libx10_optimizers_tensor_visitor_plan.so
%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10.so /usr/lib/libx10.so
%system cp /content/Library/tensorflow-2.4.0/usr/lib/libx10_training_loop.so /usr/lib/libx10_training_loop.so

%install-swiftpm-flags $clear
%install-swiftpm-flags -c release -Xswiftc -Onone
%install-swiftpm-flags -Xswiftc -DTENSORFLOW_USE_STANDARD_TOOLCHAIN
%install '.package(url: "https://github.com/philipturner/s4tf", .branch("fan/resurrection"))' TensorFlow

I haven't tried using -Xlinker or -rpath yet; I just copied the binaries to system include paths. If I can use your workaround to fix the issue with linking the binary files, then that solves half of my problem. The other half is copying the headers' paths into Clang modulemap files, so that they don't have to be copied into system header directories. I'm working on narrowing a SwiftPM bug affecting the latter task right now.

One fruit of this effort, although not the bug I'm tracking down: swiftlang/swift-package-manager#5482 (comment)

The bug I'm tracking is (from #14):

Two module.modulemap files that declare the same Clang module can overwrite each other, even if one is part of the documentation of a Swift package and never actually involved in the build process. This happened with the modulemap currently in the Utilities directory of s4tf/s4tf.

Utilities/module.modulemap shouldn't appear in the "build.db", and whether it appears or not is highly fickle.

@philipturner
Copy link
Owner Author

The reason I initially decided to compile S4TF with the old TF 2.4 binary was to narrow down the source of s4tf/s4tf#14, not to make it accessible on Colab. You are welcome to see if that bug exists on the older X10 binary, or even better - help me fix that bug :)

@mikowals
Copy link

mikowals commented Jun 5, 2022

You're saying that if it's in debug mode, you don't have to restart the runtime to load the library?

Maybe. I did not try restarting when compiled with -c release because TensorFlow was not listed in the list of libraries in the output instructions about restarting. Building in debug mode definitely doesn't require a restart.

The reason I initially decided to compile S4TF with the old TF 2.4 binary was to narrow down the source of s4tf/s4tf#14, not to make it accessible on Colab. You are welcome to see if that bug exists on the older X10 binary, or even better - help me fix that bug :)

Yes, I can see that the methods done in this Colab aren't ideal. It should allow me to run some old S4TF models using X10 on TPU if I want. I haven't tried this yet but it would be handy. However elaborate the methods to get there are...

I actually ran through the entire Model Training Walkthrough tutorial on tensorflow/swift, using -c release -Xswiftc -Onone.

I have no doubt that those flags can work and are useful. In this instance though there appears to be some interaction with the runtime in Colab, the %install command, or the build process. There are many moving pieces here. Not sure what to say other than that it consistently works with those flags commented and fails with them.

@philipturner
Copy link
Owner Author

philipturner commented Jun 5, 2022

Also, I'm planning to un-comment out x10_training_loop from the package manifest on both the head branch and this TF 2.4 branch. It was deactivated in January 2021 because of some build failure with SwiftPM, but I hypothesize that has long since been fixed.

@philipturner
Copy link
Owner Author

philipturner commented Jun 16, 2022

SwiftPlot has started failing to import on the first try if you use -c release -Xswiftc -Onone. You have to restart the runtime and rerun the %include command. This strange import behavior some time appeared between the 2021-12-06 and 2021-12-23 toolchains. This is a different time frame from when S4TF started experiencing the behavior (??? to 2021-11-12). To clarify, SwiftPlot and S4TF started failing to import at different times chronologically.

This is confirmation that the behavior is a bug. Something incorrect started happening in the Swift compiler before 2021-11-12. It was exposed to a greater extent in December, causing SwiftPlot to fail. Hopefully I can fix the compiler bug and integrate a patch into the 5.7 or 5.7.1 release.

@philipturner philipturner added the bug Something isn't working label Jun 17, 2022
@philipturner
Copy link
Owner Author

philipturner commented Jun 18, 2022

Even wierder - you now have to restart 2 times to use S4TF on s4tf/s4tf:main! You only need to restart once when using fan/resurrection. Both branches were tried on the same 2022-05-11 toolchain and with a factory reset Colab instance, but I need to double-check that there are no confounding variables. Doing so is time-intensive because each test takes around 3 minutes, so I don't feel like it right now.

Something is very off here, and I'll instruct the user to avoid -c release -Xswiftc -Onone until this narrowed down. The nature of the bug (interacting with LLDB, not yet reproducible on macOS, reproducers exist in massive code bases) makes it time-intensive to narrow down. The -c release -Xswiftc -Onone flags only reduce compile time by 33%, so the user will just have to deal with it.

@philipturner philipturner self-assigned this Jun 18, 2022
@philipturner philipturner added the help wanted Extra attention is needed label Jun 18, 2022
@philipturner philipturner changed the title Can't run Swift for TensorFlow Can't compile Swift for TensorFlow quickly Jun 18, 2022
@philipturner philipturner removed their assignment Jun 18, 2022
@philipturner
Copy link
Owner Author

v2.2 was released, and the README has instructions for compiling Swift for TensorFlow. I noted the issue with -c release -Xswiftc -Onone, keeping the SwiftPM flags directive commented out. Let me know whether this works for you!

@philipturner
Copy link
Owner Author

It should allow me to run some old S4TF models using X10 on TPU if I want. I haven't tried this yet but it would be handy. However elaborate the methods to get there are...

@mikowals I just got S4TF to run on a TPU. Look at the "TPU Tests" notebook at the bottom of the README. It was 8 TPUs at once, on the Colab free tier! I had never experienced using a TPU before. Could you provide some old X10 models designed for TPU, so that I can include them in the test suite?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants