Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splatfacto method in Colab broken? #3455

Closed
fasteinke opened this issue Sep 29, 2024 · 10 comments
Closed

splatfacto method in Colab broken? #3455

fasteinke opened this issue Sep 29, 2024 · 10 comments

Comments

@fasteinke
Copy link

fasteinke commented Sep 29, 2024

Describe the bug
Running the demo.ipynb fails to start training

To Reproduce
Steps to reproduce the behavior:

  1. Select an example datatset - here, desolation
  2. Paste the simplest command into xterm: ns-train splatfacto --data data/nerfstudio/desolation
  3. Gets to the setting up of CUDA, this will take a few minutes bit; cycles for quite a while, and then throws an error
  4. xterm then goes nuts, and constantly prompts for input; the error message is lost

Previous attempts to use this method at least started training; now it has problems even earlier.

@fasteinke
Copy link
Author

That was fast!! I'm impressed ...

@fasteinke
Copy link
Author

Okay, now I'm confused ... these look to be files to allow me to run locally. But my issue is with how the notebook runs in Colab - do the files there need to be altered in some fashion?

@nerfstudio-project nerfstudio-project deleted a comment Sep 29, 2024
@brentyi
Copy link
Collaborator

brentyi commented Sep 29, 2024

(deleted comment because malware, unfortunately I don't have experience with Colab so not the best person to help with the actual issue)

@fasteinke
Copy link
Author

That's very nasy!!! ... Looks like I need to be on the ball, with regard to GitHub responses - not something I was aware was happening ...

@fasteinke
Copy link
Author

To flesh this out, tried running it with the splatfacto-big method, just in case ...

Same error:

[03:15:38] Saving config to: outputs/desolation/splatfacto/2024-10-01_031537/config.yml experiment_config.py:136
Saving checkpoints to: outputs/desolation/splatfacto/2024-10-01_031537/nerfstudio_models trainer.py:142
Auto image downscale factor of 2 nerfstudio_dataparser.py:484
load_3D_points is true, but the dataset was processed with an outdated ns-process-data that didn't convert colmap points to .ply! Update the colmap
dataset automatically? [y/n]: y
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /root/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 233M/233M [00:00<00:00, 267MB/s]
╭─────────────── viser ───────────────╮
│ ╷ │
│ HTTP │ http://0.0.0.0:7007
│ Websocket │ ws://0.0.0.0:7007 │
│ ╵ │
╰─────────────────────────────────────╯
[03:16:08] Caching / undistorting eval images full_images_datamanager.py:230
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[03:16:12] Caching / undistorting train images full_images_datamanager.py:230
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 559.3889
VanillaPipeline.get_train_loss_dict: 559.3837
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gsplat/cuda/_backend.py", line 83, in
from gsplat import csrc as _C
ImportError: cannot import name 'csrc' from 'gsplat' (/usr/local/lib/python3.10/dist-packages/gsplat/init.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '10']' returned non-zero exit status 1.

@brentyi
Copy link
Collaborator

brentyi commented Oct 1, 2024

It seems like gsplat is not installing/building correctly on Colab? Related: nerfstudio-project/gsplat#315

It's also possible that recent changes to gsplat will help, it's pinned to 1.3.0 in nerfstudio but since nerfstudio-project/gsplat#365 was merged there's now pre-built wheels:

cc @liruilong940607 but I think he's very busy these days + also doesn't use Colab.

@liruilong940607
Copy link
Contributor

liruilong940607 commented Oct 1, 2024

Thanks for looping me in @brentyi !

I did a quick test on colab (T4 GPU) and i was able to install the latest gsplat on it. So it might be just a issue in the previous version (though I can't think of what might cause this).

The colab: https://colab.research.google.com/drive/10HVUf6e8_pRrMj4cmQ5Xepoq6BdkJkav?usp=sharing

@fasteinke
Copy link
Author

fasteinke commented Oct 2, 2024

Thanks for the input ... some progress made ...

Added cell to demo.ipynb, following the "Install Nerfstudio and Dependencies" cell:

!pip install gsplat==1.4.0 --index-url https://docs.gsplat.studio/whl

which appeared to work; uninstalled 1.3.0, installed 1.4.0.

But, this time a different error:

...
Trainer.train_iteration: 501.1180
VanillaPipeline.get_train_loss_dict: 501.1118
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gsplat/cuda/_backend.py", line 83, in
from gsplat import csrc as _C
ImportError: /usr/local/lib/python3.10/dist-packages/gsplat/csrc.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '10']' returned non-zero exit status 1.

@liruilong940607
Copy link
Contributor

Hey, installing gsplat's prebuilt wheels works fine for me, see:

https://colab.research.google.com/drive/10HVUf6e8_pRrMj4cmQ5Xepoq6BdkJkav?usp=sharing

You need to figure out the torch and CUDA version in the system and choose the correct prebuilt wheel for gsplat.

@fasteinke
Copy link
Author

fasteinke commented Oct 3, 2024

Thanks!!! ... Had a misunderstanding about using "pip install ... --index-url ..." - so, next round installed the correct version, and the processing kicked off nicely ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants