Running models with big dataset tips #853

danieltomasz · 2024-11-01T22:40:26Z

danieltomasz
Nov 1, 2024

I am experimenting with running big models on my laptop (M1 with 16 gb ram), the project is to explore feasibility and limitation oof different approaches; mI have a dataset containing response in fmri voxels (14752 voxel pers subject)

        voxel subject     value
14752   10095       1  0.448226
14753   10096       1  1.258736
14754   10097       1  2.741925
14755   10098       1  0.757536
14756   10099       1 -0.314734
...       ...     ...       ...
221275  63060      14  0.934415
221276  63061      14  2.881202
221277  63062      14  3.322961
221278  63099      14 -0.113058
221279  63100      14  3.326746

[162272 rows x 3 columns]

With the model model = bmb.Model("value ~ (1|subject) + (1|voxel)", filtered_data_frame)
When I convert the data type to 'float32' I can add more subjects to my model to be able to fit object into memory in jupyter,
but there obviously are still limits there until jupyter crash, not counting inference time, what would be the best practice to work with such big models on cpu if any (leaving aside having good GPUs and enough RAM)

tomicapretto · 2024-11-04T00:31:29Z

tomicapretto
Nov 4, 2024
Maintainer

Hi @danieltomasz, which family are you using? That seems to be a good case for plain PyMC and perhaps sparse data structures.
If you had details about the model family, I could write a minimum example that may be of help.

The reason why I suggest PyMC is that Bambi almost always create dense matrices (which can be very big in your case).

5 replies

danieltomasz Dec 14, 2024
Author

hi @tomicapretto! Thanks for the reply! I didn’t follow up earlier because I focused on the other projects and the honestly inference seemed infeasible with such a large dataset. I switched laptops—from 16GB to 48GB of RAM and upgraded to a newer processor generation—but I was still experiencing performance hiccups and extremely long estimates of inference times. However, with the recent update to PyMC a week ago, which now supports the use of Numba, I was able to perform the entire inference on an even larger dataset (230k rows) in just 11 minutes. This has given me renewed hope for this project!

The idea is to use PyMC to translate a set of functions currently written in R (using Stan) to conduct whole-brain voxel-level hierarchical Bayesian analyses. The current implementation in R works on pre-aggregated statistics: response of interest is computed at the voxel (3D pixel) level are computed and then averaged at the region level (to which the voxel belong).

What interests me is estimating the effects across the (hierarchically organized) whole-brain (or region) map under different scenarios: (1) without explanatory variables, (2) in a between-subjects design, or (3) in a within-subject design. These analyses could assume either a Gaussian distribution or a t-distribution if the data are skewed.

Even though my laptop can now handle such inference much better, it would still be highly beneficial to optimize the process to make it more feasible and accessible to a broader community while conserving computational resources. Even with my current setup—or on even more powerful computers—there will always be limits. The test dataset I used had a relatively low resolution (15k voxels), which is on the lower end. However, higher-resolution datasets could easily reach 50k or even 150k voxels per subject.

It would be great to see some examples or locate resources that explore optimization strategies for hierarchical Bayesian models, particularly for large datasets. Do you have any suggestions on where to start? I will try in the coming days to construct an example directly with Pymc and post it here

By the way, I recently re-listened to the podcast episode with you and Alexandre Andorra where you discussed sparse matrices—It was a really interesting conversation!

tomicapretto Dec 16, 2024
Maintainer

Great to see the new release of PyMC helped you in such a big way! And thanks for the comments about the episode :)

If your likelihood belongs to the exponential family and your predictors are all categorical, it's also possible to derive a more performant expression for the log-density. I'm doing that for a project where I'm working with a big dataset. I have had that in mind for a blogpost for a long time but didn't find time to work on it. With holidays coming in my region, I will try to work on it and share the write-up here.

GStechschulte Dec 16, 2024
Maintainer

Hey @danieltomasz great to hear about the benefits from the new PyMC release. Very interesting research and use-case.

When you say

it would still be highly beneficial to optimize the process to make it more feasible and accessible to a broader community while conserving computational resource

you are interested in techniques to reduce the sampling time without necessarily throwing more compute at the problem? e.g., model reparameterization, sparse data structures, or other sampling methods such as variational inference?

danieltomasz Dec 16, 2024
Author

Hi @GStechschulte - thanks for the comment! Yes, the idea would be to "simplify" the problem, to make the potential method available to a broader community with less computational resources. My example above is more like a toy one - with real studies having higher spatial resolution and more participants, I will probably approach new feasibility limits pretty fast. The first attempt was to simply translate and scale up the really simple approach. The data also has a spatial structure, which is underused now. I was thinking about trying some kind of approximate inference, but I am open to exploring other ideas and examples as well

GStechschulte Dec 17, 2024
Maintainer

Thanks for the clarification @danieltomasz. I will get back with you once I have gathered some resources and examples.

In the mean time, regarding your comment

I will try in the coming days to construct an example directly with Pymc and post it here

this would be great if you could provide this example data and model 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running models with big dataset tips #853

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Running models with big dataset tips #853

danieltomasz Nov 1, 2024

Replies: 1 comment · 5 replies

tomicapretto Nov 4, 2024 Maintainer

danieltomasz Dec 14, 2024 Author

tomicapretto Dec 16, 2024 Maintainer

GStechschulte Dec 16, 2024 Maintainer

danieltomasz Dec 16, 2024 Author

GStechschulte Dec 17, 2024 Maintainer

danieltomasz
Nov 1, 2024

Replies: 1 comment 5 replies

tomicapretto
Nov 4, 2024
Maintainer

danieltomasz Dec 14, 2024
Author

tomicapretto Dec 16, 2024
Maintainer

GStechschulte Dec 16, 2024
Maintainer

danieltomasz Dec 16, 2024
Author

GStechschulte Dec 17, 2024
Maintainer