Optimizing MIDAS on very large/complex datasets #20

neuro30 · 2022-03-26T13:47:50Z

In very large datasets (~30,000 samples x 1,000,000 features) with complex relationships (e.g. cancer omics data), the runtime for MIDAS can take a very long time (days?), even on a single GPU. However, I would like to take advantage of the 'overimpute' feature for hyperparameter tuning. This is prohibitive since this very useful feature runs the algorithm multiple times to evaluate various settings.

Would random downsampling of samples (columns) and/or features (rows) generalize the optimal hyperparameters to the larger dataset? For instance, a random subset of 500-1,000 samples with 5,000-10,000 features. This would be to specifically determine the optimal number of: nodes, layers, learning rate, and training epochs. I would think batch size (which can speed up training) is a function of the dataset size, so this would not generalize.

Any help would be great

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing MIDAS on very large/complex datasets #20

Optimizing MIDAS on very large/complex datasets #20

neuro30 commented Mar 26, 2022

Optimizing MIDAS on very large/complex datasets #20

Optimizing MIDAS on very large/complex datasets #20

Comments

neuro30 commented Mar 26, 2022