-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convergence failure in VAMPE_score #1
Comments
Hi, Looking forward to hearing back from you! Best, |
Hi Andreas, I seem to have fixed it, though I'm not sure the solution is correct. I tried changing the layer width, thinking the issue was the large input dimension comparatively. Increasing that resulted in even faster failures. So, I tried a layer width of 50 and haven't had any failures since, though only through two "steps" so far (initial mask noise and noise=5). I'm running these on an HPC so I don't have the plot for the above run, but I did a few runs using try/except before. The parameters are the same, just a different set of data for this one. This one failed at 65/100. |
So I've run into the same error again, but I can't sort why. Here is what I've done: Parameters: output_sizes = [5,5] Workflow: model = ivampnet.fit(loader_train, n_epochs=50, validation_loader=loader_val, mask=True, lam_decomp=20., lam_trace=1., start_mask=0, end_trace=20, tb_writer=writer, clip=False).fetch_model() I'm running multiple variants; some are still running, a few have finished successfully. I can't sort why this one has failed. I received the same error as I did before in the 6th "step" in the workflow. I'm attaching the VAMPE trace from tensorboard as well as the Pen_scores plot (MR from the manuscript I think?) C00, C11, and C01 don't have the same uptick near the end that Pen_scores does, so I'm not sure if it's relevant. Thank you for any assistance! |
Hi,
Would love to hear if there suggestions help you! Best |
Thank you so much! Would using GNN substantially change the parameter set? I hope that using ~25-30 nearest neighbors would be sufficient for constructing the dataset, but don't know how that would impact some of the new parameters when compared to "traditional" vampnets. |
Hey, |
Hello,
I am a novice at using ML techniques like this, so forgive me for the simplistic question. I have been trying to apply the parameters given in your jupyter notebooks to my system and coming up with some issues. For background, I am running on a dataset of 690 1 us simulations on a 299 aa protein (3aa trimmed per termini), so my input feature count is 41328. I am testing for a proper output space, but generally receive the following after a varying number of epochs:
Traceback (most recent call last): File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/testing.py", line 178, in <module> model = ivampnet.fit(loader_train, n_epochs=epochs, validation_loader=loader_val, mask=True, lam_decomp=20., lam_trace=1., start_mask=0, end_trace=20, tb_writer=writer, clip=False).fetch_model() File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 1140, in fit self.partial_fit((batch_0, batch_t), lam_decomp=lam_decomp, mask=train_mask, File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 979, in partial_fit scores_single, S_single, u_single, v_single, trace_single = score_all_systems(chi_t_list, chi_tau_list, File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 407, in score_all_systems score_i, S_i , u_i, v_i, trace_i = VAMPE_score(chi_i_t, chi_i_tau, epsilon=epsilon, mode=mode) File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 247, in VAMPE_score a, sing_values, b = torch.svd(K, compute_uv=True) RuntimeError: svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 5).
This particular test is with an output size of [ 4,4 ]. My assumption is this is based on my output size, but I'm not certain.
My parameter list:
batch_size = 1000
valid_ratio = 0.15
test_ratio = 0.0001
network_depth = 4
layer_width = 100
nodes = [layer_width]*network_depth
skip_res = 6
patchsize = 8
skip_over = 4
factor_fake = 2.
noise = 2.
cutoff=0.9
learning_rate=0.0005
epsilon=1e-6
score_mode='regularize' # one of ('trunc', 'regularize', 'clamp', 'old')
The text was updated successfully, but these errors were encountered: