-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a VASP handler (warning) for too high NBANDS #324
Conversation
thanks @Andrew-S-Rosen ! There are a few things I would like to comment:
|
Thanks for your input, @yang-ruoxi! All of your points are valid. That said, I'm still trying to figure out the best way to handle this in Custodian. Custodian can't handle anything about compute architecture, but we still want to make sure users are at the very least notified of potentially spurious results. I elaborate a bit more below.
I will take some time to reproduce what I observed many months ago, but in short, I did observe erroneous behavior at times. In the CO example, there would be many spurious states in the DOS that would appear if NBANDS was >> NELECT. I believe @mkhorton had also observed some odd behavior in the past, but he'll have to chime in about that. @esoteric-ephemera mentioned it might be a combination of too high NBANDS with not a high enough ENCUT. I'll explore that avenue as well.
Yes, it is clear in this scenario that such calculations would be overly parallelized. Unfortunately, this is pretty common in high-throughput campaigns though because if you blindly run a bunch of calculations with a full node, you might do so on a small system inadvertently. Inefficiency is really a user problem though, so I'm not concerned about that from Custodian's perspective. Ignoring the NBANDS business, I don't think running with too many cores influences the DOS at all. It just makes the calculation slow.
Yes. This is why I have checked the OUTCAR for the reported NBANDS value, which is always the one used by VASP (to the best of my knowledge). This should inherently take into account all of the factors you describe above. @mkhorton proposed doing a check to see if NBANDS > 2*NELECT simply as a rule-of-thumb for being "too high", especially in the scenario where VASP has automatically modified NBANDS.
I agree that the ideal approach would be for the user to run with fewer cores. However, Custodian doesn't have the ability to modify this in any clean way. That's why I have simply raised a warning. |
@yang-ruoxi: Do you think raising NCORE in such a scenario might be an appropriate workaround? We could fetch the number of compute cores by parsing the OUTCAR file. |
@Andrew-S-Rosen, I see. It would be good to know what triggers the odd behaviors with high NBANDS to know what exactly needs to be done. But in reality it's hard to exhaust all possible scenarios, so warning is fine before it is pined down. |
Thanks for the input, @yang-ruoxi! That all makes sense. I'll go ahead and try to reproduce the spurious behavior with too high NBANDS, and we can take it from there. You're right that it'll be worthwhile to dig into the underlying cause. |
I have also seen problems with this: convergence issues in general and also weird VASP failures. |
Since it's not immediately clear to me the best way to correct such a calculation and since @yang-ruoxi has given me the blessing of "warning is fine before it is pined down", I'll call this PR ready to go. It doesn't do anything other than raise a warning if your number of bands is 2x higher than your number of electrons (unlikely to happen intentionally; usually only happens when running a small system with a large number of cores in a high-throughput campaign). If people see spurious warnings in their calculations, we can easily adjust this. In my view, this PR will basically do no harm and may potentially save someone a bit of headache. |
It might be possible to correct this with custodian but a warning is probably more than enough. You'd have to read off the system info from OUTCAR (whether it's CPU or GPU, how many MPI ranks are used, how many OpenMP threads are used), and then try to modify the combination of {NCORE or NPAR, KPAR, NBANDS} on CPU builds and {NSIM, KPAR, NBANDS} on GPU builds. But playing around with parallelization can introduce other unexpected errors A warning might be too low visibility but we can also add this to something like the |
Thanks for the input, @esoteric-ephemera. That was the exact conclusion I came to. It probably can be done, but I did not want to bother dealing with the messiness of parallelization. If someone cares more about this, I encourage them to give it a go 😅 Perhaps it will be too low visibility, but low is better than none! Not a super satisfying conclusion, but I unfortunately don't have the time to put into giving this a true fix. External validation could be interesting too. |
@janosh: While we're on it, this one is ready to go. It is a very painless PR. It just raises a warning. |
Summary
Closes #224.
If NBANDS is set to an unphysically high value, the resulting electronic structure properties can be completely erroneous. This is becoming more of a problem lately because VASP will automatically set NBANDS to the number of cores on a machine for small systems, and machines nowadays can have many cores (Perlmutter has 128 cores per node). Running a molecule like CO with 128 NBANDS is a major problem.
This handler checks to see if VASP automatically changed NBANDS due to parallelization. If it did, then it checks if NBANDS > 2 times the NELECT value. If so, we raise a warning. Unfortunately, we can't do much more than that because VASP will override the INCAR even if the user manually specifies NBANDS. The solution is for the user to rerun with fewer cores.