-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor sampling performance with some complex posteriors compared to HMC #295
Comments
Thanks William, I'm excited to see what happens here and hopefully make improvements based on what we observe! So far the focus has been on problems where HMC performs poorly, but I agree that having reasonable performance on problems where HMC work well is very useful too. I would have expected automala to do well here since there are some funnel-like shapes in the pair plot. It could be the round-based initial step size adaptation running wild. That part of automala is relatively less understood compared to the per step auto step. Another possibility is that long trajectories are really needed for this one. With @tgliu0406 we prototyped recently an auto HMC, it would be interesting to see if it performs better than automala on that problem. Since multiple chains is not needed here, it might be worth focussing initially on running pigeons with 1 chain. I'll start with trying the reproducibility script. Thanks again! |
For reference, here are the stats from the HMC run:
One other guess is that the preconditioner might not be off... |
Some stats based off
|
Some questions/comments
|
Hi @miguelbiron , thanks for taking a look. Yes, I find that increasing the maximum number of steps is important for getting acceptable performance from HMC on these problems. Sometimes the data are not that constraining, in which case NUTS cuts the trajectories short and we average around 2^7 leapfrog steps. The DenseEuclideanMetric is critical, and HMC effectively does not work on these problems without it. I'll note that while surely there are some reparameterizations that can help with this specific dataset, it is very hard to find parameterizations that solve the problem for all datasets. |
No problem and thanks for clarifying! It's good to know at least that NUTS works here because you're taking it to extreme tree sizes (and even then, I see it goes beyond the limit 10% of the time), and also using the Dense preconditioner. From a Pigeons perspective, we instead would prefer to stick with a cheaper explorer and instead crank up the number of chains. But it's weird that that doesn't work. So I'll be doing some tests now to see if I can find a combination of dumb explorer with more chains that gives as good performance as NUTS in the same wallclock time. I'll keep you posted. |
Thanks @miguelbiron, I would say this matches my experiences with Pigeons too. The SliceSampler combined with ~40 chains and ~40 variational chains can usually power through just about anything. Unfortunately there are some challenging cases where even that leads to some weird results. For example, this figure (from a different but related posterior) shows how sometimes PT leaves these gaps along the posterior that ought to be connected (see e.g. inclination vs eccentricity): |
The axis aligned gap kinda make sense for a slice sampler failure mode. That could fixed by substituting with a Hit and Run wrapper for the slicer that we implemented for the AutoStep paper with @tgliu0406. Or even fully switching to autoRWMH which showed promising results. We should be releasing a package with these additional explorers in the short term. I'll go back to the other reproducible example now. |
Good point... I think this would be a good test bed for the auto step methods + Hit and Run... |
I noticed that the script does not give exactly the same result when I rerun it. Looking at the output of the initializer, I think it might be a cause of this. In my first run I got
vs for the second:
|
Hi @alexandrebouchard, my bad. If the starting point aren’t given explicitly they sample from the default RNG. You should be able to:
model.starting_points = [
model.link([1,2,3…]), # start values for each parameter
# repeat for each chain, or use fill(…)
] |
I also noticed the non-determinism, which is more annoying in that the HMC sample themselves are completely different from one run to the other. Like I've never seen on my runs the sinusoidal shapes you show above. So unless I f'd up something on my end, I wouldn't trust too much those samples either. BTW, I'm doing a 20 round run with 60 chains = 4Lambda, autoRWMH with n_refresh=32. I did this following my heuristic advise from the NRST paper, and I actually get
Main takeaway: you haven't reached adaptation convergence if you don't see
|
Okay, I can reproduce the non-determinism of the pathfinder initialization. I'll try to track that down. |
Wow, this is frustrating. I think I found non-determinism issues with both Pathfinder.jl, and with AdvancedHMC.jl. For pathfinder, it is non-deterministic when using multiple threads (can be worked around by passing For AdvancedHMC, I am passing an |
I pushed an update to the script that works around those issues for now. |
It’s crazy that the first restart is in round 13; I usually see restarts beginning in round 8-10. Is that related to the use of autoRWHM, do you think? |
I don't think so, I tried maany combinations of samplers and all take about 12-13 rnds to gove restarts on this problem. I think the issue is that the model is very challenging for our adaptation algorithm. Basically the restarts begin here at the same round that the logZ eatimate approximates the true value |
Some thoughts:
|
Re Hessian vector product: Seems like SparseDiffTools.jl is the one package offering it. Despite the name, it seems like it still works for arbitrary dense Hessians. |
Finished my 20 round run (without running a similarly long chain for HMC). The way I understand is is that the problem is fully unidentifiable: there are several (uncountably many perhaps?) strict submanifolds where the distribution of the logpotential restricted to that region is exactly equal to the unrestricted one. Our auto-XYZ samplers tend to be content with staying in those regions -- they basically only look the problem through the distribution of the logpotential. Therefore, the only way that you can fullfill the ergodicity guarantees of PT is via restarts. Hence, the spattered patterns. |
@miguelbiron & @alexandrebouchard , thank you for looking into this so thoroughly.
I admit I’m a bit out of my depth here, but I’m glad there is an explanation for the splatter patterns. Actually, we’ve seen this before with ptemcee.py too. Maybe we could discuss tomorrow?
I think I have done this before using ForwardDiff without issue. |
Yes, happy to discuss! Just to finish the thought about the difference with NUTS: in contrast to our samplers, the no-U-turn condition actually looks at the state vector (x,p), forcing the sampler to go far in phase space -- not just in logdensity space. It would be interesting to think about what sort of modification the autoXYZ approach would require to force movement in phase space too. Edit: BTW, if this interpretation is correct, then a Riemannian approach would not be sufficient to match NUTS' performance. At most, it would reduce the required n_refresh back to 1-3 instead of the 32 or higher to get good mixing in logpotential. But hopefully we can prove this theory is wrong :D |
@miguelbiron I wonder if the samples have some transformations applied that way too. Might be safest to try:
the last row will be the log posterior density. |
Hmm I lost the pt object (had to turn off pc). How would you do that with the hmc samples? That should run faster |
That should be possible too if the hmc chains object is in memory (it isn’t saved to disk via savechain / loadchain). the following should plot the HMC samples in the unconstrained space:
|
Re: autodiff for hessians and hessian-vector products. I tested SparseDiff's |
Beautiful! |
To add one observation to this discussion, i notice that another orbit fitting code often produces similar splatter patterns. This code uses “ptemcee”, (reversible) parallel tempered affine invariant ensemble sampler. Do you think this symptom could have a common cause? This is from the paper: https://iopscience.iop.org/article/10.3847/1538-3881/ac042e |
Maybe we could see if we can reproduce the splatter pattern with a simple synthetic example, e.g. like example 7.2 in @nikola-sur 's paper, https://arxiv.org/pdf/2405.11384 |
What is pi0 in that example? |
Hi all, I have isolated a relatively lightweight example where I find that HMC is significantly outperforming PT. I tested most variations supported in Pigeons: Slice Sampler & AutoMALA, fixed, variational and stabilized-variational).
A corner & trace plot are attached below with HMC in blue and PT in gold.
A script to produce this plot is available here: https://github.com/sefffal/OrbitPosteriorDB/blob/main/models/astrom-GL229A.jl
Use the latest
#main
commit of Octofitter (e.g.] add Octofitter#main
).I would be curious to understand better why PT is struggling so much on this target, and if there is a way to improve performance to be at least within the same ball-park.
The HMC series in this plot is not exactly converged since there's pretty high correlation between samples, but it nonetheless successfully explores the posterior while PT is stuck in a much smaller region.
The text was updated successfully, but these errors were encountered: