Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[float32] CG/Newton solver terminates early due to improvement overshoots #2313

Open
2 tasks done
n3b opened this issue Dec 27, 2024 · 9 comments
Open
2 tasks done

[float32] CG/Newton solver terminates early due to improvement overshoots #2313

n3b opened this issue Dec 27, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@n3b
Copy link

n3b commented Dec 27, 2024

Intro

My setup

What's happening? What did you expect?

I believe an abs() is missing here

mjtNum improvement = scale * (oldcost - ctx.cost);

There are serious stability issues otherwise that can be seen in cases like throwing 10+ perfectly aligned cubes on top of each other.

Steps for reproduction

Minimal model for reproduction

No response

Code required for reproduction

No response

Confirmations

@n3b n3b added the bug Something isn't working label Dec 27, 2024
@yuvaltassa
Copy link
Collaborator

An abs is certainly not missing on that line. This is a signed quantity. If it is negative then there is no improvement.

If you can provide a MRE that would be helpful.

@yuvaltassa
Copy link
Collaborator

Ah but I think I see what you mean. On line 1833, a negative improvement would would lead to termination. I'll investigate. Thanks.

@yuvaltassa
Copy link
Collaborator

I think the correct thing is not an abs but rather max(0, ):

mjtNum improvement = scale * mju_max(oldcost - ctx.cost, 0);

Right?

Right now any increase in cost leads to termination. Perhaps this is by design? @emotodorov WDYT?

@yuvaltassa
Copy link
Collaborator

Ya I think a negative improvement is not possible since in this case we will have already terminated here:

// no improvement: done

I'm closing this issue as you provided no way to reproduce, but I'm quite interested if you were to provide an example.

@n3b
Copy link
Author

n3b commented Dec 31, 2024

@yuvaltassa sorry, I should have mentioned I'm using single precision. You can see negative improvement in something like the xml below.

I also noticed that sometimes one-sided search can stuck at local minima and exhaust iterations with no improvement, while deriv[0] is above the tolerance. Would it make sense to check for delta in this loop as well?

while (p1.deriv[0]*dir <= -gtol && ctx->LSiter < m->opt.ls_iterations) {

<mujoco>
    <option timestep="0.015"/>
    <option gravity="0 0 -9.81"/>
    <size memory="100M"/>

    <default>
        <geom solimp=".9 .99 .005"/>
        <default class="box">
            <geom type="box" material="box" size="0.5 0.5 0.5"/>
        </default>
    </default>

    <asset>
        <texture name="texgeom" type="cube" builtin="flat" mark="cross" width="128" height="128"
                 rgb1="0.6 0.6 0.6" rgb2="0.6 0.6 0.6" markrgb="1 1 1"/>
        <material name="box" texture="texgeom" texuniform="true" rgba=".4 .9 .9 1"/>
    </asset>

    <visual>
        <quality shadowsize="4096" offsamples="8"/>
        <map znear="0.1" force="0.05"/>
    </visual>

    <statistic extent="4"/>

    <worldbody>
        <light directional="true" diffuse=".8 .8 .8" pos="0 0 10" dir="0 0 -10"/>
        <geom type="plane" size="10 10 .01"/>

        <body euler="0 0 0" pos="3 0 10">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 10">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 13">
            <freejoint/>
            <geom class="box"/>
            
        </body>

        <body euler="0 0 0" pos="0 0 15">
            <freejoint/>
            <geom class="box"/>
            
        </body>

        <body euler="0 0 0" pos="0 0 17">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 19">
            <freejoint/>
            <geom class="box"/>
            
        </body>

        <body euler="0 0 0" pos="0 0 21">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 23">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 25">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 27">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 29">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 31">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 33">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 35">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 37">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 39">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 41">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 43">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 45">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 47">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 49">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 51">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 53">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 55">
            <freejoint/>
            <geom class="box"/>
        </body>

        <body euler="0 0 0" pos="0 0 57">
            <freejoint/>
            <geom class="box"/>
        </body>

    </worldbody>

</mujoco>

@emotodorov
Copy link
Contributor

emotodorov commented Dec 31, 2024 via email

@emotodorov
Copy link
Contributor

emotodorov commented Dec 31, 2024 via email

@yuvaltassa yuvaltassa reopened this Dec 31, 2024
@yuvaltassa
Copy link
Collaborator

Reopening since we now have an MRE, thanks @n3b !

If you have capacity:

  1. Could you verify if the issues that you're seeing are limited to float32 or if you also see them with float64?
  2. If you've made changes locally that you think are improvements, feel free to submit a PR. Even if you are not sure that what you did is a strict improvement, this would be a good starting point for us to investigate.

Thanks again!

@n3b
Copy link
Author

n3b commented Dec 31, 2024

@emotodorov thank you, that makes so much sense now. Unfortunately my math background is way below the necessary level to make such assumptions. I never thought of precision issues in that context because I've been seeing unscaled delta cost even below -1e5.

@yuvaltassa thank you,

  1. I tested the above example yesterday and it indeed does not produce negative improvement with doubles (there is some noise but it's very close to zero), but I can't tell for more complex models because:
  2. I made changes indeed if one can call it that. I'm going to use Newton/CG solvers in a game engine and was evaluating their limits for a while now. So even though the solver algorithms are the same with slight differences in explicit vectorization and parallelism, the engine is different and it's rather hard to create an equal reproduction.

I do have some thoughts about improving Hessian updates besides what I already did, and about CG performance as well, but am not sure if it's the right place for it. Neither I'm confident it would make any sense from the math POV, so please let me know if you are up to discuss it.

Oh, and another thing I just remembered. The R dimensions for contacts are being overridden by the first one (using translation diag only). Is that intentional?

R[i+j] = mju_max(mjMINVAL, (1-imp)*d->efc_diagApprox[i+j]/imp);

R[i+j+1] = R[i+1]*friction[0]*friction[0]/(friction[j]*friction[j]);

@saran-t saran-t changed the title CG/Newton solver terminates early due to improvement overshoots [float32] CG/Newton solver terminates early due to improvement overshoots Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants