Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug flakes #1324

Closed
wants to merge 4 commits into from
Closed

Debug flakes #1324

wants to merge 4 commits into from

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented Jul 3, 2023

I got this testCreatePodUser error locally once when I was investigating something else. But of course since then I could never reproduce it again. The pod container is in state "Created", while it's supposed to be "Running". That isn't just a timeout question, it stays in "Created" forever.

another example

@martinpitt martinpitt added the flake unstable test label Jul 3, 2023
@martinpitt

This comment was marked as resolved.

@martinpitt martinpitt force-pushed the flakes branch 2 times, most recently from 23a4151 to b29afab Compare July 3, 2023 10:47
@martinpitt

This comment was marked as resolved.

@martinpitt martinpitt force-pushed the flakes branch 3 times, most recently from 67b1411 to 37092ff Compare July 3, 2023 14:14
@martinpitt martinpitt changed the title Debug testCreatePodUser flake Debug flakes Jul 4, 2023
@martinpitt

This comment was marked as resolved.

@martinpitt

This comment was marked as resolved.

@martinpitt martinpitt force-pushed the flakes branch 2 times, most recently from 4dec5bb to 0607a90 Compare July 4, 2023 15:24
@martinpitt

This comment was marked as resolved.

@martinpitt martinpitt force-pushed the flakes branch 6 times, most recently from 09e6d96 to 5a28f0d Compare July 5, 2023 04:58
@martinpitt

This comment was marked as resolved.

@martinpitt

This comment was marked as resolved.

@martinpitt
Copy link
Member Author

martinpitt commented Jul 13, 2023

I've seen TestApplication.testHealthcheckUser fail twice like this by now. example 2, example 3

I can reproduce it locally after a few iterations. This bit is interesting:

Unhandled event type container health_status

That does seem related! It seems that previously it only worked by chance if there was another event after triggering a health check run.

However, the flake still happens when adding health_status to the events that refresh the container status.

But it may be related to this failure. it's not, these are independent.

Enabled debug logging here.


variant 1 misses the transition from "checking health" to "healthy". The log shows that there is just a single monitor event for the starting health check:

> debug: podman user monitor {"status":"health_status","id":"c625b57c1a...","from":"localhost/test-busybox:latest","Type":"container","Action":"health_status","Actor": "ID":"c625b57c1a3...","Attributes":{...}},"scope":"local",","HealthStatus":"starting"}

and the corresponding inspect call also says "Health":{"Status":"starting"}. So this feels like a genuine podman bug -- podman ps clearly shows "Up 16 seconds (healthy)", so the check did run, but the monitor event is sometimes missing. In a successful log, it does happen:

> debug: podman user monitor {"status":"health_status","id":"7ee2f963...","from":"localhost/test-busybox:latest","Type":"container","Action":"health_status","Actor":{"ID":"7ee2f963a9fd...","Attributes":{...}},"scope":"local","HealthStatus":"healthy"}

I reported this as containers/podman#19237, and we have a previously implicit, and now explicit workaround, see commit 92398c3.


variant 2 misses the new healthcheck result after running the "health check" container action.

This is missing a health_status event similar to variant 1. The only thing that happens for a container health check run is the exec_died monitor event, which we started to ignore. So let's revert that. But even after reverting it still happens -- we get a health_status event (ignored) and an exec_died after clicking "Run health check", and the API query still only has one Log entry:

    "State": {    
        "OciVersion": "1.1.0-rc.1",    
        "Status": "running",    
        "Running": true,        
        "Paused": false,    
        "Restarting": false,    
        "OOMKilled": false,    
        "Dead": false,    
        "Pid": 15520,    
        "ConmonPid": 15518,    
        "ExitCode": 0,     
        "Error": "",    
        "StartedAt": "2023-07-14T10:07:14.064090116Z",    
        "FinishedAt": "0001-01-01T00:00:00Z",    
        "Health": {    
            "Status": "healthy",      
            "FailingStreak": 0,        
            "Log": [       
                {    
                    "Start": "2023-07-14T10:07:14.151866216Z",    
                    "End": "2023-07-14T10:07:14.261819433Z",    
                    "ExitCode": 0,    
                    "Output": ""    
                }    
            ]    
        },    

so that'll be another naughty. Sent as cockpit-project/bots#5004

These events change the state in a predictable way. Avoid the expensive
updateContainer() call for these, to avoid multiple calls overlapping
each other. See containers/podman#19124

The exact `StartedAt` value for the "start" event is unfortunately not
part of the event data, but we can approximate it very well with the
event time stamp. This doesn't have to be 100% correct: The only place
where we use that is the restart detection in testLifecycleOperations.
Adjust that to not require the data-started-at to be identical to the
CLI output, just that it changes.
This reverts commit d3cb07d.

Let's debug healthcheck first
For some reason, .focus() sometimes does not take effect, and the
element immediately loses the focus again. This causes set_input_text()
to select the whole page text instead of the input element, and failing
to set the intended value.
@martinpitt
Copy link
Member Author

Shelving this for now. I'll make another attempt at reducing API queries at some point, but these days tests are remarkably stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake unstable test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant