Debug flakes #1324

martinpitt · 2023-07-03T08:21:53Z

I got this testCreatePodUser error locally once when I was investigating something else. But of course since then I could never reproduce it again. The pod container is in state "Created", while it's supposed to be "Running". That isn't just a timeout question, it stays in "Created" forever.

another example

martinpitt · 2023-07-13T07:16:30Z

I've seen TestApplication.testHealthcheckUser fail twice like this by now. example 2, example 3

I can reproduce it locally after a few iterations. This bit is interesting:

Unhandled event type container health_status

That does seem related! It seems that previously it only worked by chance if there was another event after triggering a health check run.

However, the flake still happens when adding health_status to the events that refresh the container status.

~~But it may be related to this failure.~~ it's not, these are independent.

Enabled debug logging here.

variant 1 misses the transition from "checking health" to "healthy". The log shows that there is just a single monitor event for the starting health check:

> debug: podman user monitor {"status":"health_status","id":"c625b57c1a...","from":"localhost/test-busybox:latest","Type":"container","Action":"health_status","Actor": "ID":"c625b57c1a3...","Attributes":{...}},"scope":"local",","HealthStatus":"starting"}

and the corresponding inspect call also says "Health":{"Status":"starting"}. So this feels like a genuine podman bug -- podman ps clearly shows "Up 16 seconds (healthy)", so the check did run, but the monitor event is sometimes missing. In a successful log, it does happen:

> debug: podman user monitor {"status":"health_status","id":"7ee2f963...","from":"localhost/test-busybox:latest","Type":"container","Action":"health_status","Actor":{"ID":"7ee2f963a9fd...","Attributes":{...}},"scope":"local","HealthStatus":"healthy"}

I reported this as containers/podman#19237, and we have a previously implicit, and now explicit workaround, see commit 92398c3.

variant 2 misses the new healthcheck result after running the "health check" container action.

This is missing a health_status event similar to variant 1. The only thing that happens for a container health check run is the exec_died monitor event, which we started to ignore. So let's revert that. But even after reverting it still happens -- we get a health_status event (ignored) and an exec_died after clicking "Run health check", and the API query still only has one Log entry:

    "State": {    
        "OciVersion": "1.1.0-rc.1",    
        "Status": "running",    
        "Running": true,        
        "Paused": false,    
        "Restarting": false,    
        "OOMKilled": false,    
        "Dead": false,    
        "Pid": 15520,    
        "ConmonPid": 15518,    
        "ExitCode": 0,     
        "Error": "",    
        "StartedAt": "2023-07-14T10:07:14.064090116Z",    
        "FinishedAt": "0001-01-01T00:00:00Z",    
        "Health": {    
            "Status": "healthy",      
            "FailingStreak": 0,        
            "Log": [       
                {    
                    "Start": "2023-07-14T10:07:14.151866216Z",    
                    "End": "2023-07-14T10:07:14.261819433Z",    
                    "ExitCode": 0,    
                    "Output": ""    
                }    
            ]    
        },

so that'll be another naughty. Sent as cockpit-project/bots#5004

These events change the state in a predictable way. Avoid the expensive updateContainer() call for these, to avoid multiple calls overlapping each other. See containers/podman#19124 The exact `StartedAt` value for the "start" event is unfortunately not part of the event data, but we can approximate it very well with the event time stamp. This doesn't have to be 100% correct: The only place where we use that is the restart detection in testLifecycleOperations. Adjust that to not require the data-started-at to be identical to the CLI output, just that it changes.

This reverts commit d3cb07d. Let's debug healthcheck first

For some reason, .focus() sometimes does not take effect, and the element immediately loses the focus again. This causes set_input_text() to select the whole page text instead of the input element, and failing to set the intended value.

martinpitt · 2023-08-30T05:48:58Z

Shelving this for now. I'll make another attempt at reducing API queries at some point, but these days tests are remarkably stable.

martinpitt added the flake unstable test label Jul 3, 2023