Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging and handling of transient 500 errors #1563

Open
melange396 opened this issue Dec 3, 2024 · 0 comments
Open

Improve logging and handling of transient 500 errors #1563

melange396 opened this issue Dec 3, 2024 · 0 comments
Labels
bug devops building, running, deploying, environment stuff, handy utils, repository-related, engineer QoL, etc logs and monitoring logging, monitoring, alerting mysql mysql database related

Comments

@melange396
Copy link
Collaborator

From time to time, API server processes can to get into a strange state where their db connections seem to be closed or otherwise invalid in some way, and this causes 500s. Thankfully this can be remedied by restarting the API server jobs, but it would be better if we could handle it automatically.

We cannot (yet!) explain what gets us into this condition, but we can identify it through log entries emitted at

get_structured_logger("server_error").error("Received DatabaseErrorException", exception=str(e), exc_info=True)

... Those log entries include a long stack trace that boils down to

  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 1083, in cursor
    return self.dbapi_connection.cursor(*args, **kwargs)
AttributeError: 'NoneType' object has no attribute 'cursor'

We should do a couple additional things where we currently log that:

  1. Keep logging the full stack trace, but also log another shorter field to aggregate/filter on, like the root-cause exception (which would be "AttributeError: 'NoneType' object has no attribute 'cursor'" here)
  2. Create another log entry for the status of the database connection, g.db, that includes fields for its .closed and .invalidated attributes. The .info and .connection attributes might prove useful too.

If it turns out that the db connections in these cases are indeed closed (and perhaps even if not), we can check for a closed connection in the _get_db() convenience method and then re-open it.

Also, almost all of the instances where DatabaseException is raised are in small try blocks around .run_query() calls -- we should just have run_query raise the exception itself (here) and remove the try blocks in those instances.

@melange396 melange396 added bug logs and monitoring logging, monitoring, alerting devops building, running, deploying, environment stuff, handy utils, repository-related, engineer QoL, etc mysql mysql database related labels Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug devops building, running, deploying, environment stuff, handy utils, repository-related, engineer QoL, etc logs and monitoring logging, monitoring, alerting mysql mysql database related
Projects
None yet
Development

No branches or pull requests

1 participant