Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All builds stuck "discovering any new versions", seems to be NATS cert expiry underneath, recover instructions not worked #334

Open
RichardBradley opened this issue May 24, 2023 · 7 comments

Comments

@RichardBradley
Copy link
Contributor

RichardBradley commented May 24, 2023

Summary

All my builds got stuck saying "discovering any new versions" for ages. I looked at concourse/#844 and its linked issues for a while.

As part of that red herring, I found the following issue:

sh-4.2$ fly -t xxx check-resource -r x/x
checking x/x in build 42946363
initializing check: x
resource config creds evaluation: Get "https://xxx:8844/info": x509: certificate has expired or is not yet valid: current time2023-05-23T15:09:56Z is after 2023-05-23T10:03:25Z
errored

Which looks a lot like https://github.com/EngineerBetter/control-tower/blob/master/docs/troubleshooting.md#bosh-director-certificate-has-expired

We have had similar issues before and had followed the NATS cert renewal instructions last week in an attempt to avoid this.

I followed those instructions but then got:

Deploying:
  Creating instance 'bosh/0':
    Waiting until instance is ready:
      Post https://mbus/:<redacted>@54.77.80.216:6868/agent: x509: certificate has expired or is not yet valid

I then tried to follow https://github.com/EngineerBetter/control-tower/blob/master/docs/troubleshooting.md#nats-certificate-is-expired

but I currently have:

Task 8269 | 16:00:19 | Error: Failed to acquire lock for lock:deployment:concourse uid: 04075eac-579a-4839-98d9-2b4d840de459. Locking taskid is 8264, description: 'scan and fix'

Task 8269 Started  Tue May 23 16:00:19 UTC 2023
Task 8269 Finished Tue May 23 16:00:19 UTC 2023
Task 8269 Duration 00:00:00
Task 8269 error

Updating deployment:
  Expected task '8269' to succeed but state is 'error'

Exit code 1

In step 6 of the above, where it says " Run bosh deploy --recreate --fix <(bosh manifest)", what should I use for "bosh manifest"?

Steps to reproduce

Run Concourse for more than one year.

Expected results

Concourse should continue to work, or be easily recoverable.

If there are any errors they should be clear and suggest fixes.

Actual results

Concourse fails with all builds stuck on "discovering any new versions"

Additional context

Triaging info

  • Concourse version: current
  • Browser (if applicable):
  • Did this used to work? no, this happens every year

Any help or advice would be gratefully received

@RichardBradley
Copy link
Contributor Author

In step 6 of the above, where it says " Run bosh deploy --recreate --fix <(bosh manifest)", what should I use for "bosh manifest"?

Figured this out -- I was using "sh" when I needed "bash" for this to be a valid command.
I thought "bosh manifest" was a placeholder for some file I couldn't find

@RichardBradley
Copy link
Contributor Author

I think I'm on step 6: "bosh deploy --recreate --fix <(bosh manifest)"

It gives this error, which is the same error that "control-tower deploy" gives to me:


Task 8281 | 10:33:14 | Updating instance web: web/e81d69b1-743b-4351-a873-16543a8c3055 (0) (canary) (00:18:11)
                     L Error: 'web/e81d69b1-743b-4351-a873-16543a8c3055 (0)' is not running after update. Review logs for failed jobs: bosh-dns
Task 8281 | 10:33:14 | Error: 'web/e81d69b1-743b-4351-a873-16543a8c3055 (0)' is not running after update. Review logs for failed jobs: bosh-dns

Task 8281 Started  Wed May 24 10:14:59 UTC 2023
Task 8281 Finished Wed May 24 10:33:14 UTC 2023
Task 8281 Duration 00:18:15
Task 8281 error

Updating deployment:
  Expected task '8281' to succeed but state is 'error'

Exit code 1

Any suggestions on how to debug or fix?

@RichardBradley
Copy link
Contributor Author

I have deleted and recreated my Concourse, which has mostly worked but was massively disruptive

@RichardBradley
Copy link
Contributor Author

RichardBradley commented May 25, 2023

Lots of my builds are failing with "Docker failed to start within 120 seconds." and lots are just hanging. I think it's because the worker is overloaded because I'm building so much in parallel as I'm starting from scratch, but this isn't a great failure behaviour.

EDIT: this is happening even when the server is not busy, so something else is wrong. Any suggestions gratefully received

@RichardBradley
Copy link
Contributor Author

I have updated to the latest version and this seems to be settling down.

@RichardBradley
Copy link
Contributor Author

This has happened again, one year later
(I forgot to set a reminder to renew the certs. I guess I was so traumatised by the above that I just tried to forget it.)

I'm following the same instructions again, and failing again on step 6 with the same error

@RichardBradley
Copy link
Contributor Author

I might have fixed this by trying lots of different things, including running the step 6: "bosh deploy --recreate --fix <(bosh manifest)" multiple times, then deleting both worker and web VMs, then (after bosh failed to recreate them with NATs cert errors), re-running the "control tower deploy" command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant