Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System unavailable: Multiple windows machines currently offline #2493

Closed
sxa opened this issue Mar 1, 2022 · 28 comments
Closed

System unavailable: Multiple windows machines currently offline #2493

sxa opened this issue Mar 1, 2022 · 28 comments

Comments

@sxa
Copy link
Member

sxa commented Mar 1, 2022

@sxa sxa added the systemdown label Mar 1, 2022
@sxa sxa added this to the 2022-03 (March) milestone Mar 1, 2022
@sxa
Copy link
Member Author

sxa commented Mar 22, 2022

Note: The jenkins logs have repeated messages regarding a failed JNLP connection attempt from several of the Windows Azure systems:

2022-03-22 09:18:12.844+0000 [id=43063]	SEVERE	o.j.r.p.impl.NIONetworkLayer#ready: [JNLP4-connect connection from 13.68.134.204/13.68.134.204:56201] Uncaught NullPointerException
2022-03-22 09:18:40.538+0000 [id=44044]	SEVERE	o.j.r.p.impl.NIONetworkLayer#ready: [JNLP4-connect connection from 52.149.211.210/52.149.211.210:60099] Uncaught NullPointerException
2022-03-22 09:18:46.690+0000 [id=43104]	SEVERE	o.j.r.p.impl.NIONetworkLayer#ready: [JNLP4-connect connection from 20.185.182.137/20.185.182.137:50653] Uncaught NullPointerException
2022-03-22 09:18:55.367+0000 [id=43104]	SEVERE	o.j.r.p.impl.NIONetworkLayer#ready: [JNLP4-connect connection from 13.68.219.237/13.68.219.237:51080] Uncaught NullPointerException

Potentially caused by it thinking it's already connected so not quite clear why it's trying to connect again. The one here was listed as offline after a disconnect, but was able to be brought back online again so we'll see if this continues.

2022-03-22 09:18:54.929+0000 [id=44175] INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #24,372 from /13.68.134.204:56300
2022-03-22 09:18:54.973+0000 [id=44176] INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #24,373 failed: java.io.EOFException
2022-03-22 09:18:55.067+0000 [id=44177] INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #24,374 from /13.68.219.237:51080
2022-03-22 09:18:55.238+0000 [id=43101] INFO    o.j.r.p.i.ConnectionHeadersFilterLayer#onRecv: [JNLP4-connect connection from 13.68.134.204/13.68.134.204:56300] Refusing headers from remote: test-azure-win2012r2-x64-1 is already connected to this master. Rejecting this c
2022-03-22 09:18:55.367+0000 [id=43104] WARNING o.j.r.p.i.SSLEngineFilterLayer#onRecv: [JNLP4-connect connection from 13.68.219.237/13.68.219.237:51080] 
java.lang.NullPointerException
2022-03-22 09:18:55.367+0000 [id=43104] SEVERE  o.j.r.p.impl.NIONetworkLayer#ready: [JNLP4-connect connection from 13.68.219.237/13.68.219.237:51080] Uncaught NullPointerException
java.lang.NullPointerException

@sxa
Copy link
Member Author

sxa commented Mar 30, 2022

Still happening on these three systems:

   3109 JNLP4-connect connection from 13.68.134.204/13.68.134.204:000000] Refusing headers from remote: test-azure-win2012r2-x64-1 is already connected to this master. Rejecting this connection.
   3263 JNLP4-connect connection from 13.68.219.237/13.68.219.237:000000] Refusing headers from remote: test-azure-win2012r2-x64-3 is already connected to this master. Rejecting this connection.
   2835 JNLP4-connect connection from 20.185.182.137/20.185.182.137:000000] Refusing headers from remote: test-azure-win2019-x64-1 is already connected to this master. Rejecting this connection.

@Haroon-Khel
Copy link
Contributor

For test-azure-win2012r2-x64-1, I have changed it's work directory to D:\jenkins as there is 32g of free space on the D drive (the machine was offline due to a lack of disk space in C:\Users\jenkins and I could not find anything significant to delete)
Running a sanity job on it to see if it lives
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/4441/console

@Haroon-Khel
Copy link
Contributor

test-azure-win2012r2-x64-3 is fine, but is low on disk space. I've changed it's Jenkins workspace to E:\jenkins which has 127g free.
Running a sanity grinder on it
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/4442/console

@Haroon-Khel
Copy link
Contributor

Unable to connect to the following:

test-ibmcloud-win2012r2-x64-1
test-equinix-win2012r2-x64-1 (formerly known as test-packet-win2012r2-x64-1)
build-ibmcloud-win2012r2-x64-2

They're unpingable too. Might have to be rebooted via their respective vendor consoles @sxa

@Haroon-Khel
Copy link
Contributor

test-azure-win2016-x64-1 - low on disk space. I cannot find anything in its workspace folder of significant size to delete. It has a D: drive, but it only has 16G on it. Not enough to be used as a workspace imo

@sxa
Copy link
Member Author

sxa commented Apr 12, 2022

test-azure-win2016-x64-1 - low on disk space.

If it's low on space keep it offline for now

@sxa
Copy link
Member Author

sxa commented Apr 12, 2022

test-ibmcloud-win2012r2-x64-1
build-ibmcloud-win2012r2-x64-2

@AdamBrousseau Are you able to initiate a restart of these two systems for us?

@sxa
Copy link
Member Author

sxa commented Apr 12, 2022

I've triggered a restart of test-equinix-win2012r2-x64-1
EDIT: It's not looking to promising - we may have to give up / reinstall that one ...

@AdamBrousseau
Copy link
Contributor

rebooted, online.

test-ibmcloud-win2012r2-x64-1
build-ibmcloud-win2012r2-x64-2

@sxa555
Copy link

sxa555 commented Apr 12, 2022 via email

@Haroon-Khel
Copy link
Contributor

test-ibmcloud-win2012r2-x64-1 is down again, and unpingable. build-ibmcloud-win2012r2-x64-2 is still up

@sxa
Copy link
Member Author

sxa commented Apr 13, 2022

FYI I'm no longer seeing the already connected messages so at least for now that seems to have been resolved by things that have changed in the last day.

@sxa
Copy link
Member Author

sxa commented Apr 14, 2022

@AdamBrousseau Can you see if test-ibmcloud-win2012r2-x64-1 can be recovered again please? We might need to keep an eye on that one - bring it up, look in the event look, then try a reboot and verify whether it comes back again. Ping me on slack when you want to do it so I can try and catch it in case it disappears of its own accod again.

@AdamBrousseau
Copy link
Contributor

rebooted, online.

test-ibmcloud-win2012r2-x64-1

@sxa
Copy link
Member Author

sxa commented Apr 14, 2022

It's back and has survived a subsequent reboot to apply security updates 👍
It's running https://ci.adoptopenjdk.net/job/Test_openjdk8_hs_sanity.perf_x86-64_windows/323/console just now

@sxa
Copy link
Member Author

sxa commented Apr 14, 2022

Although build-ibmcloud-win2012r2-x64-2 seems to have gone offline again :-( Have pinged Adam who will try and recover it again. We should take a close look at the event log when it's back to see if there's any obvious cause

@zdtsw
Copy link
Contributor

zdtsw commented May 19, 2022

both build-azure-win2012r2-x64-1 and build-ibmcloud-win2012r2-x64-1 went offline
maybe @AdamBrousseau can help restart the 2nd one?

@AdamBrousseau
Copy link
Contributor

Took a few tries but it's back online.

Feel free to ping me on slack if I don't respond fast enough here.

@sxa sxa removed this from the 2022-04 (April) milestone May 24, 2022
@sxa sxa added this to the 2022-05 (May) milestone May 24, 2022
@sxa sxa pinned this issue Jun 7, 2022
@Haroon-Khel
Copy link
Contributor

All but test-equinix-win2012r2-x64-1 are back online. The ibmcloud machines need to be restarted every now and then, thankfully we have @AdamBrousseau to do it, but it would be nice to understand why they are unstable and find a solution

@sxa
Copy link
Member Author

sxa commented Jun 7, 2022

@Haroon-Khel Can you see anything in the event log on the ibmcloud systems? Those machines have historically been stable enough for us.

@sxa
Copy link
Member Author

sxa commented Jul 15, 2022

Current offline systems:

@sxa sxa unpinned this issue Jul 15, 2022
@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jul 15, 2022

In test-azure-win2012r2-x64-1, its C:\Users\jenkins.test-2012r2-1\AppData\Local\Temp directory contains some beefy data files. @sxa Have you seen files like these before?

adoptopenjdk@test-2012r2-1 /cygdrive/c/Users/jenkins.test-2012r2-1/AppData/Local/Temp
$ du -sh * | grep G
2.1G    dst12191381147079577745.dat
2.1G    dst8808757851773313999.dat
2.1G    src10083073893585810729.dat
2.1G    src17662931351367789633.dat
2.1G    src17862466297894479515.dat

@Haroon-Khel
Copy link
Contributor

They were not created recently

adoptopenjdk@test-2012r2-1 /cygdrive/c/Users/jenkins.test-2012r2-1/AppData/Local/Temp
$ ls -la *.dat
-rwx------+ 1 jenkins None 2147484671 Oct 10  2021 dst12191381147079577745.dat
-rwx------+ 1 jenkins None 2147484671 May  8 03:13 dst8808757851773313999.dat
-rwx------+ 1 jenkins None 2147484671 Oct 10  2021 src10083073893585810729.dat
-rwx------+ 1 jenkins None          0 Jun 19 01:10 src12523888732483928706.dat
-rwx------+ 1 jenkins None          0 Jun 26 01:35 src14450829699699143094.dat
-rwx------+ 1 jenkins None          0 Jul  3 02:00 src16185337324852676813.dat
-rwx------+ 1 jenkins None          0 Jul 10 22:30 src17095945012730825308.dat
-rwx------+ 1 jenkins None 2147484671 May  8 03:13 src17662931351367789633.dat
-rwx------+ 1 jenkins None 2147484671 May  8 04:37 src17862466297894479515.dat
-rwx------+ 1 jenkins None          0 Jul 10 02:01 src17894327494747906074.dat
-rwx------+ 1 jenkins None          0 Jun 26 05:03 src5264114818175128872.dat
-rwx------+ 1 jenkins None          0 Jul 10 04:11 src7319034148243782179.dat

@Haroon-Khel
Copy link
Contributor

Same on test-azure-win2012r2-x64-3. Some of these files are more recent than the others

AdoptopenJDK@test-2012r2-3 /cygdrive/c/Users/jenkins/AppData/Local/Temp
$ ls -la *.dat
-rwx------+ 1 jenkins None 2147484671 Jul  3 18:26 dst10525283875402494967.dat
-rwx------+ 1 jenkins None 2147484671 Jan  9  2022 dst1090982662380210025.dat
-rwx------+ 1 jenkins None 2147484671 Jan  9  2022 dst14090557438999081470.dat
-rwx------+ 1 jenkins None 2147484671 Oct 17  2021 dst15451742831967023501.dat
-rwx------+ 1 jenkins None 2147484671 May  8 02:44 dst16767680786618213670.dat
-rwx------+ 1 jenkins None 2147484671 Jul  3 00:59 dst5560256765629115735.dat
-rwx------+ 1 jenkins None 2147484671 Jul  4 00:14 src11220301557628620189.dat
-rwx------+ 1 jenkins None 2147484671 Jul  3 18:26 src12371769305297563754.dat
-rwx------+ 1 jenkins None 2147484671 Jul  3 00:59 src1798870126431615614.dat
-rwx------+ 1 jenkins None 2147484671 Jul 10 21:06 src5646674978054390026.dat

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jul 15, 2022

And on test-azure-win2019-x64-1

adoptopenjdk@test-win2019-1 /cygdrive/c/Users/jenkins/AppData/Local/Temp
$ ls -la *.dat
-rwx------+ 1 jenkins None 5368709122 Feb  7 01:57 LargeGatheringWrite3322105129955012357.dat
-rwx------+ 1 jenkins None 5368709122 Feb 20 08:19 LargeGatheringWrite9933378702831300867.dat
-rwx------+ 1 jenkins None 2147484671 Mar 14 09:43 dst12651648640093188310.dat
-rwx------+ 1 jenkins None 2147484671 Feb  6 10:31 dst183570549372488362.dat
-rwx------+ 1 jenkins None 2147484671 Jun 19 21:07 dst4338419869657228497.dat
-rwx------+ 1 jenkins None 2147484671 Feb 20 08:22 dst5702335999286369047.dat
-rwx------+ 1 jenkins None 2147484671 Jun 19 05:08 dst7728209470019621345.dat
-rwx------+ 1 jenkins None 2147484671 Feb  7 02:00 dst9261951980099793202.dat
-rwx------+ 1 jenkins None 2147484671 Jun 19 21:06 src14810603928291401966.dat
-rwx------+ 1 jenkins None 2147484671 Jun 19 05:06 src2088681814118365386.dat

@sxa
Copy link
Member Author

sxa commented Jul 15, 2022

Almost certainly from test cases. There are definitely ones that create files of that sort of size outside the workspace (See #2448 for an issue with that on AIX when /tmp was too small - the directory you're looking at is the equivalent of /tmp for windows.

@sxa
Copy link
Member Author

sxa commented Sep 5, 2022

.dat files covered in the separate issue now. The Packet machine will not be re-enabled and will soon be decomissioned. Closing as the others alls eem to be live now.

@sxa sxa closed this as completed Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants