140.163.4.200

Moderators: Site Moderators, FAHC Science Team

Re: 140.163.4.200

Postby JohnChodera » Fri Sep 11, 2020 4:22 pm

Folks: The new work server (pllwskifah1.mskcc.org) ended up in a weird state that was not receiving WUs even though the WS appeared to be running normally. We've restarted it, and it's now receiving the backlog of results.

Please let us know if you notice this happening again! We'll also try to keep a close eye on it and try to figure out what went wrong here.

Apologies for this---it might be the new big NFS storage we mounted on the WS to attempt to avoid out-of-space issues.

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 466
Joined: Fri Feb 22, 2013 10:59 pm

Re: 140.163.4.200

Postby rickoic » Fri Sep 11, 2020 4:38 pm

My backload is slowly disappearing. Had 7 and now its down to 3, so progress is being made. Tks a lot for the fix.
I'm folding because Dec 2005 I had radical prostate surgery.
Lost brother to spinal cancer, brother-in-law to prostate cancer.
Several 1st cousins lost and a few who have survived.
rickoic
 
Posts: 303
Joined: Sat May 23, 2009 5:49 pm
Location: Mississippi near Memphis, Tn

Re: 140.163.4.200

Postby mgetz » Fri Sep 11, 2020 4:44 pm

JohnChodera wrote:Please let us know if you notice this happening again! We'll also try to keep a close eye on it and try to figure out what went wrong here.
~ John Chodera // MSKCC


Can we keep it at zero weight through the weekend unless someone is going to actively keep an eye on it? I'd rather not have my GPUs idled for two days if possible (the science must compute!).
Image
mgetz
 
Posts: 57
Joined: Tue Aug 11, 2020 7:23 pm

Re: 140.163.4.200

Postby rickoic » Fri Sep 11, 2020 4:48 pm

Spoke too soon. This just happened a few minutes ago.

Edit: this problem resolved itself a few minutes later. Just slow.

Code: Select all
15:40:05:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:40:05:WU04:FS01:Assigned to work server 140.163.4.200
15:40:05:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:40:05:WU04:FS01:Connecting to 140.163.4.200:8080
15:40:26:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:40:26:WU04:FS01:Connecting to 140.163.4.200:80
15:40:48:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:40:48:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:40:48:WU04:FS01:Assigned to work server 140.163.4.200
15:40:48:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:40:48:WU04:FS01:Connecting to 140.163.4.200:8080
15:41:09:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:41:09:WU04:FS01:Connecting to 140.163.4.200:80
15:41:31:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:41:47:WU02:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
15:41:47:WU02:FS01:0x22:Average performance: 83.8835 ns/day
15:41:48:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:41:48:WU04:FS01:Assigned to work server 140.163.4.200
15:41:48:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:41:48:WU04:FS01:Connecting to 140.163.4.200:8080
15:41:54:WU02:FS01:0x22:Saving result file ..\logfile_01.txt
15:41:54:WU02:FS01:0x22:Saving result file checkpointState.xml.bz2
15:41:55:WU02:FS01:0x22:Saving result file globals.csv
15:41:55:WU02:FS01:0x22:Saving result file positions.xtc
15:41:55:WU02:FS01:0x22:Saving result file science.log
15:41:55:WU02:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
15:41:56:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
15:41:56:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13426 run:1456 clone:20 gen:4 core:0x22 unit:0x0000000812bc7d9a5f57207fe28d1881
15:41:56:WU02:FS01:Uploading 5.70MiB to 18.188.125.154
15:41:56:WU02:FS01:Connecting to 18.188.125.154:8080
15:42:02:WU02:FS01:Upload 55.94%
15:42:07:WU02:FS01:Upload complete
15:42:07:WU02:FS01:Server responded WORK_ACK (400)
15:42:07:WU02:FS01:Final credit estimate, 176071.00 points
15:42:07:WU02:FS01:Cleaning up
15:42:09:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:42:09:WU04:FS01:Connecting to 140.163.4.200:80
15:42:31:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:43:25:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:43:26:WU04:FS01:Assigned to work server 140.163.4.200
15:43:26:WU04:FS01:Requesting new work unit for slot 01: READY gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:43:26:WU04:FS01:Connecting to 140.163.4.200:8080
15:43:47:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:43:47:WU04:FS01:Connecting to 140.163.4.200:80
15:44:08:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:46:02:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:46:02:WU04:FS01:Assigned to work server 140.163.4.200
15:46:03:WU04:FS01:Requesting new work unit for slot 01: READY gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:46:03:WU04:FS01:Connecting to 140.163.4.200:8080
15:46:24:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:46:24:WU04:FS01:Connecting to 140.163.4.200:80
15:46:45:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
rickoic
 
Posts: 303
Joined: Sat May 23, 2009 5:49 pm
Location: Mississippi near Memphis, Tn

Re: 140.163.4.200

Postby JohnChodera » Fri Sep 11, 2020 5:14 pm

Looks like the server ended up not accepting 80/8080 again. We're going to keep it on weight 0 for a while to monitor.

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 466
Joined: Fri Feb 22, 2013 10:59 pm

Re: 140.163.4.200

Postby mgetz » Fri Sep 11, 2020 5:23 pm

JohnChodera wrote:Looks like the server ended up not accepting 80/8080 again. We're going to keep it on weight 0 for a while to monitor.

~ John Chodera // MSKCC

I have two WUs from it right now:
13436 (22, 5, 2)
13433 (63, 0, 2) completed successfully with no retries 157.664 ns/day

I'll report back in when they finish if they upload or not.
mgetz
 
Posts: 57
Joined: Tue Aug 11, 2020 7:23 pm

Re: 140.163.4.200

Postby LazyDev » Fri Sep 11, 2020 9:15 pm

My two work units have since been uploaded. Thank for fix this.
Image
LazyDev
 
Posts: 13
Joined: Tue Aug 30, 2016 8:28 pm

Re: 140.163.4.200

Postby mgetz » Fri Sep 11, 2020 9:58 pm

project:13436 run:22 clone:5 gen:2 core:0x22 did upload... but it took forever, something is seriously messed up with that server.
mgetz
 
Posts: 57
Joined: Tue Aug 11, 2020 7:23 pm

Re: 140.163.4.200

Postby JohnChodera » Sat Sep 12, 2020 3:53 am

Update: it looks like the issue is with an underperforming NFS mount. We're investigating.

Thanks for your patience!

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 466
Joined: Fri Feb 22, 2013 10:59 pm

Re: 140.163.4.200

Postby hhherby » Mon Dec 28, 2020 4:50 pm

I'm noticing this being a super slow connection that keeps timing out.
hhherby
 
Posts: 14
Joined: Thu Jan 05, 2017 10:30 pm

Re: 140.163.4.200

Postby hhherby » Sun Jan 03, 2021 8:46 pm

Can anyone even ping this server?

Pinging 140.163.4.200 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.
hhherby
 
Posts: 14
Joined: Thu Jan 05, 2017 10:30 pm

Re: 140.163.4.200

Postby Joe_H » Sun Jan 03, 2021 9:07 pm

The server is behind the MSKCC firewall, it blocks pings. If you want to check if the server is up, just enter the IP number into a browser window.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 7193
Joined: Tue Apr 21, 2009 5:41 pm
Location: W. MA

Re: 140.163.4.200

Postby TristanChen » Mon Jan 04, 2021 8:17 pm

Going to vent here a bit. The collection server (140.163.4.210) tied to this work server has been barely functional for half of December and is still 90% dead today.

I've got no less than 20 completed work units, some days old with 100+ retries, still waiting for the damned server to fix itself.

Can't admins at least set up some sort of redirect?! If 30% of my daily output is just going to be flushed down the drain anyway, then I might as well be running Nicehash...
TristanChen
 
Posts: 21
Joined: Tue May 30, 2017 5:55 am

Re: 140.163.4.200

Postby Neil-B » Mon Jan 04, 2021 9:38 pm

Still happen bit .. but better than April to June last year .. worth posting here as message can be got to the people who look after each impacted server by the core team .. over weekends/holidays issues can be more noticable and some of the servers are in different timezones where getting responses can be trickier
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Neil-B
 
Posts: 1980
Joined: Sun Mar 22, 2020 6:52 pm
Location: UK

Re: 140.163.4.200

Postby PantherX » Tue Jan 05, 2021 3:48 am

FYI, the CS 140.163.4.210 has an update of about 1 hour so was recently rebooted. I am aware that working is being done on it to improve certain aspects.

BTW, redirection will not work with the current setup. The WU will either try to reach out to the WS or the CS (if it is defined) which is determined when it was downloaded by the client. There's no way to dynamically update that information on the WU end.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 7020
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

PreviousNext

Return to Issues with a specific server

Who is online

Users browsing this forum: No registered users and 1 guest

cron