SteveWillis wrote:Mine has been running all day without missing a beat and the script hasn't triggered the pause/unpause even once. I'm going to show you my firewall settings. I messed around with them some and maybe it will be some help.
Status: active
To Action From
-- ------ ----
Anywhere REJECT 171.67.108.105
Anywhere ALLOW 171.67.108.102
Anywhere REJECT OUT 171.67.108.105
Anywhere ALLOW OUT 171.67.108.102
171.67.108.102 ALLOW OUT Anywhere
171.67.108.105 REJECT OUT Anywhere
I see you also reject out going to 171.67.108.105 ?
@foldy, thank you for the workaround. I think I'll keep this rule in place even after the server issue is fixed. Why? Because if the researcher(s) who own the project(s) on this server can't be bothered to fix the issue on a weekend, then all of my 41 slots of high-performance GPUs will go to other projects.
In the real world outside of academia, the individuals responsible for wasting untold thousand of hours of unrecoverable donor time would be unceremoniously fired. Sure, there is tremendous demand for computational biologists and data scientists, but two questions all potential employees emanating from Stanford PL/FAH should be asked by their prospective employers is this, "In your research work, did you ever have an instance where one of your computational projects became disabled? After being notified, what specifically did you do to remedy the situation, and how long did it take to get your project restored?" If the answer has any tinge of procrastination where they weren't on it like white-on-rice, and/or (as is the case here) crickets when it comes to notifying donors of the issue and what's being done to fix it, then move on to the next candidate.
Spending the better part of the weekend pausing and unpausing slots hoping to get a different server is a Whiskey Tango Foxtrot situation.
I'm not using any form of workaround and I've had no problem with assignments since yesterday ... a few "Server did not assign ..." but only one or two at a time and work continued pretty much immediately. I have not used the Pause/fold sequence once today.
Just to echo PS3EdOlkkola, there is a serious lack of respect and responsibility being shown in the handling of this issue.
I was ill at the weekend and literally crawling from room to room to sort out stuck clients and firewall rules - that's how seriously I take my 'responsibility' to donate to this project - I just wish there was some evidence of likewise at Stanford.
They have an amazing petaflop scale resource at their disposal, but just because they aren't paying for it doesn't mean it should be taken for granted.
Hardware configuration: PC 1: Linux Mint 17.3 three gtx 1080 GPUs One on a powered header Motherboard = [MB-AM3-AS-SB-990FXR2] qty 1 Asus Sabertooth 990FX(+59.99) CPU = [CPU-AM3-FX-8320BR] qty 1 AMD FX 8320 Eight Core 3.5GHz(+41.99)
PC2: Linux Mint 18 Open air case Motherboard: ASUS Crosshair V Formula-Z AM3+ AMD 990FX SATA 6Gb/s USB 3.0 ATX AMD AMD FD6300WMHKBOX FX-6300 6-Core Processor Black Edition with Cooler Master Hyper 212 EVO - CPU Cooler with 120mm PWM Fan three gtx 1080, one gtx 1080 TI on a powered header
I should mention that my older machine has also not had any problem at all. Only my newer machine had the problem. I mentioned it earlier but didn't bother to include my log.
I have been having trouble with the *.105 one of these servers for quite some time now. The server shows as not online, as in down, missing, useless, and more than just a little annoying, as this has been going on for over a day, maybe even longer than that. I have 6 computers with lots of gtx 1080s on them that keep running into this trash. I'm on #3 right now, and here is the log. Oh, well. I guess for this one, never mind. All the gpus are working now. It's just the cpu that is stuck. It was stuck on that one b4, but now it's reading all zeros. I just got the newest version of V7 on this machine, because it was having horrible problems. I'll have to go through the other 5 and see. But I'll still show you the log.
03:46:50:Adding folding slot 03: READY gpu:4:GP104 [GeForce GTX 1080] 8873
03:46:50:Saving configuration to /etc/fahclient/config.xml
03:46:50:<config>
03:46:50: <!-- Network -->
03:46:50: <proxy v=':8080'/>
03:46:50:
03:46:50: <!-- Slot Control -->
03:46:50: <power v='full'/>
03:46:50:
03:46:50: <!-- User Information -->
03:46:50: <passkey v='********************************'/>
03:46:50: <user v='RABishop'/>
03:46:50:
03:46:50: <!-- Folding Slots -->
03:46:50: <slot id='0' type='CPU'/>
03:46:50: <slot id='1' type='GPU'/>
03:46:50: <slot id='2' type='GPU'/>
03:46:50: <slot id='3' type='GPU'/>
03:46:50:</config>
03:46:51:WU00:FS03:Connecting to 171.67.108.45:80
03:46:51:WU00:FS03:Assigned to work server 171.67.108.105
03:46:51:WU00:FS03:Requesting new work unit for slot 03: READY gpu:4:GP104 [GeForce GTX 1080] 8873 from 171.67.108.105
03:46:51:WU00:FS03:Connecting to 171.67.108.105:8080
03:46:51:ERROR:WU00:FS03:Exception: Server did not assign work unit
03:46:51:WU00:FS03:Connecting to 171.67.108.45:80
03:46:51:WU00:FS03:Assigned to work server 171.67.108.105
03:46:51:WU00:FS03:Requesting new work unit for slot 03: READY gpu:4:GP104 [GeForce GTX 1080] 8873 from 171.67.108.105
03:46:51:WU00:FS03:Connecting to 171.67.108.105:8080
03:46:52:ERROR:WU00:FS03:Exception: Server did not assign work unit
03:47:02:WU01:FS01:0x21:Completed 100000 out of 2500000 steps (4%)
03:47:13:WU02:FS02:0x21:Completed 72000 out of 2400000 steps (3%)
03:47:46:WU01:FS01:0x21:Completed 125000 out of 2500000 steps (5%)
03:47:47:Saving configuration to /etc/fahclient/config.xml
03:47:47:<config>
03:47:47: <!-- Network -->
03:47:47: <proxy v=':8080'/>
03:47:47:
03:47:47: <!-- Slot Control -->
03:47:47: <power v='full'/>
03:47:47:
03:47:47: <!-- User Information -->
03:47:47: <passkey v='********************************'/>
03:47:47: <user v='RABishop'/>
03:47:47:
03:47:47: <!-- Folding Slots -->
03:47:47: <slot id='0' type='CPU'/>
03:47:47: <slot id='1' type='GPU'/>
03:47:47: <slot id='2' type='GPU'/>
03:47:47: <slot id='3' type='GPU'/>
03:47:47:</config>
03:47:51:WU00:FS03:Connecting to 171.67.108.45:80
03:47:51:WU00:FS03:Assigned to work server 171.67.108.105
03:47:51:WU00:FS03:Requesting new work unit for slot 03: READY gpu:4:GP104 [GeForce GTX 1080] 8873 from 171.67.108.105
03:47:51:WU00:FS03:Connecting to 171.67.108.105:8080
03:47:52:ERROR:WU00:FS03:Exception: Server did not assign work unit
03:47:52:WU02:FS02:0x21:Completed 96000 out of 2400000 steps (4%)
03:48:12:WU03:FS00:Connecting to 171.67.108.45:8080
03:48:12:WARNING:WU03:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
03:48:12:WU03:FS00:Connecting to 171.64.65.35:80
03:48:12:WARNING:WU03:FS00:Failed to get assignment from '171.64.65.35:80': Empty work server assignment
03:48:12:ERROR:WU03:FS00:Exception: Could not get an assignment
03:48:29:WU01:FS01:0x21:Completed 150000 out of 2500000 steps (6%)
03:48:31:WU02:FS02:0x21:Completed 120000 out of 2400000 steps (5%)
03:49:09:WU02:FS02:0x21:Completed 144000 out of 2400000 steps (6%)
03:49:13:WU01:FS01:0x21:Completed 175000 out of 2500000 steps (7%)
03:49:28:WU00:FS03:Connecting to 171.67.108.45:80
03:49:29:WU00:FS03:Assigned to work server 171.67.108.105
03:49:29:WU00:FS03:Requesting new work unit for slot 03: READY gpu:4:GP104 [GeForce GTX 1080] 8873 from 171.67.108.105
03:49:29:WU00:FS03:Connecting to 171.67.108.105:8080
03:49:29:ERROR:WU00:FS03:Exception: Server did not assign work unit
03:49:48:WU02:FS02:0x21:Completed 168000 out of 2400000 steps (7%)
03:49:56:WU01:FS01:0x21:Completed 200000 out of 2500000 steps (8%)
03:50:27:WU02:FS02:0x21:Completed 192000 out of 2400000 steps (8%)
03:50:41:WU01:FS01:0x21:Completed 225000 out of 2500000 steps (9%)
03:50:49:WU03:FS00:Connecting to 171.67.108.45:8080
03:50:49:WARNING:WU03:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
03:50:49:WU03:FS00:Connecting to 171.64.65.35:80
03:50:49:WARNING:WU03:FS00:Failed to get assignment from '171.64.65.35:80': Empty work server assignment
03:50:49:ERROR:WU03:FS00:Exception: Could not get an assignment
03:51:09:WU02:FS02:0x21:Completed 216000 out of 2400000 steps (9%)
03:51:25:WU01:FS01:0x21:Completed 250000 out of 2500000 steps (10%)
03:51:49:WU02:FS02:0x21:Completed 240000 out of 2400000 steps (10%)
03:52:06:WU00:FS03:Connecting to 171.67.108.45:80
03:52:06:WU00:FS03:Assigned to work server 171.67.108.105
03:52:06:WU00:FS03:Requesting new work unit for slot 03: READY gpu:4:GP104 [GeForce GTX 1080] 8873 from 171.67.108.105
03:52:06:WU00:FS03:Connecting to 171.67.108.105:8080
03:52:06:ERROR:WU00:FS03:Exception: Server did not assign work unit
03:52:09:WU01:FS01:0x21:Completed 275000 out of 2500000 steps (11%)
03:52:28:WU02:FS02:0x21:Completed 264000 out of 2400000 steps (11%)
03:52:53:WU01:FS01:0x21:Completed 300000 out of 2500000 steps (12%)
03:53:08:WU02:FS02:0x21:Completed 288000 out of 2400000 steps (12%)
03:53:37:WU01:FS01:0x21:Completed 325000 out of 2500000 steps (13%)
03:53:47:WU02:FS02:0x21:Completed 312000 out of 2400000 steps (13%)
03:54:21:WU01:FS01:0x21:Completed 350000 out of 2500000 steps (14%)
03:54:26:WU02:FS02:0x21:Completed 336000 out of 2400000 steps (14%)
03:55:03:WU03:FS00:Connecting to 171.67.108.45:8080
03:55:03:WARNING:WU03:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
03:55:03:WU03:FS00:Connecting to 171.64.65.35:80
03:55:04:WARNING:WU03:FS00:Failed to get assignment from '171.64.65.35:80': Empty work server assignment
03:55:04:ERROR:WU03:FS00:Exception: Could not get an assignment
03:55:05:WU02:FS02:0x21:Completed 360000 out of 2400000 steps (15%)
03:55:05:WU01:FS01:0x21:Completed 375000 out of 2500000 steps (15%)
03:55:44:WU02:FS02:0x21:Completed 384000 out of 2400000 steps (16%)
03:55:49:WU01:FS01:0x21:Completed 400000 out of 2500000 steps (16%)
03:56:20:WU00:FS03:Connecting to 171.67.108.45:80
03:56:20:WU00:FS03:Assigned to work server 171.67.108.160
03:56:20:WU00:FS03:Requesting new work unit for slot 03: READY gpu:4:GP104 [GeForce GTX 1080] 8873 from 171.67.108.160
03:56:20:WU00:FS03:Connecting to 171.67.108.160:8080
03:56:23:WU00:FS03:Downloading 2.02MiB
03:56:23:WU00:FS03:Download complete
03:56:23:WU00:FS03:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:9839 run:0 clone:26 gen:205 core:0x21 unit:0x000000eeab436ca05890cac62307f54f
03:56:23:WU00:FS03:Starting
03:56:23:WU00:FS03:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 00 -suffix 01 -version 704 -lifeline 1459 -checkpoint 15 -gpu 2 -gpu-vendor nvidia
03:56:23:WU00:FS03:Started FahCore on PID 4042
03:56:23:WU00:FS03:Core PID:4046
03:56:23:WU00:FS03:FahCore 0x21 started
03:56:24:WU02:FS02:0x21:Completed 408000 out of 2400000 steps (17%)
03:56:24:WU00:FS03:0x21:*********************** Log Started 2017-05-30T03:56:23Z ***********************
03:56:24:WU00:FS03:0x21:Project: 9839 (Run 0, Clone 26, Gen 205)
03:56:24:WU00:FS03:0x21:Unit: 0x000000eeab436ca05890cac62307f54f
03:56:24:WU00:FS03:0x21:CPU: 0x00000000000000000000000000000000
03:56:24:WU00:FS03:0x21:Machine: 3
03:56:24:WU00:FS03:0x21:Reading tar file core.xml
03:56:24:WU00:FS03:0x21:Reading tar file integrator.xml
03:56:24:WU00:FS03:0x21:Reading tar file state.xml
03:56:24:WU00:FS03:0x21:Reading tar file system.xml
03:56:24:WU00:FS03:0x21:Digital signatures verified
03:56:24:WU00:FS03:0x21:Folding@home GPU Core21 Folding@home Core
03:56:24:WU00:FS03:0x21:Version 0.0.18
03:56:27:WU00:FS03:0x21:Completed 0 out of 2400000 steps (0%)
03:56:27:WU00:FS03:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
03:56:33:WU01:FS01:0x21:Completed 425000 out of 2500000 steps (17%)
03:57:03:WU02:FS02:0x21:Completed 432000 out of 2400000 steps (18%)
03:57:03:WU00:FS03:0x21:Completed 24000 out of 2400000 steps (1%)
03:57:17:WU01:FS01:0x21:Completed 450000 out of 2500000 steps (18%)
03:57:39:WU00:FS03:0x21:Completed 48000 out of 2400000 steps (2%)
03:57:41:WU02:FS02:0x21:Completed 456000 out of 2400000 steps (19%)
03:58:01:WU01:FS01:0x21:Completed 475000 out of 2500000 steps (19%)
03:58:15:WU00:FS03:0x21:Completed 72000 out of 2400000 steps (3%)
03:58:20:WU02:FS02:0x21:Completed 480000 out of 2400000 steps (20%)
03:58:45:WU01:FS01:0x21:Completed 500000 out of 2500000 steps (20%)
03:58:51:WU00:FS03:0x21:Completed 96000 out of 2400000 steps (4%)
03:58:59:WU02:FS02:0x21:Completed 504000 out of 2400000 steps (21%)
03:59:28:WU00:FS03:0x21:Completed 120000 out of 2400000 steps (5%)
03:59:30:WU01:FS01:0x21:Completed 525000 out of 2500000 steps (21%)
03:59:38:WU02:FS02:0x21:Completed 528000 out of 2400000 steps (22%)
04:00:04:WU00:FS03:0x21:Completed 144000 out of 2400000 steps (6%)
04:00:13:WU01:FS01:0x21:Completed 550000 out of 2500000 steps (22%)
04:00:17:WU02:FS02:0x21:Completed 552000 out of 2400000 steps (23%)
04:00:40:WU00:FS03:0x21:Completed 168000 out of 2400000 steps (7%)
04:00:55:WU02:FS02:0x21:Completed 576000 out of 2400000 steps (24%)
04:00:57:WU01:FS01:0x21:Completed 575000 out of 2500000 steps (23%)
04:01:17:WU00:FS03:0x21:Completed 192000 out of 2400000 steps (8%)
04:01:34:WU02:FS02:0x21:Completed 600000 out of 2400000 steps (25%)
04:01:41:WU01:FS01:0x21:Completed 600000 out of 2500000 steps (24%)
04:01:53:WU00:FS03:0x21:Completed 216000 out of 2400000 steps (9%)
04:01:55:WU03:FS00:Connecting to 171.67.108.45:8080
04:01:55:WARNING:WU03:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
04:01:55:WU03:FS00:Connecting to 171.64.65.35:80
04:01:55:WARNING:WU03:FS00:Failed to get assignment from '171.64.65.35:80': Empty work server assignment
04:01:55:ERROR:WU03:FS00:Exception: Could not get an assignment
04:02:13:WU02:FS02:0x21:Completed 624000 out of 2400000 steps (26%)
04:02:26:WU01:FS01:0x21:Completed 625000 out of 2500000 steps (25%)
04:02:29:WU00:FS03:0x21:Completed 240000 out of 2400000 steps (10%)
PS3EdOlkkola's sentiments are spot on. Letting scores of petaflops of research of computing power go to waste is a damn shame in any setting. Savvy donors will take their computing power elsewhere. The guys who keep coming back to prod for action are the guys who care... If you look at the top 10 folders in the world, nearly all of them have had their output cut from a third to a half. Some (like #2 folder in the world msi_TW) have quit entirely since the issue began.
Judging from my 24 GPUs, 171.67.108.45/171.67.108.105 is pretty much causing 100% of the issues, and has been for the past weekend. Even after implementing the firewall rules suggested on this forum, FAH still calls the server and stalls. I've had to manually shut down the client and restart multiple times (sometimes as many as 10 times) until the client stops calling 171.67.108.45. As of an hour ago, 80% of WU requests from my computers at home are still pointing to 171.67.108.45, and this appears to be especially the case for high-end Nvidia 10x-series on my most productive machines. Over an entire unsupervised day, the failure rate of these cards are basically 100%.
My father died of cancer 3 years ago. Since then, I've dedicated myself to becoming a FAH billionaire. Currently running 8 1080Tis, 2 1080s, 9 1070s, 1 1060, 1 690, 1 7970, 1 670, and baby r7-260 through several different accounts... And yes, to echo PS3EdOlkkola, this past weekend has been a whiskey-tango-foxtrot episode from the twilight zone.
Just reconfirming that 171.67.108.45/171.67.108.105 combo is still causing nearly 100% of my Nvidia 10x-series cards to stall this morning. It seems to be the default server for these cards, and is called 8 out of 10 times, meaning that over 24 unsupervised hrs folding slots with these newer Nvidia cards will stall nearly 100% of the time. For Windows 10, the previously suggested firewall settings did not fix the issue for me, and so I've had to manually shut down and relaunch FAHClient sometimes as many as 10 times before a different, working server is called. Really really frustrating. It looks like some of our top folders have already quit (e.g. msi_TW) and the output of nearly everyone in the top 10 worldwide have been cut by 1/3 to 1/2. Quite the fiasco.
Although in my case the workaround from foldy initially worked for a full day yesterday (Thank You Foldy!).
However, now it's not working.
Did it over and over again and even restarted my system.
Did the pause and un-pause method and still being directed to that *105 server
I hope someone from Stanford be able to rectify this issue.
@Adam A. Wanderer 41 GPU folding slots plus 1 Intel Phi 7210 (256 CPUs) delivers between 34 and 35 million points per day when there are no issues. The last two days have been between 40% and 75% of the normal total, and only that high because I've had to stay on top of pausing/unpausing slots every couple of hours since Friday afternoon.
This whole episode is sad and disrespectful to the people that contribute to this project. Two questions come to mind:
1) Why do the volunteer contributors have a sense of urgency and those responsible for the project do not ?
2) These problems ALWAYS seem to happen over holiday weekends. Is everything so fragile that it fails when no one is there to hand hold the systems and keep them running ?