Page 1 of 2

171.67.108.*** Problems getting workunits

Posted: Sat Mar 19, 2016 9:54 am
by lysistrata
I'm looking for a little help with getting work units. Apologies if I'm in the wrong place - if so could Mods please move it?

I'm running a machine as spec'd below:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 4
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 16
Model: 9
Model name: AMD Opteron(tm) Processor 6128
Stepping: 1
CPU MHz: 1999.958
BogoMIPS: 4000.10
Virtualization: AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 5118K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
NUMA node2 CPU(s): 8-11
NUMA node3 CPU(s): 12-15
NUMA node4 CPU(s): 16-19
NUMA node5 CPU(s): 20-23
NUMA node6 CPU(s): 24-27
NUMA node7 CPU(s): 28-31

My machine has recently had trouble with getting work units (snippet of log below).
******************************* Date: 2016-03-11 *******************************
******************************* Date: 2016-03-12 *******************************
******************************* Date: 2016-03-12 *******************************
******************************* Date: 2016-03-12 *******************************
******************************* Date: 2016-03-12 *******************************
******************************* Date: 2016-03-13 *******************************
******************************* Date: 2016-03-13 *******************************
******************************* Date: 2016-03-13 *******************************
20:47:23:WARNING:WU00:FS00:Server did not like results, dumping
******************************* Date: 2016-03-13 *******************************
22:06:39:WARNING:WU01:FS00:Server did not like results, dumping
23:18:13:WARNING:WU00:FS00:Server did not like results, dumping
00:25:10:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
00:25:11:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
00:25:11:ERROR:WU00:FS00:Exception: Could not get an assignment
00:25:12:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
00:25:12:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
00:25:12:ERROR:WU00:FS00:Exception: Could not get an assignment
00:26:12:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
00:26:13:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
00:26:13:ERROR:WU00:FS00:Exception: Could not get an assignment
00:33:33:WARNING:WU01:FS00:Server did not like results, dumping
01:49:09:WARNING:WU00:FS00:Server did not like results, dumping
02:52:02:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
02:52:02:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
02:52:02:ERROR:WU00:FS00:Exception: Could not get an assignment


...etc etc...


******************************* Date: 2016-03-15 *******************************
06:29:52:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
06:29:52:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
06:29:52:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2016-03-15 *******************************
12:29:52:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
12:29:52:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
12:29:52:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2016-03-15 *******************************
******************************* Date: 2016-03-16 *******************************
******************************* Date: 2016-03-16 *******************************
******************************* Date: 2016-03-16 *******************************
******************************* Date: 2016-03-16 *******************************
******************************* Date: 2016-03-17 *******************************
14:19:09:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
14:19:09:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
14:19:09:ERROR:WU00:FS00:Exception: Could not get an assignment
14:19:11:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
14:19:12:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
14:19:12:ERROR:WU00:FS00:Exception: Could not get an assignment
14:20:11:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.45:8080': Empty work server assignment
14:20:12:WARNING:WU00:FS00:Failed to get assignment from '171.67.108.204:80': Empty work server assignment
14:20:12:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2016-03-17 *******************************
******************************* Date: 2016-03-18 *******************************
******************************* Date: 2016-03-18 *******************************
******************************* Date: 2016-03-18 *******************************
******************************* Date: 2016-03-18 *******************************
******************************* Date: 2016-03-19 *******************************
******************************* Date: 2016-03-19 *******************************
This is typical of performance before the work units stopped:
14:34:07:WU01:FS00:0xa4:Completed 80000 out of 80000 steps (100%)
14:34:09:WU01:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
14:34:13:WU00:FS00:Download 69.15%
14:34:19:WU00:FS00:Download 74.16%
14:34:19:WU01:FS00:0xa4:
14:34:19:WU01:FS00:0xa4:Finished Work Unit:
14:34:19:WU01:FS00:0xa4:- Reading up to 4117896 from "01/wudata_01.trr": Read 4117896
14:34:19:WU01:FS00:0xa4:trr file hash check passed.
14:34:19:WU01:FS00:0xa4:- Reading up to 3189560 from "01/wudata_01.xtc": Read 3189560
14:34:19:WU01:FS00:0xa4:xtc file hash check passed.
14:34:19:WU01:FS00:0xa4:edr file hash check passed.
14:34:19:WU01:FS00:0xa4:logfile size: 19947
14:34:19:WU01:FS00:0xa4:Leaving Run
14:34:20:WU01:FS00:0xa4:- Writing 7329795 bytes of core data to disk...
14:34:22:WU01:FS00:0xa4:Done: 7329283 -> 7058738 (compressed to 96.3 percent)
14:34:22:WU01:FS00:0xa4: ... Done.
14:34:26:WU00:FS00:Download 81.17%
14:34:32:WU00:FS00:Download 89.19%
14:34:37:WU00:FS00:Download complete
14:34:37:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:9752 run:1275 clone:0 gen:720 core:0xa4 unit:0x00000389ab40416355417363f2a22f96
14:40:36:WU01:FS00:0xa4:- Shutting down core
14:40:36:WU01:FS00:0xa4:
14:40:36:WU01:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
14:41:31:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
14:41:31:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:9752 run:572 clone:0 gen:808 core:0xa4 unit:0x0000041aab4041635541726963e76164
14:41:31:WU01:FS00:Uploading 6.73MiB to 171.64.65.99
14:41:31:WU01:FS00:Connecting to 171.64.65.99:8080
14:41:31:WU00:FS00:Starting
14:41:31:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 704 -lifeline 1530 -checkpoint 15 -np 32
14:41:31:WU00:FS00:Started FahCore on PID 47248
14:41:31:WU00:FS00:Core PID:47252
14:41:31:WU00:FS00:FahCore 0xa4 started
14:41:31:WU00:FS00:0xa4:
14:41:31:WU00:FS00:0xa4:*------------------------------*
14:41:31:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
14:41:31:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
14:41:31:WU00:FS00:0xa4:
14:41:31:WU00:FS00:0xa4:Preparing to commence simulation
14:41:31:WU00:FS00:0xa4:- Looking at optimizations...
14:41:31:WU00:FS00:0xa4:- Created dyn
14:41:31:WU00:FS00:0xa4:- Files status OK
14:41:32:WU00:FS00:0xa4:- Expanded 6539191 -> 22431316 (decompressed 343.0 percent)
14:41:32:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=6539191 data_size=22431316, decompressed_data_size=22431316 diff=0
14:41:32:WU00:FS00:0xa4:- Digital signature verified
14:41:32:WU00:FS00:0xa4:
14:41:32:WU00:FS00:0xa4:Project: 9752 (Run 1275, Clone 0, Gen 720)
14:41:32:WU00:FS00:0xa4:

I wondered if there might be a lack of units for 32 cores so I implemented multiple slots of 12, 12, 8, leaving the initial slot set to -1. I immediately got work units, however the PPD appears much less than before.

Could somebody please help me work out what's going on and, if possible, how to resolve it. This machine, unlike my others, is dedicated and so I'd like to use it for the maximum benefit for the folding project. I'd appreciate any input on config.

Thanks very much.

L

Re: 171.67.108.*** Problems getting workunits

Posted: Sat Mar 19, 2016 2:46 pm
by bruce
From time-to-time, servers do run out of projects for a particular hardware configuration (in your case, CPU:32). The project(s) owner(s) should be aware of the problem and fix it, but sometimes they're not paying close attention.

The messages from servers 171.67.108.45 and 171.67.108.204: "Empty work server assignment" do indicate that the Assignment Servers could not find any Work Server that had work that could be assigned to you. Unfortunately that's not something you can solve yourself, except at you have already done: reduce the number of CPUs until you do get assignments -- and post here to get help.

In this case neither you nor I have access to any information about which projects can be assigned to a particular type of hardware (in this case, CPU:32). I can often track down somebody who can fix the problem but I have to get a lot of my information from posts like yours. [THANK YOU FOR A VERY DETAILED REPORT, including what your log looked like when it was working.]

I'll dig into my list of contacts and see if I can find somebody who can fix it.

Re: 171.67.108.*** Problems getting workunits

Posted: Mon Mar 21, 2016 9:09 am
by lysistrata
That's great, Bruce. Thanks very much for your support and for responding so quickly.

L :)

Re: 171.67.108.*** Problems getting workunits

Posted: Fri Apr 01, 2016 11:34 pm
by donkom
I'm having the same issue, same servers, and it's been happening for a while. I occasionally get a WU, but they are few and far between.

Any love for the CPU:32 Windows folks?

Re: 171.67.108.*** Problems getting workunits

Posted: Sat Apr 02, 2016 9:30 am
by Ricky
I think the last project (7528) that could be run with more than 24 threads just ran out. I went 9 hours without an assignment. So, I just gave up and went to 24 threads.

Re: 171.67.108.*** Problems getting workunits

Posted: Sat Apr 02, 2016 7:41 pm
by toTOW
donkom wrote:Any love for the CPU:32 Windows folks?
You could still try to set up multiple slot with fewer threads ... maybe two 16-thread slots, or four 8-thread slots (or any other combination you'll like).

Re: 171.67.108.*** Problems getting workunits

Posted: Sat Apr 16, 2016 10:56 am
by WhitehawkEQ
Try 64 cores :) I have 2 systems with 4 6276 Opteron's. I'm in same boat, hoping to see 810x WU's soon.

Re: 171.67.108.*** Problems getting workunits

Posted: Sat Apr 16, 2016 5:15 pm
by DocJonz
A couple of us on our team have had the same issue and have reset the CPU count to 24 or less to be able to pick something up.
Looking forward to the return of some more Core A5 based WU's to make the most of high core count Opteron machines.

http://128.143.231.201

Posted: Sat Apr 16, 2016 10:07 pm
by Nathan_P
The "many core WU" aka the bigadv server doesn't seem to be issuing any work - certainly not to v6 clients. I've had nearly 3 days of this in my logs

Code: Select all

[13:51:17] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 06 -np 48 -checkpoint 15 -verbose -lifeline 3020 -version 634'

[13:51:17] 
[13:51:17] *------------------------------*
[13:51:17] Folding@Home Gromacs SMP Core
[13:51:17] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[13:51:17] 
[13:51:17] Preparing to commence simulation
[13:51:17] - Looking at optimizations...
[13:51:17] - Created dyn
[13:51:17] - Files status OK
[13:51:17] Couldn't Decompress
[13:51:17] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[13:51:17] -Error: Couldn't update checksum variables
[13:51:17] Error: Could not open work file
[13:51:17] 
[13:51:17] Folding@home Core Shutdown: FILE_IO_ERROR
[13:51:17] CoreStatus = 75 (117)
[13:51:17] Error opening or reading from a file.
[13:51:17] Deleting current work unit & continuing...
[13:51:17] Trying to send all finished work units
[13:51:17] + No unsent completed units remaining.
[13:51:17] - Preparing to get new work unit...
[13:51:17] Cleaning up work directory
[13:51:17] + Attempting to get work packet
[13:51:17] Passkey found
[13:51:17] - Will indicate memory of 15999 MB
[13:51:17] - Connecting to assignment server
[13:51:17] Connecting to http://assign.stanford.edu:8080/
[13:51:17] Posted data.
[13:51:17] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[13:51:17] + News From Folding@Home: 
[13:51:18] Loaded queue successfully.
[13:51:18] Sent data
[13:51:18] Connecting to http://128.143.231.201:8080/
[13:51:18] Posted data.
[13:51:18] Initial: 0000; - Receiving payload (expected size: 512)
[13:51:18] Conversation time very short, giving reduced weight in bandwidth avg
[13:51:18] - Downloaded at ~1 kB/s
[13:51:18] - Averaged speed for that direction ~1 kB/s
[13:51:18] + Received work.
[13:51:18] + Closed connections
[13:51:23] 
[13:51:23] + Processing work unit
[13:51:23] Core required: FahCore_a5.exe
[13:51:23] Core found.
[13:51:23] Working on queue slot 07 [April 14 13:51:23 UTC]
[13:51:23] + Working ...
[13:51:23] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 07 -np 48 -checkpoint 15 -verbose -lifeline 3020 -version 634'

[13:51:23] 
[13:51:23] *------------------------------*
[13:51:23] Folding@Home Gromacs SMP Core
[13:51:23] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[13:51:23] 
[13:51:23] Preparing to commence simulation
[13:51:23] - Looking at optimizations...
[13:51:23] - Created dyn
[13:51:23] - Files status OK
[13:51:23] Couldn't Decompress
[13:51:23] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[13:51:23] -Error: Couldn't update checksum variables
[13:51:23] Error: Could not open work file
[13:51:23] 
[13:51:23] Folding@home Core Shutdown: FILE_IO_ERROR
[13:51:23] CoreStatus = 75 (117)
[13:51:23] Error opening or reading from a file.
[13:51:23] Deleting current work unit & continuing...
[13:51:23] Trying to send all finished work units
[13:51:23] + No unsent completed units remaining.
[13:51:23] - Preparing to get new work unit...
[13:51:23] Cleaning up work directory
[13:51:23] + Attempting to get work packet
[13:51:23] Passkey found
[13:51:23] - Will indicate memory of 15999 MB
[13:51:23] - Connecting to assignment server
[13:51:23] Connecting to http://assign.stanford.edu:8080/
[13:51:24] Posted data.
[13:51:24] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[13:51:24] + News From Folding@Home: 
[13:51:24] Loaded queue successfully.
[13:51:24] Sent data
[13:51:24] Connecting to http://128.143.231.201:8080/
[13:51:25] Posted data.
[13:51:25] Initial: 0000; - Receiving payload (expected size: 512)
[13:51:25] Conversation time very short, giving reduced weight in bandwidth avg
[13:51:25] - Downloaded at ~1 kB/s
[13:51:25] - Averaged speed for that direction ~1 kB/s
[13:51:25] + Received work.
[13:51:25] + Closed connections
[13:51:30] 
[13:51:30] + Processing work unit
[13:51:30] Core required: FahCore_a5.exe
[13:51:30] Core found.
[13:51:30] Working on queue slot 08 [April 14 13:51:30 UTC]
[13:51:30] + Working ...
[13:51:30] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 08 -np 48 -checkpoint 15 -verbose -lifeline 3020 -version 634'

[13:51:30] 
[13:51:30] *------------------------------*
[13:51:30] Folding@Home Gromacs SMP Core
[13:51:30] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[13:51:30] 
[13:51:30] Preparing to commence simulation
[13:51:30] - Looking at optimizations...
[13:51:30] - Created dyn
[13:51:30] - Files status OK
[13:51:30] Couldn't Decompress
[13:51:30] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[13:51:30] -Error: Couldn't update checksum variables
[13:51:30] Error: Could not open work file
[13:51:30] 
[13:51:30] Folding@home Core Shutdown: FILE_IO_ERROR
[13:51:30] CoreStatus = 75 (117)
[13:51:30] Error opening or reading from a file.
[13:51:30] Deleting current work unit & continuing...
[13:51:30] Trying to send all finished work units
[13:51:30] + No unsent completed units remaining.
[13:51:30] - Preparing to get new work unit...
[13:51:30] Cleaning up work directory
[13:51:30] + Attempting to get work packet
[13:51:30] Passkey found
[13:51:30] - Will indicate memory of 15999 MB
[13:51:30] - Connecting to assignment server
[13:51:30] Connecting to http://assign.stanford.edu:8080/
[13:51:31] Posted data.
[13:51:31] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[13:51:31] + News From Folding@Home: 
[13:51:31] Loaded queue successfully.
[13:51:31] Sent data
[13:51:31] Connecting to http://128.143.231.201:8080/
[13:51:31] Posted data.
[13:51:31] Initial: 0000; - Receiving payload (expected size: 512)
[13:51:31] Conversation time very short, giving reduced weight in bandwidth avg
[13:51:31] - Downloaded at ~1 kB/s
[13:51:31] - Averaged speed for that direction ~1 kB/s
[13:51:31] + Received work.
[13:51:31] + Closed connections
[13:51:36] 
[13:51:36] + Processing work unit
[13:51:36] Core required: FahCore_a5.exe
[13:51:36] Core found.
[13:51:36] Working on queue slot 09 [April 14 13:51:36 UTC]
[13:51:36] + Working ...
[13:51:36] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 09 -np 48 -checkpoint 15 -verbose -lifeline 3020 -version 634'

It uploaded a 8107 WU at 04:11 on the 14th and since then just a constant stream of the above in the log, it managed to get a 7527 @ 12pm on the 14th and a 7526 at 06:30 on the 15th but nothing else. Is there a problem with the server and/or WU, a lack of WU for machines with more than 24 cores or has v6 finally been shut down?

Now I know that the short response has been an indicator of bad WU in the past but there has usually been something else to fall back on.

Re: 171.67.108.*** Problems getting workunits

Posted: Sat Apr 16, 2016 10:14 pm
by bruce
We're talking about WUs for FahCore_a5. Right? And IIRC, Core_a5 only runs on 86 bit Linux. Right?

I think Core_a4 can go up to 24 cores on either 320 or 64-bit Windows Right?

We need to be sure the PIs know what server settings should be set. (I think those who supported 32 CPUs have moved on or may have forgotten.)

Re: http://128.143.231.201

Posted: Sat Apr 16, 2016 10:29 pm
by bruce
This is probably the same problem being discussed here
viewtopic.php?f=72&t=28684

... or there may be two problems.

Either way, most of the new analyses of high-atom-count projects seem to have been customized for FahCore_21 on GPUs. For CPUs, you need a certain minimum number of atoms per CPU which means many CPU project will not run on larger servers. I've been talking to PG about the issue and they're investigating possible solutions.

Re: 171.67.108.*** Problems getting workunits

Posted: Sun Apr 17, 2016 7:24 am
by DocJonz
bruce wrote:We're talking about WUs for FahCore_a5. Right? And IIRC, Core_a5 only runs on 86 bit Linux. Right?

I think Core_a4 can go up to 24 cores on either 320 or 64-bit Windows Right?

We need to be sure the PIs know what server settings should be set. (I think those who supported 32 CPUs have moved on or may have forgotten.)
I think you mean 64-bit Linux and FahCore_a5 - yes, this is what a number of us have been running, up to 64 cores (48 cores in my case), until the WU's dried up recently. It is also worth noting that, if an A5-based WU wasn't picked up for whatever reason, an A4-based WU would have been downloaded - this isn't happening at the moment, the client just sits waiting.

Re: http://128.143.231.201

Posted: Sun Apr 17, 2016 7:34 am
by Nathan_P
It looks like it is the same problem, I didn't catch that thread because the title is talking about a different server.

The threads can be merged if it makes things easier
Edit by MOd: Done.

Re: 171.67.108.*** Problems getting workunits

Posted: Sun Apr 17, 2016 7:40 am
by Nathan_P
Most of the 24+ core WU's were generated by Kasson or one of his associates out of his lab, he has lots of projects listed on the project summary using core A3 and A5 that run great on our multi cpu machines but none of his servers currently appear on the server status page

Re: 171.67.108.*** Problems getting workunits

Posted: Sun Apr 17, 2016 10:56 pm
by bruce
DocJonz wrote:
bruce wrote:We're talking about WUs for FahCore_a5. Right? And IIRC, Core_a5 only runs on 86 bit Linux. Right?
I think you mean 64-bit Linux and FahCore_a5 - yes,
Right. 64-bit Linux
(That's what I get for trying to balance the keyboard on my lap)