#882 Server did not like results, dumping...?

Moderators: Site Moderators, FAHC Science Team

Re: Server did not like results, dumping...?

Postby bollix47 » Tue Sep 25, 2012 10:20 pm

Thanks for your report.

There's a similar thread here but with different servers.

I'm not sure the problem was ever resolved but if we keep reporting examples where a CS dumps work then maybe someone can figure out what's going on and offer a solution. :ewink:

Fortunately the WS doesn't often fail to the point where the CS has to get involved.
bollix47
 
Posts: 2877
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: Server did not like results, dumping...?

Postby bollix47 » Wed Sep 26, 2012 1:19 am

FWIW

Here's another example that occurred to me today:

Code: Select all
16:30:15:WU00:FS00:0xa4:Completed 10000000 out of 10000000 steps  (100%)
16:30:15:WU00:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
16:30:16:WU01:FS00:Connecting to assign3.stanford.edu:8080
16:30:16:WU01:FS00:News: Welcome to Folding@Home
16:30:16:WU01:FS00:Assigned to work server 129.74.85.15
16:30:16:WU01:FS00:Requesting new work unit for slot 00: RUNNING smp:12 from 129.74.85.15
16:30:16:WU01:FS00:Connecting to 129.74.85.15:8080
16:30:25:WU00:FS00:0xa4:
16:30:25:WU00:FS00:0xa4:Finished Work Unit:
16:30:25:WU00:FS00:0xa4:- Reading up to 2011920 from "00/wudata_01.trr": Read 2011920
16:30:25:WU00:FS00:0xa4:trr file hash check passed.
16:30:25:WU00:FS00:0xa4:- Reading up to 209244 from "00/wudata_01.xtc": Read 209244
16:30:25:WU00:FS00:0xa4:xtc file hash check passed.
16:30:25:WU00:FS00:0xa4:edr file hash check passed.
16:30:25:WU00:FS00:0xa4:logfile size: 79524
16:30:25:WU00:FS00:0xa4:Leaving Run
16:30:28:WU00:FS00:0xa4:- Writing 2325160 bytes of core data to disk...
16:30:29:WU00:FS00:0xa4:Done: 2324648 -> 1693397 (compressed to 72.8 percent)
16:30:29:WU00:FS00:0xa4:  ... Done.
16:30:29:WU00:FS00:0xa4:- Shutting down core
16:30:29:WU00:FS00:0xa4:
16:30:29:WU00:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
16:30:29:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:30:29:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7025 run:2 clone:85 gen:37 core:0xa4 unit:0x000000570001329c4dfbad0cb27fd6b3
16:30:29:WU00:FS00:Uploading 1.62MiB to 129.74.85.15
16:30:29:WU00:FS00:Connecting to 129.74.85.15:8080
16:32:02:WU00:FS00:Upload 3.87%
16:32:02:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
16:32:02:WU00:FS00:Trying to send results to collection server
16:32:02:WU00:FS00:Uploading 1.62MiB to 129.74.85.16
16:32:02:WU00:FS00:Connecting to 129.74.85.16:8080
16:35:26:ERROR:WU01:FS00:Exception: 10002: Received short response, expected 512 bytes, got 0
16:35:26:WU01:FS00:Connecting to assign3.stanford.edu:8080
16:35:27:WU01:FS00:News: Welcome to Folding@Home
16:35:27:WU01:FS00:Assigned to work server 129.74.85.15
16:35:27:WU01:FS00:Requesting new work unit for slot 00: READY smp:12 from 129.74.85.15
16:35:27:WU01:FS00:Connecting to 129.74.85.15:8080
16:35:27:WU01:FS00:Downloading 131.31KiB
16:35:28:WU01:FS00:Download complete
16:35:28:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:OK project:7039 run:0 clone:180 gen:1 core:0xa4 unit:0x000000030001329c4f394f006742c08b
16:35:28:WU01:FS00:Starting
16:35:28:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /home/bollix/fah/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 01 -suffix 01 -version 701 -lifeline 2035 -checkpoint 15 -np 12
16:35:28:WU01:FS00:Started FahCore on PID 3287
16:35:28:WU01:FS00:Core PID:3291
16:35:28:WU01:FS00:FahCore 0xa4 started
16:35:28:WU01:FS00:Downloading project 7039 description
16:35:28:WU01:FS00:Connecting to fah-web.stanford.edu:80
16:35:28:WU01:FS00:0xa4:
16:35:28:WU01:FS00:0xa4:*------------------------------*
16:35:28:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
16:35:28:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
16:35:28:WU01:FS00:0xa4:
16:35:28:WU01:FS00:0xa4:Preparing to commence simulation
16:35:28:WU01:FS00:0xa4:- Looking at optimizations...
16:35:28:WU01:FS00:0xa4:- Created dyn
16:35:28:WU01:FS00:0xa4:- Files status OK
16:35:28:WU01:FS00:0xa4:- Expanded 133948 -> 308460 (decompressed 230.2 percent)
16:35:28:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=133948 data_size=308460, decompressed_data_size=308460 diff=0
16:35:28:WU01:FS00:0xa4:- Digital signature verified
16:35:28:WU01:FS00:0xa4:
16:35:28:WU01:FS00:0xa4:Project: 7039 (Run 0, Clone 180, Gen 1)
16:35:28:WU01:FS00:0xa4:
16:35:28:WU01:FS00:0xa4:Assembly optimizations on if available.
16:35:28:WU01:FS00:0xa4:Entering M.D.
16:35:28:WU01:FS00:Project 7039 description downloaded successfully
16:35:34:WU01:FS00:0xa4:Completed 0 out of 25000000 steps  (0%)
16:39:13:WU00:FS00:Upload 3.87%
16:39:13:ERROR:WU00:FS00:Exception: Transfer failed
16:39:13:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7025 run:2 clone:85 gen:37 core:0xa4 unit:0x000000570001329c4dfbad0cb27fd6b3
16:39:13:WU00:FS00:Uploading 1.62MiB to 129.74.85.15
16:39:13:WU00:FS00:Connecting to 129.74.85.15:8080
16:42:15:WU01:FS00:0xa4:Completed 250000 out of 25000000 steps  (1%)
16:48:55:WU01:FS00:0xa4:Completed 500000 out of 25000000 steps  (2%)
16:52:43:WU00:FS00:Upload 3.87%
16:52:43:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
16:52:43:WU00:FS00:Trying to send results to collection server
16:52:43:WU00:FS00:Uploading 1.62MiB to 129.74.85.16
16:52:43:WU00:FS00:Connecting to 129.74.85.16:8080
16:52:49:WU00:FS00:Upload 58.03%
16:52:57:WU00:FS00:Upload complete
16:52:57:WU00:FS00:Server responded WORK_QUIT (404)
16:52:57:WARNING:WU00:FS00:Server did not like results, dumping
16:52:57:WU00:FS00:Cleaning up


There appears to be two problems here.

1. Why is the transfer failing for the Work Server(129.74.85.15)?
2. Why is the Collection Server(129.74.85.16) dumping the results?

p.s. I just uploaded another WU to 129.74.85.15 and it worked fine. :e?:

Perhaps ticket #882 needs to be reopened?

Mod edit: Reopened.
bollix47
 
Posts: 2877
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: #882 Server did not like results, dumping...?

Postby bollix47 » Wed Sep 26, 2012 2:59 am

Thank you for reopening.

FWIW

I'd like to add that the original link for creating that ticket shows my 3 examples and a fourth report from another folder who never did supply their log. So it never was a "one off".
bollix47
 
Posts: 2877
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: #882 Server did not like results, dumping...?

Postby 7im » Wed Sep 26, 2012 3:47 am

The ticket has been classified as "needing more information" and progress won't happen without more info. And I can't tell what Joe is looking for...

The other problem, and maybe we need another ticket for this, the server provides no specific reason for why the data was rejected. CRC errror, timeout error, NOTHING. That needs to be improved a lot.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: Server did not like results, dumping...?

Postby Amaruk » Wed Sep 26, 2012 9:10 am

bollix47 wrote:There appears to be two problems here.

1. Why is the transfer failing for the Work Server(129.74.85.15)?
2. Why is the Collection Server(129.74.85.16) dumping the results?

p.s. I just uploaded another WU to 129.74.85.15 and it worked fine. :e?:


1. Why is the transfer failing for the Work Server(129.74.85.15)?

129.74.85.15 (fahnd03) was in reject mode sometime between 09:30 and 10:00 PDT.
Code: Select all
Tue Sep 25 09:30:01 PDT 2012   129.74.85.15   fahnd03   izaguirre   SMP   full   Accepting

Tue Sep 25 09:40:00 PDT 2012   129.74.85.15   fahnd03   izaguirre   SMP   full   Reject   

Tue Sep 25 09:50:00 PDT 2012   129.74.85.15   fahnd03   izaguirre   SMP   full   Reject   

Tue Sep 25 10:00:01 PDT 2012   129.74.85.15   fahnd03   izaguirre   SMP   full   Accepting


To make things a little easier, I filtered your log:
Code: Select all
16:30:15:WU00:FS00:0xa4:Completed 10000000 out of 10000000 steps  (100%)
16:30:15:WU00:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
16:30:25:WU00:FS00:0xa4:
16:30:25:WU00:FS00:0xa4:Finished Work Unit:
16:30:25:WU00:FS00:0xa4:- Reading up to 2011920 from "00/wudata_01.trr": Read 2011920
16:30:25:WU00:FS00:0xa4:trr file hash check passed.
16:30:25:WU00:FS00:0xa4:- Reading up to 209244 from "00/wudata_01.xtc": Read 209244
16:30:25:WU00:FS00:0xa4:xtc file hash check passed.
16:30:25:WU00:FS00:0xa4:edr file hash check passed.
16:30:25:WU00:FS00:0xa4:logfile size: 79524
16:30:25:WU00:FS00:0xa4:Leaving Run
16:30:28:WU00:FS00:0xa4:- Writing 2325160 bytes of core data to disk...
16:30:29:WU00:FS00:0xa4:Done: 2324648 -> 1693397 (compressed to 72.8 percent)
16:30:29:WU00:FS00:0xa4:  ... Done.
16:30:29:WU00:FS00:0xa4:- Shutting down core
16:30:29:WU00:FS00:0xa4:
16:30:29:WU00:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
16:30:29:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:30:29:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7025 run:2 clone:85 gen:37 core:0xa4 unit:0x000000570001329c4dfbad0cb27fd6b3
16:30:29:WU00:FS00:Uploading 1.62MiB to 129.74.85.15
16:30:29:WU00:FS00:Connecting to 129.74.85.15:8080
16:32:02:WU00:FS00:Upload 3.87%
16:32:02:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
16:32:02:WU00:FS00:Trying to send results to collection server
16:32:02:WU00:FS00:Uploading 1.62MiB to 129.74.85.16
16:32:02:WU00:FS00:Connecting to 129.74.85.16:8080
16:39:13:WU00:FS00:Upload 3.87%
16:39:13:ERROR:WU00:FS00:Exception: Transfer failed
16:39:13:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7025 run:2 clone:85 gen:37 core:0xa4 unit:0x000000570001329c4dfbad0cb27fd6b3
16:39:13:WU00:FS00:Uploading 1.62MiB to 129.74.85.15
16:39:13:WU00:FS00:Connecting to 129.74.85.15:8080
16:52:43:WU00:FS00:Upload 3.87%
16:52:43:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
16:52:43:WU00:FS00:Trying to send results to collection server
16:52:43:WU00:FS00:Uploading 1.62MiB to 129.74.85.16
16:52:43:WU00:FS00:Connecting to 129.74.85.16:8080
16:52:49:WU00:FS00:Upload 58.03%
16:52:57:WU00:FS00:Upload complete
16:52:57:WU00:FS00:Server responded WORK_QUIT (404)
16:52:57:WARNING:WU00:FS00:Server did not like results, dumping
16:52:57:WU00:FS00:Cleaning up


Your initial upload attempt (129.74.85.15) failed at 16:32:02 UTC. (09:32:02 PDT)

Second attempt (129.74.85.16) failed at 16:39:13 UTC. (09:39:13 PDT)

Third attempt (129.74.85.15 again) failed at 16:52:43 UTC. (09:52:43 PDT)

Fourth attempt (back to 129.74.85.16) was completed, then rejected.

It seems both attempted uploads to the WS (129.74.85.15) occured during the short time it was in reject mode. :(


JuanPabloCuervo's 3 failed WS uploads were also during this timeframe. (16:51:13, 16:51:20 and 16:54:18 UTC = 09:51:13, 09:51:20, and 09:54:18 PDT)



2. Why is the Collection Server(129.74.85.16) dumping the results?

Is 129.74.85.16 (fahnd04) actually a CS? Server status lists it as classic...

And if it really is another (disfunctional) CS, perhaps it should be on standby like the rest. ;)
Image
User avatar
Amaruk
 
Posts: 254
Joined: Fri Jun 20, 2008 4:57 am
Location: Watching from the Woods

Re: #882 Server did not like results, dumping...?

Postby bollix47 » Wed Sep 26, 2012 11:41 am

Thank you Amaruk for catching that Reject status. I did look at the log but obviously didn't look close enough and missed those two entries. :oops:

As you say that explains the failed file transfer making this situation more like the one I linked to in the second post, although it was a different WS, but leaves question marks around 129.74.85.16.

Is that server configured incorrectly or is the wrong I.P. being used for the CS or should the server be on Standby or did the WS go into Reject after the initial upload had started leaving some status in a wrong state or ????

The main focus of this thread, the other thread and the ticket should be the CS behavior and why are perfectly good results being dumped thus slowing down the projects involved and wasting folder's time and electricity.
bollix47
 
Posts: 2877
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: #882 Server did not like results, dumping...?

Postby Joe_H » Wed Sep 26, 2012 4:05 pm

Amaruk wrote:Is 129.74.85.16 (fahnd04) actually a CS? Server status lists it as classic...

Being listed as "classic" does not seem to be an issue, another CS that is listed as "classic" is 171.65.103.160 VSPMF93. That works just fine as a collection server for the projects assigned to it. It appears that it is the older CS's that are waiting on software upgrades to be functional that are listed as "CS". Newer ones that are functioning appear not to have been marked as CS in the server status page.

There may be an issue with how 129.74.85.16 is configured, so it is rejecting some WU's. Perhaps reports on this can lead to it being fixed if it is the problem.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Joe_H
Site Admin
 
Posts: 6902
Joined: Tue Apr 21, 2009 5:41 pm
Location: W. MA

Re: #882 Server did not like results, dumping...?

Postby 7im » Wed Sep 26, 2012 5:29 pm

I asked PG about this a while ago. With server virtualization, and updated server code, a single server can be both a Work Server, and a Collection Server for other Work Servers.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona


Return to V7.1.52 Windows/Linux

Who is online

Users browsing this forum: No registered users and 3 guests

cron