128.143.48.226 : server reports problem with unit

Moderators: Site Moderators, FAHC Science Team

farmpuma
Posts: 25
Joined: Sat Mar 21, 2009 12:50 pm
Location: Soybean field, IN, USA

Re: 128.143.48.226 : server reports problem with unit

Post by farmpuma »

I have three different machines with the same failure. Each machine had successfully finished hundreds of work units and and a few of these from server 128.143.48.226 when the failures happened. No sneaker-netting was involved.

edit - All three machines continue to finish and upload other non 128.143.48.226 work units with no issues.
I'm the same farmpuma from years gone by, but it appears my account went away when the passwords changed to six characters minimum.

Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.143.48.226 : server reports problem with unit

Post by bruce »

Mactin wrote:The question now is :
Which servers are using the new server code so that we can avoid and/or manage the problem/feature/bug ?
Soon all servers will be running the new code, so everyone who sneakernets must figure out a way to manage MachineID/UserID so that WUs always get uploaded with the same values that were being used when the WU was downloaded. Stanford has always assumed one client per MachineID/UserID but hasn't always validated that data. Now that they are (or soon will be, depending on the server) you need to convince the servers that you follow the one-client=one-machineID+UserID policy.

Since you specifically said that no sneakernetting was involved, you probably have a different problem (bug?). This isn't the only cause for the message "server reports a problem with unit"

Look near the top of FAHlog.txt for the information about MachineID/UserID/UserName/TeamNumber. Post that information for each of your machines.

Also, please post segments of FAHlog.txt showing some of the failed upload attempts where you got that message.
stevei
Posts: 5
Joined: Fri Jan 25, 2008 12:08 am

Re: 128.143.48.226 : server reports problem with unit

Post by stevei »

stevei wrote:I've found 4 units completed but unsent with the same error.

When I run qfix, it detect that that results do not match the que entry, saying proj 0 run 0 clone 0 for all 4 finished WUs...

Should I just delete them and move on, or is there any way to fix this?

The error is "server reports problem with unit" and the server is 171.64.122.139 for all 4 units.
I have at least 5 sitting like this now.. A couple can be traced back to different User IDs. The rest were on the same machine beginning to end.

For all units, qfix gives the same information, and even for the User ID issue, taking it back to the originating machine does not work. The above issue with the que seems to mark it off to FAH to not attempt to upload. Is there a way to fix it so it will re-attempt upload? Qfix sees the issue but does not attempt to fix it.

I would hate to delete units, points or no points.
stevei
Posts: 5
Joined: Fri Jan 25, 2008 12:08 am

Re: 128.143.48.226 : server reports problem with unit

Post by stevei »

Here is an example of start to finish on the same machine:

Code: Select all

download:
[18:07:51] - User name: SteveI (Team 45)
[18:07:51] - User ID: 755F5E954D0E93FA
[18:07:51] - Machine ID: 4
[18:07:51] 
[18:07:51] Loaded queue successfully.
[18:07:51] 
[18:07:51] + Processing work unit
[18:07:51] Core required: FahCore_78.exe
[18:07:51] Core found.
[18:07:51] Working on queue slot 06 [March 18 18:07:51 UTC]
[18:07:51] + Working ...
[18:07:51] 
[18:07:51] *------------------------------*
[18:07:51] Folding@Home Gromacs Core
[18:07:51] Version 1.90 (March 8, 2006)
[18:07:51] 
[18:07:51] Preparing to commence simulation
[18:07:51] - Assembly optimizations manually forced on.
[18:07:51] - Not checking prior termination.
[18:07:51] - Expanded 238594 -> 1167708 (decompressed 489.4 percent)
[18:07:51] - Starting from initial work packet
[18:07:51] 
[18:07:51] Project: 3798 (Run 28, Clone 4, Gen 35)

Completed normally, then:
[20:07:56] - User name: SteveI (Team 45)
[20:07:56] - User ID: 755F5E954D0E93FA
[20:07:56] - Machine ID: 4
[20:07:56] 
[20:07:56] Loaded queue successfully.
[20:07:56] - Preparing to get new work unit...
[20:07:56] Project: 3798 (Run 28, Clone 4, Gen 35)


[20:07:56] + Attempting to send results [March 18 20:07:56 UTC]
[20:07:56] - Presenting message box asking to network.
[20:07:56] - Presenting message box asking to network.
[20:07:58] + Attempting to get work packet
[20:07:58] - Connecting to assignment server
[20:07:59] - Successful: assigned to (171.64.122.139).
[20:07:59] + News From Folding@Home: Welcome to Folding@Home
[20:07:59] Loaded queue successfully.
[20:08:01] - Server reports problem with unit.
[20:08:16] + Connections closed: You may now disconnect
[20:08:16] 
[20:08:16] + Processing work unit
[20:08:16] Core required: FahCore_78.exe
[20:08:16] Core found.
[20:08:16] Working on queue slot 07 [March 18 20:08:16 UTC]
[20:08:16] + Working ...
[20:08:16] 
[20:08:16] *------------------------------*
[20:08:16] Folding@Home Gromacs Core
[20:08:16] Version 1.90 (March 8, 2006)
[20:08:16] 
[20:08:16] Preparing to commence simulation
[20:08:16] - Assembly optimizations manually forced on.
[20:08:16] - Not checking prior termination.
[20:08:16] - Expanded 238700 -> 1167708 (decompressed 489.1 percent)
[20:08:16] - Starting from initial work packet
[20:08:16] 
[20:08:16] Project: 3798 (Run 91, Clone 2, Gen 35)
[20:08:16] 
[20:08:16] Assembly optimizations on if available.
[20:08:16] Entering M.D.
[20:08:23] Protein: p3798
jcoffland
Site Admin
Posts: 1019
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: 128.143.48.226 : server reports problem with unit

Post by jcoffland »

I looked into this. The reason your WUs are not being accepted is because data files on your client have been tampered with or removed since the WU was checked out. Now your client does not have the correct acceptance codes to return the unit. This may have worked in the past but the new servers are stricter.

You can delete these WUs with out worry as they have been completed elsewhere.

In the future you should avoid moving WUs or deleting and reinstalling your client when you have unreturned work units. Upgrading your client should not pose a problem.

If you have a need that our system is not addressing please explain and we can provide a supported solution.
Cauldron Development LLC
http://cauldrondevelopment.com/
^w^ing
Posts: 136
Joined: Fri Mar 07, 2008 7:29 pm
Hardware configuration: C2D E6400 2.13 GHz @ 3.2 GHz
Asus EN8800GTS 640 (G80) @ 660/792/1700 running the 6.23 w/ core11 v1.19
forceware 260.89
Asus P5N-E SLi
2GB 800MHz DDRII (2xCorsair TwinX 512MB)
WinXP 32 SP3
Location: Prague

Re: 128.143.48.226 : server reports problem with unit

Post by ^w^ing »

Does this mean that WUs fixed with qfix wont be accepted by the server anymore?
stevei
Posts: 5
Joined: Fri Jan 25, 2008 12:08 am

Re: 128.143.48.226 : server reports problem with unit

Post by stevei »

jcoffland wrote:I looked into this. The reason your WUs are not being accepted is because data files on your client have been tampered with or removed since the WU was checked out. Now your client does not have the correct acceptance codes to return the unit. This may have worked in the past but the new servers are stricter.

You can delete these WUs with out worry as they have been completed elsewhere.

In the future you should avoid moving WUs or deleting and reinstalling your client when you have unreturned work units. Upgrading your client should not pose a problem.

If you have a need that our system is not addressing please explain and we can provide a supported solution.
Thank you for your assistance. I will go ahead and delete the results that are not uploaded.

My configuration is not unusual. I sneakernet for a machine that is unable to access the internet, an isolated test machine. My simple process is to move the entire folding folder, which includes the work folder. I run the executable from the identical path as the download machine with the same flags, and move the folder back to the same originating machine to upload. A wireless issue caused me to deviate temporarily to a different machine (which is why a couple of the wus did not match machine IDs).

Because I am using the simple console version, I don't install, I just copy back and forth.

While I have been folding for many years, I would not call myself an expert, and would not intentionally tamper with or remove data files. I'm not sure what caused this. I understand that sneakernet is not a supported procedure. I guess I will keep an eye on this for a while.

Steve.

EDIT: I only tried Qfix as a result of the WUs not uploading.... It couldn't have caused the problem.
farmpuma
Posts: 25
Joined: Sat Mar 21, 2009 12:50 pm
Location: Soybean field, IN, USA

Re: 128.143.48.226 : server reports problem with unit

Post by farmpuma »

bruce wrote:Since you specifically said that no sneakernetting was involved, you probably have a different problem (bug?). This isn't the only cause for the message "server reports a problem with unit"

Look near the top of FAHlog.txt for the information about MachineID/UserID/UserName/TeamNumber. Post that information for each of your machines.

Also, please post segments of FAHlog.txt showing some of the failed upload attempts where you got that message.
One of the machines is down for cleaning and mods, and may still have the FAHlog.txt file, but the other two have long ago dumped the FAHlog.txt info on these work units. Perhaps the work folder logfile will help.

from the current work unit on machine #1 -
[18:13:34] - Ask before connecting: Yes
[18:13:34] - User name: farmpuma (Team 2630)
[18:13:34] - User ID: 50454F1144A71857
[18:13:34] - Machine ID: 1

Code: Select all

*------------------------------*
Folding@Home Double Gromacs Core C
Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)

Preparing to commence simulation
- Assembly optimizations manually forced on.
- Not checking prior termination.
- Expanded 231244 -> 637080 (decompressed 275.5 percent)

Project: 3863 (Run 561, Clone 0, Gen 1)

Assembly optimizations on if available.
Entering M.D.
Will resume from checkpoint file
Working on p3863_fkbprelative_ligand
Completed 0 out of 1500000 steps  (0%)
Extra SSE2 boost OK
Resuming from checkpoint
Verified work/wudata_01.log
Verified work/wudata_01.edr
Verified work/wudata_01.xvg
Verified work/wudata_01.trr
Verified work/wudata_01.xtc
Completed 1260001 out of 1500000 steps  (84%)
Timer requesting checkpoint
Completed 1275000 out of 1500000 steps  (85%)
Timer requesting checkpoint
Completed 1290000 out of 1500000 steps  (86%)
Timer requesting checkpoint
Completed 1305000 out of 1500000 steps  (87%)
Timer requesting checkpoint
Completed 1320000 out of 1500000 steps  (88%)
Timer requesting checkpoint
Completed 1335000 out of 1500000 steps  (89%)
Timer requesting checkpoint
Completed 1350000 out of 1500000 steps  (90%)
Timer requesting checkpoint
Completed 1365000 out of 1500000 steps  (91%)
Timer requesting checkpoint
Completed 1380000 out of 1500000 steps  (92%)
Timer requesting checkpoint
Completed 1395000 out of 1500000 steps  (93%)
Timer requesting checkpoint
Completed 1410000 out of 1500000 steps  (94%)
Timer requesting checkpoint
Completed 1425000 out of 1500000 steps  (95%)
Timer requesting checkpoint
Completed 1440000 out of 1500000 steps  (96%)
Timer requesting checkpoint
Completed 1455000 out of 1500000 steps  (97%)
Timer requesting checkpoint
Completed 1470000 out of 1500000 steps  (98%)
Timer requesting checkpoint
Completed 1485000 out of 1500000 steps  (99%)
Timer requesting checkpoint
Completed 1500000 out of 1500000 steps  (100%)

Finished Work Unit:
- Reading up to 414544 from "work/wudata_01.trr": Read 414544
- Reading up to 22980 from "work/wudata_01.xtc": Read 22980
xvg file size: 1003788
logfile size: 67530
Leaving Run
- Writing 1611862 bytes of core data to disk...
Done: 1611350 -> 802060 (compressed to 49.7 percent)
  ... Done.
- Shutting down core

Folding@home Core Shutdown: FINISHED_UNIT
---- edit ----

same info from machine #2 -

[14:49:54] - Ask before connecting: Yes
[14:49:54] - User name: farmpuma (Team 2630)
[14:49:54] - User ID: 34B35B15250BEAC1
[14:49:54] - Machine ID: 1

Code: Select all

*------------------------------*
Folding@Home Double Gromacs Core C
Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)

Preparing to commence simulation
- Assembly optimizations manually forced on.
- Not checking prior termination.
- Expanded 1141474 -> 3288656 (decompressed 288.1 percent)

Project: 3860 (Run 546, Clone 5, Gen 1)

Assembly optimizations on if available.
Entering M.D.
Will resume from checkpoint file
Working on p3860_fkbprelative_complex
Completed 0 out of 300000 steps  (0%)
Extra SSE2 boost OK
Resuming from checkpoint
Verified work/wudata_01.log
Verified work/wudata_01.edr
Verified work/wudata_01.xvg
Verified work/wudata_01.trr
Verified work/wudata_01.xtc
Completed 291733 out of 300000 steps  (97%)
Timer requesting checkpoint
Timer requesting checkpoint
Timer requesting checkpoint
Completed 294000 out of 300000 steps  (98%)
Timer requesting checkpoint
Timer requesting checkpoint
Timer requesting checkpoint
Timer requesting checkpoint
Completed 297000 out of 300000 steps  (99%)
Timer requesting checkpoint
Timer requesting checkpoint
Timer requesting checkpoint
Timer requesting checkpoint
Completed 300000 out of 300000 steps  (100%)

Finished Work Unit:
- Reading up to 2077168 from "work/wudata_01.trr": Read 2077168
- Reading up to 182488 from "work/wudata_01.xtc": Read 182488
xvg file size: 613418
logfile size: 118795
Leaving Run
- Writing 3094889 bytes of core data to disk...
Done: 3094377 -> 2434954 (compressed to 78.6 percent)
  ... Done.
- Shutting down core

Folding@home Core Shutdown: FINISHED_UNIT
---- edit 16 June 2009 ----
I can send you the wuresults_01.dat files on these two failures as well.

---- edit 22 June 2009 ----
corrected grammar
Last edited by farmpuma on Mon Jun 22, 2009 9:04 am, edited 2 times in total.
I'm the same farmpuma from years gone by, but it appears my account went away when the passwords changed to six characters minimum.

Image
farmpuma
Posts: 25
Joined: Sat Mar 21, 2009 12:50 pm
Location: Soybean field, IN, USA

Re: 128.143.48.226 : server reports problem with unit

Post by farmpuma »

Well, I may have been mistaken about one of those work units listed above not being moved. But I have two new failures which I am absolutely certain have NOT been moved from the machine that downloaded them.

On the first machine I only have the work unit logfile. The system is an extremely stable AMD socket 754 Athlon 64 2800+ cpu on an Asus K8N-E mother board. It has finished hundreds of work units including many big packets bonus points Gromacs which in my experience will reveal even the slightest instability.
[06:07:00] - Ask before connecting: Yes
[06:07:00] - User name: farmpuma (Team 2630)
[06:07:00] - User ID: 50454F1144A71857
[06:07:00] - Machine ID: 1

Code: Select all

*------------------------------*
Folding@Home Double Gromacs Core C
Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)

Preparing to commence simulation
- Files status OK
- Expanded 1210165 -> 3423324 (decompressed 282.8 percent)

Project: 3864 (Run 401, Clone 14, Gen 4)

Assembly optimizations on if available.
Entering M.D.
Will resume from checkpoint file
Working on p3862_fkbprelative_complex
Completed 0 out of 300000 steps  (0%)
Extra SSE2 boost OK
Resuming from checkpoint
Verified work/wudata_01.log
Verified work/wudata_01.edr
Verified work/wudata_01.xvg
Verified work/wudata_01.trr
Verified work/wudata_01.xtc
Completed 294001 out of 300000 steps  (98%)
Timer requesting checkpoint
Completed 297000 out of 300000 steps  (99%)
Timer requesting checkpoint
Timer requesting checkpoint
Completed 300000 out of 300000 steps  (100%)
Timer requesting checkpoint

Finished Work Unit:
- Reading up to 2169040 from "work/wudata_01.trr": Read 2169040
- Reading up to 173040 from "work/wudata_01.xtc": Read 173040
xvg file size: 303788
logfile size: 67685
Leaving Run
- Writing 2816573 bytes of core data to disk...
Done: 2816061 -> 2469250 (compressed to 87.6 percent)
  ... Done.
- Shutting down core

Folding@home Core Shutdown: FINISHED_UNIT
I can send you the wuresults_01.dat file if you wish.


---- edit ----

The second machine is an extremely stable AMD socket A (462) Sempron 2400+ cpu on an Abit NF7-S mother board. It has finished hundreds of work units including many big packets bonus points Gromacs which in my experience will reveal even the slightest instability.

Code: Select all

--- Opening Log file [June 4 10:43:52] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\FAH504
Executable: C:\Program Files\FAH504\FAH504-Console.exe
Arguments: -advmethods 

[10:43:52] - Ask before connecting: Yes
[10:43:52] - User name: farmpuma (Team 2630)
[10:43:52] - User ID: 3CAE18179A2206A
[10:43:52] - Machine ID: 1
[10:43:52] 
[10:43:52] Loaded queue successfully.
[10:43:52] + Benchmarking ...
[10:43:55] - Preparing to get new work unit...
[10:43:55] - Presenting message box asking to network.
[10:43:56] + Attempting to get work packet
[10:43:56] - Connecting to assignment server
[10:43:58] - Successful: assigned to (128.143.48.226).
[10:43:58] + News From Folding@Home: Welcome to Folding@Home
[10:43:58] Loaded queue successfully.
[10:45:25] + Connections closed: You may now disconnect
[10:45:25] 
[10:45:25] + Processing work unit
[10:45:25] Core required: FahCore_7c.exe
[10:45:25] Core found.
[10:45:25] Working on Unit 01 [June 4 10:45:25]
[10:45:25] + Working ...
[10:45:26] 
[10:45:26] *------------------------------*
[10:45:26] Folding@Home Double Gromacs Core C
[10:45:26] Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)
[10:45:26] 
[10:45:26] Preparing to commence simulation
[10:45:26] - Files status OK
[10:45:26] - Expanded 245377 -> 657056 (decompressed 267.7 percent)
[10:45:26] 
[10:45:26] Project: 3861 (Run 491, Clone 3, Gen 8)
[10:45:26] 
[10:45:26] Assembly optimizations on if available.
[10:45:26] Entering M.D.
[10:45:32] Working on p3861_fkbpabsolute_ligand
[10:45:32] Completed 0 out of 1500000 steps  (0)
[10:56:34] Timer requesting checkpoint
[11:07:34] Timer requesting checkpoint
[11:18:34] Timer requesting checkpoint
[11:29:34] Timer requesting checkpoint
[11:40:34] Timer requesting checkpoint
[11:43:45] Completed 15000 out of 1500000 steps  (1)
[11:54:47] Timer requesting checkpoint
[12:05:48] Timer requesting checkpoint
[12:16:48] Timer requesting checkpoint
[12:27:49] Timer requesting checkpoint
[12:38:49] Timer requesting checkpoint
[12:41:41] Completed 30000 out of 1500000 steps  (2)
[12:52:44] Timer requesting checkpoint
[13:03:44] Timer requesting checkpoint
[13:14:44] Timer requesting checkpoint
[13:25:44] Timer requesting checkpoint
[13:36:44] Timer requesting checkpoint
[13:39:31] Completed 45000 out of 1500000 steps  (3)
[13:50:33] Timer requesting checkpoint
[14:01:33] Timer requesting checkpoint
[14:12:33] Timer requesting checkpoint
[14:23:33] Timer requesting checkpoint
[14:34:33] Timer requesting checkpoint
[14:37:15] Completed 60000 out of 1500000 steps  (4)
[14:48:17] Timer requesting checkpoint
[14:59:17] Timer requesting checkpoint
[15:10:17] Timer requesting checkpoint
[15:21:17] Timer requesting checkpoint
[15:32:17] Timer requesting checkpoint
[15:35:08] Completed 75000 out of 1500000 steps  (5)
[15:46:10] Timer requesting checkpoint
[15:57:10] Timer requesting checkpoint
[16:08:10] Timer requesting checkpoint
[16:19:10] Timer requesting checkpoint
[16:30:10] Timer requesting checkpoint
[16:33:00] Completed 90000 out of 1500000 steps  (6)
[16:44:02] Timer requesting checkpoint
[16:55:02] Timer requesting checkpoint
[17:06:03] Timer requesting checkpoint
[17:17:05] Timer requesting checkpoint
[17:28:06] Timer requesting checkpoint
[17:30:57] Completed 105000 out of 1500000 steps  (7)
[17:41:59] Timer requesting checkpoint
[17:52:59] Timer requesting checkpoint
[18:03:59] Timer requesting checkpoint
[18:14:59] Timer requesting checkpoint
[18:25:59] Timer requesting checkpoint
[18:28:57] Completed 120000 out of 1500000 steps  (8)
[18:39:59] Timer requesting checkpoint
[18:50:59] Timer requesting checkpoint
[19:01:59] Timer requesting checkpoint
[19:12:59] Timer requesting checkpoint
[19:23:59] Timer requesting checkpoint
[19:28:03] Completed 135000 out of 1500000 steps  (9)
[19:39:05] Timer requesting checkpoint
[19:50:05] Timer requesting checkpoint
[20:01:06] Timer requesting checkpoint
[20:12:06] Timer requesting checkpoint
[20:23:06] Timer requesting checkpoint
[20:26:55] Completed 150000 out of 1500000 steps  (10)
[20:37:58] Timer requesting checkpoint
[20:48:58] Timer requesting checkpoint
[20:59:58] Timer requesting checkpoint

Folding@Home Client Shutdown.


**** numerous stops and restarts due to night folding only ****


--- Opening Log file [June 15 06:52:13] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\FAH504
Executable: C:\Program Files\FAH504\FAH504-Console.exe
Arguments: -advmethods 

[06:52:13] - Ask before connecting: Yes
[06:52:13] - User name: farmpuma (Team 2630)
[06:52:13] - User ID: 3CAE18179A2206A
[06:52:13] - Machine ID: 1
[06:52:13] 
[06:52:13] Loaded queue successfully.
[06:52:13] + Benchmarking ...
[06:52:15] 
[06:52:15] + Processing work unit
[06:52:15] Core required: FahCore_7c.exe
[06:52:15] Core found.
[06:52:15] Working on Unit 01 [June 15 06:52:15]
[06:52:15] + Working ...
[06:52:16] 
[06:52:16] *------------------------------*
[06:52:16] Folding@Home Double Gromacs Core C
[06:52:16] Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)
[06:52:16] 
[06:52:16] Preparing to commence simulation
[06:52:16] - Files status OK
[06:52:16] - Expanded 245377 -> 657056 (decompressed 267.7 percent)
[06:52:16] 
[06:52:16] Project: 3861 (Run 491, Clone 3, Gen 8)
[06:52:16] 
[06:52:17] Assembly optimizations on if available.
[06:52:17] Entering M.D.
[06:52:23] Will resume from checkpoint file
[06:52:23] Working on p3861_fkbpabsolute_ligand
[06:52:23] Completed 0 out of 1500000 steps  (0)
[06:52:25] Resuming from checkpoint
[06:52:26] Verified work/wudata_01.log
[06:52:26] Verified work/wudata_01.edr
[06:52:26] Verified work/wudata_01.xvg
[06:52:26] Verified work/wudata_01.trr
[06:52:26] Verified work/wudata_01.xtc
[06:52:26] Completed 1445568 out of 1500000 steps  (96)
[07:03:27] Timer requesting checkpoint
[07:14:27] Timer requesting checkpoint
[07:25:27] Timer requesting checkpoint
[07:29:50] Completed 1455000 out of 1500000 steps  (97)
[07:40:52] Timer requesting checkpoint
[07:51:52] Timer requesting checkpoint
[08:02:52] Timer requesting checkpoint
[08:13:52] Timer requesting checkpoint
[08:24:52] Timer requesting checkpoint
[08:29:15] Completed 1470000 out of 1500000 steps  (98)
[08:40:17] Timer requesting checkpoint
[08:51:17] Timer requesting checkpoint
[09:02:17] Timer requesting checkpoint
[09:13:17] Timer requesting checkpoint
[09:24:17] Timer requesting checkpoint
[09:28:02] Completed 1485000 out of 1500000 steps  (99)
[09:39:04] Timer requesting checkpoint
[09:50:04] Timer requesting checkpoint
[10:01:04] Timer requesting checkpoint
[10:12:04] Timer requesting checkpoint
[10:23:04] Timer requesting checkpoint
[10:26:12] Completed 1500000 out of 1500000 steps  (100)
[10:27:14] 
[10:27:14] Finished Work Unit:
[10:27:14] - Reading up to 435712 from "work/wudata_01.trr": Read 435712
[10:27:14] - Reading up to 27716 from "work/wudata_01.xtc": Read 27716
[10:27:14] xvg file size: 2027418
[10:27:14] logfile size: 118267
[10:27:14] Leaving Run
[10:27:16] - Writing 2712133 bytes of core data to disk...
[10:27:17] Done: 2711621 -> 1094073 (compressed to 40.3 percent)
[10:27:17]   ... Done.
[10:27:17] - Shutting down core
[10:27:17] 
[10:27:17] Folding@home Core Shutdown: FINISHED_UNIT
[10:27:20] CoreStatus = 64 (100)
[10:27:20] Sending work to server


[10:27:20] + Attempting to send results
[10:27:20] - Presenting message box asking to network.

Folding@Home Client Shutdown.


--- Opening Log file [June 15 10:31:13] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\FAH504
Executable: C:\Program Files\FAH504\FAH504-Console.exe
Arguments: -send all 

[10:31:13] - Ask before connecting: Yes
[10:31:13] - User name: farmpuma (Team 2630)
[10:31:13] - User ID: 3CAE18179A2206A
[10:31:13] - Machine ID: 1
[10:31:13] 
[10:31:13] Loaded queue successfully.
[10:31:13] Attempting to return result(s) to server...


[10:31:13] + Attempting to send results
[10:31:13] - Presenting message box asking to network.
[10:37:43] - Server reports problem with unit.
[10:37:43] - Failed to send all units to server

Folding@Home Client Shutdown.

So there you have it. After slogging through eleven nights of standard loops the work unit was thrown in the trash and given zero credit even though the finish looks perfectly normal.

Which brings me to another question. Why is it again that you are doing your best to discourage sneaker netting when it can optimize the use of our equipment and electricity, and return the results as quickly as possible? One would think that the assignment server would only assign these work units to machines with SSE2 capable cpus to optimize your/our resources. But given that it doesn't do this, why again are you determined to keep us from doing it?
Last edited by farmpuma on Sat Oct 17, 2009 11:26 am, edited 1 time in total.
I'm the same farmpuma from years gone by, but it appears my account went away when the passwords changed to six characters minimum.

Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.143.48.226 : server reports problem with unit

Post by bruce »

farmpuma wrote:Why is it again that you are doing your best to discourage sneaker netting when it can optimize the use of our equipment and electricity, and return the results as quickly as possible? One would think that the assignment server would only assign these work units to machines with SSE2 capable cpus to optimize your/our resources. But given that it doesn't do this, why again are you determined to keep us from doing it?
If you're talking to me, I'm not discouraging sneakernetting -- nor am I determined to keep you from doing it. In fact, I've been responsible for much of the "how-to" documentation for sneakernetting. Even so, it has never been "supported" by the Pande Group.

I'm simply reporting that recent changes to the servers have made it more difficult. The Pande Group sets the security policies for their servers, not me. They've decided to validate that each WU is returned from the same machine that downloaded it.

Having an Assignment Server that confirms the presence of SSE2 and customizes the assignments is an excellent idea. I suggested the same thing a number of years ago. Unfortunately the list of suggestions is much longer than their programming resources and a lot of good ideas never get implemented. (Eventually, the non-SSE2 hardware will be retired, and then it will no longer be an issue.)
farmpuma
Posts: 25
Joined: Sat Mar 21, 2009 12:50 pm
Location: Soybean field, IN, USA

Re: 128.143.48.226 : server reports problem with unit

Post by farmpuma »

Actually I was speaking to jcoffland and any other member of the Pande group who might read this thread. I sincerely believe there is a serious issue here that is being hand waved away and I no longer have any confidence in any work unit from server .226 or any other double gromacs work unit.

Also, the new sneaker netting restrictions make it impossible for me to use my cousin's broadband internet to return work units which are too large to upload via my dial-up connection, hence the rant. I apologize for making it general and not personally directed to the Pande group.

You are probably correct about non-SSE2 hardware going the way of the horse and buggy, but I am a luddite and quite fond of my last win98se system even though it prevents me from folding bonus gromacs.
I'm the same farmpuma from years gone by, but it appears my account went away when the passwords changed to six characters minimum.

Image
farmpuma
Posts: 25
Joined: Sat Mar 21, 2009 12:50 pm
Location: Soybean field, IN, USA

Re: 128.143.48.226 : server reports problem with unit

Post by farmpuma »

Curiouser and curiouser .. I have two identical (as far as I can tell, except for the download date) 3863 (Run 288, Clone 6, Gen 25) work units which appeared to upload the correct file size and failed with this error.

Code: Select all

--- Opening Log file [October 12 07:56:11] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\FAH
Executable: C:\Program Files\FAH\FAH504-Console.exe
Arguments: -advmethods -forceasm 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[07:56:11] - Ask before connecting: Yes
[07:56:11] - User name: farmpuma (Team 2630)
[07:56:11] - User ID: 50454F1144A71857
[07:56:11] - Machine ID: 1
[07:56:11] 
[07:56:11] Loaded queue successfully.
[07:56:11] + Benchmarking ...
[07:56:14] - Preparing to get new work unit...
[07:56:14] - Presenting message box asking to network.
[07:56:16] + Attempting to get work packet
[07:56:16] - Connecting to assignment server
[07:56:18] - Successful: assigned to (128.143.48.226).
[07:56:18] + News From Folding@Home: Welcome to Folding@Home
[07:56:18] Loaded queue successfully.
[07:57:46] + Connections closed: You may now disconnect
[07:57:46] 
[07:57:46] + Processing work unit
[07:57:46] Core required: FahCore_7c.exe
[07:57:46] Core found.
[07:57:46] Working on Unit 01 [October 12 07:57:46]
[07:57:46] + Working ...
[07:57:46] 
[07:57:46] *------------------------------*
[07:57:46] Folding@Home Double Gromacs Core C
[07:57:46] Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)
[07:57:46] 
[07:57:46] Preparing to commence simulation
[07:57:46] - Assembly optimizations manually forced on.
[07:57:46] - Not checking prior termination.
[07:57:47] - Expanded 231068 -> 635404 (decompressed 274.9 percent)
[07:57:47] 
[07:57:47] Project: 3863 (Run 288, Clone 6, Gen 25)

and
--- Opening Log file [October 13 06:41:39] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\FAH
Executable: C:\Program Files\FAH\FAH504-Console.exe
Arguments: -advmethods -forceasm 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[06:41:39] - Ask before connecting: Yes
[06:41:39] - User name: farmpuma (Team 2630)
[06:41:39] - User ID: 50454F1144A71857
[06:41:39] - Machine ID: 1
[06:41:39] 
[06:41:39] Loaded queue successfully.
[06:41:39] + Benchmarking ...
[06:41:42] - Preparing to get new work unit...
[06:41:42] - Presenting message box asking to network.
[06:41:44] + Attempting to get work packet
[06:41:44] - Connecting to assignment server
[06:41:45] - Successful: assigned to (128.143.48.226).
[06:41:45] + News From Folding@Home: Welcome to Folding@Home
[06:41:45] Loaded queue successfully.
[06:42:57] + Connections closed: You may now disconnect
[06:42:57] 
[06:42:57] + Processing work unit
[06:42:57] Core required: FahCore_7c.exe
[06:42:57] Core found.
[06:42:57] Working on Unit 01 [October 13 06:42:57]
[06:42:57] + Working ...
[06:42:57] 
[06:42:57] *------------------------------*
[06:42:57] Folding@Home Double Gromacs Core C
[06:42:57] Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)
[06:42:57] 
[06:42:57] Preparing to commence simulation
[06:42:57] - Assembly optimizations manually forced on.
[06:42:57] - Not checking prior termination.
[06:42:57] - Expanded 231068 -> 635404 (decompressed 274.9 percent)
[06:42:57] 
[06:42:57] Project: 3863 (Run 288, Clone 6, Gen 25)
What the heck is going on here?

--- edit ---

It seems to me to be a failure in the duplicate work unit detection routine. Which is precisely what we the willing have been trying to say since the beginning of this thread.

P.S. Thank you toTow for making my post prettier.
Last edited by toTOW on Fri Oct 16, 2009 11:31 am, edited 1 time in total.
Reason: Added code tags.
I'm the same farmpuma from years gone by, but it appears my account went away when the passwords changed to six characters minimum.

Image
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: 128.143.48.226 : server reports problem with unit

Post by codysluder »

The project is no longer listed on the project summary, so I'm not sure how long the final deadline is, but judging from several of the logs posted, many of you seem to have downloaded the WU a long time ago and have take a long time to finish it. I'm not sure what message you get when you try to upload a WU that has passed the final deadline, but "server has a problem with this unit" is probably one of the possible messages.
farmpuma
Posts: 25
Joined: Sat Mar 21, 2009 12:50 pm
Location: Soybean field, IN, USA

Re: 128.143.48.226 : server reports problem with unit

Post by farmpuma »

It is most definitely still on the project summary with 3860, 3861, 3863, and 3864 listed. It is also currently green on the server status page with 4,183 WUs available and 19% served.

--- edit ---

And it is still having issues. The following log files are from a different machine than the one listed in the previous failure post. This system is an Intel E7501 based server motherboard from Gigabyte running a pair of 2.4GHz socket 604 Xeons and ECC memory. It is bone stock and extremely stable. It has successfully crunched hundreds of work units including quite a few SMP 2653s. The CPUs run on a 100MHz DDR FSB and thus crunch slowly for 2.4GHz.

Code: Select all

Due to a considerable amount of 0.0.0.0 server failures I do not have the download log file for the work unit which is currently failing.  But this is the end of it's run and the failure message .. 

--- Opening Log file [October 16 23:16:04] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah504-a
Executable: C:\Program Files\fah504-a\FAH504-Console.exe
Arguments: -local -forceasm -advmethods 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[23:16:04] - Ask before connecting: Yes
[23:16:04] - User name: TRFrankenbot (Team 2630)
[23:16:04] - User ID: 79C8BD2B776D045F
[23:16:04] - Machine ID: 1
[23:16:04] 
[23:16:05] Loaded queue successfully.
[23:16:05] + Benchmarking ...
[23:16:07] 
[23:16:07] + Processing work unit
[23:16:07] Core required: FahCore_7c.exe
[23:16:07] Core found.
[23:16:07] Working on Unit 01 [October 16 23:16:07]
[23:16:07] + Working ...
[23:16:07] 
[23:16:07] *------------------------------*
[23:16:07] Folding@Home Double Gromacs Core C
[23:16:07] Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)
[23:16:07] 
[23:16:07] Preparing to commence simulation
[23:16:07] - Assembly optimizations manually forced on.
[23:16:07] - Not checking prior termination.
[23:16:07] - Expanded 246491 -> 661416 (decompressed 268.3 percent)
[23:16:07] 
[23:16:07] Project: 3861 (Run 471, Clone 10, Gen 23)
[23:16:07] 
[23:16:08] Assembly optimizations on if available.
[23:16:08] Entering M.D.
[23:16:13] Will resume from checkpoint file
[23:16:14] Working on p3861_fkbpabsolute_ligand
[23:16:14] Completed 0 out of 1500000 steps  (0)
[23:16:14] Extra SSE2 boost OK
[23:16:19] Resuming from checkpoint
[23:16:19] Verified work/wudata_01.log
[23:16:19] Verified work/wudata_01.edr
[23:16:19] Verified work/wudata_01.xvg
[23:16:19] Verified work/wudata_01.trr
[23:16:19] Verified work/wudata_01.xtc
[23:16:19] Completed 1395001 out of 1500000 steps  (93)
[23:31:20] Timer requesting checkpoint
[23:46:22] Timer requesting checkpoint
[00:01:24] Timer requesting checkpoint
[00:02:58] Completed 1410000 out of 1500000 steps  (94)
[00:18:04] Timer requesting checkpoint
[00:33:04] Timer requesting checkpoint
[00:47:51] Completed 1425000 out of 1500000 steps  (95)
[01:02:54] Timer requesting checkpoint
[01:17:53] Timer requesting checkpoint
[01:31:40] Completed 1440000 out of 1500000 steps  (96)
[01:46:43] Timer requesting checkpoint
[02:01:43] Timer requesting checkpoint
[02:16:42] Completed 1455000 out of 1500000 steps  (97)
[02:16:43] Timer requesting checkpoint
[02:31:47] Timer requesting checkpoint
[02:46:48] Timer requesting checkpoint
[03:01:48] Timer requesting checkpoint
[03:01:48] Completed 1470000 out of 1500000 steps  (98)
[03:16:50] Timer requesting checkpoint
[03:31:52] Timer requesting checkpoint
[03:46:52] Completed 1485000 out of 1500000 steps  (99)
[03:46:52] Timer requesting checkpoint
[04:01:58] Timer requesting checkpoint
[04:16:59] Timer requesting checkpoint
[04:31:53] Completed 1500000 out of 1500000 steps  (100)
[04:31:59] Timer requesting checkpoint
[04:32:55] 
[04:32:55] Finished Work Unit:
[04:32:55] - Reading up to 436144 from "work/wudata_01.trr": Read 436144
[04:32:55] - Reading up to 32268 from "work/wudata_01.xtc": Read 32268
[04:32:55] xvg file size: 2027418
[04:32:55] logfile size: 118320
[04:32:55] Leaving Run
[04:32:58] - Writing 2717170 bytes of core data to disk...
[04:32:59] Done: 2716658 -> 1093849 (compressed to 40.2 percent)
[04:32:59]   ... Done.
[04:32:59] - Shutting down core
[04:32:59] 
[04:32:59] Folding@home Core Shutdown: FINISHED_UNIT
[04:33:02] CoreStatus = 64 (100)
[04:33:02] Sending work to server


[04:33:02] + Attempting to send results
[04:33:02] - Presenting message box asking to network.

Folding@Home Client Shutdown.

*** Due to my dial-up connection and the intermittent nature of the server I was not able to immediately try to upload the work unit. ***

--- Opening Log file [October 17 07:47:32] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah504-a
Executable: C:\Program Files\fah504-a\FAH504-Console.exe
Arguments: -local -send all 

[07:47:32] - Ask before connecting: Yes
[07:47:32] - User name: TRFrankenbot (Team 2630)
[07:47:32] - User ID: 79C8BD2B776D045F
[07:47:32] - Machine ID: 1
[07:47:32] 
[07:47:32] Loaded queue successfully.
[07:47:32] Attempting to return result(s) to server...


[07:47:32] + Attempting to send results
[07:47:32] - Presenting message box asking to network.
[07:55:32] - Server reports problem with unit.
[07:55:32] - Failed to send all units to server

Folding@Home Client Shutdown.

*** The following work unit reached 23% before I realized it was identical to the one which failed to upload and it is currently setting in a "hold_dup" folder.  I estimate the time between downloads to be about two days. ***

--- Opening Log file [October 16 09:37:06] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah504-a
Executable: C:\Program Files\fah504-a\FAH504-Console.exe
Arguments: -local -forceasm -advmethods 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[09:37:06] - Ask before connecting: Yes
[09:37:06] - User name: TRFrankenbot (Team 2630)
[09:37:06] - User ID: 79C8BD2B776D045F
[09:37:06] - Machine ID: 1
[09:37:06] 
[09:37:06] Loaded queue successfully.
[09:37:06] + Benchmarking ...
[09:37:09] 
[09:37:09] + Processing work unit
[09:37:09] Core required: FahCore_7c.exe
[09:37:09] Core found.
[09:37:09] Working on Unit 01 [October 16 09:37:09]
[09:37:09] + Working ...
[09:37:09] 
[09:37:09] *------------------------------*
[09:37:09] Folding@Home Double Gromacs Core C
[09:37:09] Version 1.00 (Thu Apr 24 19:12:09 PDT 2008)
[09:37:09] 
[09:37:09] Preparing to commence simulation
[09:37:09] - Assembly optimizations manually forced on.
[09:37:09] - Not checking prior termination.
[09:37:09] - Expanded 246491 -> 661416 (decompressed 268.3 percent)
[09:37:09] 
[09:37:09] Project: 3861 (Run 471, Clone 10, Gen 23)
[09:37:09] 
[09:37:10] Assembly optimizations on if available.
[09:37:10] Entering M.D.
[09:37:16] Will resume from checkpoint file
[09:37:16] Working on p3861_fkbpabsolute_ligand
[09:37:16] Completed 0 out of 1500000 steps  (0)
[09:37:16] Extra SSE2 boost OK
[09:37:18] Resuming from checkpoint
[09:37:18] Verified work/wudata_01.log
[09:37:18] Verified work/wudata_01.edr
[09:37:18] Verified work/wudata_01.xvg
[09:37:18] Verified work/wudata_01.trr
[09:37:18] Verified work/wudata_01.xtc
[09:52:18] Timer requesting checkpoint
[10:07:19] Timer requesting checkpoint
[10:19:45] Completed 15000 out of 1500000 steps  (1)
[10:34:48] Timer requesting checkpoint
[10:49:48] Timer requesting checkpoint
[11:02:55] Completed 30000 out of 1500000 steps  (2)
[11:17:57] Timer requesting checkpoint
[11:32:59] Timer requesting checkpoint
[11:45:10] Completed 45000 out of 1500000 steps  (3)
[12:00:14] Timer requesting checkpoint
[12:15:13] Timer requesting checkpoint
[12:27:52] Completed 60000 out of 1500000 steps  (4)
[12:42:54] Timer requesting checkpoint
[12:57:55] Timer requesting checkpoint
[13:10:19] Completed 75000 out of 1500000 steps  (5)
[13:25:23] Timer requesting checkpoint
[13:40:25] Timer requesting checkpoint
[13:52:54] Completed 90000 out of 1500000 steps  (6)
[14:07:57] Timer requesting checkpoint
[14:22:56] Timer requesting checkpoint
[14:36:02] Completed 105000 out of 1500000 steps  (7)
[14:51:04] Timer requesting checkpoint
[15:06:05] Timer requesting checkpoint
[15:20:09] Completed 120000 out of 1500000 steps  (8)
[15:35:12] Timer requesting checkpoint
[15:50:14] Timer requesting checkpoint
[16:03:24] Completed 135000 out of 1500000 steps  (9)
[16:18:27] Timer requesting checkpoint
[16:33:31] Timer requesting checkpoint
[16:45:38] Completed 150000 out of 1500000 steps  (10)
[17:00:40] Timer requesting checkpoint
[17:15:41] Timer requesting checkpoint
[17:28:19] Completed 165000 out of 1500000 steps  (11)
[17:43:24] Timer requesting checkpoint
[17:58:23] Timer requesting checkpoint
[18:11:21] Completed 180000 out of 1500000 steps  (12)
[18:26:23] Timer requesting checkpoint
[18:41:24] Timer requesting checkpoint
[18:54:42] Completed 195000 out of 1500000 steps  (13)
[19:09:47] Timer requesting checkpoint
[19:24:49] Timer requesting checkpoint
[19:39:16] Completed 210000 out of 1500000 steps  (14)
[19:54:18] Timer requesting checkpoint
[20:09:19] Timer requesting checkpoint
[20:23:44] Completed 225000 out of 1500000 steps  (15)
[20:38:46] Timer requesting checkpoint
[20:53:46] Timer requesting checkpoint
[21:06:04] Completed 240000 out of 1500000 steps  (16)
[21:21:07] Timer requesting checkpoint
[21:36:07] Timer requesting checkpoint
[21:47:34] Completed 255000 out of 1500000 steps  (17)
[22:02:38] Timer requesting checkpoint
[22:17:38] Timer requesting checkpoint
[22:32:06] Completed 270000 out of 1500000 steps  (18)
[22:47:08] Timer requesting checkpoint
[23:02:08] Timer requesting checkpoint
[23:14:38] Completed 285000 out of 1500000 steps  (19)

Folding@Home Client Shutdown.

*** 20 to 23 percent looks the same as above. *** 
I suspect when the server code was recently changed the duplicate time frame was either left undefined or improperly defined.
I'm the same farmpuma from years gone by, but it appears my account went away when the passwords changed to six characters minimum.

Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128.143.48.226 : server reports problem with unit

Post by bruce »

What do you mean by "the duplicate time frame"?

Projects 3860-3864 have a preferred deadline of either 6 or 8 days and a final deadline of 60 days.

Please let us know the PRCG of each WU with a problem. Also, when was each WUs downloaded and when did each one start getting the error message.
Post Reply