Project: 2619 (Run 1, Clone 870, Gen 0)

Ren02 · Post by **Ren02** » Mon Apr 21, 2008 10:43 am

I was issued this WU April 6.

[02:39:01] Core required: FahCore_a2.exe
[02:39:01] Core found.
[02:39:01] Working on Unit 06 [April 6 02:39:01]
[02:39:01] + Working ...
[02:39:01] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 15 -forceasm -verbose -lifeline 3951 -version 602'

[02:39:02] 
[02:39:02] *------------------------------*
[02:39:02] Folding@Home Gromacs SMP Core
[02:39:02] Version 1.91 (2007)
[02:39:02] 
[02:39:02] Preparing to commence simulation
[02:39:02] - Ensuring status. Please wait.
[02:39:19] - Assembly optimizations manually forced on.
[02:39:19] - Not checking prior termination.
[02:39:19] Error: Work unit read from disk is invalid
[02:39:19] Finalizing output
[02:39:24] - Expanded 7865616 -> 48331685 (decompressed 68.4 percent)
[02:39:26] 
[02:39:26] Project: 2619 (Run 1, Clone 870, Gen 0)

snip

[10:26:58] Completed 114380 out of 125000 steps  (92%)
[10:42:07] Timer requesting checkpoint
[10:47:51] Completed 115630 out of 125000 steps  (93%)
[11:03:01] Timer requesting checkpoint
[11:08:44] Completed 116880 out of 125000 steps  (94%)
[11:23:54] Timer requesting checkpoint
[11:26:49] ***** Got a SIGTERM signal (15)
[11:26:49] Killing all core threads

Folding@Home Client Shutdown.

This was processing on my office PC at work. I was about to take two weeks off, so I decided to shut the PC down for that period. I added -oneunit to fah in the hopes it would complete the unit and then shut down. Unfortunately it crashed:

Code: Select all

--- Opening Log file [April 7 11:31:20] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/r3d/foldingathome/CPU1
Executable: ./fah6
Arguments: -smp -oneunit -verbosity 9 

[11:31:20] - Ask before connecting: No
[11:31:20] - User name: Ren02 (Team 385)
[11:31:20] - User ID: 3A14A43C1E9D8348
[11:31:20] - Machine ID: 1
[11:31:20] 
[11:31:21] Loaded queue successfully.
[11:31:21] 
[11:31:21] + Processing work unit
[11:31:21] Core required: FahCore_a2.exe
[11:31:21] Core found.
[11:31:21] - Autosending finished units...
[11:31:21] Trying to send all finished work units
[11:31:21] + No unsent completed units remaining.
[11:31:21] - Autosend completed
[11:31:21] Working on Unit 06 [April 7 11:31:21]
[11:31:21] + Working ...
[11:31:21] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 879 -version 602'

[11:31:21] 
[11:31:21] *------------------------------*
[11:31:21] Folding@Home Gromacs SMP Core
[11:31:21] Version 1.91 (2007)
[11:31:21] 
[11:31:21] Preparing to commence simulation
[11:31:21] - Ensuring status. Please wait.
[11:31:38] - Looking at optimizations...
[11:31:38] - Working with standard loops on this execution.
[11:31:38] - Created dyn
[11:31:38] - Files status OK
[11:31:38] Error: Work unit read from disk is invalid
[11:31:38] Finalizing output
[11:31:41] - Expanded 7865616 -> 48331685 (decompressed 68.4 percent)
[11:31:42] 
[11:31:42] Project: 2619 (Run 1, Clone 870, Gen 0)
[11:31:42] 
[11:31:42] Entering M.D.
[11:31:49] Will resume from checkpoint file
[11:32:20]  (0%)
[11:32:25] CoreStatus = FF (255)
[11:32:25] Client-core communications error: ERROR 0xff
[11:32:25] Deleting current work unit & continuing...
[11:34:02] ***** Got an Activate signal (2)
[11:34:02] Killing all core threads

Folding@Home Client Shutdown.
[11:34:02] - Warning: Could not delete all work unit files (6): Core file absent
[11:34:02] Trying to send all finished work units
[11:34:02] + No unsent completed units remaining.
[11:34:02] - Preparing to get new work unit...
[11:34:02] + Attempting to get work packet
[11:34:02] - Will indicate memory of 1003 MB
[11:34:02] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 11, Stepping: 2
[11:34:02] - Connecting to assignment server
[11:34:02] Connecting to http://assign.stanford.edu:8080/

It looked like it was about to download a new unit so I quickly shut it down.
Two weeks later:

Code: Select all

--- Opening Log file [April 17 11:28:09] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/r3d/foldingathome/CPU1
Executable: /home/r3d/foldingathome/CPU1/fah6
Arguments: -local -forceasm -smp -verbosity 9 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[11:28:09] - Ask before connecting: No
[11:28:09] - User name: Ren02 (Team 385)
[11:28:09] - User ID: 3A14A43C1E9D8348
[11:28:09] - Machine ID: 1
[11:28:09] 
[11:28:09] Loaded queue successfully.
[11:28:09] - Autosending finished units...
[11:28:09] Trying to send all finished work units
[11:28:09] + No unsent completed units remaining.
[11:28:09] - Autosend completed
[11:28:09] - Preparing to get new work unit...
[11:28:09] + Attempting to get work packet
[11:28:09] - Will indicate memory of 1003 MB
[11:28:09] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 11, Stepping: 2
[11:28:09] - Connecting to assignment server
[11:28:09] Connecting to http://assign.stanford.edu:8080/
[11:28:10] Posted data.
[11:28:10] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[11:28:10] + News From Folding@Home: Welcome to Folding@Home
[11:28:10] Loaded queue successfully.
[11:28:10] Connecting to http://171.64.65.56:8080/
[11:28:19] Posted data.
[11:28:19] Initial: 0000; - Receiving payload (expected size: 7866128)
[11:28:34] - Downloaded at ~512 kB/s
[11:28:34] - Averaged speed for that direction ~559 kB/s
[11:28:34] + Received work.
[11:28:34] + Closed connections
[11:28:34] 
[11:28:34] + Processing work unit
[11:28:34] Core required: FahCore_a2.exe
[11:28:34] Core found.
[11:28:34] Working on Unit 07 [April 17 11:28:34]
[11:28:34] + Working ...
[11:28:34] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -forceasm -verbose -lifeline 4023 -version 602'

[11:28:35] 
[11:28:35] *------------------------------*
[11:28:35] Folding@Home Gromacs SMP Core
[11:28:35] Version 1.91 (2007)
[11:28:35] 
[11:28:35] Preparing to commence simulation
[11:28:35] - Ensuring status. Please wait.
[11:28:52] - Assembly optimizations manually forced on.
[11:28:52] - Not checking prior termination.
[11:28:52] Error: Work unit read from disk is invalid
[11:28:52] Finalizing output
[11:28:57] - Expanded 7865616 -> 48331685 (decompressed 68.4 percent)
[11:29:00] 
[11:29:00] Project: 2619 (Run 1, Clone 870, Gen 0)

snip

[22:52:46] Completed 123130 out of 125000 steps  (99%)
[23:07:56] Timer requesting checkpoint
[23:14:20] Completed 124380 out of 125000 steps  (100%)
[23:26:08] 
[23:26:08] Finished Work Unit:
[23:26:08] - Reading up to 7393252 from "work/wudata_07.trr": Read 7393252
[23:26:09] - Reading up to 10364620 from "work/wudata_07.xtc": Read 10364620
[23:26:10] logfile size: 63884
[23:26:10] Leaving Run
[23:26:14] - Writing 17907124 bytes of core data to disk...
[23:26:24] Done: 17906612 -> 17383643 (compressed to 97.0 percent)
[23:26:24]   ... Done.
[23:26:28] - Shutting down core
[23:28:08] - Autosending finished units...
[23:28:08] Trying to send all finished work units
[23:28:08] + No unsent completed units remaining.
[23:28:08] - Autosend completed
[23:28:28] 
[23:28:28] Folding@home Core Shutdown: FINISHED_UNIT
[23:31:29] CoreStatus = 64 (100)
[23:31:29] Unit 7 finished with 62 percent of time to deadline remaining.
[23:31:29] Updated performance fraction: 0.582747
[23:31:29] Sending work to server


[23:31:29] + Attempting to send results
[23:31:29] - Reading file work/wuresults_07.dat from core
[23:31:30]   (Read 17384155 bytes from disk)
[23:31:30] Connecting to http://171.64.65.56:8080/
[23:31:53] Posted data.
[23:31:54] Initial: 0000; - Uploaded at ~707 kB/s
[23:31:54] - Averaged speed for that direction ~808 kB/s
[23:31:54] + Results successfully sent
[23:31:54] Thank you for your contribution to Folding@Home.
[23:31:54] + Number of Units Completed: 16

The same WU! This time it finished just fine and I got credited as well. My question is what happened to this WU in the 10 days period when my PC was off? Since I was unable to complete it first time around, it should have reached the preferred deadline (April 10) and be assigned to someone else. Yet it was waiting for me on April 17. How is this possible?

7im · Post by **7im** » Mon Apr 21, 2008 4:16 pm

Each fah client is responsible for completing a work unit until the final deadline has passed for that work unit.

When a work unit errors out in such a way that it does not send back any data or error message, the work server will re-assign the same work unit, assuming it was lost, either an accidental deletion or error while downloading, etc.

It's part of the redundancy built in to the project for reliability.

Ren02 · Post by **Ren02** » Mon Apr 21, 2008 4:37 pm

7im wrote:Each fah client is responsible for completing a work unit until the final deadline has passed for that work unit.

When a work unit errors out in such a way that it does not send back any data or error message, the work server will re-assign the same work unit, assuming it was lost, either an accidental deletion or error while downloading, etc.

It's part of the redundancy built in to the project for reliability.

Indeed, but in the example above this responsibility seems to extend the final deadline considerably. I was gone 10 days and when I resumed folding I received the same unit. There'd be no problem if this had occured within the 4 days deadline. But since it happened 7 days AFTER the final deadline, it makes me wonder if there is a bug in the system. Some of the donors quit after all. Does it mean that the last WU that was assigned to them will just be lost?

7im · Post by **7im** » Mon Apr 21, 2008 5:59 pm

I forgot to mention Bug # 21 from the SMP Known Bugs list.

http://foldingforum.org/viewtopic.php?f=8&t=50

21) If a WU expires, it continues to process rather than aborting.

Looks like I will have to add to that Bug description the problem you encountered. Expired WU doesn't abort, even 2 weeks later.

I also added this info for the developers.

Ren02 · Post by **Ren02** » Mon Apr 21, 2008 6:02 pm

No matter since that bug is completely unrelated.

7im · Post by **7im** » Mon Apr 21, 2008 6:06 pm

We don't know that for sure. It could be related.

Ren02 · Post by **Ren02** » Mon Apr 21, 2008 6:08 pm

7im wrote:
21) If a WU expires, it continues to process rather than aborting.
Looks like I will have to add to that Bug description the problem you encountered. Expired WU doesn't abort, even 2 weeks later.

No. My client downloaded a new WU. It just happened to be the same WU I had been processing 2 weeks earlier. Which leads me to conclude that the FAH assignment server DOES NOT REASSIGN a WU to someone else once it is past the final deadline.

Ren02 · Post by **Ren02** » Mon Apr 21, 2008 6:10 pm

7im wrote:We don't know that for sure. It could be related.

Hmmm..
You mean the client could be stubborn and request the same WU it was processing?

Has someone else completed the Project 2619 (Run 1, Clone 870, Gen 0) ?

7im · Post by **7im** » Mon Apr 21, 2008 7:54 pm

Hi ChasR (team 32),
Your WU (P2619 R1 C870 G0) was added to the stats database on 2008-04-11 14:37:34 for 1620 points of credit.

Hi Ren02 (team 385),
Your WU (P2619 R1 C870 G0) was added to the stats database on 2008-04-18 16:42:31 for 1620 points of credit.

So now there are so many things wrong with this scenario, I don't even know where to begin.

Ren02 · Post by **Ren02** » Mon Apr 21, 2008 10:36 pm

Ah. But at least the WU wasn't stalled.

So after my preferred/final deadline on 2008-04-10 02:39 it was reassigned and 1.5 days later it was completed. The project must go on.

Now the only question is whether it is a client-side bug like you said earlier, or are there some issues at the server as well.

7im · Post by **7im** » Mon Apr 21, 2008 10:53 pm

I don't suppose we can rule out complete coincidence either.

Post by **bruce** » Tue Apr 22, 2008 2:23 am

Ren02 wrote:No. My client downloaded a new WU. It just happened to be the same WU I had been processing 2 weeks earlier. Which leads me to conclude that the FAH assignment server DOES NOT REASSIGN a WU to someone else once it is past the final deadline.

It depends on what you mean by "REASSIGN to someone else"

1) If user1 starts a WU and then requests a new WU without any upload, they are generally reassigned the same WU.
2) If the Preferred dealine passes, the same WU is assigned to someone else -- user2.
3) If (like 1, above) user1 requests a new WU again and the WU has not been completed by EITHER user1 or user2, the same WU may still be assigned.
4) Once the WU is completed by either user1 or user2 -- or, in fact, by user1 AND user2 -- the WU should no longer reassigned to anyone. This completion can be after the Preferred Deadline.

anandhanju · Post by **anandhanju** » Tue Apr 22, 2008 3:56 am

My two cents here... I understand that the researchers at FAH have built in some sort of anti-dumping measure that reassigns the same SMP WU if it is dumped. Could it be that the servers had a knowledge of the assigned but "dumped" (no results returned) WU and reassigned it, without checking if it was assigned to and returned by someone else? To give a timeline of things (UTC):

April 6th: Ren02 gets Project: 2619 (Run 1, Clone 870, Gen 0)
April 7th: Ren02 decides to stop the client to add the oneunit flag. Client fails to recover from checkpoint and trashes WU.
April 10th: WU crosses preferred deadline, is flagged for reassignment and is assigned to ChasR.
April 11th: ChasR returns WU for full credit.
April 17th: Ren02 starts client after 10 days (with an empty queue, as indicated by "[11:28:09] Preparing to get new work unit..."). Ren02 gets the same WU that was returned by ChasR. <-- Shouldn't have happened.
April 18th: Ren02 finishes and returns WU for full credit.

My initial thought was that the WU was queued up and the bug (inability to recognize and delete expired WUs) caused Ren02's client to crunch it again. But the fact that the server assigned this to Ren02 indicates some logic on the Work Server's assignment module thats a bit funny. Did it not realize that although this WU "belonged" to Ren02, ChasR was assigned to it and had submitted results? I feel its either slow feedback from the collection module or a missing (finished WU) check while reassignment.

Dang, that coffee was killer! I hope it isn't a "count until a million sheep to fall asleep" night.

Post by **bruce** » Tue Apr 22, 2008 5:09 am

anandhanju wrote:My two cents here... I understand that the researchers at FAH have built in some sort of anti-dumping measure that reassigns the same SMP WU if it is dumped. Could it be that the servers had a knowledge of the assigned but "dumped" (no results returned) WU and reassigned it, without checking if it was assigned to and returned by someone else?

My initial thought was that the WU was queued up and the bug (inability to recognize and delete expired WUs) caused Ren02's client to crunch it again. But the fact that the server assigned this to Ren02 indicates some logic on the Work Server's assignment module thats a bit funny. Did it not realize that although this WU "belonged" to Ren02, ChasR was assigned to it and had submitted results? I feel its either slow feedback from the collection module or a missing (finished WU) check while reassignment.

I really doubt it is the anti-dumping measures -- the slow feedback from the CS or a missing check for a finished WU sounds more likely. I can confirm your time-line, though: (in PDT)

Donator: ChasR
Credit: 1620 Credit Time: 2008-04-11 14:37:34

Donator: Ren02
WU assigned to donor at: 2008-04-17 04:20:48

That certainly should not be happening and I'll be sure it's reported as a bug in the assignment logic.

Ren02 · Post by **Ren02** » Tue Apr 22, 2008 10:30 am

Luckily scenario like this should be a rather rare occurrence, therefore it's understandable it might have slipped the programmer's mind. I was worried that the WU doesn't get reassigned and is waiting until the donor returns but that isn't happening. Just the first WU assigned to a returnee is kinda useless.... Well, it's definitely worth fixing though.

PS. Loved the timeline. Gives a news coverage feel to the case.

Folding Forum

Project: 2619 (Run 1, Clone 870, Gen 0)

Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)

Re: Project: 2619 (Run 1, Clone 870, Gen 0)