Page 1 of 2

Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Sun Dec 23, 2007 11:03 am
by _ikki_
It crashed 3 time at the same point (59%)

I made hourly backup of the fah folder so if anyone is interessed, I will give him the backup juste before the crash.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Sun Dec 23, 2007 1:18 pm
by Flathead74
_ikki_, have you tried restarting the WU from a point just before the usual crash-point, say at 56 - 57% or so?

Some folks have found that the WU will complete if done this way, though not all.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Thu Dec 27, 2007 10:06 am
by _ikki_
No I didn't try to restart the WU before before it crashed but it worth to test it, thanks to my backups ;)

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Thu Dec 27, 2007 12:15 pm
by gwildperson
_ikki_ wrote:It crashed 3 time at the same point (59%)
What was the error message? We might be able to help a bit more if you posted FAHlog.txt.There are lots of possible meanings for "crashed" and I don't want to guess. ;)

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Sun Dec 30, 2007 3:16 pm
by _ikki_
It's an error like this :
[10:51:13] Completed 2900000 out of 5000000 steps (58 percent)
[11:00:31] Writing local files
[11:00:31] Completed 2950000 out of 5000000 steps (59 percent)
[11:08:25] Warning: long 1-4 interactions
[11:08:29] CoreStatus = 1 (1)
[11:08:29] Client-core communications error: ERROR 0x1
[11:08:29] Deleting current work unit & continuing...
[11:12:51] - Warning: Could not delete all work unit files (7): Core returned invalid code
[11:12:51] Trying to send all finished work units
[11:12:51] + No unsent completed units remaining.
[11:12:51] - Preparing to get new work unit...
[11:12:51] + Attempting to get work packet
[11:12:51] - Will indicate memory of 2014 MB

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Sun Dec 30, 2007 4:29 pm
by ChelseaOilman
This WU hasn't been submitted by anyone yet. If it's been issued to other people they may be having the same issues as you.

Look in the work folder to see if there are any wuresults_0x.dat files. If there are I would try running qfix to get credit for what you did and get this WU entered into the WU database.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Mon Dec 31, 2007 5:35 pm
by _ikki_
There is no wuresults_0x.dat file in the work folder.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Tue Jan 01, 2008 10:29 am
by _ikki_
For debugging, here is the data one frame before the crash (about 10 minutes) :

http://rapidshare.com/files/80698307/fa ... 1.tgz.html

Happy new year :d

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Wed Jan 02, 2008 1:22 am
by VijayPande
Thanks! We'll take a look once the full team gets back from the Stanford holiday break.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Wed Jan 02, 2008 2:04 pm
by _ikki_
okay, I hope this data will be useful.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Wed Jan 02, 2008 6:32 pm
by ChelseaOilman
Someone else has sucessfully finished this WU.

Your WU (P3062 R3 C26 G1) was added to the stats database on 2007-12-31 22:59:05 for 1324 points of credit.

After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
[17:53:44] Working on Unit 07 [December 23 17:53:44]
[17:53:44] + Working ...
[17:53:44] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 07 -checkpoint 15 -forceasm -verbose -lifeline 20217 -version 600'

[17:53:44]
[17:53:44] *------------------------------*
[17:53:44] Folding@Home Gromacs SMP Core
[17:53:44] Version 1.74 (November 27, 2006)
[17:53:44]
[17:53:44] Preparing to commence simulation
[17:53:44] - Ensuring status. Please wait.
[17:54:01] - Assembly optimizations manually forced on.
[17:54:01] - Not checking prior termination.
[17:54:01] - Expanded 607662 -> 3257309 (decompressed 536.0 percent)
[17:54:01]
[17:54:01] Project: 3062 (Run 3, Clone 26, Gen 1)
[17:54:01]
[17:54:01] Assembly optimizations on if available.
[17:54:01] Entering M.D.
[17:54:07] Calling FAH init
[17:54:07] mbda5_99sb
[17:54:07] Writing local files
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[17:54:07]
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[18:02:07] Writing local files
[18:02:08] Completed 2950000 out of 5000000 steps (59 percent)
[18:10:16] Writing local files
[18:10:16] Completed 3000000 out of 5000000 steps (60 percent)
[18:18:21] Writing local files
[18:18:21] Completed 3050000 out of 5000000 steps (61 percent)
[18:26:20] Writing local files
[18:26:20] Completed 3100000 out of 5000000 steps (62 percent)

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Thu Jan 03, 2008 1:13 pm
by _ikki_
Someone else has sucessfully finished this WU.
... since last time you checked this WU ?

What does it mean ?

The donator who finished the WU has restarted the client before the crash or did the WU finished without any action ? (the only one who can respond is the donator himself ;) )

The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...

Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Thu Jan 03, 2008 1:32 pm
by bruce
_ikki_ wrote:The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...

Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?
At some point I'm sure that Stanford will figure out how to prevent this problem before it happens, but at this point the only thing we can do is continue to gather data that may help them find the problem. I'd recommend that we do report cases where a WU failed without a restart but was able to proceed if it was stopped/restarted. That may be the only way we can help.

The fact that ChelseaOilman was able to resume processing and continue past the point of the original error is important. For debugging purposes, I'd like to see is a captured WU that will fail in the next frame, not one that can continue.
ChelseaOilman wrote:After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
For the purposes of learning more about the protein, itself, finishing the project is important. For the purposes of debugging, finding a repeatable error is important.

If you want to help with the debugging, it's probably better to use the advanced setting to ignore local deadlines than it is to change your system clock, but both work.

@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something) :?:

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Thu Jan 03, 2008 7:32 pm
by _ikki_
bruce wrote:
@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something) :?:
how to disable local deadlines without modifying the date ? :?:

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Posted: Thu Jan 03, 2008 8:29 pm
by bruce
_ikki_ wrote:how to disable local deadlines without modifying the date ? :?:
With the console client, restart with -config or with -configonly.
. . .
Change advanced options (yes/no) [no]? y
. . .
Ignore any deadline information (mainly useful if
system clock frequently has errors) (no/yes) [no]? y
. . .