Project: 3062 (Run 3, Clone 26, Gen 1)

Moderators: Site Moderators, FAHC Science Team

Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Sun Dec 23, 2007 12:03 pm

It crashed 3 time at the same point (59%)

I made hourly backup of the fah folder so if anyone is interessed, I will give him the backup juste before the crash.
Team #35819 P2P-Community
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby Flathead74 » Sun Dec 23, 2007 2:18 pm

_ikki_, have you tried restarting the WU from a point just before the usual crash-point, say at 56 - 57% or so?

Some folks have found that the WU will complete if done this way, though not all.
Flathead74
 
Posts: 266
Joined: Sun Dec 02, 2007 7:08 pm
Location: Central New York

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Thu Dec 27, 2007 11:06 am

No I didn't try to restart the WU before before it crashed but it worth to test it, thanks to my backups ;)
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby gwildperson » Thu Dec 27, 2007 1:15 pm

_ikki_ wrote:It crashed 3 time at the same point (59%)


What was the error message? We might be able to help a bit more if you posted FAHlog.txt.There are lots of possible meanings for "crashed" and I don't want to guess. ;)
gwildperson
 
Posts: 450
Joined: Tue Dec 04, 2007 9:36 pm

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Sun Dec 30, 2007 4:16 pm

It's an error like this :

[10:51:13] Completed 2900000 out of 5000000 steps (58 percent)
[11:00:31] Writing local files
[11:00:31] Completed 2950000 out of 5000000 steps (59 percent)
[11:08:25] Warning: long 1-4 interactions
[11:08:29] CoreStatus = 1 (1)
[11:08:29] Client-core communications error: ERROR 0x1
[11:08:29] Deleting current work unit & continuing...
[11:12:51] - Warning: Could not delete all work unit files (7): Core returned invalid code
[11:12:51] Trying to send all finished work units
[11:12:51] + No unsent completed units remaining.
[11:12:51] - Preparing to get new work unit...
[11:12:51] + Attempting to get work packet
[11:12:51] - Will indicate memory of 2014 MB
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby ChelseaOilman » Sun Dec 30, 2007 5:29 pm

This WU hasn't been submitted by anyone yet. If it's been issued to other people they may be having the same issues as you.

Look in the work folder to see if there are any wuresults_0x.dat files. If there are I would try running qfix to get credit for what you did and get this WU entered into the WU database.
User avatar
ChelseaOilman
 
Posts: 1037
Joined: Sun Dec 02, 2007 4:47 pm
Location: Colorado @ 10,000 feet

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Mon Dec 31, 2007 6:35 pm

There is no wuresults_0x.dat file in the work folder.
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Tue Jan 01, 2008 11:29 am

For debugging, here is the data one frame before the crash (about 10 minutes) :

http://rapidshare.com/files/80698307/fa ... 1.tgz.html

Happy new year :d
Last edited by _ikki_ on Wed Jan 02, 2008 3:00 pm, edited 1 time in total.
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby VijayPande » Wed Jan 02, 2008 2:22 am

Thanks! We'll take a look once the full team gets back from the Stanford holiday break.
VijayPande
Pande Group Member
 
Posts: 2058
Joined: Fri Nov 30, 2007 7:25 am
Location: Stanford

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Wed Jan 02, 2008 3:04 pm

okay, I hope this data will be useful.
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby ChelseaOilman » Wed Jan 02, 2008 7:32 pm

Someone else has sucessfully finished this WU.

Your WU (P3062 R3 C26 G1) was added to the stats database on 2007-12-31 22:59:05 for 1324 points of credit.

After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.

[17:53:44] Working on Unit 07 [December 23 17:53:44]
[17:53:44] + Working ...
[17:53:44] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 07 -checkpoint 15 -forceasm -verbose -lifeline 20217 -version 600'

[17:53:44]
[17:53:44] *------------------------------*
[17:53:44] Folding@Home Gromacs SMP Core
[17:53:44] Version 1.74 (November 27, 2006)
[17:53:44]
[17:53:44] Preparing to commence simulation
[17:53:44] - Ensuring status. Please wait.
[17:54:01] - Assembly optimizations manually forced on.
[17:54:01] - Not checking prior termination.
[17:54:01] - Expanded 607662 -> 3257309 (decompressed 536.0 percent)
[17:54:01]
[17:54:01] Project: 3062 (Run 3, Clone 26, Gen 1)
[17:54:01]
[17:54:01] Assembly optimizations on if available.
[17:54:01] Entering M.D.
[17:54:07] Calling FAH init
[17:54:07] mbda5_99sb
[17:54:07] Writing local files
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[17:54:07]
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[18:02:07] Writing local files
[18:02:08] Completed 2950000 out of 5000000 steps (59 percent)
[18:10:16] Writing local files
[18:10:16] Completed 3000000 out of 5000000 steps (60 percent)
[18:18:21] Writing local files
[18:18:21] Completed 3050000 out of 5000000 steps (61 percent)
[18:26:20] Writing local files
[18:26:20] Completed 3100000 out of 5000000 steps (62 percent)
User avatar
ChelseaOilman
 
Posts: 1037
Joined: Sun Dec 02, 2007 4:47 pm
Location: Colorado @ 10,000 feet

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Thu Jan 03, 2008 2:13 pm

Someone else has sucessfully finished this WU.


... since last time you checked this WU ?

What does it mean ?

The donator who finished the WU has restarted the client before the crash or did the WU finished without any action ? (the only one who can respond is the donator himself ;) )

The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...

Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby bruce » Thu Jan 03, 2008 2:32 pm

_ikki_ wrote:The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...

Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?


At some point I'm sure that Stanford will figure out how to prevent this problem before it happens, but at this point the only thing we can do is continue to gather data that may help them find the problem. I'd recommend that we do report cases where a WU failed without a restart but was able to proceed if it was stopped/restarted. That may be the only way we can help.

The fact that ChelseaOilman was able to resume processing and continue past the point of the original error is important. For debugging purposes, I'd like to see is a captured WU that will fail in the next frame, not one that can continue.

ChelseaOilman wrote:After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
For the purposes of learning more about the protein, itself, finishing the project is important. For the purposes of debugging, finding a repeatable error is important.

If you want to help with the debugging, it's probably better to use the advanced setting to ignore local deadlines than it is to change your system clock, but both work.

@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something) :?:
bruce
 
Posts: 20119
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby _ikki_ » Thu Jan 03, 2008 8:32 pm

bruce wrote:
@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something) :?:


how to disable local deadlines without modifying the date ? :?:
_ikki_
 
Posts: 27
Joined: Wed Dec 05, 2007 9:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Postby bruce » Thu Jan 03, 2008 9:29 pm

_ikki_ wrote:how to disable local deadlines without modifying the date ? :?:


With the console client, restart with -config or with -configonly.
. . .
Change advanced options (yes/no) [no]? y
. . .
Ignore any deadline information (mainly useful if
system clock frequently has errors) (no/yes) [no]? y
. . .
bruce
 
Posts: 20119
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Next

Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 3 guests

cron