Page 1 of 1

Project: 2652 (Run 0 Clone 236 Gen 23) Again

Posted: Sun Dec 30, 2007 9:31 am
by ArVee
2652(0,236,23) is a bad WU. It EUE's ("Gromacs cannot continue further") at the identical spot just after 13 frames complete, even after changing to a sharply reduced OC for the third attempt. Further, qfix doesn't seem to serve the purpose of getting the partial results u/l'd so that the unit can be identified and removed and the next person doesn't have to sit through three more instances of the same thing. It SAID it had done some fixing, but if it did, nothing reflected in the points, so I don't think anything made it in to Stanford.

There seems to be a lot of this with 2652. I know it taxes the system, but identical failure points say to me at least that it's another bad WU. If correct, why so many on 2652? :e?:

Re: 2652 Again

Posted: Sun Dec 30, 2007 5:05 pm
by ChelseaOilman
It does seem to be a bad WU. Multiple people have received partial credit for their effort.

Your among them:

Hi ArVee (team 328),
Your WU (P2652 R0 C236 G23) was added to the stats database on 2007-12-30 00:57:55 for 668.2 points of credit.

Re: 2652 Again

Posted: Sun Dec 30, 2007 5:14 pm
by ArVee
Thank you for checking that, I just noticed the points showing up late at EOC before I checked back here. It's good they made it in, so at least there's a record of the problem WU.

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Posted: Mon Dec 31, 2007 4:41 pm
by Qeldroma
I'd like to confirm this- it's happening to our team too- the WUs fail three times at the same point but where it fails is different for each of us. Team Link

Thanks for the Qfix info-

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Posted: Tue Jan 01, 2008 10:15 pm
by al2
Well i've just had Project: 2652 *but* Run 0, Clone 430, Gen 44 and my system is all stock and likely very stable wrt win smp client since i've never have any issues like this before( i can remember) since i started folding last summer (i occasionally get hanging clients assoc. the net connection (i think) but this isn't a problem with regular monitering).

Here's my Fahlog should it be of any use;
[14:08:08] Completed 550000 out of 1000000 steps (55 percent)
[14:23:49] Writing local files
[14:23:49] Completed 560000 out of 1000000 steps (56 percent)
[14:39:29] Writing local files
[14:39:29] Completed 570000 out of 1000000 steps (57 percent)
[14:51:10] Warning: long 1-4 interactions
[14:51:10] Gromacs cannot continue further.
[14:51:10] Going to send back what have done.
[14:51:10] logfile size: 353037
[14:51:10] - Writing 353573 bytes of core data to disk...
[14:51:11] ... Done.
[14:51:11] - Failed to delete work/wudata_06.arc
[14:51:11] No C.P. to delete.
[14:51:11] - Failed to delete work/wudata_06.dyn
[14:51:11] - Failed to delete work/wudata_06.chk
[14:51:11] - Failed to delete work/wudata_06.sas
[14:51:11] - Failed to delete work/wudata_06.goe
[14:51:11] - Failed to delete work/wudata_06.xvg
[14:51:11] Warning: check for stray files
[14:51:11]
[14:51:11] Folding@home Core Shutdown: EARLY_UNIT_END
[14:51:11]
[14:51:11] Folding@home Core Shutdown: EARLY_UNIT_END
[14:51:17] CoreStatus = 7B (123)
[14:51:17] Client-core communications error: ERROR 0x7b
[14:51:17] Deleting current work unit & continuing...
[14:53:21] - Preparing to get new work unit...
[14:53:21] + Attempting to get work packet
[14:53:21] - Connecting to assignment server
[14:53:22] - Successful: assigned to (171.64.65.64).
[14:53:22] + News From Folding@Home: Welcome to Folding@Home
[14:53:22] Loaded queue successfully.
[14:53:27] + Closed connections
[14:53:32]
[14:53:32] + Processing work unit
[14:53:32] Core required: FahCore_a1.exe
[14:53:32] Core found.
[14:53:32] Working on Unit 07 [January 1 14:53:32]
[14:53:32] + Working ...
[14:53:33]
[14:53:33] *------------------------------*
[14:53:33] Folding@Home Gromacs SMP Core
[14:53:33] Version 1.74 (March 10, 2007)
[14:53:33]
[14:53:33] Preparing to commence simulation
[14:53:33] - Ensuring status. Please wait.
[14:53:33] Created dyn
[14:53:33] - Files status OK
[14:53:33] this execution.
[14:53:33] - Files status OK
[14:53:34] mpressed 507.5 percent)
[14:53:34] - Starting from initial work packet
[14:53:34]
[14:53:34] Project: 2652 (Run 0, Clone 430, Gen 44)
[14:53:34]
[14:53:34] : 2652 (Run 0, Clone 430, Gen 44)
[14:53:34]
[14:53:34] ble.
[14:53:34] Entering M.D.
[14:53:51] al work pa- Starting from initial work packet
[14:53:51]
[14:53:51] Project: 2652 (Run 0, Clone 430, Gen 44)
[14:53:51]
[14:53:51] Entering M.D.
[14:53:58] rotein
[14:53:58] Writing local files
[14:53:58] cal files
[14:53:58] boost OK.
[14:53:58] Writing local files
[14:53:58] Completed 0 out of 1000000 steps (0 percent)
[15:09:39] Writing local files
[15:09:39] Completed 10000 out of 1000000 steps (1 percent)
[15:28:27] Writing local files
[15:28:27] Completed 20000 out of 1000000 steps (2 percent)
[15:49:30] Writing local files

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Posted: Tue Jan 01, 2008 10:54 pm
by bruce
Please see this recent post viewtopic.php?f=8&t=571&start=0&st=0&sk=t&sd=a

I only see one log posted. Are all the errors 0x7b?

Can anyone explain why the WU fails at the same point for user A but repeatedly fails at a different point for User B?

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Posted: Wed Jan 02, 2008 6:11 am
by 7im
bruce wrote:Can anyone explain why the WU fails at the same point for user A but repeatedly fails at a different point for User B?
A fails at same point, and B fails at same "other" point?

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Posted: Wed Jan 02, 2008 7:00 pm
by Qeldroma
7im wrote:
bruce wrote:Can anyone explain why the WU fails at the same point for user A but repeatedly fails at a different point for User B?
A fails at same point, and B fails at same "other" point?
On the same step- some crap out 3 times at step 0, another I know of and myself hung 3 times on step 3, another report of step 5- consistently 3 times on the same step- but not always at the same step number.

The client finally gave up on it- reloaded the executable, then swimmingly started and finished a 2653. Haven't seen it again in a while- but will post log if again. And yes, Bruce, they're all 7Bs.

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Posted: Wed Jan 02, 2008 8:52 pm
by bruce
Qeldroma wrote:And yes, Bruce, they're all 7Bs.
I'm really disturbed about this whole 0x7b situation. It appears to be a catch-all category with multiple causes and I don't know any good way to isolate them in a way that allows them to be fixed. Some are certainly issues that could be handled by the software; some are not. Virtually none of them are reproducible on different hardware.

The other overlapping issue that bothers me is the situation where many people have an EUE (often 3x or 4x) and then somebody manages to complete the WU successfully, taking all the pressure off of fixing whatever was wrong (and providing the all-too easy answer: It was just due to unstable hardware - - even though that may also be one of the possible causes.)

I wrote a response to this thread where it looked like we had captured a repeatable EUE and then somebody managed to complete it. Unfortunately due to a sql error, the 20 minutes I spent composing that response was wasted when the data was lost somewhere between my machine and the forum. (I'll probably write it again soon, but not now.)

I'd welcome any inputs on how to debug these issues and divide them into manageable categories. Unfortunately both overlapping issues are too big to tackled without a realistic plan that involves a list of specific known issues plus a plan that allows the issues to be divided into small enough categories that a single good programmer can attack. (I know how difficult this can be. Several of us worked hard for many, many months to isolate an errata on certain AMD CPUs and once it was documented and fixed, several of us got a nice certificates and one still hangs on my wall.)