List of SMP WUs with the "1 core usage" issue

Moderators: Site Moderators, FAHC Science Team

toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: List of SMP WUs with the "1 core usage" issue

Post by toTOW »

List edited ... it's beginning to be a long list :?
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
BrokenWolf
Posts: 126
Joined: Sat Aug 02, 2008 3:08 am

Re: List of SMP WUs with the "1 core usage" issue

Post by BrokenWolf »

Got one this morning as well. 2677 R3/C78/G28 Client is 6.24beta, FahCore_a2.exe was 2.07.

Code: Select all

Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
7250000 steps,  14500.0 ps (continuing from step 7000000,  14000.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483269. It should have been within [ 0 .. 9464 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483611. It should have been within [ 0 .. 256 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Image
ChasR
Posts: 402
Joined: Sun Dec 02, 2007 5:36 am
Location: Atlanta, GA

Re: List of SMP WUs with the "1 core usage" issue

Post by ChasR »

Project: 2677 (Run 3, Clone 79, Gen 33)
Project: 2677 (Run 23, Clone 74, Gen 28)
Image
ChasR
Posts: 402
Joined: Sun Dec 02, 2007 5:36 am
Location: Atlanta, GA

Re: List of SMP WUs with the "1 core usage" issue

Post by ChasR »

I'm also experiencing a rash of A2 core WUs failing to proceed after Entering M.D. on core 2.07, as BrokenWolf posted above. I have yet to find a WU running on one core on FAH core 2.07.
Image
BrokenWolf
Posts: 126
Joined: Sat Aug 02, 2008 3:08 am

Re: List of SMP WUs with the "1 core usage" issue

Post by BrokenWolf »

Got another one. 2677 R5/C21/G30. It did not appear to be on the list yet.

@ChasR> I think that the 2.07 core can not handle the busted WU at all so it barfs. It appears that the 2.08 has better handling of WU funniness as it tries to go but can only process it on one core. At least that is my look on it.

BW
Image
ChasR
Posts: 402
Joined: Sun Dec 02, 2007 5:36 am
Location: Atlanta, GA

Re: List of SMP WUs with the "1 core usage" issue

Post by ChasR »

I've been doing some rudimentary checking to see if the same WUs that run on one core on 2.08 fail to proceed on 2.07. I haven't found a duplicate yet, but that might not mean much since on every 2.07 machine that had a bad WU, I deleted core 2.07 and got 2.08. I deleted the core on the problem 2.08 WUs as well, so most of the logs with the hung 2.07 WUs have been overwritten and are gone. While you are probably correct about the relationship of the failures seen on core 2.07 and 2.08, strictly speaking we don't know they are related. If you find a WU that hangs on 2.07 and runs on one core on 2.08, then you will have convinced me. I'll continue to look, though my core 2.07 machine count is way down. If they do turn out to be related, the problem WUs have been out there for some time.
Image
ChasR
Posts: 402
Joined: Sun Dec 02, 2007 5:36 am
Location: Atlanta, GA

Re: List of SMP WUs with the "1 core usage" issue

Post by ChasR »

Project: 2669 (Run 0, Clone 32, Gen 188)
Image
parkut
Posts: 364
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: List of SMP WUs with the "1 core usage" issue

Post by parkut »

First time for this series WU and on a Quad Core machine. All prior instances (for me) were on Conroe Core2's

Project: 2671 (Run 37, Clone 79, Gen 78) 1920.00 pts (17.678 pt/hr)

compressed_data_size=1513330

Code: Select all

quad8.parkut.com
 14:24:01 up 11 days, 11:25,  0 users,  load average: 1.00, 1.00, 1.00
20077 99.6 20077 S ?        01:25:05 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20080  0.3 20080 S ?        00:00:18 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20078  0.0 20078 S ?        00:00:04 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20079  0.0 20079 S ?        00:00:04 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
...
model name	: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
cpu MHz		: 3006.932
cache size	: 4096 KB
Memory: 1.96 GB physical, 1.94 GB virtual
...
Client Version 6.24R3  
Core: FahCore_a2.exe
Core Version 2.08 (Mon May 18 14:47:42 PDT 2009)
Current Work Unit
-----------------
Name: p2671_IBX in water
Tag: P2671R37C79G78
Download time: August 20 16:58:37
Due time: August 23 16:58:37
Progress: 1%  [__________]
Oldhat
Posts: 30
Joined: Mon Dec 03, 2007 11:42 am
Location: Auckland

Re: List of SMP WUs with the "1 core usage" issue

Post by Oldhat »

Just got another one. :)

Project: 2669 (Run 7, Clone 51, Gen 110)

Cheers
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: List of SMP WUs with the "1 core usage" issue

Post by toTOW »

List updated. Thanks for your reports.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
^w^ing
Posts: 136
Joined: Fri Mar 07, 2008 7:29 pm
Hardware configuration: C2D E6400 2.13 GHz @ 3.2 GHz
Asus EN8800GTS 640 (G80) @ 660/792/1700 running the 6.23 w/ core11 v1.19
forceware 260.89
Asus P5N-E SLi
2GB 800MHz DDRII (2xCorsair TwinX 512MB)
WinXP 32 SP3
Location: Prague

Re: List of SMP WUs with the "1 core usage" issue

Post by ^w^ing »

ChasR wrote:... If you find a WU that hangs on 2.07 and runs on one core on 2.08, then you will have convinced me.
It's true, when one of these WUs broke my client which ran 2.07, after I deleted the core it downloaded 2.08 and restarted the (same) WU. And it did start and ran on only one core.
BrokenWolf
Posts: 126
Joined: Sat Aug 02, 2008 3:08 am

Re: List of SMP WUs with the "1 core usage" issue

Post by BrokenWolf »

Got a repeat here. p2669 R13/C29/G178. Are these not being marked as bad so they do not get sent back out?

Code: Select all

[02:02:56] Connecting to http://171.64.65.56:8080/
[02:03:03] Posted data.
[02:03:03] Initial: 0000; - Receiving payload (expected size: 1508832)
[02:03:04] - Downloaded at ~1473 kB/s
[02:03:04] - Averaged speed for that direction ~1230 kB/s
[02:03:04] + Received work.
[02:03:04] Trying to send all finished work units
[02:03:04] + No unsent completed units remaining.
[02:03:04] + Closed connections
[02:03:04] 
[02:03:04] + Processing work unit
[02:03:04] At least 4 processors must be requested.Core required: FahCore_a2.exe
[02:03:04] Core found.
[02:03:05] Working on queue slot 02 [August 21 02:03:05 UTC]
[02:03:05] + Working ...
[02:03:05] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -priority 96 -checkpoint 10 -verbose -lifeline 28432 -version 624'

[02:03:05] 
[02:03:05] *------------------------------*
[02:03:05] Folding@Home Gromacs SMP Core
[02:03:05] Version 2.08 (Mon May 18 14:47:42 PDT 2009)
[02:03:05] 
[02:03:05] Preparing to commence simulation
[02:03:05] - Ensuring status. Please wait.
[02:03:06] Called DecompressByteArray: compressed_data_size=1508320 data_size=23973757, decompressed_data_size=23973757 diff=0
[02:03:06] - Digital signature verified
[02:03:06] 
[02:03:06] Project: 2669 (Run 13, Clone 29, Gen 178)
[02:03:06] 
[02:03:06] Assembly optimizations on if available.
[02:03:06] Entering M.D.
[02:03:16] un 13, Clone 29, Gen 178)
[02:03:16] 
[02:03:16] Entering M.D.
[02:03:53] Completed 0 out of 250000 steps  (0%)
Image
BrokenWolf
Posts: 126
Joined: Sat Aug 02, 2008 3:08 am

Re: List of SMP WUs with the "1 core usage" issue

Post by BrokenWolf »

And another one this morning. 2677 R35/C76/G35

BW
Image
ChasR
Posts: 402
Joined: Sun Dec 02, 2007 5:36 am
Location: Atlanta, GA

Re: List of SMP WUs with the "1 core usage" issue

Post by ChasR »

Project: 2675 (Run 3, Clone 182, Gen 153)
Image
Gary480six
Posts: 91
Joined: Mon Jan 21, 2008 6:42 pm

Re: List of SMP WUs with the "1 core usage" issue

Post by Gary480six »

Project: 2677 (Run 35, Clone 54, Gen 25)

How do I make it go away so I can get different work? I deleted the work folder and queue.dat and got the same project again.
Post Reply