Project: 2684 (Run 2, Clone 5, Gen 10)

Moderators: Site Moderators, PandeGroup

Project: 2684 (Run 2, Clone 5, Gen 10)

Postby DrSpalding » Mon Jul 19, 2010 4:59 pm

Hi, my -bigadv machine just turned in (or tried to) early results on the above project at 44% complete. The corestatus was 0xC0000005 (which is a STATUS_ACCESS_VIOLATION fault) and a Client-core communcications error. I suspect that the core terminated w/o telling the client what was happening. Is there any info on this WU about its stability or should I look into the machine for hardware issues?

Here is the relevant log info:

Code: Select all
[05:39:55] - Preparing to get new work unit...
[05:39:55] Cleaning up work directory
[05:39:55] + Attempting to get work packet
[05:39:55] Passkey found
[05:39:55] - Connecting to assignment server
[05:39:55] - Successful: assigned to (171.67.108.22).
[05:39:55] + News From Folding@Home: Welcome to Folding@Home
[05:39:56] Loaded queue successfully.
[05:40:54] + Closed connections
[05:40:54]
[05:40:54] + Processing work unit
[05:40:54] Core required: FahCore_a3.exe
[05:40:54] Core found.
[05:40:54] Working on queue slot 06 [July 18 05:40:54 UTC]
[05:40:54] + Working ...
[05:40:54]
[05:40:54] *------------------------------*
[05:40:54] Folding@Home Gromacs SMP Core
[05:40:54] Version 2.22 (Mar 12, 2010)
[05:40:54]
[05:40:54] Preparing to commence simulation
[05:40:54] - Looking at optimizations...
[05:40:54] - Created dyn
[05:40:54] - Files status OK
[05:40:58] - Expanded 24821153 -> 30791309 (decompressed 124.0 percent)
[05:40:58] Called DecompressByteArray: compressed_data_size=24821153 data_size=30791309, decompressed_data_size=30791309 diff=0
[05:40:59] - Digital signature verified
[05:40:59]
[05:40:59] Project: 2684 (Run 2, Clone 5, Gen 10)
[05:40:59]
[05:40:59] Assembly optimizations on if available.
[05:40:59] Entering M.D.
[05:41:09] Completed 0 out of 250000 steps  (0%)
[06:30:30] Completed 2500 out of 250000 steps  (1%)
[07:15:45] Completed 5000 out of 250000 steps  (2%)
...
[14:30:35] Completed 107500 out of 250000 steps  (43%)
[15:15:24] Completed 110000 out of 250000 steps  (44%)
[15:19:19] Gromacs cannot continue further.
[15:19:19] Going to send back what have done -- stepsTotalG=250000
[15:19:19] Work fraction=-1.#IND steps=250000.
[15:19:49] logfile size=97434 infoLength=97434 edr=0 trr=23
[15:19:49] logfile size: 97434 info=97434 bed=0 hdr=23
[15:19:49] - Writing 97970 bytes of core data to disk...
[15:19:52] CoreStatus = C0000005 (-1073741819)
[15:19:52] Client-core communications error: ERROR 0xc0000005
[15:19:52] Deleting current work unit & continuing...
[15:20:34] - Preparing to get new work unit...
[15:20:34] Cleaning up work directory
[15:20:34] + Attempting to get work packet
[15:20:34] Passkey found
[15:20:34] - Connecting to assignment server
[15:20:34] - Successful: assigned to (171.67.108.22).
[15:20:34] + News From Folding@Home: Welcome to Folding@Home
[15:20:35] Loaded queue successfully.
[15:21:05] + Closed connections
[15:21:10]
Not a real doctor, I just play one on the 'net!
ImageImage
DrSpalding
 
Posts: 106
Joined: Wed May 27, 2009 5:48 pm

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby toTOW » Mon Jul 19, 2010 5:11 pm

No data for this WU in the DB yet ...
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image
User avatar
toTOW
Super Moderator
 
Posts: 9387
Joined: Sun Dec 02, 2007 11:38 am
Location: Bordeaux, France

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby DrSpalding » Mon Jul 19, 2010 5:41 pm

The assignment server gave the same WU back to the machine, so the results must not have gotten uploaded. We'll see in another 33 hours if it does the same thing again and if so, I'll have to get the machine to move on to another WU manually.
Not a real doctor, I just play one on the 'net!
ImageImage
DrSpalding
 
Posts: 106
Joined: Wed May 27, 2009 5:48 pm

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby bruce » Mon Jul 19, 2010 7:54 pm

Please do a thorough memory test on your system at the actual temperatures that your system sees when folding. There is no single cause for 0xC0000005 but the most common one is memory errors (including memory timing settings that are just a bit too tight for your memory as well as chipset errors that result in memory errors).
bruce
Site Admin
 
Posts: 8996
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby Grandpa_01 » Mon Jul 19, 2010 8:02 pm

bruce wrote:Please do a thorough memory test on your system at the actual temperatures that your system sees when folding. There is no single cause for 0xC0000005 but the most common one is memory errors (including memory timing settings that are just a bit too tight for your memory as well as chipset errors that result in memory errors).


Good advice bruce especiall this part
Please do a thorough memory test on your system at the actual temperatures that your system see's when folding.
which is very hard to do.

Do you know of a memory test that will use 100% of the CPU and create the heat that folding does. Or a memory test that can be run while folding.
Chart of Frame Times on -bigadv a3 Units viewtopic.php?f=55&t=14757#p145568

Image
User avatar
Grandpa_01
 
Posts: 805
Joined: Wed Mar 04, 2009 8:36 am

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby PantherX » Mon Jul 19, 2010 8:06 pm

Grandpa_01 wrote:...Do you know of a memory test that will use 100% of the CPU and create the heat that folding does...

I use IntelBurnTest and configure it for Maximum RAM and it generates more heat than F@H does. It stress the RAM and CPU at the same time so I really like it. After that, I run StressCPU to ensure that the stable system is also scientifically stable too as it is for F@H.
User avatar
PantherX
 
Posts: 1397
Joined: Wed Dec 23, 2009 10:33 am
Location: Jeddah, Kingdom Of Saudia Arabia

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby DrSpalding » Mon Jul 19, 2010 8:15 pm

What do you consider a suitable memory test? Prime95 with a 10GB data set while running the GPU client on the GTX275 as well? The problem with doing a standalone memory test (a la memtest x86 or the Win7 memdiag) is that I can't saturate the machine further with the bandwidth and heat from the GPU as well.

The memory is 7-7-7-24 @ 1333 MHz, but it is running at a slower speed, at least according to the BIOS, at 7-7-7-24 @ ~1146MHz.
Not a real doctor, I just play one on the 'net!
ImageImage
DrSpalding
 
Posts: 106
Joined: Wed May 27, 2009 5:48 pm

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby 7im » Mon Jul 19, 2010 8:29 pm

Prime is lame, it's not up to the task, hence your problem with not being able to saturate the system.

You could also use the memtestG80, which is a memory testing for the NV Cards. Then combine with whatever tool you like that they recommended above.

IBT and OCCT (mixed test) are about as good as they get for maxing both CPU and Memory. Throw in the memtestG80 and you're all set.
User avatar
7im
 
Posts: 7379
Joined: Thu Nov 29, 2007 5:30 pm

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby PantherX » Mon Jul 19, 2010 8:33 pm

@DrSpalding: (Based on my experience) Prime95 produces less stress than F@H and IntelBurnTest is more stressful than F@H if configured properly. I prefer IntelBurnTest with 10 iterations @ Maximum setting. What I do is first fire off IBT @ Maximum for 10 iterations. If passed successfully, I set it at 6 threads with 3 GB RAM and then run Furmark @ Maximum settings and run Hyper PI 0.99 Beta @ 32 Million on the last free thread. Thus I have stressed my CPU and GPU. I use HWMonitor and if any temperature passes 90C, I terminate everything and downclock and repeat until the Maximum temperature Value is =<90C.
User avatar
PantherX
 
Posts: 1397
Joined: Wed Dec 23, 2009 10:33 am
Location: Jeddah, Kingdom Of Saudia Arabia

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby DrSpalding » Mon Jul 19, 2010 8:43 pm

PantherX wrote:@DrSpalding: (Based on my experience) Prime95 produces less stress than F@H and IntelBurnTest is more stressful than F@H if configured properly. I prefer IntelBurnTest with 10 iterations @ Maximum setting. What I do is first fire off IBT @ Maximum for 10 iterations. If passed successfully, I set it at 6 threads with 3 GB RAM and then run Furmark @ Maximum settings and run Hyper PI 0.99 Beta @ 32 Million on the last free thread. Thus I have stressed my CPU and GPU. I use HWMonitor and if any temperature passes 90C, I terminate everything and downclock and repeat until the Maximum temperature Value is =<90C.

1. Where do you get IBT and/or OCCT? I found downloads for both (IBT v2.3 and OCCT v3) on guru3d.com but don' t know which versions are the up-to-date ones.
2. I have noted that Prime95 gets the cpu cores a couple of degrees hotter than F@H seems to. It seems to hold that high temperature more stably than F@H does too, FWIW.
3. Is running a GPU client sufficient to test the GPU + CPU at the same time when running IBT or OCCT?

Thanks,
Dan
Not a real doctor, I just play one on the 'net!
ImageImage
DrSpalding
 
Posts: 106
Joined: Wed May 27, 2009 5:48 pm

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby PantherX » Mon Jul 19, 2010 8:49 pm

1) You can check the tools list in most cases, it contains the links to the latest softwares (FYI, IBT 2.4 is latest).
2) YMMV but on my system, it took Prime95 longer to reach the tempratures of F@H and never exceed them. IBT on the other hand, overshoot the F@H temps in <5 minutes.
3) IBT is specific to CPU only. OCCT I have heard that it includes the LinX (used in IBT) and also has its own GPU stress software. I haven't used OCCT so can't be specific.
User avatar
PantherX
 
Posts: 1397
Joined: Wed Dec 23, 2009 10:33 am
Location: Jeddah, Kingdom Of Saudia Arabia

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby bruce » Mon Jul 19, 2010 9:14 pm

DrSpalding wrote:Is running a GPU client sufficient to test the GPU + CPU at the same time when running IBT or OCCT?


Probably not.

First, running a GPU client means different things to the CPU, depending on whether you have ATi, Fermi, or a G80.

Second (as noted earlier), it's almost impossible to test everything simultaneously. For simplicity's sake I'm going to divide a system into RAM, GPU, and CPU and further divide the CPU into ALU and FPU. A single test can maximize the use of any one of them but not all of them simultaneously. Picking a test that deals with each one separately is fairly easily, but finding something that comes close to maximizing all of the simultaneously is next to impossible. FAH will also be limited by the maximum of one of them but will use the others at somewhat less than maximum so finding something that is close to the way your system runs FAH means you'll probably have to run more than one test. That's one reason why you always have to back down from whatever settings seem to be stable.

Prime probably maximizes the use of the ALU but doesn't maximize the FPU or RAM and certainly not the GPU.
Memtest86 probably maximizes RAM but doesn't use much ALU or FPU or GPU.
The GPU client probably comes close to maximizing the use of the GPU but doesn't saturate the ALU and uses virtually no FPU. (Then, too, the various GPU benchmark tests may maximize different aspects of the GPU, but let's not go into that.)
StressCPU2 probably maximizes the FPU similar to FAH's SMP client but may not catch errors in other components.

Integrated tests do a better job of balancing the use of ALU/FPU/RAM so adding a GPU client or benchmark helps find heating issues but we can then debate the relative priority of the two tasks.

No matter what tests you run, you'll probably need more than one and you'll still need to add additional margin.
bruce
Site Admin
 
Posts: 8996
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby DrSpalding » Tue Jul 20, 2010 7:06 pm

I ran the IBT several times standalone, including a set of 10 at 10+GB of memory allocation and it worked flawlessly, ending in about 3 hours. However, the next set I ran, I also included the running of the GPU client with its priority set to normal so that it would make sure and run. Within about 11 minutes, the machine bug-checked on me with the nebulous "MACHINE_CHECK_EXCEPTION" of 9C. That one is a catch-all for various MCA exceptions from the CPU and w/o a debugger attached to the machine, it is hard to figure anything else out from it. I suspect memory or bus issues since the nVidia GTX275 GPU running an F@H client really only intersects the machine at the bus and memory. If anyone has an idea what I should tweak first (vCore for CPU, memory timings, etc.), please feel free to drop me a message.

For now, I am running w/o the GPU client until I get it sorted out.
Not a real doctor, I just play one on the 'net!
ImageImage
DrSpalding
 
Posts: 106
Joined: Wed May 27, 2009 5:48 pm

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby PantherX » Tue Jul 20, 2010 7:40 pm

@DrSpalding: Have you overclocked the System? If yes, return everything to stock and see if the error arises.
Have you changed any variables in the motherboard? If yes, change everything to stock and then try it again.
Have you tried to change the PCI-E Slot of the GPU and repeat the test? If yes, was the error a same one or not?
Can you run MemtestG80 on the GPU without any problems? (mode details in my guide; link in sig)
Is your PSU stable enough to provided enough power to both the CPU and GPU when both are at 100% load?
User avatar
PantherX
 
Posts: 1397
Joined: Wed Dec 23, 2009 10:33 am
Location: Jeddah, Kingdom Of Saudia Arabia

Re: Project: 2684 (Run 2, Clone 5, Gen 10)

Postby B2K24 » Tue Jul 20, 2010 8:07 pm

I got errors with bigadv when I manually set the timings in bios 7-7-7-20 as the sticker on my Corsair Dominator C7's reads but when I put all timings to AUTO bios gives them 9-9-9-24 and have had no stability issues running with auto timings.
B2K24
 
Posts: 52
Joined: Wed May 19, 2010 4:44 pm

Next

Return to Issues with a specific WU

Who is online

Users browsing this forum: MSN [Bot]