Hardware fault - has us stumped

The most demanding Projects are only available to a small percentage of very high-end servers.

Moderators: Site Moderators, PandeGroup

Hardware fault - has us stumped

Postby Nathan_P » Tue Nov 27, 2012 12:17 pm

A team mate is having problems with one of his machines, here is the log:-

Code: Select all
[18:00:03] Completed 142500 out of 250000 steps  (57%)
[18:11:01] Completed 145000 out of 250000 steps  (58%)
[18:12:15] mdrun returned 255
[18:12:15] Going to send back what have done -- stepsTotalG=250000
[18:12:15] Work fraction=175.5183 steps=250000.
[18:12:19] logfile size=124477 infoLength=124477 edr=25 trr=1
[18:12:19] logfile size: 124477 info=124477 bed=25 hdr=1
[18:12:19] - Writing 125015 bytes of core data to disk...
[18:12:19] Done: 124503 -> 14565 (compressed to 11.6 percent)
[18:12:19]   ... Done.
[18:12:22]
[18:12:22] Folding@home Core Shutdown: UNSTABLE_MACHINE
[18:12:23] CoreStatus = 7A (122)
[18:12:23] Sending work to server
[18:12:23] Project: 8101 (Run 5, Clone 1, Gen 102)


[18:12:23] + Attempting to send results [November 26 18:12:23 UTC]
[18:12:23] - Reading file work/wuresults_07.dat from core
[18:12:23]   (Read 15077 bytes from disk)
[18:12:23] Connecting to http://128.143.231.201:8080/
[18:12:23] Posted data.
[18:12:23] Initial: 0000; Conversation time very short, giving reduced weight in bandwidth avg
[18:12:23] - Uploaded at ~31 kB/s
[18:12:23] - Averaged speed for that direction ~295 kB/s
[18:12:23] + Results successfully sent
[18:12:23] Thank you for your contribution to Folding@Home.
[18:12:23] Trying to send all finished work units
[18:12:23] + No unsent completed units remaining.
[18:12:23] - Preparing to get new work unit...
[18:12:23] Cleaning up work directory
[18:12:23] + Attempting to get work packet
[18:12:23] Passkey found
[18:12:23] - Will indicate memory of 64537 MB
[18:12:23] - Connecting to assignment server
[18:12:23] Connecting to http://assign.stanford.edu:8080/
[18:12:25] Posted data.
[18:12:25] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[18:12:25] + News From Folding@Home: Welcome to Folding@Home
[18:12:25] Loaded queue successfully.
[18:12:25] Sent data
[18:12:25] Connecting to http://128.143.231.201:8080/
[18:12:33] Posted data.
[18:12:33] Initial: 0000; - Receiving payload (expected size: 30310898)
[18:14:17] - Downloaded at ~284 kB/s
[18:14:17] - Averaged speed for that direction ~543 kB/s
[18:14:17] + Received work.
[18:14:17] Trying to send all finished work units
[18:14:17] + No unsent completed units remaining.
[18:14:17] + Closed connections
[18:14:22]
[18:14:22] + Processing work unit
[18:14:23] Core required: FahCore_a5.exe
[18:14:23] Core found.
[18:14:23] Working on queue slot 08 [November 26 18:14:23 UTC]
[18:14:23] + Working ...
[18:14:23] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 08 -np 64 -checkpoint 3 -verbose -lifeline 2204 -version 634'

[18:14:23]
[18:14:23] *------------------------------*
[18:14:23] Folding@Home Gromacs SMP Core
[18:14:23] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[18:14:23]
[18:14:23] Preparing to commence simulation
[18:14:23] - Looking at optimizations...
[18:14:23] - Created dyn
[18:14:23] - Files status OK
[18:14:26] - Expanded 30310386 -> 33158020 (decompressed 109.3 percent)
[18:14:26] Called DecompressByteArray: compressed_data_size=30310386 data_size=33158020, decompressed_data_size=33158020 diff=0
[18:14:26] - Digital signature verified
[18:14:26]
[18:14:26] Project: 8101 (Run 25, Clone 7, Gen 64)
[18:14:26]
[18:14:27] Assembly optimizations on if available.
[18:14:27] Entering M.D.
[18:14:33] Mapping NT from 64 to 64
[18:14:39] Completed 0 out of 250000 steps  (0%)
[18:26:21] Completed 2500 out of 250000 steps  (1%)
[18:37:51] Completed 5000 out of 250000 steps  (2%)
[18:42:43] mdrun returned 255
[18:42:43] Going to send back what have done -- stepsTotalG=250000
[18:42:43] Work fraction=2642.4443 steps=250000.
[18:42:47] logfile size=17953 infoLength=17953 edr=25 trr=1
[18:42:47] logfile size: 17953 info=17953 bed=25 hdr=1
[18:42:47] - Writing 18491 bytes of core data to disk...
[18:42:47] Done: 17979 -> 5493 (compressed to 30.5 percent)
[18:42:47]   ... Done.
[18:42:50]
[18:42:50] Folding@home Core Shutdown: UNSTABLE_MACHINE
[18:42:50] CoreStatus = 7A (122)
[18:42:50] Sending work to server
[18:42:50] Project: 8101 (Run 25, Clone 7, Gen 64)



Now we've suggested the obvious, back to stock clocks etc but he is still having issues. 3 things to note

There was a power spike followed by a power cut just before this started
The board has suffered damage to a VRM on the mobo due to severe overclocking
The board will complete standard SMP units with a mild over clock, in this case a 4p interlagos rig clocked to 3ghz

Code: Select all
[09:16:48] Leaving Run
[09:16:48] - Writing 3810282 bytes of core data to disk...
[09:16:49] Done: 3809770 -> 3532359 (compressed to 92.7 percent)
[09:16:49]   ... Done.
[09:16:50] - Shutting down core
[09:16:50]
[09:16:50] Folding@home Core Shutdown: FINISHED_UNIT
[09:16:50] CoreStatus = 64 (100)
[09:16:50] Unit 1 finished with 99 percent of time to deadline remaining.
[09:16:50] Updated performance fraction: 0.985548
[09:16:50] Sending work to server
[09:16:50] Project: 7165 (Run 0, Clone 82, Gen 313)


[09:16:50] + Attempting to send results [November 27 09:16:50 UTC]
[09:16:50] - Reading file work/wuresults_01.dat from core
[09:16:50]   (Read 3532871 bytes from disk)
[09:16:50] Connecting to http://128.143.199.96:8080/
[09:16:57] Posted data.
[09:16:57] Initial: 0000; - Uploaded at ~493 kB/s
[09:16:57] - Averaged speed for that direction ~493 kB/s
[09:16:57] + Results successfully sent
[09:16:57] Thank you for your contribution to Folding@Home.
[09:16:57] + Number of Units Completed: 81

[09:16:57] Trying to send all finished work units
[09:16:57] + No unsent completed units remaining.
[09:16:57] - Preparing to get new work unit...
[09:16:57] Cleaning up work directory
[09:16:57] + Attempting to get work packet
[09:16:57] Passkey found
[09:16:57] - Will indicate memory of 64537 MB
[09:16:57] - Connecting to assignment server
[09:16:57] Connecting to http://assign.stanford.edu:8080/
[09:16:58] Posted data.
[09:16:58] Initial: 8F80; - Successful: assigned to (128.143.199.96).
[09:16:58] + News From Folding@Home: Welcome to Folding@Home
[09:16:58] Loaded queue successfully.
[09:16:58] Sent data
[09:16:58] Connecting to http://128.143.199.96:8080/
[09:17:00] Posted data.
[09:17:00] Initial: 0000; - Receiving payload (expected size: 1796001)
[09:17:05] - Downloaded at ~350 kB/s
[09:17:05] - Averaged speed for that direction ~394 kB/s
[09:17:05] + Received work.
[09:17:05] Trying to send all finished work units
[09:17:05] + No unsent completed units remaining.
[09:17:05] + Closed connections
[09:17:05]
[09:17:05] + Processing work unit
[09:17:05] Core required: FahCore_a3.exe
[09:17:05] Core found.
[09:17:05] Working on queue slot 02 [November 27 09:17:05 UTC]
[09:17:05] + Working ...
[09:17:05] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 02 -np 64 -checkpoint 3 -verbose -lifeline 3250 -version 634'

[09:17:05]
[09:17:05] *------------------------------*
[09:17:05] Folding@Home Gromacs SMP Core
[09:17:05] Version 2.27 (Dec. 15, 2010)
[09:17:05]
[09:17:05] Preparing to commence simulation
[09:17:05] - Looking at optimizations...
[09:17:05] - Created dyn
[09:17:05] - Files status OK
[09:17:05] - Expanded 1795489 -> 2108980 (decompressed 117.4 percent)
[09:17:05] Called DecompressByteArray: compressed_data_size=1795489 data_size=2108980, decompressed_data_size=2108980 diff=0
[09:17:05] - Digital signature verified
[09:17:05]
[09:17:05] Project: 7165 (Run 0, Clone 82, Gen 314)
[09:17:05]
[09:17:05] Assembly optimizations on if available.
[09:17:05] Entering M.D.
[09:17:11] Mapping NT from 64 to 64
[09:17:12] Completed 0 out of 500000 steps  (0%)

[11:08:23] Completed 495000 out of 500000 steps  (99%)
[11:09:11] Completed 500000 out of 500000 steps  (100%)
[11:09:11] DynamicWrapper: Finished Work Unit: sleep=10000
[11:09:21]
[11:09:21] Finished Work Unit:
[11:09:21] - Reading up to 3715440 from "work/wudata_02.trr": Read 3715440
[11:09:21] trr file hash check passed.
[11:09:21] edr file hash check passed.
[11:09:21] logfile size: 66760
[11:09:21] Leaving Run
[11:09:26] - Writing 3817760 bytes of core data to disk...
[11:09:27] Done: 3817248 -> 3533109 (compressed to 92.5 percent)
[11:09:27]   ... Done.
[11:09:27] - Shutting down core
[11:09:27]
[11:09:27] Folding@home Core Shutdown: FINISHED_UNIT
[11:09:27] CoreStatus = 64 (100)
[11:09:27] Unit 2 finished with 98 percent of time to deadline remaining.
[11:09:27] Updated performance fraction: 0.983907
[11:09:27] Sending work to server
[11:09:27] Project: 7165 (Run 0, Clone 82, Gen 314)


[11:09:27] + Attempting to send results [November 27 11:09:27 UTC]
[11:09:27] - Reading file work/wuresults_02.dat from core
[11:09:27]   (Read 3533621 bytes from disk)
[11:09:27] Connecting to http://128.143.199.96:8080/
[11:09:35] Posted data.
[11:09:35] Initial: 0000; - Uploaded at ~431 kB/s
[11:09:35] - Averaged speed for that direction ~462 kB/s
[11:09:35] + Results successfully sent
[11:09:35] Thank you for your contribution to Folding@Home.
[11:09:35] + Number of Units Completed: 82

[11:09:35] Trying to send all finished work units
[11:09:35] + No unsent completed units remaining.
[11:09:35] - Preparing to get new work unit...
[11:09:35] Cleaning up work directory
[11:09:35] + Attempting to get work packet
[11:09:35] Passkey found
[11:09:35] - Will indicate memory of 64428 MB
[11:09:35] - Connecting to assignment server
[11:09:35] Connecting to http://assign.stanford.edu:8080/
[11:09:36] Posted data.
[11:09:36] Initial: 8F80; - Successful: assigned to (128.143.199.96).
[11:09:36] + News From Folding@Home: Welcome to Folding@Home
[11:09:36] Loaded queue successfully.
[11:09:36] Sent data
[11:09:36] Connecting to http://128.143.199.96:8080/
[11:09:38] Posted data.
[11:09:38] Initial: 0000; - Receiving payload (expected size: 1796165)
[11:09:42] - Downloaded at ~438 kB/s
[11:09:42] - Averaged speed for that direction ~409 kB/s
[11:09:42] + Received work.
[11:09:42] Trying to send all finished work units
[11:09:42] + No unsent completed units remaining.
[11:09:42] + Closed connections
[11:09:42]
[11:09:42] + Processing work unit
[11:09:42] Core required: FahCore_a3.exe
[11:09:42] Core found.
[11:09:42] Working on queue slot 03 [November 27 11:09:42 UTC]
[11:09:42] + Working ...
[11:09:42] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 03 -np 64 -checkpoint 3 -verbose -lifeline 1982 -version 634'

[11:09:43]
[11:09:43] *------------------------------*
[11:09:43] Folding@Home Gromacs SMP Core
[11:09:43] Version 2.27 (Dec. 15, 2010)
[11:09:43]
[11:09:43] Preparing to commence simulation
[11:09:43] - Looking at optimizations...
[11:09:43] - Created dyn
[11:09:43] - Files status OK
[11:09:43] - Expanded 1795653 -> 2108980 (decompressed 117.4 percent)
[11:09:43] Called DecompressByteArray: compressed_data_size=1795653 data_size=2108980, decompressed_data_size=2108980 diff=0
[11:09:43] - Digital signature verified
[11:09:43]
[11:09:43] Project: 7165 (Run 0, Clone 82, Gen 315)
[11:09:43]
[11:09:43] Assembly optimizations on if available.
[11:09:43] Entering M.D.
[11:09:49] Mapping NT from 64 to 64
[11:09:50] Completed 0 out of 500000 steps  (0%)
[11:10:53] Completed 5000 out of 500000 steps  (1%)__________________


Now the error codes are not being very helpful, 7A (122) usually occurs on gpu's and i can't find a mention of 114 any where on the wiki. Any idea's??
Image
Nathan_P
 
Posts: 1442
Joined: Wed Apr 01, 2009 9:22 pm
Location: Jersey, Channel islands

Re: Hardware fault - has us stumped

Postby P5-133XL » Tue Nov 27, 2012 12:36 pm

Check RAM using memtst86+

One main difference between SMP and bigadv's is the memory usage. It is possible that RAM outside SMP usage range is flawed but is still capable of corrupting the bigadv WU's. More ram being used also places slightly more stress on flaky VRM's and RAM is also sensitive to power spikes.
Image
P5-133XL
 
Posts: 4034
Joined: Sun Dec 02, 2007 4:36 am
Location: Salem. OR USA

Re: Hardware fault - has us stumped

Postby Punchy » Tue Nov 27, 2012 1:46 pm

Has a different set of CPUs been tried? Otherwise, if P5's idea doesn't work, try dropping P-states one at a time until it is bigadv-stable.
Punchy
 
Posts: 218
Joined: Fri Feb 19, 2010 1:49 am

Re: Hardware fault - has us stumped

Postby Nathan_P » Tue Nov 27, 2012 7:15 pm

P5-133XL wrote:Check RAM using memtst86+

One main difference between SMP and bigadv's is the memory usage. It is possible that RAM outside SMP usage range is flawed but is still capable of corrupting the bigadv WU's. More ram being used also places slightly more stress on flaky VRM's and RAM is also sensitive to power spikes.


Oh its a very flaky VRM, damn thing burned :lol: He is currently running memtest for 8 hours to see what that shows up - yes he has set back to stock clocks.

Thanks for the tips
Nathan_P
 
Posts: 1442
Joined: Wed Apr 01, 2009 9:22 pm
Location: Jersey, Channel islands


Return to SMP with bigadv

Who is online

Users browsing this forum: No registered users and 1 guest

cron