Good WU gone bad?

Moderators: Site Moderators, FAHC Science Team

Post Reply
anko1
Posts: 438
Joined: Mon Dec 03, 2007 1:31 am
Hardware configuration: Old Faithful CPU: Windows Graphical 5.03; Intel Pentium 4 Processor 540
(3.2GHz) HT;Windows XP
Big Red: Windows SMP Console 6.29; Windows GPU console 6.20r1; Intel Q9450 2.66G; ASUS P5Q 775 P45; [BFG 9800GTX+ old graphics card] NVidia GeForce 8800 GTX [as of 5/9/09]; Windows XP Pro SP3
Lenovo Think Pad: Windows 6.29 w/ SMP; Windows GPU Console 6.20r1 systray; Intel QX9300; NVIDIA Quadro FX-3700M; Windows XP Professional
Location: SF Peninsula

Good WU gone bad?

Post by anko1 »

I've never seen this before, where the results of a good WU that hasn't been sent suddenly turn bad. There was a system failure on my laptop (the Lenovo), and I had to shut down the machine, but the WU results had already been saved. When I restarted it, I got a message, "Bad Index stored in Work Queue. Has been reset to zero.", but I assumed that referred to a bad wu [which may have been bad due the error on the machine, since it is now reprocessing (so far successfully @45%) the same wu.

Code: Select all

# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\SMP
Executable: C:\SMP\SMP.exe
Arguments: -smp -deino -verbosity 9 

[23:04:21] - Ask before connecting: No
[23:04:21] - User name: anko1 (Team 47815)
[23:04:21] - User ID: 74362D0123xxxx
[23:04:21] - Machine ID: 1
[23:04:21] 
[23:04:21] Loaded queue successfully.
[23:04:21] 
[23:04:21] - Autosending finished units... [April 2 23:04:21 UTC]
[23:04:21] + Processing work unit
[23:04:21] Trying to send all finished work units
[23:04:21] Core required: FahCore_a3.exe
[23:04:21] Project: 6013 (Run 0, Clone 31, Gen 134)
[23:04:21] Core found.


[23:04:21] + Attempting to send results [April 2 23:04:21 UTC]
[23:04:21] - Reading file work/wuresults_07.dat from core
[23:04:21] Working on queue slot 08 [April 2 23:04:21 UTC]                                             <---------Start WU 08
[23:04:25] + Working ...
[23:04:25]   (Read 25887675 bytes from disk)
[23:04:25] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 4 -checkpoint 15 -verbose -lifeline 2440 -version 629'

[23:04:25] Connecting to http://130.237.232.140:8080/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.232.140:8080)
[23:04:25] + Retrying using alternative port
[23:04:25] Connecting to http://130.237.232.140:80/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.232.140:80)
[23:04:25] - Error: Could not transmit unit 07 (completed April 2) to work server.
[23:04:25] - 4 failed uploads of this unit.


[23:04:25] + Attempting to send results [April 2 23:04:25 UTC]
[23:04:25] - Reading file work/wuresults_07.dat from core
[23:04:25]   (Read 25887675 bytes from disk)
[23:04:25] Connecting to http://130.237.162.125:8080/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.162.125:8080)
[23:04:25] + Retrying using alternative port
[23:04:25] Connecting to http://130.237.162.125:80/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.162.125:80)
[23:04:25]   Could not transmit unit 07 to Collection server; keeping in queue.
[23:04:25] + Sent 0 of 1 completed units to the server
[23:04:25] - Autosend completed
[23:04:27] 
[23:04:28] *------------------------------*
[23:04:28] Folding@Home Gromacs SMP Core
[23:04:28] Version 2.17 (Mar 12, 2010)
[23:04:28] 
[23:04:28] Preparing to commence simulation
[23:04:28] - Ensuring status. Please wait.
[23:04:36] - Looking at optimizations...
[23:04:36] - Working with standard loops on this execution.
[23:04:36] - Previous termination of core was improper.
[23:04:36] - Files status OK
[23:04:37] - Expanded 1768768 -> 1971489 (decompressed 111.4 percent)
[23:04:37] Called DecompressByteArray: compressed_data_size=1768768 data_size=1971489, decompressed_data_size=1971489 diff=0
[23:04:37] - Digital signature verified
[23:04:38] 
[23:04:38] Project: 6020 (Run 0, Clone 1, Gen 112)
[23:04:38] 
[23:04:38] Entering M.D.
[23:04:46] Completed 0 out of 500000 steps  (0%)
[23:12:06] Completed 5000 out of 500000 steps  (1%)
                      [snip}

[05:04:25] + Attempting to send results [April 3 05:04:25 UTC]
[05:04:25] - Reading file work/wuresults_07.dat from core                                               <----------WU 07 sent
[05:04:25]   (Read 25887675 bytes from disk)
[05:04:25] Connecting to http://130.237.232.140:8080/
[05:06:04] Completed 220000 out of 500000 steps  (44%)
[05:11:04] Posted data.
[05:11:04] Initial: 0000; - Uploaded at ~63 kB/s
[05:11:04] - Averaged speed for that direction ~195 kB/s
[05:11:04] + Results successfully sent
                     {snip}
[13:31:53] Completed 500000 out of 500000 steps  (100%)
[13:31:54] DynamicWrapper: Finished Work Unit: sleep=10000
[13:32:04] 
[13:32:04] Finished Work Unit:
[13:32:04] - Reading up to 20302920 from "work/wudata_08.trr": Read 20302920
[13:32:04] trr file hash check passed.
[13:32:04] edr file hash check passed.
[13:32:04] logfile size: 58828
[13:32:04] Leaving Run
[13:32:07] - Writing 20396024 bytes of core data to disk...
[13:32:08]   ... Done.
[13:32:13] - Shutting down core
[13:32:13] 
[13:32:13] Folding@home Core Shutdown: FINISHED_UNIT
[13:32:15] CoreStatus = 64 (100)                                                                                     <------Finish WU 08 [Good]
[13:32:16] Unit 8 finished with 87 percent of time to deadline remaining.
[13:32:16] Updated performance fraction: 0.842166
[13:32:16] Sending work to server
[13:32:16] Project: 6020 (Run 0, Clone 1, Gen 112)


[13:32:16] + Attempting to send results [April 3 13:32:16 UTC]
[13:32:16] - Reading file work/wuresults_08.dat from core
[13:32:16]   (Read 20396024 bytes from disk)
[13:32:16] Connecting to http://171.64.65.54:8080/
[13:32:16] - Couldn't send HTTP request to server
[13:32:16]   (Got status 503)
[13:32:16] + Could not connect to Work Server (results)
[13:32:16]     (171.64.65.54:8080)
[13:32:16] + Retrying using alternative port

{snip}

[15:00:39] + Processing work unit
[15:00:39] Core required: FahCore_a3.exe
[15:00:39] Core found.
[15:00:39] Working on queue slot 09 [April 3 15:00:39 UTC]
[15:00:39] + Working ...
[15:00:39] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 09 -np 4 -checkpoint 15 -verbose -lifeline 2440 -version 629'

[15:00:39] 
[15:00:39] *------------------------------*
[15:00:39] Folding@Home Gromacs SMP Core
[15:00:39] Version 2.17 (Mar 12, 2010)
[15:00:39] 
[15:00:39] Preparing to commence simulation
[15:00:39] - Looking at optimizations...
[15:00:39] - Created dyn
[15:00:39] - Files status OK
[15:00:40] - Expanded 1797187 -> 2078149 (decompressed 115.6 percent)
[15:00:40] Called DecompressByteArray: compressed_data_size=1797187 data_size=2078149, decompressed_data_size=2078149 diff=0
[15:00:40] - Digital signature verified
[15:00:40] 
[15:00:40] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:00:40] 
[15:00:40] Assembly optimizations on if available.
[15:00:40] Entering M.D.
[15:00:46] Completed 0 out of 500000 steps  (0%)
[15:08:49] Completed 5000 out of 500000 steps  (1%)
[15:16:57] Completed 10000 out of 500000 steps  (2%)
[15:25:03] Completed 15000 out of 500000 steps  (3%)
[15:33:29] Completed 20000 out of 500000 steps  (4%)
[15:44:30] Completed 25000 out of 500000 steps  (5%)
[15:45:47] Caught exception!
[15:45:47] mdrun returned 255
[15:45:47] Going to send back what have done -- stepsTotalG=500000
[15:45:47] Work fraction=0.0512 steps=500000.
[15:45:51] logfile size=13919 infoLength=13919 edr=0 trr=25
[15:45:51] logfile size: 13919 info=13919 bed=0 hdr=25
[15:45:51] Warning: Core could not open logfile work/wudata_09.log; steps completed=500000.
[15:45:51] - Could not open results file
[15:45:51] 
[15:45:51] Folding@home Core Shutdown: EARLY_UNIT_END
[15:45:51] Couldn't delete work/core78.sta.
[15:45:54] CoreStatus = 72 (114)
[15:45:54] Sending work to server
[15:45:54] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:45:54] - Error: Could not get length of results file work/wuresults_09.dat
[15:45:54] - Error: Could not read unit 09 file. Removing from queue.
[15:45:54] Trying to send all finished work units
[15:45:54] Project: 6020 (Run 0, Clone 1, Gen 112)
[15:45:54] - Error: Could not get length of results file work/wuresults_08.dat                     <---- WU 08 gone bad?
[15:45:54] - Error: Could not read unit 08 file. Removing from queue.
[15:45:54] + Sent 0 of 1 completed units to the server
[15:45:54] - Preparing to get new work unit...
[15:45:54] Cleaning up work directory
[15:45:54] + Attempting to get work packet
                 {snip}
[15:57:18] - Attempt #8  to get work failed, and no other work to do.
Waiting before retry.                                                                                                         <------ Hard stop


--- Opening Log file [April 3 17:33:22 UTC] 


# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\SMP
Executable: C:\SMP\SMP.exe
Arguments: -smp -deino -verbosity 9 

[17:33:22] - Ask before connecting: No
[17:33:22] - User name: anko1 (Team 47815)
[17:33:22] - User ID: 74362D012xxxxx
[17:33:22] - Machine ID: 1
[17:33:22] 
[17:33:22] Loaded queue successfully.
[17:33:22] - Bad Index stored in Work Queue. Has been reset to zero.                                   <---------- Is this a clue?
[17:33:22] - Preparing to get new work unit...
[17:33:22] - Autosending finished units... [April 3 17:33:22 UTC]
[17:33:22] Cleaning up work directory
[17:33:22] Trying to send all finished work units
[17:33:22] + No unsent completed units remaining.
[17:33:22] - Autosend completed
[17:33:22] + Attempting to get work packet
[17:33:22] Passkey found
[17:33:22] - Will indicate memory of 2553 MB
[17:33:22] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[17:33:22] - Connecting to assignment server
[17:33:22] Connecting to http://assign.stanford.edu:8080/
[17:33:22] Posted data.
[17:33:22] Initial: ED82; - Successful: assigned to (130.237.232.140).
[17:33:22] + News From Folding@Home: Welcome to Folding@Home
[17:33:23] Loaded queue successfully.
[17:33:23] Connecting to http://130.237.232.140:8080/
[17:33:23] Posted data.
[17:33:23] Initial: 0000; - Receiving payload (expected size: 1797699)
[17:33:28] - Downloaded at ~351 kB/s
[17:33:28] - Averaged speed for that direction ~351 kB/s
[17:33:28] + Received work.
[17:33:28] + Closed connections
[17:33:28] 
[17:33:28] + Processing work unit
[17:33:28] Core required: FahCore_a3.exe
[17:33:28] Core found.
[17:33:28] Working on queue slot 01 [April 3 17:33:28 UTC]
[17:33:28] + Working ...
[17:33:28] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 4 -checkpoint 15 -verbose -lifeline 2448 -version 629'

[17:33:28] 
[17:33:28] *------------------------------*
[17:33:28] Folding@Home Gromacs SMP Core
[17:33:28] Version 2.17 (Mar 12, 2010)
[17:33:28] 
[17:33:28] Preparing to commence simulation
[17:33:28] - Ensuring status. Please wait.
[17:33:38] - Looking at optimizations...
[17:33:38] - Working with standard loops on this execution.
[17:33:38] - Created dyn
[17:33:38] - Files status OK
[17:33:38] - Expanded 1797187 -> 2078149 (decompressed 115.6 percent)
[17:33:38] Called DecompressByteArray: compressed_data_size=1797187 data_size=2078149, decompressed_data_size=2078149 diff=0
[17:33:38] - Digital signature verified
[17:33:38] 
[17:33:38] Project: 6012 (Run 1, Clone 4, Gen 42)
[17:33:38] 
[17:33:38] Entering M.D.
[17:33:45] Completed 0 out of 500000 steps  (0%)
[17:40:30] Completed 5000 out of 500000 steps  (1%)

Don't know why it skipped slot 00. I checked the work folder and there's no wuresults_00 file. However, I still have wuresults_07 [sent] and _08. Also the files for WU 09 are still hanging around. Should I delete them, or will they get cleaned up?
Last edited by anko1 on Sat Apr 03, 2010 11:16 pm, edited 2 times in total.
Wrish
Posts: 74
Joined: Thu Jan 28, 2010 5:09 am

Re: Good WU gone bad?

Post by Wrish »

Work unit 08 ran into an error, the results could not be uploaded, unit was deleted on your end, and therefore the server gave you a fresh copy of the same work unit. Look for system instability.

Try not to delete work files. The client should eventually get around to erasing them.
anko1
Posts: 438
Joined: Mon Dec 03, 2007 1:31 am
Hardware configuration: Old Faithful CPU: Windows Graphical 5.03; Intel Pentium 4 Processor 540
(3.2GHz) HT;Windows XP
Big Red: Windows SMP Console 6.29; Windows GPU console 6.20r1; Intel Q9450 2.66G; ASUS P5Q 775 P45; [BFG 9800GTX+ old graphics card] NVidia GeForce 8800 GTX [as of 5/9/09]; Windows XP Pro SP3
Lenovo Think Pad: Windows 6.29 w/ SMP; Windows GPU Console 6.20r1 systray; Intel QX9300; NVIDIA Quadro FX-3700M; Windows XP Professional
Location: SF Peninsula

Re: Good WU gone bad?

Post by anko1 »

Actually, I think WU 08 finished OK, but WU 09 went bad and then 08 turned bad, or something:

Code: Select all

[13:32:13] Folding@home Core Shutdown: FINISHED_UNIT
[13:32:15] CoreStatus = 64 (100)
[13:32:16] Unit 8 finished with 87 percent of time to deadline remaining.
[13:32:16] Updated performance fraction: 0.842166
[13:32:16] Sending work to server
[13:32:16] Project: 6020 (Run 0, Clone 1, Gen 112)

Code: Select all

[15:45:47] Caught exception!
[15:45:47] mdrun returned 255
[15:45:47] Going to send back what have done -- stepsTotalG=500000
[15:45:47] Work fraction=0.0512 steps=500000.
[15:45:51] logfile size=13919 infoLength=13919 edr=0 trr=25
[15:45:51] logfile size: 13919 info=13919 bed=0 hdr=25
[15:45:51] Warning: Core could not open logfile work/wudata_09.log; steps completed=500000.
[15:45:51] - Could not open results file
[15:45:51] 
[15:45:51] Folding@home Core Shutdown: EARLY_UNIT_END
[15:45:51] Couldn't delete work/core78.sta.
[15:45:54] CoreStatus = 72 (114)
[15:45:54] Sending work to server
[15:45:54] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:45:54] - Error: Could not get length of results file work/wuresults_09.dat
[15:45:54] - Error: Could not read unit 09 file. Removing from queue.

Code: Select all

[15:45:54] Trying to send all finished work units
[15:45:54] Project: 6020 (Run 0, Clone 1, Gen 112)
[15:45:54] - Error: Could not get length of results file work/wuresults_08.dat
[15:45:54] - Error: Could not read unit 08 file. Removing from queue.
[15:45:54] + Sent 0 of 1 completed units to the server
[15:45:54] - Preparing to get new work unit...
Wrish
Posts: 74
Joined: Thu Jan 28, 2010 5:09 am

Re: Good WU gone bad?

Post by Wrish »

Oh, I'm sorry, the snips confused me. Those of us who interpret logs can see uploads, early ends, and client crashes. :)

Work unit 09 was the one that crashed out and was handed back to you as work unit 01. There is a snip between the first and second attempts to upload work unit 08 such that I can't tell what the client recorded. There are also errors deleting files that have me wondering whether an antivirus or permissions problem may be affecting the client.

When you get a Bad Index error, though, you should -oneunit whatever work unit you're working on (completing and uploading it), and then delete queue.dat and, optionally, the entire work folder.

If you don't have a network connection at the time of completing a work unit, you can later run the client with the flag "-send all" and it will upload whatever it has and then exit.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Good WU gone bad?

Post by P5-133XL »

If the WU hasn't actually been successfully uploaded to the server, then yes it is possible for an external event to kill it and thereby fail to get credit. That is a very rare issue however.
Image
anko1
Posts: 438
Joined: Mon Dec 03, 2007 1:31 am
Hardware configuration: Old Faithful CPU: Windows Graphical 5.03; Intel Pentium 4 Processor 540
(3.2GHz) HT;Windows XP
Big Red: Windows SMP Console 6.29; Windows GPU console 6.20r1; Intel Q9450 2.66G; ASUS P5Q 775 P45; [BFG 9800GTX+ old graphics card] NVidia GeForce 8800 GTX [as of 5/9/09]; Windows XP Pro SP3
Lenovo Think Pad: Windows 6.29 w/ SMP; Windows GPU Console 6.20r1 systray; Intel QX9300; NVIDIA Quadro FX-3700M; Windows XP Professional
Location: SF Peninsula

Re: Good WU gone bad?

Post by anko1 »

@Wrish and P5-133XL - thanks for the information.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Good WU gone bad?

Post by P5-133XL »

If the data files are still there, you might try using the qfix utility to see if it can be recovered.
Image
anko1
Posts: 438
Joined: Mon Dec 03, 2007 1:31 am
Hardware configuration: Old Faithful CPU: Windows Graphical 5.03; Intel Pentium 4 Processor 540
(3.2GHz) HT;Windows XP
Big Red: Windows SMP Console 6.29; Windows GPU console 6.20r1; Intel Q9450 2.66G; ASUS P5Q 775 P45; [BFG 9800GTX+ old graphics card] NVidia GeForce 8800 GTX [as of 5/9/09]; Windows XP Pro SP3
Lenovo Think Pad: Windows 6.29 w/ SMP; Windows GPU Console 6.20r1 systray; Intel QX9300; NVIDIA Quadro FX-3700M; Windows XP Professional
Location: SF Peninsula

Re: Good WU gone bad?

Post by anko1 »

Ahh, I'd forgotten about qfix. Thanks for the reminder.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Good WU gone bad?

Post by P5-133XL »

It's alright, I had forgotten about it too, otherwise I would have suggested it earlier when I posted. But then I remembered...
Image
anko1
Posts: 438
Joined: Mon Dec 03, 2007 1:31 am
Hardware configuration: Old Faithful CPU: Windows Graphical 5.03; Intel Pentium 4 Processor 540
(3.2GHz) HT;Windows XP
Big Red: Windows SMP Console 6.29; Windows GPU console 6.20r1; Intel Q9450 2.66G; ASUS P5Q 775 P45; [BFG 9800GTX+ old graphics card] NVidia GeForce 8800 GTX [as of 5/9/09]; Windows XP Pro SP3
Lenovo Think Pad: Windows 6.29 w/ SMP; Windows GPU Console 6.20r1 systray; Intel QX9300; NVIDIA Quadro FX-3700M; Windows XP Professional
Location: SF Peninsula

Re: Good WU gone bad?

Post by anko1 »

I'm using Windows XP, and tried running both Windows/x86 qfix-native and qfix.exe from the link in toTOW's post. I keep getting an "Unknown version of <queue.dat> file" message. Any idea what I'm doing wrong? I noticed toTOW's message about an update for v.6, but don't see anything more recent there. Thanks for your help.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Good WU gone bad?

Post by P5-133XL »

Dick Howell, the original developer died and is no longer maintaining it. Another person took over development and this is his web site: LinuxMinded. It might have a more recent version than the one here and failing all else, there is a link to E-mail him from there.
Image
Post Reply