9712 (Run 55, Clone 20, Gen 64), bad state

Moderators: Site Moderators, FAHC Science Team

Post Reply
ChristianVirtual
Posts: 1596
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

9712 (Run 55, Clone 20, Gen 64), bad state

Post by ChristianVirtual »

On CentOS 7, GTX 970 with nV355.11 a bad state WU

Interesting: two times at 6%

Code: Select all

15:42:15:WU00:FS02:0x21:Project: 9712 (Run 55, Clone 20, Gen 64)
15:42:15:WU00:FS02:0x21:Unit: 0x0000012eab40416255b9ae993dfab3a9
15:42:15:WU00:FS02:0x21:CPU: 0x00000000000000000000000000000000
15:42:15:WU00:FS02:0x21:Machine: 2
15:42:15:WU00:FS02:0x21:Reading tar file core.xml
15:42:15:WU00:FS02:0x21:Reading tar file integrator.xml
15:42:15:WU00:FS02:0x21:Reading tar file system.xml
15:42:15:WU00:FS02:0x21:Reading tar file state.xml
15:42:16:WU00:FS02:0x21:Digital signatures verified
15:42:16:WU00:FS02:0x21:Folding@home GPU Core21 Folding@home Core
15:42:16:WU00:FS02:0x21:Version 0.0.11
15:42:41:WU00:FS02:0x21:Completed 0 out of 1280000 steps (0%)
15:42:41:WU00:FS02:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
15:44:54:WU00:FS02:0x21:Completed 12800 out of 1280000 steps (1%)
15:47:01:WU00:FS02:0x21:Completed 25600 out of 1280000 steps (2%)
...
15:55:28:WU00:FS02:0x21:Completed 76800 out of 1280000 steps (6%)
15:56:05:WU00:FS02:0x21:Bad State detected... attempting to resume from last good checkpoint
15:58:12:WU00:FS02:0x21:Completed 12800 out of 1280000 steps (1%)
16:00:18:WU00:FS02:0x21:Completed 25600 out of 1280000 steps (2%)
16:02:25:WU00:FS02:0x21:Completed 38400 out of 1280000 steps (3%)
16:04:32:WU00:FS02:0x21:Completed 51200 out of 1280000 steps (4%)
16:06:39:WU00:FS02:0x21:Completed 64000 out of 1280000 steps (5%)
16:08:45:WU00:FS02:0x21:Completed 76800 out of 1280000 steps (6%)
16:09:22:WU00:FS02:0x21:Bad State detected... attempting to resume from last good checkpoint
16:11:29:WU00:FS02:0x21:Completed 12800 out of 1280000 steps (1%)
16:13:35:WU00:FS02:0x21:Completed 25600 out of 1280000 steps (2%)
16:15:42:WU00:FS02:0x21:Completed 38400 out of 1280000 steps (3%)
16:17:49:WU00:FS02:0x21:Completed 51200 out of 1280000 steps (4%)
16:19:56:WU00:FS02:0x21:Completed 64000 out of 1280000 steps (5%)
16:22:02:WU00:FS02:0x21:Completed 76800 out of 1280000 steps (6%)
16:24:15:WU00:FS02:0x21:Completed 89600 out of 1280000 steps (7%)
...
17:40:49:WU00:FS02:0x21:Completed 550400 out of 1280000 steps (43%)
17:42:29:WU00:FS02:0x21:Bad State detected... attempting to resume from last good checkpoint
17:42:29:WU00:FS02:0x21:Max number of retries reached. Aborting.
17:42:29:WU00:FS02:0x21:ERROR:Max Retries Reached
17:42:29:WU00:FS02:0x21:Saving result file logfile_01.txt
17:42:29:WU00:FS02:0x21:Saving result file log.txt
17:42:30:WU00:FS02:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
17:42:30:WARNING:WU00:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
17:42:30:WU00:FS02:Sending unit results: id:00 state:SEND error:FAULTY project:9712 run:55 clone:20 gen:64 core:0x21 unit:0x0000012eab40416255b9ae993dfab3a9
17:42:30:WU00:FS02:Uploading 2.97KiB to 171.64.65.98
17:42:30:WU00:FS02:Connecting to 171.64.65.98:8080
17:42:31:WU00:FS02:Upload complete
17:42:31:WU02:FS02:Connecting to 171.67.108.45:80
ImageImage
Please contribute your logs to http://ppd.fahmm.net
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: 9712 (Run 55, Clone 20, Gen 64), bad state

Post by foldy »

Why are there only 3 retries to resume from checkpoint on Bad State detected?
The work unit failed 2 times at 6% and third time at 43%.
Maybe with more retries the work unit would have made it to finish?
ChristianVirtual
Posts: 1596
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: 9712 (Run 55, Clone 20, Gen 64), bad state

Post by ChristianVirtual »

Somewhere you have to set a limit ... Fine with that. Just hope the developers get the root cause identified and removed.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
toTOW
Site Moderator
Posts: 6312
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 9712 (Run 55, Clone 20, Gen 64), bad state

Post by toTOW »

Another example of the random bad states with core 21 :(

In the WU history, I see 5 failures, and finally someone completed it. Unfortunately, I don't have access to the hardware used in each reports.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: 9712 (Run 55, Clone 20, Gen 64), bad state

Post by billford »

foldy wrote:Why are there only 3 retries to resume from checkpoint on Bad State detected?
I was just thinking the same relating to one of my current WUs...
ChristianVirtual wrote:Somewhere you have to set a limit ...
I agree with that too, but perhaps some sort of "proportionality" could be incorporated to make it more forgiving?

Eg, if "Bad states" have occurred at 25% and 62%, then another occurs at 97% it seems worth having another try rather than simply quitting.

Not sure how it would be coded (maybe decrement the count after 15 completed frames?), but I'm sure it could be done and might well be worth the effort when processing the very large molecules that Core_21 is intended for.
Image
ChristianVirtual
Posts: 1596
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: 9712 (Run 55, Clone 20, Gen 64), bad state

Post by ChristianVirtual »

billford wrote: Not sure how it would be coded (maybe decrement the count after 15 completed frames?), but I'm sure it could be done and might well be worth the effort when processing the very large molecules that Core_21 is intended for.
That's could be rather easy, like depretiation ... After x good % reduce the retry counter again :eugeek:

[facepalm]obvious I did not read what you wrote before I posted :oops: [/facepalm]
Last edited by ChristianVirtual on Sun Oct 11, 2015 12:34 pm, edited 1 time in total.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: 9712 (Run 55, Clone 20, Gen 64), bad state

Post by billford »

It seems easy to me too, at least on the surface, but I know absolutely nothing about how the cores work!
Image
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: 9712 (Run 55, Clone 20, Gen 64), bad state

Post by billford »

Suggestion submitted on reddit... now to see what excuses PG come up with for not implementing it.
Image
Post Reply