BAD_STATE sometimes when OCed

Moderators: Site Moderators, PandeGroup

BAD_STATE sometimes when OCed

Postby Stephen1R2 » Wed Apr 25, 2018 2:09 am

Hello

I know the solution is not to OC the GPU when folding but sometimes you forget to switch it back after a game. When switched to underclock and lowered power target it runs solid and cool but still goes BAD_STATE if I start a youtube video.

So could the client auto-pause the unit after the first BAD_STATE error so you don't come back to the machine and see 10-20 fails?
Code: Select all
*********************** Log Started 2018-04-25T01:10:29Z ***********************
01:10:29:************************* Folding@home Client *************************
01:10:29:      Website: http://folding.stanford.edu/
01:10:29:    Copyright: (c) 2009-2014 Stanford University
01:10:29:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
01:10:29:         Args: --open-web-control
01:10:29:       Config: D:/common/FAHClient/config.xml
01:10:29:******************************** Build ********************************
01:10:29:      Version: 7.4.4
01:10:29:         Date: Mar 4 2014
01:10:29:         Time: 20:26:54
01:10:29:      SVN Rev: 4130
01:10:29:       Branch: fah/trunk/client
01:10:29:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
01:10:29:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
01:10:29:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
01:10:29:     Platform: win32 XP
01:10:29:         Bits: 32
01:10:29:         Mode: Release
01:10:29:******************************* System ********************************
01:10:29:          CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
01:10:29:       CPU ID: GenuineIntel Family 6 Model 45 Stepping 7
01:10:29:         CPUs: 32
01:10:29:       Memory: 63.97GiB
01:10:29:  Free Memory: 52.97GiB
01:10:29:      Threads: WINDOWS_THREADS
01:10:29:   OS Version: 6.2
01:10:29:  Has Battery: false
01:10:29:   On Battery: false
01:10:29:   UTC Offset: -4
01:10:29:          PID: 1636
01:10:29:          CWD: D:/common/FAHClient
01:10:29:           OS: Windows 10 Pro
01:10:29:      OS Arch: AMD64
01:10:29:         GPUs: 1
01:10:29:        GPU 0: ATI:5 Hawaii [Radeon R9 200/300 Series]
01:10:29:         CUDA: Not detected
01:10:29:Win32 Service: false
01:10:29:***********************************************************************
01:10:29:<config>
01:10:29:  <!-- Network -->
01:10:29:  <proxy v=':8080'/>
01:10:29:
01:10:29:  <!-- Slot Control -->
01:10:29:  <pause-on-battery v='false'/>
01:10:29:
01:10:29:  <!-- User Information -->
01:10:29:  <passkey v='********************************'/>
01:10:29:  <user v='Stephen1R2'/>
01:10:29:
01:10:29:  <!-- Folding Slots -->
01:10:29:  <slot id='1' type='GPU'/>
01:10:29:  <slot id='2' type='CPU'>
01:10:29:    <cpus v='24'/>
01:10:29:    <paused v='true'/>
01:10:29:  </slot>
01:10:29:</config>
01:10:29:Trying to access database...
01:10:29:Successfully acquired database lock
01:10:29:Enabled folding slot 01: READY gpu:0:Hawaii [Radeon R9 200/300 Series]
01:10:29:Enabled folding slot 02: PAUSED cpu:24 (by user)
01:10:30:WU00:FS01:Connecting to 171.67.108.45:80
01:10:31:WU00:FS01:Assigned to work server 155.247.166.219
01:10:31:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:Hawaii [Radeon R9 200/300 Series] from 155.247.166.219
01:10:31:WU00:FS01:Connecting to 155.247.166.219:8080
01:10:33:WU00:FS01:Downloading 904.00KiB
01:10:33:WU00:FS01:Download complete
01:10:33:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13782 run:195 clone:2 gen:178 core:0x21 unit:0x000000bf0002894b5a78cee742bffd8f
Stephen1R2
 
Posts: 7
Joined: Sun Nov 06, 2016 8:58 pm

Re: BAD_STATE sometimes when OCed

Postby bruce » Wed Apr 25, 2018 3:42 am

FAH is a scientific project, not a game. Science is VERY strict about discarding work that contains calculation errors whereas if there's a minor calculation error in a game, it's often as simple as a few pixels that are the wrong color or are offset slightly from where they should be: no foul, no harm.

Fah will discard the bad information and move on. If it detects a few consecutive discards, it will shut down to avoid draining the server(s) of WUs that must be reassigned to someone else who can complete them successfully. The words "a few" are not precisely defined, but it probably is something like the 10-ish that you've suggested. The log you posted doesn't show that information.
bruce
 
Posts: 22738
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: BAD_STATE sometimes when OCed

Postby Stephen1R2 » Wed Apr 25, 2018 4:04 am

Hi and thanks for the quick reply.

The bad_state doesn't happen all the time and for the 1st two or 3 it states that it will resume from a checkpoint. I was wondering if there existed a way for the client to pause the slot after receiving the signal the first time so that the progress till that point is not lost because I was away from the machine and did not notice.

This is also the case for the odd youtube caused fail. One relevant log portion is below

Code: Select all
18:43:54:WU00:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:Hawaii [Radeon R9 200/300 Series] from 155.247.166.219
18:43:54:WU00:FS01:Connecting to 155.247.166.219:8080
18:43:56:WU00:FS01:Downloading 901.44KiB
18:43:56:WU00:FS01:Download complete
18:43:56:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13781 run:440 clone:1 gen:39 core:0x21 unit:0x000000280002894b5a8264b425b84082
18:44:28:WU02:FS02:0xa4:Completed 122500 out of 250000 steps  (49%)
18:45:04:WU02:FS02:0xa4:Completed 125000 out of 250000 steps  (50%)
18:45:20:WU01:FS01:0x21:Completed 5000000 out of 5000000 steps (100%)
18:45:21:WU01:FS01:0x21:Saving result file logfile_01.txt
18:45:21:WU01:FS01:0x21:Saving result file checkpointState.xml
18:45:21:WU01:FS01:0x21:Saving result file checkpt.crc
18:45:21:WU01:FS01:0x21:Saving result file log.txt
18:45:21:WU01:FS01:0x21:Saving result file positions.xtc
18:45:22:WU01:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
18:45:22:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
18:45:22:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:13781 run:609 clone:1 gen:30 core:0x21 unit:0x0000001e0002894b5a8264c9db7219db
18:45:22:WU01:FS01:Uploading 1.77MiB to 155.247.166.219
18:45:22:WU01:FS01:Connecting to 155.247.166.219:8080
18:45:22:WU00:FS01:Starting
18:45:22:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" D:/common/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/ATI/R600/Core_21.fah/FahCore_21.exe -dir 00 -suffix 01 -version 704 -lifeline 13504 -checkpoint 15 -gpu 0 -gpu-vendor ati
18:45:22:WU00:FS01:Started FahCore on PID 3532
18:45:22:WU00:FS01:Core PID:7252
18:45:22:WU00:FS01:FahCore 0x21 started
18:45:23:WU00:FS01:0x21:*********************** Log Started 2018-03-18T18:45:23Z ***********************
18:45:23:WU00:FS01:0x21:Project: 13781 (Run 440, Clone 1, Gen 39)
18:45:23:WU00:FS01:0x21:Unit: 0x000000280002894b5a8264b425b84082
18:45:23:WU00:FS01:0x21:CPU: 0x00000000000000000000000000000000
18:45:23:WU00:FS01:0x21:Machine: 1
18:45:23:WU00:FS01:0x21:Reading tar file core.xml
18:45:23:WU00:FS01:0x21:Reading tar file integrator.xml
18:45:23:WU00:FS01:0x21:Reading tar file state.xml
18:45:23:WU00:FS01:0x21:Reading tar file system.xml
18:45:23:WU00:FS01:0x21:Digital signatures verified
18:45:23:WU00:FS01:0x21:Folding@home GPU Core21 Folding@home Core
18:45:23:WU00:FS01:0x21:Version 0.0.18
18:45:23:WU01:FS01:Upload complete
18:45:23:WU01:FS01:Server responded WORK_ACK (400)
18:45:23:WU01:FS01:Final credit estimate, 22648.00 points
18:45:24:WU01:FS01:Cleaning up
18:45:29:WU00:FS01:0x21:Completed 0 out of 5000000 steps (0%)
18:45:29:WU00:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
18:45:40:WU02:FS02:0xa4:Completed 127500 out of 250000 steps  (51%)
18:46:16:WU02:FS02:0xa4:Completed 130000 out of 250000 steps  (52%)
18:46:52:WU02:FS02:0xa4:Completed 132500 out of 250000 steps  (53%)
18:46:57:WU00:FS01:0x21:Completed 50000 out of 5000000 steps (1%)
18:47:27:WU02:FS02:0xa4:Completed 135000 out of 250000 steps  (54%)
18:48:03:WU02:FS02:0xa4:Completed 137500 out of 250000 steps  (55%)
18:48:26:WU00:FS01:0x21:Completed 100000 out of 5000000 steps (2%)
18:48:39:WU02:FS02:0xa4:Completed 140000 out of 250000 steps  (56%)
18:49:15:WU02:FS02:0xa4:Completed 142500 out of 250000 steps  (57%)
18:49:51:WU02:FS02:0xa4:Completed 145000 out of 250000 steps  (58%)
18:49:54:WU00:FS01:0x21:Completed 150000 out of 5000000 steps (3%)
18:50:27:WU02:FS02:0xa4:Completed 147500 out of 250000 steps  (59%)
18:51:02:WU02:FS02:0xa4:Completed 150000 out of 250000 steps  (60%)
18:51:23:WU00:FS01:0x21:Completed 200000 out of 5000000 steps (4%)
18:51:38:WU02:FS02:0xa4:Completed 152500 out of 250000 steps  (61%)
18:52:14:WU02:FS02:0xa4:Completed 155000 out of 250000 steps  (62%)
18:52:49:WU02:FS02:0xa4:Completed 157500 out of 250000 steps  (63%)
18:52:51:WU00:FS01:0x21:Completed 250000 out of 5000000 steps (5%)
18:53:25:WU02:FS02:0xa4:Completed 160000 out of 250000 steps  (64%)
18:54:02:WU02:FS02:0xa4:Completed 162500 out of 250000 steps  (65%)
18:54:27:WU00:FS01:0x21:Completed 300000 out of 5000000 steps (6%)
18:54:40:WU02:FS02:0xa4:Completed 165000 out of 250000 steps  (66%)
18:54:41:WU00:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
18:55:17:WU00:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
18:55:17:WU02:FS02:0xa4:Completed 167500 out of 250000 steps  (67%)
18:55:55:WU02:FS02:0xa4:Completed 170000 out of 250000 steps  (68%)
18:56:32:WU02:FS02:0xa4:Completed 172500 out of 250000 steps  (69%)
18:57:04:WU00:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
18:57:04:WU00:FS01:0x21:ERROR:114: Max Retries Reached
18:57:04:WU00:FS01:0x21:Saving result file logfile_01.txt
18:57:04:WU00:FS01:0x21:Saving result file badstate-0.xml
18:57:05:WU00:FS01:0x21:Saving result file badstate-1.xml
18:57:05:WU00:FS01:0x21:Saving result file badstate-2.xml
18:57:05:WU00:FS01:0x21:Saving result file log.txt
18:57:05:WU00:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
18:57:06:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:57:06:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13781 run:440 clone:1 gen:39 core:0x21 unit:0x000000280002894b5a8264b425b84082
18:57:06:WU00:FS01:Uploading 6.41KiB to 155.247.166.219
18:57:06:WU00:FS01:Connecting to 155.247.166.219:8080
18:57:06:WU00:FS01:Upload complete
18:57:07:WU00:FS01:Server responded WORK_ACK (400)
18:57:07:WU00:FS01:Cleaning up
18:57:07:WU01:FS01:Connecting to 171.67.108.45:80
18:57:07:WU01:FS01:Assigned to work server 155.247.166.219
18:57:07:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:Hawaii [Radeon R9 200/300 Series] from 155.247.166.219
18:57:07:WU01:FS01:Connecting to 155.247.166.219:8080
18:57:09:WU01:FS01:Downloading 901.71KiB
18:57:09:WU01:FS01:Download complete
18:57:09:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13783 run:648 clone:1 gen:37 core:0x21 unit:0x000000260002894b5a833fc9144f027c
18:57:09:WU01:FS01:Starting
18:57:09:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" D:/common/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/ATI/R600/Core_21.fah/FahCore_21.exe -dir 01 -suffix 01 -version 704 -lifeline 13504 -checkpoint 15 -gpu 0 -gpu-vendor ati
18:57:09:WU01:FS01:Started FahCore on PID 5552
18:57:09:WU01:FS01:Core PID:16016
18:57:09:WU01:FS01:FahCore 0x21 started
18:57:10:WU02:FS02:0xa4:Completed 175000 out of 250000 steps  (70%)
18:57:10:WU01:FS01:0x21:*********************** Log Started 2018-03-18T18:57:09Z ***********************
18:57:10:WU01:FS01:0x21:Project: 13783 (Run 648, Clone 1, Gen 37)
18:57:10:WU01:FS01:0x21:Unit: 0x000000260002894b5a833fc9144f027c
18:57:10:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
18:57:10:WU01:FS01:0x21:Machine: 1
18:57:10:WU01:FS01:0x21:Reading tar file core.xml
18:57:10:WU01:FS01:0x21:Reading tar file integrator.xml
18:57:10:WU01:FS01:0x21:Reading tar file state.xml
18:57:10:WU01:FS01:0x21:Reading tar file system.xml
18:57:10:WU01:FS01:0x21:Digital signatures verified
18:57:10:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
18:57:10:WU01:FS01:0x21:Version 0.0.18
18:57:17:WU01:FS01:0x21:Completed 0 out of 5000000 steps (0%)
18:57:17:WU01:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
18:57:39:WU01:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
18:57:47:WU02:FS02:0xa4:Completed 177500 out of 250000 steps  (71%)
18:58:25:WU02:FS02:0xa4:Completed 180000 out of 250000 steps  (72%)
18:58:57:WU01:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
18:59:02:WU02:FS02:0xa4:Completed 182500 out of 250000 steps  (73%)
18:59:40:WU02:FS02:0xa4:Completed 185000 out of 250000 steps  (74%)
19:00:17:WU02:FS02:0xa4:Completed 187500 out of 250000 steps  (75%)
19:00:42:WU01:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
19:00:42:WU01:FS01:0x21:ERROR:114: Max Retries Reached
19:00:42:WU01:FS01:0x21:Saving result file logfile_01.txt
19:00:42:WU01:FS01:0x21:Saving result file badstate-0.xml
19:00:43:WU01:FS01:0x21:Saving result file badstate-1.xml
19:00:43:WU01:FS01:0x21:Saving result file badstate-2.xml
19:00:44:WU01:FS01:0x21:Saving result file log.txt
19:00:44:WU01:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
19:00:44:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
19:00:44:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13783 run:648 clone:1 gen:37 core:0x21 unit:0x000000260002894b5a833fc9144f027c

Stephen1R2
 
Posts: 7
Joined: Sun Nov 06, 2016 8:58 pm

Re: BAD_STATE sometimes when OCed

Postby bruce » Wed Apr 25, 2018 4:17 am

Your client discarded two WUs (after 3 attempts on each one). After (probably) 5 times that long, the client will pause.

A client that is paused is a wastes useful resources ... just not in the same way ... and many people would object to that happening on their unattended machine.

It's your responsibility to donate resources from stable hardware.
bruce
 
Posts: 22738
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: BAD_STATE sometimes when OCed

Postby Stephen1R2 » Wed Apr 25, 2018 4:31 am

I know.

Everything chugs along just fine when I have the card on standard settings with the reduced power target -- quite stable while unattended and stays about 60 C.

It just explodes (and loses 2 hours of progress) if I sit down and watch a youtube video without first pausing the GPU slot. That is when the log is from.

This is just a convenience issue and a sort of feature request; the pause behavior should be optional.
Stephen1R2
 
Posts: 7
Joined: Sun Nov 06, 2016 8:58 pm

Re: BAD_STATE sometimes when OCed

Postby bruce » Wed Apr 25, 2018 7:10 am

Most applications contain a setting allowing them to use your CPU for rendering rather than the GPU. If you can find the setting in YouTube NOT to use accelerated rendering you MIGHT or MIGHT NOT experience pauses in the video stream but you wouldn't want to do that with a graphics-intense game. The same type of setting is buried somewhere deep in Windows, and the rendering of the desktop will probably be BETTER than if Windows is competing with FAH. ... or if your system has room for a second GPU, that might work, too.
bruce
 
Posts: 22738
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.


Return to V7.4.4 Public Release Windows/Linux/MacOS X (deprecated)

Who is online

Users browsing this forum: No registered users and 2 guests

cron