Page 2 of 2

Re: Core 21 failures on GTX970

Posted: Sat Dec 05, 2015 12:56 pm
by toTOW
JohnChodera wrote:> This is the TDR issue ... now, is triggered by faulty hardware or by software ?

Are you sure this is the same as the TDR? Project 11411 is not very large, so it would be surprising if one of the GPU kernels was exceeding the windows timeout on a GTX 970.
clEnqueueNDRangeKernel (-5) error + GPU reset logged in Event log is typically the TDR issue.

But the TDR is not necessarily triggered by software misbehaviour. The purpose of TDR is to detected unresponsive GPU to avoid a frozen screen, like we used to have on previous Windows. There are two sources of freezes :

- when software sends too much work to the GPU. First step of this is sluggish UI, then as the work is getting even bigger, it leads to longer freezes. This is the form of TDR triggered by large WUs that you tried to workaround in core21.
- when an unrecoverable error occurs in GPU or VRAM. This one is usually triggered by too much overclocking, or faulty/dying hardware. This one is similar to the freezes that could happen on CPUs in case of unstable overcklocking ...

Re: Core 21 failures on GTX970

Posted: Sat Dec 05, 2015 3:24 pm
by foldy
@Duce H_K_: Maybe you can use FurMark to test your GPU overclock to be stable?

Re: Core 21 failures on GTX970

Posted: Sat Dec 05, 2015 7:43 pm
by artoar_11
@Duce H_K_:
GPU-Z shows "Boost 1443 MHz" (see your screenshot). If you press "Sensors" you will see real GPU frequency. On your second screenshot MSI AB shows 1493 MHz. This is real GPU frequency. 1493 MHz is high OC.
My GTX 970@1430 MHz/1.1870V, vMEM@6010 MHz works well for now with core_21 v 0.0.14.

Re: Core 21 failures on GTX970

Posted: Sat Dec 05, 2015 10:25 pm
by bruce
When running FAH, overclocking VRAM is usually a bad idea. More power, more heat, almost no change to performance.

Use default memory speed and push the shaders a bit more if that leaves more margin.

Re: Core 21 failures on GTX970

Posted: Thu Dec 10, 2015 4:25 am
by Kebast
Just a quick note. I've had zero failures since the core 21 WUs were removed for this card. The core 21s are working just fine on my ubuntu system with the 750ti.zero failures there as well.

Re: Core 21 failures on GTX970

Posted: Thu Dec 10, 2015 5:50 pm
by foldy
@Kebast: There is a new version Core_21 to be released which will fix many of the bad states.

Re: Core 21 failures on GTX970

Posted: Wed Jan 06, 2016 5:43 am
by Nert
I'm not sure if I should tack on to the end of this thread or not, but I believe I had a core 21 failure on my GTX 970 this evening. It's kind of hard for me to tell for sure if it's the 970 or the 750 TI that failed since they are not identified properly in advanced control. But, I believe it was the 970.

Here's the beginning of the log:

Code: Select all

*********************** Log Started 2016-01-01T00:10:18Z ***********************
00:10:18:************************* Folding@home Client *************************
00:10:18:      Website: http://folding.stanford.edu/
00:10:18:    Copyright: (c) 2009-2014 Stanford University
00:10:18:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
00:10:18:         Args: 
00:10:18:       Config: C:/Users/roger/AppData/Roaming/FAHClient/config.xml
00:10:18:******************************** Build ********************************
00:10:18:      Version: 7.4.4
00:10:18:         Date: Mar 4 2014
00:10:18:         Time: 20:26:54
00:10:18:      SVN Rev: 4130
00:10:18:       Branch: fah/trunk/client
00:10:18:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
00:10:18:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
00:10:18:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
00:10:18:     Platform: win32 XP
00:10:18:         Bits: 32
00:10:18:         Mode: Release
00:10:18:******************************* System ********************************
00:10:18:          CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
00:10:18:       CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
00:10:18:         CPUs: 8
00:10:18:       Memory: 15.94GiB
00:10:18:  Free Memory: 14.02GiB
00:10:18:      Threads: WINDOWS_THREADS
00:10:18:   OS Version: 6.2
00:10:18:  Has Battery: false
00:10:18:   On Battery: false
00:10:18:   UTC Offset: -7
00:10:18:          PID: 7092
00:10:18:          CWD: C:/Users/roger/AppData/Roaming/FAHClient
00:10:18:           OS: Windows 10 Home
00:10:18:      OS Arch: AMD64
00:10:18:         GPUs: 2
00:10:18:        GPU 0: NVIDIA:4 GM107 [GeForce GTX 750 Ti]
00:10:18:        GPU 1: NVIDIA:5 GM204 [GeForce GTX 970]
00:10:18:         CUDA: 5.2
00:10:18:  CUDA Driver: 7050
00:10:18:Win32 Service: false
00:10:18:***********************************************************************
00:10:18:<config>
00:10:18:  <!-- Folding Slot Configuration -->
00:10:18:  <cause v='PARKINSONS'/>
00:10:18:
00:10:18:  <!-- Network -->
00:10:18:  <proxy v=':8080'/>
00:10:18:
00:10:18:  <!-- Slot Control -->
00:10:18:  <power v='FULL'/>
00:10:18:
00:10:18:  <!-- User Information -->
00:10:18:  <passkey v='********************************'/>
00:10:18:  <team v='165780'/>
00:10:18:  <user v='nert'/>
00:10:18:
00:10:18:  <!-- Folding Slots -->
00:10:18:  <slot id='1' type='GPU'>
00:10:18:    <max-packet-size v='big'/>
00:10:18:    <paused v='true'/>
00:10:18:  </slot>
00:10:18:  <slot id='2' type='GPU'>
00:10:18:    <paused v='true'/>
00:10:18:  </slot>
00:10:18:  <slot id='0' type='CPU'>
00:10:18:    <paused v='true'/>
00:10:18:  </slot>
00:10:18:</config>
00:10:18:Trying to access database...
00:10:18:Successfully acquired database lock
And the error condition:

Code: Select all

05:28:09:WU02:FS02:Connecting to 171.67.108.45:80
05:28:10:WU02:FS02:Assigned to work server 140.163.4.242
05:28:10:WU02:FS02:Requesting new work unit for slot 02: RUNNING gpu:1:GM204 [GeForce GTX 970] from 140.163.4.242
05:28:10:WU02:FS02:Connecting to 140.163.4.242:8080
05:28:11:WU02:FS02:Downloading 4.14MiB
05:28:13:WU02:FS02:Download complete
05:28:13:WU02:FS02:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:11402 run:0 clone:20 gen:103 core:0x21 unit:0x0000009e8ca304f255ed4e3f1c299dcf
05:28:20:WU00:FS02:0x18:Saving result file logfile_01.txt
05:28:20:WU00:FS02:0x18:Saving result file checkpointState.xml
05:28:22:WU00:FS02:0x18:Saving result file checkpt.crc
05:28:22:WU00:FS02:0x18:Saving result file log.txt
05:28:22:WU00:FS02:0x18:Saving result file positions.xtc
05:28:23:WU00:FS02:0x18:Folding@home Core Shutdown: FINISHED_UNIT
05:28:24:WU00:FS02:FahCore returned: FINISHED_UNIT (100 = 0x64)
05:28:24:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:9141 run:43 clone:0 gen:260 core:0x18 unit:0x000001330a3b1e6155664fbdaa904638
05:28:24:WU00:FS02:Uploading 6.00MiB to 171.64.65.61
05:28:24:WU02:FS02:Starting
05:28:24:WU00:FS02:Connecting to 171.64.65.61:8080
05:28:24:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/roger/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 7092 -checkpoint 15 -gpu 1 -gpu-vendor nvidia
05:28:24:WU02:FS02:Started FahCore on PID 1884
05:28:24:WU02:FS02:Core PID:11436
05:28:24:WU02:FS02:FahCore 0x21 started
05:28:24:WU02:FS02:0x21:*********************** Log Started 2016-01-06T05:28:24Z ***********************
05:28:24:WU02:FS02:0x21:Project: 11402 (Run 0, Clone 20, Gen 103)
05:28:24:WU02:FS02:0x21:Unit: 0x0000009e8ca304f255ed4e3f1c299dcf
05:28:24:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
05:28:24:WU02:FS02:0x21:Machine: 2
05:28:24:WU02:FS02:0x21:Reading tar file core.xml
05:28:24:WU02:FS02:0x21:Reading tar file system.xml
05:28:24:WU02:FS02:0x21:Reading tar file integrator.xml
05:28:24:WU02:FS02:0x21:Reading tar file state.xml
05:28:25:WU02:FS02:0x21:Digital signatures verified
05:28:25:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
05:28:25:WU02:FS02:0x21:Version 0.0.17
05:28:30:WU00:FS02:Upload 14.59%
05:28:36:WU00:FS02:Upload 22.92%
05:28:42:WU00:FS02:Upload 32.30%
05:28:48:WU00:FS02:Upload 40.63%
05:28:53:WARNING:WU02:FS02:FahCore returned: FAILED_3 (255 = 0xff)
05:28:53:WU02:FS02:Starting
05:28:53:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/roger/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 7092 -checkpoint 15 -gpu 1 -gpu-vendor nvidia
05:28:53:WU02:FS02:Started FahCore on PID 3968
05:28:53:WU02:FS02:Core PID:11252
05:28:53:WU02:FS02:FahCore 0x21 started
05:28:54:WU02:FS02:0x21:*********************** Log Started 2016-01-06T05:28:54Z ***********************
05:28:54:WU02:FS02:0x21:Project: 11402 (Run 0, Clone 20, Gen 103)
05:28:54:WU02:FS02:0x21:Unit: 0x0000009e8ca304f255ed4e3f1c299dcf
05:28:54:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
05:28:54:WU02:FS02:0x21:Machine: 2
05:28:54:WU02:FS02:0x21:Reading tar file core.xml
05:28:54:WU02:FS02:0x21:Reading tar file system.xml
05:28:54:WU00:FS02:Upload 50.01%
05:28:54:WU02:FS02:0x21:Reading tar file integrator.xml
05:28:54:WU02:FS02:0x21:Reading tar file state.xml
05:28:55:WU02:FS02:0x21:Digital signatures verified
05:28:55:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
05:28:55:WU02:FS02:0x21:Version 0.0.17
05:29:00:WU00:FS02:Upload 58.35%
05:29:06:WU00:FS02:Upload 66.68%
05:29:07:WU02:FS02:0x21:Completed 0 out of 5000000 steps (0%)
05:29:07:WU02:FS02:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
05:29:12:WU00:FS02:Upload 76.06%
05:29:18:WU00:FS02:Upload 84.39%
05:29:24:WU00:FS02:Upload 93.77%
05:29:37:WU00:FS02:Upload complete
05:29:37:WU00:FS02:Server responded WORK_ACK (400)
05:29:37:WU00:FS02:Final credit estimate, 22017.00 points
05:29:37:WU00:FS02:Cleaning up

Re: Core 21 failures on GTX970

Posted: Wed Jan 06, 2016 10:23 am
by foldy
This is the FahCore_21 warning on first start:
05:28:53:WARNING:WU02:FS02:FahCore returned: FAILED_3 (255 = 0xff)
After that FahCore_21 started again without a problem.

Re: Core 21 failures on GTX970

Posted: Thu Jan 07, 2016 5:12 am
by bruce
Exactly.

I don't remember seeing that happen before.
05:28:13:WU02:FS02:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:11402 run:0 clone:20 gen:103 core:0x21 unit:0x0000009e8ca304f255ed4e3f1c299dcf
05:28:24:WU02:FS02:Starting
05:28:24:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/roger/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 7092 -checkpoint 15 -gpu 1 -gpu-vendor nvidia
05:28:24:WU02:FS02:Started FahCore on PID 1884
05:28:24:WU02:FS02:Core PID:11436
05:28:24:WU02:FS02:FahCore 0x21 started
05:28:24:WU02:FS02:0x21:*********************** Log Started 2016-01-06T05:28:24Z ***********************
05:28:24:WU02:FS02:0x21:Project: 11402 (Run 0, Clone 20, Gen 103)
05:28:24:WU02:FS02:0x21:Unit: 0x0000009e8ca304f255ed4e3f1c299dcf
05:28:24:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
05:28:24:WU02:FS02:0x21:Machine: 2
05:28:24:WU02:FS02:0x21:Reading tar file core.xml
05:28:24:WU02:FS02:0x21:Reading tar file system.xml
05:28:24:WU02:FS02:0x21:Reading tar file integrator.xml
05:28:24:WU02:FS02:0x21:Reading tar file state.xml
05:28:25:WU02:FS02:0x21:Digital signatures verified
05:28:25:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
05:28:25:WU02:FS02:0x21:Version 0.0.17
05:28:53:WARNING:WU02:FS02:FahCore returned: FAILED_3 (255 = 0xff)
05:28:53:WU02:FS02:Starting
05:28:53:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/roger/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 7092 -checkpoint 15 -gpu 1 -gpu-vendor nvidia
05:28:53:WU02:FS02:Started FahCore on PID 3968
05:28:53:WU02:FS02:Core PID:11252
05:28:53:WU02:FS02:FahCore 0x21 started
05:28:54:WU02:FS02:0x21:*********************** Log Started 2016-01-06T05:28:54Z ***********************
05:28:54:WU02:FS02:0x21:Project: 11402 (Run 0, Clone 20, Gen 103)
05:28:54:WU02:FS02:0x21:Unit: 0x0000009e8ca304f255ed4e3f1c299dcf
05:28:54:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
05:28:54:WU02:FS02:0x21:Machine: 2
05:28:54:WU02:FS02:0x21:Reading tar file core.xml
05:28:54:WU02:FS02:0x21:Reading tar file system.xml
05:28:54:WU02:FS02:0x21:Reading tar file integrator.xml
05:28:54:WU02:FS02:0x21:Reading tar file state.xml
05:28:55:WU02:FS02:0x21:Digital signatures verified
05:28:55:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
05:28:55:WU02:FS02:0x21:Version 0.0.17
05:29:07:WU02:FS02:0x21:Completed 0 out of 5000000 steps (0%)