Deadline/ETA mismatch with GPU WUs (Solved)

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Trekkie
Posts: 11
Joined: Thu Jul 31, 2014 4:41 pm

Deadline/ETA mismatch with GPU WUs (Solved)

Post by Trekkie »

Hello,

I've only started folding recently, and my CPU seems to be working away merrily, but my GPU is currently getting Work Units that it will never complete within the deadline. I'm trying to determine if it's my card that's not working as hard as it could, or if it's the WUs themselves.

Details, yummy yummy details:
  • CPU: AMD Athlon X4 740 Quad-Core @ 3.2GHz
  • GPU: AMD Radeon R7 250X, manufactured by MSI, with 1GB of dedicated RAM
  • GPU Drivers: Catalyst 14.4
  • Operating System: Windows 7 Professional 64-bit, all updates applied
  • Overclocked?: No
  • Stable?: Yes
  • Software: Client is v7.4.4, don't know how to get the core(s) version
  • WU Details: For the CPU, it's currently on #8575 (1, 2, 452) with a deadline of 12.86 days and an ETA of 2.20 days. Therefore, working fine.
    For the GPU, it's on #10468 (0, 271, 22) with a deadline of 10.64 days and an ETA of 14.34 days. Therefore, b0rken.
    What's a Gromacs?
Do you need the log.txt in this case?
Joe_H
Site Admin
Posts: 7883
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Deadline/ETA mismatch with GPU WUs

Post by Joe_H »

Welcome to the forum.

First, the folding client can take several percent of progress with GPU WU's before giving an accurate ETA. This especially applies if a WU is the first the client is processing from a particular project and no history is available in the clients WU database.

As for your other questions, the core versions are reported in the log file. Posting the beginning section of the log where the client and system information is shown, and then through the first few percent of processing of the WU's can give us additional information to see what is happening.

Gromacs is the name of the code used by PG to create some of the processing cores. Additional information can be found on the PG site from the Software link at the top of the forum. There are a number of FAQ pages that cover a number of folding related subjects.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Trekkie
Posts: 11
Joined: Thu Jul 31, 2014 4:41 pm

Re: Deadline/ETA mismatch with GPU WUs

Post by Trekkie »

OK, here's the log file:

Code: Select all

*********************** Log Started 2014-07-30T21:00:09Z ***********************
21:00:09:************************* Folding@home Client *************************
21:00:09:      Website: http://folding.stanford.edu/
21:00:09:    Copyright: (c) 2009-2014 Stanford University
21:00:09:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:00:09:         Args: 
21:00:09:       Config: C:/Users/<UserName>/AppData/Roaming/FAHClient/config.xml
21:00:09:******************************** Build ********************************
21:00:09:      Version: 7.4.4
21:00:09:         Date: Mar 4 2014
21:00:09:         Time: 20:26:54
21:00:09:      SVN Rev: 4130
21:00:09:       Branch: fah/trunk/client
21:00:09:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
21:00:09:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
21:00:09:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
21:00:09:     Platform: win32 XP
21:00:09:         Bits: 32
21:00:09:         Mode: Release
21:00:09:******************************* System ********************************
21:00:09:          CPU: AMD Athlon(tm) X4 740 Quad Core Processor
21:00:09:       CPU ID: AuthenticAMD Family 21 Model 16 Stepping 1
21:00:09:         CPUs: 4
21:00:09:       Memory: 7.94GiB
21:00:09:  Free Memory: 6.55GiB
21:00:09:      Threads: WINDOWS_THREADS
21:00:09:   OS Version: 6.1
21:00:09:  Has Battery: false
21:00:09:   On Battery: false
21:00:09:   UTC Offset: 1
21:00:09:          PID: 4264
21:00:09:          CWD: C:/Users/<UserName>/AppData/Roaming/FAHClient
21:00:09:           OS: Windows 7 Professional
21:00:09:      OS Arch: AMD64
21:00:09:         GPUs: 1
21:00:09:        GPU 0: ATI:5 R575A [Radeon R7 250X/HD 7700/8760]
21:00:09:         CUDA: Not detected
21:00:09:Win32 Service: false
21:00:09:***********************************************************************
21:00:09:<config>
21:00:09:  <!-- Network -->
21:00:09:  <proxy v=':8080'/>
21:00:09:
21:00:09:  <!-- Slot Control -->
21:00:09:  <pause-on-battery v='false'/>
21:00:09:  <power v='FULL'/>
21:00:09:
21:00:09:  <!-- User Information -->
21:00:09:  <passkey v='********************************'/>
21:00:09:  <team v='198'/>
21:00:09:  <user v=(UserName)/>
21:00:09:
21:00:09:  <!-- Folding Slots -->
21:00:09:  <slot id='0' type='CPU'/>
21:00:09:  <slot id='1' type='GPU'/>
21:00:09:</config>
21:00:09:Trying to access database...
21:00:09:Successfully acquired database lock
21:00:09:Enabled folding slot 00: READY cpu:3
21:00:09:Enabled folding slot 01: READY gpu:0:R575A [Radeon R7 250X/HD 7700/8760]
21:00:09:WARNING:WU01:FS01:Past final deadline 2014-07-30T12:28:48Z, dumping
21:00:10:WU01:FS01:Cleaning up
21:00:11:WU00:FS00:Starting
21:00:11:WU00:FS00:Running FahCore: "D:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/<UserName>/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 00 -suffix 01 -version 704 -lifeline 4264 -checkpoint 15 -np 3
21:00:13:WU00:FS00:Started FahCore on PID 4656
21:00:15:WU00:FS00:Core PID:4848
21:00:15:WU00:FS00:FahCore 0xa4 started
21:00:17:WU00:FS00:0xa4:
21:00:17:WU00:FS00:0xa4:*------------------------------*
21:00:17:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
21:00:17:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
21:00:17:WU00:FS00:0xa4:
21:00:17:WU00:FS00:0xa4:Preparing to commence simulation
21:00:17:WU00:FS00:0xa4:- Ensuring status. Please wait.
21:00:18:WU01:FS01:Connecting to 171.67.108.201:80
21:00:19:WU01:FS01:Assigned to work server 140.163.4.233
21:00:19:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:R575A [Radeon R7 250X/HD 7700/8760] from 140.163.4.233
21:00:19:WU01:FS01:Connecting to 140.163.4.233:8080
21:00:20:WU01:FS01:Downloading 4.31MiB
21:00:24:WU01:FS01:Download complete
21:00:24:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:10468 run:0 clone:271 gen:22 core:0x17 unit:0x00000022538b3db9538cb166ae48791d
21:00:25:WU01:FS01:Starting
21:00:25:WU01:FS01:Running FahCore: "D:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/<UserName>/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe -dir 01 -suffix 01 -version 704 -lifeline 4264 -checkpoint 15 -gpu 0 -gpu-vendor ati
21:00:25:WU01:FS01:Started FahCore on PID 4684
21:00:26:WU00:FS00:0xa4:- Looking at optimizations...
21:00:26:WU00:FS00:0xa4:- Working with standard loops on this execution.
21:00:27:WU00:FS00:0xa4:- Previous termination of core was improper.
21:00:27:WU00:FS00:0xa4:- Going to use standard loops.
21:00:27:WU00:FS00:0xa4:- Files status OK
21:00:27:WU00:FS00:0xa4:- Expanded 117557 -> 266072 (decompressed 226.3 percent)
21:00:27:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=117557 data_size=266072, decompressed_data_size=266072 diff=0
21:00:27:WU00:FS00:0xa4:- Digital signature verified
21:00:27:WU00:FS00:0xa4:
21:00:27:WU00:FS00:0xa4:Project: 6369 (Run 13, Clone 70, Gen 44)
21:00:27:WU00:FS00:0xa4:
21:00:27:WU00:FS00:0xa4:Entering M.D.
21:00:28:WU01:FS01:Core PID:4124
21:00:28:WU01:FS01:FahCore 0x17 started
21:00:31:WU01:FS01:0x17:*********************** Log Started 2014-07-30T21:00:30Z ***********************
21:00:31:WU01:FS01:0x17:Project: 10468 (Run 0, Clone 271, Gen 22)
21:00:31:WU01:FS01:0x17:Unit: 0x00000022538b3db9538cb166ae48791d
21:00:31:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
21:00:31:WU01:FS01:0x17:Machine: 1
21:00:31:WU01:FS01:0x17:Reading tar file state.xml
21:00:31:WU01:FS01:0x17:Reading tar file system.xml
21:00:32:WU01:FS01:0x17:Reading tar file integrator.xml
21:00:32:WU01:FS01:0x17:Reading tar file core.xml
21:00:32:WU01:FS01:0x17:Digital signatures verified
21:00:32:WU01:FS01:0x17:Folding@home GPU core17
21:00:32:WU01:FS01:0x17:Version 0.0.52
21:00:33:WU00:FS00:0xa4:Using Gromacs checkpoints
21:00:33:WU00:FS00:0xa4:Mapping NT from 3 to 3 
21:00:33:WU00:FS00:0xa4:Resuming from checkpoint
21:00:33:WU00:FS00:0xa4:Verified 00/wudata_01.log
21:00:34:WU00:FS00:0xa4:Verified 00/wudata_01.trr
21:00:34:WU00:FS00:0xa4:Verified 00/wudata_01.xtc
21:00:34:WU00:FS00:0xa4:Verified 00/wudata_01.edr
21:00:34:WU00:FS00:0xa4:Completed 3464430 out of 5000000 steps  (69%)
21:01:50:20:127.0.0.1:New Web connection
21:05:15:WU01:FS01:0x17:Completed 0 out of 5000000 steps (0%)
21:05:15:WU01:FS01:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
21:06:28:WU00:FS00:0xa4:Completed 3500000 out of 5000000 steps  (70%)
21:13:11:WU00:FS00:0xa4:Completed 3550000 out of 5000000 steps  (71%)
21:19:25:WU00:FS00:0xa4:Completed 3600000 out of 5000000 steps  (72%)
21:23:54:WU01:FS01:0x17:Completed 50000 out of 5000000 steps (1%)
21:25:23:WU00:FS00:0xa4:Completed 3650000 out of 5000000 steps  (73%)
21:31:22:WU00:FS00:0xa4:Completed 3700000 out of 5000000 steps  (74%)
21:37:27:WU00:FS00:0xa4:Completed 3750000 out of 5000000 steps  (75%)
21:42:21:WU01:FS01:0x17:Completed 100000 out of 5000000 steps (2%)
21:43:31:WU00:FS00:0xa4:Completed 3800000 out of 5000000 steps  (76%)
21:49:39:WU00:FS00:0xa4:Completed 3850000 out of 5000000 steps  (77%)
21:56:15:WU00:FS00:0xa4:Completed 3900000 out of 5000000 steps  (78%)
21:58:21:52:127.0.0.1:New Web connection
22:02:52:WU00:FS00:0xa4:Completed 3950000 out of 5000000 steps  (79%)
22:09:37:WU00:FS00:0xa4:Completed 4000000 out of 5000000 steps  (80%)
22:16:24:WU00:FS00:0xa4:Completed 4050000 out of 5000000 steps  (81%)
22:23:26:WU00:FS00:0xa4:Completed 4100000 out of 5000000 steps  (82%)
22:30:05:WU00:FS00:0xa4:Completed 4150000 out of 5000000 steps  (83%)
22:36:50:WU00:FS00:0xa4:Completed 4200000 out of 5000000 steps  (84%)
22:43:42:WU00:FS00:0xa4:Completed 4250000 out of 5000000 steps  (85%)
22:50:23:WU00:FS00:0xa4:Completed 4300000 out of 5000000 steps  (86%)
22:56:53:WU00:FS00:0xa4:Completed 4350000 out of 5000000 steps  (87%)
23:03:45:WU00:FS00:0xa4:Completed 4400000 out of 5000000 steps  (88%)
23:10:23:WU00:FS00:0xa4:Completed 4450000 out of 5000000 steps  (89%)
23:17:08:WU00:FS00:0xa4:Completed 4500000 out of 5000000 steps  (90%)
23:24:05:WU00:FS00:0xa4:Completed 4550000 out of 5000000 steps  (91%)
23:30:55:WU00:FS00:0xa4:Completed 4600000 out of 5000000 steps  (92%)
23:37:11:WU00:FS00:0xa4:Completed 4650000 out of 5000000 steps  (93%)
23:43:16:WU00:FS00:0xa4:Completed 4700000 out of 5000000 steps  (94%)
23:49:18:WU00:FS00:0xa4:Completed 4750000 out of 5000000 steps  (95%)
23:55:17:WU00:FS00:0xa4:Completed 4800000 out of 5000000 steps  (96%)
23:59:23:78:127.0.0.1:New Web connection
00:01:24:WU00:FS00:0xa4:Completed 4850000 out of 5000000 steps  (97%)
00:07:37:WU00:FS00:0xa4:Completed 4900000 out of 5000000 steps  (98%)
00:13:29:WU00:FS00:0xa4:Completed 4950000 out of 5000000 steps  (99%)
00:13:29:WU02:FS00:Connecting to 171.67.108.200:8080
00:13:30:WU02:FS00:Assigned to work server 128.143.231.202
00:13:30:WU02:FS00:Requesting new work unit for slot 00: RUNNING cpu:3 from 128.143.231.202
00:13:30:WU02:FS00:Connecting to 128.143.231.202:8080
00:13:31:WU02:FS00:Downloading 3.67MiB
00:13:37:WU02:FS00:Download 95.33%
00:13:37:WU02:FS00:Download complete
00:13:37:WU02:FS00:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:8575 run:1 clone:2 gen:452 core:0xa3 unit:0x0000543d0a3b1e59522885a89a386a4a
******************************* Date: 2014-07-31 *******************************
15:14:11:WARNING:WU00:FS00:Detected clock skew (14 hours 57 mins), adjusting time estimates
15:14:11:WARNING:WU01:FS01:Detected clock skew (14 hours 57 mins), adjusting time estimates
15:17:49:WU00:FS00:0xa4:Completed 5000000 out of 5000000 steps  (100%)
15:17:49:WU00:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
15:17:59:WU00:FS00:0xa4:
15:17:59:WU00:FS00:0xa4:Finished Work Unit:
15:17:59:WU00:FS00:0xa4:- Reading up to 1251624 from "00/wudata_01.trr": Read 1251624
15:17:59:WU00:FS00:0xa4:trr file hash check passed.
15:17:59:WU00:FS00:0xa4:- Reading up to 106384 from "00/wudata_01.xtc": Read 106384
15:17:59:WU00:FS00:0xa4:xtc file hash check passed.
15:17:59:WU00:FS00:0xa4:edr file hash check passed.
15:17:59:WU00:FS00:0xa4:logfile size: 94818
15:17:59:WU00:FS00:0xa4:Leaving Run
15:18:03:WU00:FS00:0xa4:- Writing 1524726 bytes of core data to disk...
15:18:04:WU00:FS00:0xa4:Done: 1524214 -> 1284414 (compressed to 84.2 percent)
15:18:04:WU00:FS00:0xa4:  ... Done.
15:18:04:WU00:FS00:0xa4:- Shutting down core
15:18:04:WU00:FS00:0xa4:
15:18:04:WU00:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
15:18:04:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
15:18:05:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:6369 run:13 clone:70 gen:44 core:0xa4 unit:0x000000330002894b53276707d0ea21d1
15:18:05:WU00:FS00:Uploading 1.23MiB to 155.247.166.219
15:18:05:WU00:FS00:Connecting to 155.247.166.219:8080
15:18:05:WU02:FS00:Starting
15:18:05:WU02:FS00:Running FahCore: "D:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/<UserName>/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 02 -suffix 01 -version 704 -lifeline 4264 -checkpoint 15 -np 3
15:18:05:WU02:FS00:Started FahCore on PID 7780
15:18:06:WU02:FS00:Core PID:8432
15:18:06:WU02:FS00:FahCore 0xa3 started
15:18:06:WU02:FS00:0xa3:
15:18:06:WU02:FS00:0xa3:*------------------------------*
15:18:06:WU02:FS00:0xa3:Folding@Home Gromacs SMP Core
15:18:06:WU02:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
15:18:06:WU02:FS00:0xa3:
15:18:06:WU02:FS00:0xa3:Preparing to commence simulation
15:18:06:WU02:FS00:0xa3:- Looking at optimizations...
15:18:06:WU02:FS00:0xa3:- Created dyn
15:18:06:WU02:FS00:0xa3:- Files status OK
15:18:06:WU02:FS00:0xa3:- Expanded 3849197 -> 4383200 (decompressed 113.8 percent)
15:18:06:WU02:FS00:0xa3:Called DecompressByteArray: compressed_data_size=3849197 data_size=4383200, decompressed_data_size=4383200 diff=0
15:18:07:WU02:FS00:0xa3:- Digital signature verified
15:18:07:WU02:FS00:0xa3:
15:18:07:WU02:FS00:0xa3:Project: 8575 (Run 1, Clone 2, Gen 452)
15:18:07:WU02:FS00:0xa3:
15:18:07:WU02:FS00:0xa3:Assembly optimizations on if available.
15:18:07:WU02:FS00:0xa3:Entering M.D.
15:18:11:WU00:FS00:Upload 45.90%
15:18:13:WU02:FS00:0xa3:Mapping NT from 3 to 3 
15:18:14:WU02:FS00:0xa3:Completed 0 out of 500000 steps  (0%)
15:18:17:WU00:FS00:Upload 96.91%
15:18:17:WU00:FS00:Upload complete
15:18:18:WU00:FS00:Server responded WORK_ACK (400)
15:18:18:WU00:FS00:Final credit estimate, 458.00 points
15:18:18:WU00:FS00:Cleaning up
15:51:04:WU02:FS00:0xa3:Completed 5000 out of 500000 steps  (1%)
16:24:01:WU02:FS00:0xa3:Completed 10000 out of 500000 steps  (2%)
16:39:04:107:127.0.0.1:New Web connection
16:57:04:WU02:FS00:0xa3:Completed 15000 out of 500000 steps  (3%)
17:30:44:WU02:FS00:0xa3:Completed 20000 out of 500000 steps  (4%)
18:03:54:WU02:FS00:0xa3:Completed 25000 out of 500000 steps  (5%)
Napoleon
Posts: 887
Joined: Wed May 26, 2010 2:31 pm
Hardware configuration: Atom330 (overclocked):
Windows 7 Ultimate 64bit
Intel Atom330 dualcore (4 HyperThreads)
NVidia GT430, core_15 work
2x2GB Kingston KVR1333D3N9K2/4G 1333MHz memory kit
Asus AT3IONT-I Deluxe motherboard
Location: Finland

Re: Deadline/ETA mismatch with GPU WUs

Post by Napoleon »

Looking at the R250X Time Per Frame, it shouldn't have any problems returning WUs before Timeout even if you only folded for a few hours per day. However, there seems to be something weird going on with your system clock:

Code: Select all

******************************* Date: 2014-07-31 *******************************
15:14:11:WARNING:WU00:FS00:Detected clock skew (14 hours 57 mins), adjusting time estimates
15:14:11:WARNING:WU01:FS01:Detected clock skew (14 hours 57 mins), adjusting time estimates
Idk what is causing the skew, but changes in system time can make the client think that a WU has expired. In that case the client will dump the WU and you won't get any points for the WU.
bollix47
Posts: 2950
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Deadline/ETA mismatch with GPU WUs

Post by bollix47 »

Also, it appears your GPU slot progressed to 2% and then it stopped. Are your power settings set to High performance? If not, they should be with sleeping and hibernating all disabled. Are you allowing Windows to shut down your Display after a certain amount of time? This might also affect your GPU performance so it's advised to set that to 'never' and just turn it off using the power switch.
Image
Napoleon
Posts: 887
Joined: Wed May 26, 2010 2:31 pm
Hardware configuration: Atom330 (overclocked):
Windows 7 Ultimate 64bit
Intel Atom330 dualcore (4 HyperThreads)
NVidia GT430, core_15 work
2x2GB Kingston KVR1333D3N9K2/4G 1333MHz memory kit
Asus AT3IONT-I Deluxe motherboard
Location: Finland

Re: Deadline/ETA mismatch with GPU WUs

Post by Napoleon »

bollix47 wrote:If not, they should be with sleeping and hibernating all disabled.
Good catch. Obviously sleep and hibernate will cause clock skew warnings in the client. More importantly, sleep/hibernate only preserves the memory and CPU state. Not so with GPU(s), sleep/hibernate will more or less reset the GPU state mid-WU, causing the GPU WU to fail. IIRC, FahClient will prevent automatic sleep/hibernate by default during folding but engaging sleep/hibernate manually will still work.
Trekkie
Posts: 11
Joined: Thu Jul 31, 2014 4:41 pm

Re: Deadline/ETA mismatch with GPU WUs

Post by Trekkie »

Napoleon wrote:However, there seems to be something weird going on with your system clock:
Napoleon wrote:Obviously sleep and hibernate will cause clock skew warnings in the client. More importantly, sleep/hibernate only preserves the memory and CPU state. Not so with GPU(s), sleep/hibernate will more or less reset the GPU state mid-WU, causing the GPU WU to fail.
Ah, so F@H does not like hibernation. Good to know. Would pausing F@H before hibernating be a workaround?
bollix47 wrote:Are your power settings set to High performance?
Yep.
bollix47 wrote:Are you allowing Windows to shut down your Display after a certain amount of time? This might also affect your GPU performance...
Um...what? :e?: I've allowed Windows to turn off the displays on multiple computers of a variety of specs - laptops, desktops, integrated graphics, dedicated cards, SLI and Crossfire setups, with Windows XP, V...Vista :shudder: and 7 - and have never had any performance issues (not even with the Vista abomination). So, I'm gonna need a source for that. IMO, a graphics card that can't handle the monitor turning off :roll: should be RMA'd.
Napoleon wrote:Looking at the R250X Time Per Frame, it shouldn't have any problems returning WUs before Timeout even if you only folded for a few hours per day.
So what might be causing the deadline/ETA mismatch? Oh, and I do only fold part-time. 12-16 hours a day, but still not 24/7.
bollix47
Posts: 2950
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Deadline/ETA mismatch with GPU WUs

Post by bollix47 »

I have personally seen my TPFs double on a laptop while folding using an AMD mobile GPU when I let Windows shut down the display. Tested on numerous work units. I had to switch to using a blank screensaver to overcome the problem and the TPFs did return to normal. Different setups may not experience the same problem.
Image
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Deadline/ETA mismatch with GPU WUs

Post by P5-133XL »

The GPU state is not stored when a computer hibernates. When the computer comes back online, the folding application starts running again where it left off but the data in the GPU is gone and that invariably destroys the WU. Pausing the WU before hibernation is a reasonable work-around.

Another similar event is the video-driver reset. Folding is running, but the reset messes with the GPU state resulting in a damaged WU. Often, when there is a discrepancy between the folding log and the client status (log stops recording events but the client continues reporting progress) it can be attributed to video-driver reset. Pause followed by an unpause or restarting the folding client will often restart the WU at the last working checkpoint where the log stopped recording. This failure mode is commonly associated with overclocking (even factory OC's) or gaming while folding. Video-driver resets are normally recorded in the Windows system event logs so that is a good place to start looking if you see a discrepancy between the log and the client in the %done.
Image
Trekkie
Posts: 11
Joined: Thu Jul 31, 2014 4:41 pm

Re: Deadline/ETA mismatch with GPU WUs

Post by Trekkie »

bollix47 wrote:I have personally seen my TPFs double on a laptop while folding using an AMD mobile GPU when I let Windows shut down the display. Tested on numerous work units. I had to switch to using a blank screensaver to overcome the problem and the TPFs did return to 'normal'. Different setups may not experience the same problem.
All right, I'll give it a try, see how that works. Why do you have normal in air quotes?
Trekkie
Posts: 11
Joined: Thu Jul 31, 2014 4:41 pm

Re: Deadline/ETA mismatch with GPU WUs

Post by Trekkie »

P5-133XL wrote:Pausing the WU before hibernation is a reasonable work-around.
Great, good to know!
P5-133XL wrote:Another similar event is the video-driver reset. Folding is running, but the reset messes with the GPU state resulting in a damaged WU. ... Pause followed by an unpause or restarting the folding client will often restart the WU at the last working checkpoint where the log stopped recording. This failure mode is commonly associated with overclocking (even factory OC's) or gaming while folding.
Oh dear. :shock: So, if my monitor suddenly turns off and on again, and I see a message see a message saying "Display driver stopped working and has successfully recovered" (this has happened to me a couple of times before, with AMD's previous driver), I should pause and un-pause F@H? Could the driver reset silently? Can I change the checkpointing frequency? Might that introduce its own problems? So many questions!
Rel25917
Posts: 303
Joined: Wed Aug 15, 2012 2:31 am

Re: Deadline/ETA mismatch with GPU WUs

Post by Rel25917 »

Trekkie wrote: So, if my monitor suddenly turns off and on again, and I see a message see a message saying "Display driver stopped working and has successfully recovered" (this has happened to me a couple of times before, with AMD's previous driver), I should pause and un-pause F@H?
I would use advanced control (I think that what it's called, I don't use the most recent version) and just pause/unpause the gpu slot.
Trekkie wrote: Could the driver reset silently?
It could do it when your not looking and you wouldn't notice it had happened.
Trekkie wrote: Can I change the checkpointing frequency?
The checkpoint frequency only works on some cores, The gpu cores do not follow it but go by a set frequency set by whomever made the project you are working on. They usually go around 2-5%.

If the card is factory overclocked you may need to back it off. Not sure about amd cards but core 17 hates higher memory clocks on nvidia cards but you can up the core clock after dropping the memory down.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Deadline/ETA mismatch with GPU WUs

Post by P5-133XL »

Yes, drivers can reset silently but will still record the reset in the system event log. Also, Not all resets damage the WU.

Yes, pause and unpause is again a reasonable choice if the driver did reset which will restart folding back at the last checkpoint. The better long-term solution is to eliminate the cause of the reset(s). For example lowering OC's and pausing GPU folding slots before gaming and unpause when you are done are both good avoidance strategies.

Yes, you can change the checkpoint frequency in the client to between 3-30 minutes with 15 minutes as the default, but in practice it is not so useful because most of the current cores set the checkpoint frequency thus bypassing the client setting. Also extreme frequencies (3 minutes) seem to be associated with failed corrupted checkpoints. The default 15 minutes seems to work reasonably well.
Image
davidcoton
Posts: 1102
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: Deadline/ETA mismatch with GPU WUs

Post by davidcoton »

1.
Trekkie wrote: Oh dear. :shock: So, if my monitor suddenly turns off and on again, and I see a message see a message saying "Display driver stopped working and has successfully recovered" (this has happened to me a couple of times before, with AMD's previous driver), I should pause and un-pause F@H?
Yes
2.
Trekkie wrote: Could the driver reset silently?
Yes
3.
Trekkie wrote:Can I change the checkpointing frequency?
Yes
4.
Trekkie wrote:Might that introduce its own problems?
Yes
5.
Trekkie wrote: So many questions!
Yes


To be a bit more helpful (at least I hope this will be!):
1. The pause/unpause will increase the chance of recovering a WU after a video reset. Since the chance is otherwise nil, that's not saying much. Cores may respond differently (statistically), it's always worth trying but may not always recover the WU.
2. There will usually be a Windows Log entry, the on-screen message may depend on the drivers/Windows version. So you may not know unless you look. If the "Status" tab progress and the progress shown in the "Log" tab disagree widely, a video reset (after a hibernate/sleep or after a fault detected by the OS) is the likely cause.
3. The trade-off is slightly reduced performance to allow writing the checkpoints. How often do you need to recover from a checkpoint? (Including every pause and every hibernate)? If you are not folding 24/7, and you occasionally pause GPU folding to improve video performance, increasing checkpoint frequency will be worthwhile. But see below.
4. Not all cores take notice of the set checkpoint frequency, since checkpoints cannot be written at any arbitrary time but only when all threads are in sync.
EDIT: syncing threads may not be the reason, but still some cores (including Core17) have predetermined checkpoint intervals.
5. That is entirely normal for people who take folding seriously. Keep asking! There will be others who want to know too, but haven't asked yet.
Last edited by davidcoton on Fri Aug 01, 2014 1:51 pm, edited 1 time in total.
Image
Trekkie
Posts: 11
Joined: Thu Jul 31, 2014 4:41 pm

Re: Deadline/ETA mismatch with GPU WUs

Post by Trekkie »

Rel25917 wrote:
Trekkie wrote: Could the driver reset silently?
It could do it when your not looking and you wouldn't notice it had happened.
So, could I get Windows to put up a notification that doesn't go away until I acknowledge it, if the driver resets? I do NOT want to have to check the Event Viewer every time I leave and return to the computer, just in case the driver reset while I wasn't looking.

Unfortunately, I work off GMT, and it's time I went comatose for a few hours, hallucinated vividly, and then maybe suffer amnesia about the whole experience. If you don't get that reference, your life is incomplete. :ewink: So, don't expect another reply from me for a while.
Post Reply