Bad work unit??

Moderators: Site Moderators, FAHC Science Team

gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Bad work unit??

Post by gunnarre »

In case you want to test "--disable-cuda", you can add it in the "extra-core-args" option on the GPU slot. Reducing the clock would be the best idea because CUDA gives so much faster folding than OpenCL right now.
Image
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Particle coordinate is nan???

Post by bruce »

Depending on how long it has been used, you may have accumulated dust in the heatsink which may be causing overheating ... or the thermal paste between the CPU and the heatsink may have degraded ... or maybe the factory overclock was always inadequate to handle the full load imposed by a serious processing loads (perhaps just games were expected).

FAH has recently distributed CUDA code for NVidia GPUs which may load it more fully that it has ever been loaded before. If you manually increased the overclocking yourself, back off.

I'd start by reducing the overclocking to original factory settings and see if that eliminates the problem ... but check the heatsink for dust first.
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Particle coordinate is nan???

Post by PantherX »

In addition to what bruce mentioned, please note that AIDA64 Extreme, Hashing, etc. benchmarking GPU tools do not replicate the results of folding. Generally speaking, folding is more intensive than those applications. Thus, see if stock settings or even underclocking your GPU a bit helps out.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
stratocastor
Posts: 20
Joined: Tue Aug 02, 2016 10:37 pm

Re: Particle coordinate is nan???

Post by stratocastor »

Thank you all for the suggestions! I have tried down locking with similar results, still odd errors. Last week, I actually tore it apart, cleaned and applied new thermal paste as well. What other program could I try that would emulate the sort of cuda load that folding does? Literally every thing else runs fine that I can throw at it.
stratocastor
Posts: 20
Joined: Tue Aug 02, 2016 10:37 pm

Re: Bad work unit??

Post by stratocastor »

JohnChodera wrote:> When did the cuda update in folding drop again?? T

core22 0.0.13 with CUDA support rolled out on Mon 28 Sep for most folks (though BETA users had it more than a week earlier).

Is just this project giving you trouble, or all of them?

~ John Chodera // MSKCC

This seems to align to when I started experiencing the problems. Will search through logs to confirm.

When I went back to air cooling, I used new thermal pads, as the old ones were toast. Using thermal grizzly minus pad 8s. I have tried stressing the GPU with every program imaginable. Can get the temps up to 75C with most demanding benchmarks, mining apps, aida64 etc. I tried to downclock with no change in the error frequency. Perhaps my GPU just doesn't like being that warm. When I was on water, was maxing at 45C. Currently waiting on a few parks from EKWB to rebuild my loop. Currently have cuda disabled, and will run overnight to see how it goes. Currently, seems to be progressing past the point where I would have experienced errors. Will post back to tomorrow with updates. The points difference!!!! :(
stratocastor
Posts: 20
Joined: Tue Aug 02, 2016 10:37 pm

Re: Bad work unit??

Post by stratocastor »

Well... that was short lived....

Code: Select all

23:14:20:WU00:FS01:0x22:*********************** Log Started 2020-10-03T23:14:19Z ***********************
23:14:20:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
23:14:20:WU00:FS01:0x22:       Core: Core22
23:14:20:WU00:FS01:0x22:       Type: 0x22
23:14:20:WU00:FS01:0x22:    Version: 0.0.13
23:14:20:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
23:14:20:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
23:14:20:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
23:14:20:WU00:FS01:0x22:       Date: Sep 19 2020
23:14:20:WU00:FS01:0x22:       Time: 02:35:58
23:14:20:WU00:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
23:14:20:WU00:FS01:0x22:     Branch: core22-0.0.13
23:14:20:WU00:FS01:0x22:   Compiler: Visual C++ 2015
23:14:20:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:20:WU00:FS01:0x22:             -DOPENMM_GIT_HASH="\"189320d0\""
23:14:20:WU00:FS01:0x22:   Platform: win32 10
23:14:20:WU00:FS01:0x22:       Bits: 64
23:14:20:WU00:FS01:0x22:       Mode: Release
23:14:20:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
23:14:20:WU00:FS01:0x22:             <peastman@stanford.edu>
23:14:20:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 16336 -checkpoint 15
23:14:20:WU00:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
23:14:20:WU00:FS01:0x22:             0 -gpu 0 --disable-cuda
23:14:20:WU00:FS01:0x22:************************************ libFAH ************************************
23:14:20:WU00:FS01:0x22:       Date: Sep 7 2020
23:14:20:WU00:FS01:0x22:       Time: 19:09:56
23:14:20:WU00:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
23:14:20:WU00:FS01:0x22:     Branch: HEAD
23:14:20:WU00:FS01:0x22:   Compiler: Visual C++ 2015
23:14:20:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:20:WU00:FS01:0x22:   Platform: win32 10
23:14:20:WU00:FS01:0x22:       Bits: 64
23:14:20:WU00:FS01:0x22:       Mode: Release
23:14:20:WU00:FS01:0x22:************************************ CBang *************************************
23:14:20:WU00:FS01:0x22:       Date: Sep 7 2020
23:14:20:WU00:FS01:0x22:       Time: 19:08:30
23:14:20:WU00:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
23:14:20:WU00:FS01:0x22:     Branch: HEAD
23:14:20:WU00:FS01:0x22:   Compiler: Visual C++ 2015
23:14:20:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:20:WU00:FS01:0x22:   Platform: win32 10
23:14:20:WU00:FS01:0x22:       Bits: 64
23:14:20:WU00:FS01:0x22:       Mode: Release
23:14:20:WU00:FS01:0x22:************************************ System ************************************
23:14:20:WU00:FS01:0x22:        CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
23:14:20:WU00:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
23:14:20:WU00:FS01:0x22:       CPUs: 8
23:14:20:WU00:FS01:0x22:     Memory: 15.94GiB
23:14:20:WU00:FS01:0x22:Free Memory: 11.92GiB
23:14:20:WU00:FS01:0x22:    Threads: WINDOWS_THREADS
23:14:20:WU00:FS01:0x22: OS Version: 6.2
23:14:20:WU00:FS01:0x22:Has Battery: false
23:14:20:WU00:FS01:0x22: On Battery: false
23:14:20:WU00:FS01:0x22: UTC Offset: -6
23:14:20:WU00:FS01:0x22:        PID: 15640
23:14:20:WU00:FS01:0x22:        CWD: C:\Users\Beast\AppData\Roaming\FAHClient\work
23:14:20:WU00:FS01:0x22:************************************ OpenMM ************************************
23:14:20:WU00:FS01:0x22:   Revision: 189320d0
23:14:20:WU00:FS01:0x22:********************************************************************************
23:14:20:WU00:FS01:0x22:Project: 11751 (Run 0, Clone 15165, Gen 12)
23:14:20:WU00:FS01:0x22:Unit: 0x0000001e8ca304e75e6d6f7f5be4be71
23:14:20:WU00:FS01:0x22:Digital signatures verified
23:14:20:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
23:14:20:WU00:FS01:0x22:Version 0.0.13
23:14:20:WU00:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
23:14:20:WU00:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
23:14:20:WU00:FS01:0x22:  XTC frame write interval: 50000 steps (5%) [20 total]
23:14:20:WU00:FS01:0x22:  Global context and integrator variables write interval: disabled
23:14:20:WU00:FS01:0x22:There are 4 platforms available.
23:14:20:WU00:FS01:0x22:Platform 0: Reference
23:14:20:WU00:FS01:0x22:Platform 1: CPU
23:14:20:WU00:FS01:0x22:Platform 2: OpenCL
23:14:20:WU00:FS01:0x22:  opencl-device 0 specified
23:14:20:WU00:FS01:0x22:Platform 3: CUDA
23:14:20:WU00:FS01:0x22:  cuda-device 0 specified
23:14:20:WU00:FS01:0x22:Disabling CUDA platform because 'disable-cuda' argument was specified.
23:14:32:WU00:FS01:0x22:Attempting to create OpenCL context:
23:14:32:WU00:FS01:0x22:  Configuring platform OpenCL
23:14:45:WU00:FS01:0x22:  Using OpenCL on platformId 0 and gpu 0
23:14:45:WU00:FS01:0x22:Completed 50000 out of 1000000 steps (5%)
23:15:53:WU00:FS01:0x22:Completed 60000 out of 1000000 steps (6%)
23:16:59:WU00:FS01:0x22:Completed 70000 out of 1000000 steps (7%)
23:18:05:WU00:FS01:0x22:Completed 80000 out of 1000000 steps (8%)
23:19:10:WU00:FS01:0x22:Completed 90000 out of 1000000 steps (9%)
23:20:15:WU00:FS01:0x22:Completed 100000 out of 1000000 steps (10%)
23:20:16:WU00:FS01:0x22:Checkpoint completed at step 100000
23:21:21:WU00:FS01:0x22:Completed 110000 out of 1000000 steps (11%)
23:22:26:WU00:FS01:0x22:Completed 120000 out of 1000000 steps (12%)
23:23:32:WU00:FS01:0x22:Completed 130000 out of 1000000 steps (13%)
23:24:44:WU00:FS01:0x22:An exception occurred at step 139555: Particle coordinate is nan
23:24:44:WU00:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
23:24:44:WU00:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
23:24:45:WARNING:WU00:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
23:24:45:WU00:FS01:Starting
23:24:45:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Beast\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 12136 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 --disable-cuda
23:24:45:WU00:FS01:Started FahCore on PID 17684
23:24:45:WU00:FS01:Core PID:14788
23:24:45:WU00:FS01:FahCore 0x22 started
Mod Edit: Added Code Tags - PantherX
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Particle coordinate is nan???

Post by PantherX »

stratocastor wrote:... What other program could I try that would emulate the sort of cuda load that folding does? Literally every thing else runs fine that I can throw at it.
Unfortunately, FAHBench hasn't been updated to FahCore_22 specifications. There are plans to do it but there's no ETA.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Particle coordinate is nan???

Post by foldy »

@stratocastor: Maybe your GPU works better with OpenCL instead of CUDA for FAH? You can disable CUDA in extra core options but its slower

Image
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: Bad work unit??

Post by NormalDiffusion »

So even underclocked she won't fold anymore? Or were you trying with cuda disabled and delivery clock?
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Particle coordinate is nan???

Post by gunnarre »

I think you need two dashes, so "--disable-cuda".
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Particle coordinate is nan???

Post by PantherX »

gunnarre wrote:I think you need two dashes, so "--disable-cuda".
FYI, a single dash or a double dash would work fine.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Particle coordinate is nan???

Post by gunnarre »

Thanks. Good to know. By convention, many *nix tools use a single dash for single-letter options, but it's more user friendly to allow both.
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
stratocastor
Posts: 20
Joined: Tue Aug 02, 2016 10:37 pm

Re: Bad work unit??

Post by stratocastor »

For the last 22 hours, it took reducing the power limit to 80%, core clock running at 1329 currently. Seems to be stable on cuda folding for the time being. The factory overclock on this 980ti is 1404mhz.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad work unit??

Post by bruce »

For some people, overclocking is a way to increase throughput. For others, running CUDA accomplishes the same thing. Apparently you can't have both. Officially, FAH does not support overclocking. If you choose take responsibility for your own overclock settings and your own cooling methodology, you're welcome to disable CUDA or not ... or figure out what is optimum for your kit. OpenCL is still a choice you can make, but we can't really help you with those decisions.

NaNs are a common result of unstable calculations and such errors are not produce resuts that are useful to science. There are lots and lots of people with non-overclocked systems that are very happy with CUDA's increase in productivity. The FAHCores are reportedly a more strenuous benchmark that others programs, but please don't waste production assignments.
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Bad work unit??

Post by Joe_H »

stratocastor wrote:For the last 22 hours, it took reducing the power limit to 80%, core clock running at 1329 currently. Seems to be stable on cuda folding for the time being. The factory overclock on this 980ti is 1404mhz.
You are still at a higher clock than the reference design for the 980 ti. Base clock for the reference design is 1000 MHz and boost of 1075 MHz.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Post Reply