Page 1 of 1

13434/13435 Stalled Work Units

Posted: Mon Jan 11, 2021 8:48 pm
by Dotious
I have received 3 WUs for 13434 and 13435 over the last couple of days. All WUs have eventually completed, but each WU threw a "WU_stalled" flag 2 - 3 times per simulation. I haven't seen this on any other work units over the last few months. These WUs have all been assigned to my RTX 2070 Super.

13434 (248, 1, 28)
13434 (104, 3, 45)
13435 (348,1,21)

Code: Select all

18:48:28:WU00:FS01:0x22:Completed 3650000 out of 5000000 steps (73%)
19:00:47:WU00:FS01:0x22:Watchdog triggered, requesting soft shutdown down
19:10:47:WU00:FS01:0x22:Watchdog shutdown failed, hard shutdown triggered
19:10:47:WARNING:WU00:FS01:FahCore returned an unknown error code which probably indicates that it crashed
19:10:47:WARNING:WU00:FS01:FahCore returned: WU_STALLED (127 = 0x7f)
19:10:47:WU00:FS01:Starting
19:10:47:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\user\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 3568 -checkpoint 30 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
19:10:47:WU00:FS01:Started FahCore on PID 11696
19:10:47:WU00:FS01:Core PID:3988
19:10:47:WU00:FS01:FahCore 0x22 started
19:10:48:WU00:FS01:0x22:*********************** Log Started 2021-01-11T19:10:47Z ***********************
19:10:48:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
19:10:48:WU00:FS01:0x22:       Core: Core22
19:10:48:WU00:FS01:0x22:       Type: 0x22
19:10:48:WU00:FS01:0x22:    Version: 0.0.13
19:10:48:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:10:48:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
19:10:48:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
19:10:48:WU00:FS01:0x22:       Date: Sep 19 2020
19:10:48:WU00:FS01:0x22:       Time: 02:35:58
19:10:48:WU00:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
19:10:48:WU00:FS01:0x22:     Branch: core22-0.0.13
19:10:48:WU00:FS01:0x22:   Compiler: Visual C++ 2015
19:10:48:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
19:10:48:WU00:FS01:0x22:             -DOPENMM_GIT_HASH="\"189320d0\""
19:10:48:WU00:FS01:0x22:   Platform: win32 10
19:10:48:WU00:FS01:0x22:       Bits: 64
19:10:48:WU00:FS01:0x22:       Mode: Release
19:10:48:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
19:10:48:WU00:FS01:0x22:             <peastman@stanford.edu>
19:10:48:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 11696 -checkpoint 30
19:10:48:WU00:FS01:0x22:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
19:10:48:WU00:FS01:0x22:             nvidia -gpu 0 -gpu-usage 100
19:10:48:WU00:FS01:0x22:************************************ libFAH ************************************
19:10:48:WU00:FS01:0x22:       Date: Sep 7 2020
19:10:48:WU00:FS01:0x22:       Time: 19:09:56
19:10:48:WU00:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
19:10:48:WU00:FS01:0x22:     Branch: HEAD
19:10:48:WU00:FS01:0x22:   Compiler: Visual C++ 2015
19:10:48:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
19:10:48:WU00:FS01:0x22:   Platform: win32 10
19:10:48:WU00:FS01:0x22:       Bits: 64
19:10:48:WU00:FS01:0x22:       Mode: Release
19:10:48:WU00:FS01:0x22:************************************ CBang *************************************
19:10:48:WU00:FS01:0x22:       Date: Sep 7 2020
19:10:48:WU00:FS01:0x22:       Time: 19:08:30
19:10:48:WU00:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
19:10:48:WU00:FS01:0x22:     Branch: HEAD
19:10:48:WU00:FS01:0x22:   Compiler: Visual C++ 2015
19:10:48:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
19:10:48:WU00:FS01:0x22:   Platform: win32 10
19:10:48:WU00:FS01:0x22:       Bits: 64
19:10:48:WU00:FS01:0x22:       Mode: Release
19:10:48:WU00:FS01:0x22:************************************ System ************************************
19:10:48:WU00:FS01:0x22:        CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
19:10:48:WU00:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
19:10:48:WU00:FS01:0x22:       CPUs: 12
19:10:48:WU00:FS01:0x22:     Memory: 15.94GiB
19:10:48:WU00:FS01:0x22:Free Memory: 9.41GiB
19:10:48:WU00:FS01:0x22:    Threads: WINDOWS_THREADS
19:10:48:WU00:FS01:0x22: OS Version: 6.2
19:10:48:WU00:FS01:0x22:Has Battery: false
19:10:48:WU00:FS01:0x22: On Battery: false
19:10:48:WU00:FS01:0x22: UTC Offset: -8
19:10:48:WU00:FS01:0x22:        PID: 3988
19:10:48:WU00:FS01:0x22:        CWD: C:\Users\user\AppData\Roaming\FAHClient\work
19:10:48:WU00:FS01:0x22:************************************ OpenMM ************************************
19:10:48:WU00:FS01:0x22:   Revision: 189320d0
19:10:48:WU00:FS01:0x22:********************************************************************************
19:10:48:WU00:FS01:0x22:Project: 13434 (Run 104, Clone 3, Gen 45)
19:10:48:WU00:FS01:0x22:Unit: 0x00000000000000000000000000000000
19:10:48:WU00:FS01:0x22:Digital signatures verified
19:10:48:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
19:10:48:WU00:FS01:0x22:Version 0.0.13
19:10:48:WU00:FS01:0x22:  Checkpoint write interval: 250000 steps (5%) [20 total]
19:10:48:WU00:FS01:0x22:  JSON viewer frame write interval: 50000 steps (1%) [100 total]
19:10:48:WU00:FS01:0x22:  XTC frame write interval: 250000 steps (5%) [20 total]
19:10:48:WU00:FS01:0x22:  Global context and integrator variables write interval: disabled
19:10:48:WU00:FS01:0x22:There are 4 platforms available.
19:10:48:WU00:FS01:0x22:Platform 0: Reference
19:10:48:WU00:FS01:0x22:Platform 1: CPU
19:10:48:WU00:FS01:0x22:Platform 2: OpenCL
19:10:48:WU00:FS01:0x22:  opencl-device 0 specified
19:10:48:WU00:FS01:0x22:Platform 3: CUDA
19:10:48:WU00:FS01:0x22:  cuda-device 0 specified
19:10:54:WU00:FS01:0x22:Attempting to create CUDA context:
19:10:54:WU00:FS01:0x22:  Configuring platform CUDA
19:10:56:WU00:FS01:0x22:  Using CUDA and gpu 0
19:10:56:WU00:FS01:0x22:Completed 3500000 out of 5000000 steps (70%)
19:13:35:WU00:FS01:0x22:Completed 3550000 out of 5000000 steps (71%)
19:16:15:WU00:FS01:0x22:Completed 3600000 out of 5000000 steps (72%)
19:18:54:WU00:FS01:0x22:Completed 3650000 out of 5000000 steps (73%)
19:21:34:WU00:FS01:0x22:Completed 3700000 out of 5000000 steps (74%)
19:24:13:WU00:FS01:0x22:Completed 3750000 out of 5000000 steps (75%)
19:24:14:WU00:FS01:0x22:Checkpoint completed at step 3750000

Code: Select all

18:38:46:******************************* System ********************************
18:38:46:            CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
18:38:46:         CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
18:38:46:           CPUs: 12
18:38:46:         Memory: 15.94GiB
18:38:46:    Free Memory: 9.28GiB
18:38:46:        Threads: WINDOWS_THREADS
18:38:46:     OS Version: 6.2
18:38:46:    Has Battery: false
18:38:46:     On Battery: false
18:38:46:     UTC Offset: -8
18:38:46:            PID: 3568
18:38:46:            CWD: C:\Users\user\AppData\Roaming\FAHClient
18:38:46:  Win32 Service: false
18:38:46:             OS: Windows 10 Enterprise
18:38:46:        OS Arch: AMD64
18:38:46:           GPUs: 2
18:38:46:          GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:8 TU104 [GeForce RTX 2070 SUPER]
18:38:46:                 8218
18:38:46:          GPU 1: Bus:2 Slot:0 Func:0 NVIDIA:5 GM204 [GeForce GTX 970] 3494
18:38:46:  CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:7.5 Driver:11.2
18:38:46:  CUDA Device 1: Platform:0 Device:1 Bus:2 Slot:0 Compute:5.2 Driver:11.2
18:38:46:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:461.9
18:38:46:OpenCL Device 1: Platform:0 Device:1 Bus:2 Slot:0 Compute:1.2 Driver:461.9
18:38:46:***********************************************************************

Re: 13434/13435 Stalled Work Units

Posted: Tue Jan 12, 2021 7:10 am
by PantherX
Welcome to the F@H Forum Dotious,

Do you know what you (or your system) were doing just before you encountered the "WU_STALLED (127 = 0x7f)" error? Is this something that you can reproduce on demand? If yes, what exact steps do you take?

Re: 13434/13435 Stalled Work Units

Posted: Tue Jan 12, 2021 10:17 pm
by Dotious
Thanks for the welcome, PantherX,

These have all occurred while I was away from the computer, so I'm not sure if something occurred in the background. I haven't tried to reproduce on demand. I run folding@home at a throttled power setting (70%) using the EVGA X1 app. This keeps the clock near stock speeds. I've folded for a few months with this power setting and haven't encountered these stalls until the last few days. These are larger WUs, do you think that throttling might affect it?

I haven't received anymore of these projects since I posted. All WUs I have received in the last day have completed without issues. I'll keep an eye on the logs and see if it pops up again.

Re: 13434/13435 Stalled Work Units

Posted: Tue Jan 12, 2021 10:45 pm
by PantherX
If you're reducing the power target for the GPU, then it shouldn't have any impact on WUs as long as the target is within specifications. If it is too low, I think the GPU would become unstable so as long as it is within normal range, it would be fine.

Re: 13434/13435 Stalled Work Units

Posted: Wed Jan 13, 2021 10:38 pm
by Dotious
Looks like I may have been a bit hasty singling out these projects. I had two other work units from other projects give the "WU_stalled" flag last night. I forgot to note which projects. Now I'm thinking it may have been a driver issue with the latest from Nvidia (461.09). I reverted back to 460.89 last night and I haven't had any stall since then. I'll keep an eye on my logs the next few days.

Good news is all WUs ended up finishing and returning results, even if they took an extra 30-60 minutes each.

Re: 13434/13435 Stalled Work Units

Posted: Thu Jan 14, 2021 12:16 pm
by PantherX
Humm, I really want to update the Nvidia drivers to get the Security fix but they haven't updated the Studio Driver (460.89) so I don't know if that's a driver issue or not :(

I haven't seen any issues being reported with the Game Ready Driver (461.09 WHQL) so at this point, I would say that it's unlikely. However, you may consider doing a fresh installation where you select the option to do a clean install. Alternatively, you can use NVCleanstall (https://www.techpowerup.com/download/te ... leanstall/) which I haven't used but heard good things about it.