Page 4 of 4

Re: faulty WU 13415, 2495,0,1

PostPosted: Sat Jun 27, 2020 12:57 pm
by mwroggenbuck
Hello all,

In the last 24 hours, I have had 8 failures. They were all 13415. I am also running on an RX 570.

I have been talking about this error in viewtopic.php?f=81&t=35482

From now on, I will only post to this thread.

I thought I had figured something out with cpu affinity, but apparently that was not enough.

From now on, I will let the system run like it normally would. If a job fails twice, I will save it off and send to John.

On another note, I want to thank everyone (and I mean everyone) associated with this project. I am enough of a geek to truly appreciate the amount of CPU power we are throwing at this virus. I also am enough of a physical chemist to appreciate the sheer magnitude of the variations of structures we are trying to examine. This is a very, very impressive effort. Everyone deserves a pat on the back. :D

Re: faulty WU 13415, 2495,0,1

PostPosted: Sat Jun 27, 2020 1:22 pm
by BobWilliams757
I haven't had many of these yet (only 6), but all have run without error and well above the average PPD for my system. Even on this meager Vega 11 onboard chip, I only get GPU utilization up in the 80-85% range with low memory use. CPU use is higher than many WU's, with average utilization up in the 10% range. There has been a little variance in WU speed with each RCG, but within normal based on other computer use.

If there is anything I can do to help figure out the issues that other people are having, please let me know. I have to think that hardware architecture is in this picture somewhere, as it seems to be a trend that certain WU's just don't play well with certain hardware.

Re: faulty WU 13415, 2495,0,1

PostPosted: Sat Jun 27, 2020 1:43 pm
by cayenne187
some complete ok. a week ago some were crashing in the middle and restarting, none since. as far as i know only the RX570 does it. i am continuing to run 13415 on the GTX970 to see if it fails. of note is all these WU have 2500 points and on my RX570 they run slowly, using only 20% power and deliver like 2500-5000 points. the current WU on GTX970 is using 60% power and 14000 points. it's like 13415 can't see how fast the RX570 is and utilize it.

Re: faulty WU 13415, 2495,0,1

PostPosted: Thu Jul 16, 2020 12:38 pm
by Jan
Over some hops I found my way into this thread - I think my issue fits:

Ran two WUs yesterday, CPU WU worked fine, GPU WU finished with 100% and no indication of a problem, then crashed. Windows pop up with "FAH Core stopped working", FAH log says "FAILED_3 (255 = 0xff)". However, windows event log indicates the same issue other posters had, see below. I am also running a RX570 (8GB, no OC), project 13416, Core version 0.11. Cleared the log of redundant stuff:

Code: Select all
*********************** Log Started 2020-07-14T10:29:00Z ***********************
10:29:00:Trying to access database...
10:29:00:Successfully acquired database lock
10:29:01:Downloading GPUs.txt from assign1.foldingathome.org:80
10:29:01:Connecting to assign1.foldingathome.org:80
10:29:01:Read GPUs.txt
10:29:01:Enabled folding slot 00: PAUSED cpu:7 (by user)
10:29:01:Enabled folding slot 01: PAUSED gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590] (by user)
10:29:02:****************************** FAHClient ******************************
10:29:02:        Version: 7.6.13
10:29:02:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:29:02:      Copyright: 2020 foldingathome.org
10:29:02:       Homepage: https://foldingathome.org/
10:29:02:           Date: Apr 27 2020
10:29:02:           Time: 21:21:01
10:29:02:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
10:29:02:         Branch: master
10:29:02:       Compiler: Visual C++ 2008
10:29:02:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
10:29:02:       Platform: win32 10
10:29:02:           Bits: 32
10:29:02:           Mode: Release
10:29:02:         Config: D:\Folding\FAHClient\config.xml
10:29:02:******************************** CBang ********************************
10:29:02:           Date: Apr 24 2020
10:29:02:           Time: 17:07:55
10:29:02:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
10:29:02:         Branch: master
10:29:02:       Compiler: Visual C++ 2008
10:29:02:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
10:29:02:       Platform: win32 10
10:29:02:           Bits: 32
10:29:02:           Mode: Release
10:29:02:******************************* System ********************************
10:29:02:            CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
10:29:02:         CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
10:29:02:           CPUs: 8
10:29:02:         Memory: 15.96GiB
10:29:02:    Free Memory: 10.38GiB
10:29:02:        Threads: WINDOWS_THREADS
10:29:02:     OS Version: 6.2
10:29:02:    Has Battery: false
10:29:02:     On Battery: false
10:29:02:     UTC Offset: 2
10:29:02:            PID: 596
10:29:02:            CWD: D:\Folding\FAHClient
10:29:02:  Win32 Service: false
10:29:02:             OS: Windows 10 Enterprise
10:29:02:        OS Arch: AMD64
10:29:02:           GPUs: 1
10:29:02:          GPU 0: Bus:1 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
10:29:02:                 470/480/570/580/590]
10:29:02:           CUDA: Not detected: Failed to open dynamic library 'nvcuda.dll': Das
10:29:02:                 angegebene Modul wurde nicht gefunden.
10:29:02:
10:29:02:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.12
10:29:02:******************************* libFAH ********************************
10:29:02:           Date: Apr 15 2020
10:29:02:           Time: 14:53:14
10:29:02:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
10:29:02:         Branch: master
10:29:02:       Compiler: Visual C++ 2008
10:29:02:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
10:29:02:       Platform: win32 10
10:29:02:           Bits: 32
10:29:02:           Mode: Release
10:29:02:***********************************************************************
10:29:02:<config>
10:29:02:  <!-- Folding Slot Configuration -->
10:29:02:  <client-type v='advanced'/>
10:29:02:
10:29:02:  <!-- Network -->
10:29:02:  <proxy v=':8080'/>
10:29:02:
10:29:02:  <!-- Slot Control -->
10:29:02:  <power v='full'/>
10:29:02:
10:29:02:  <!-- User Information -->
10:29:02:  <passkey v='*****'/>
10:29:02:  <team v='244976'/>
10:29:02:  <user v='SomeGuyinIlmenau'/>
10:29:02:
10:29:02:  <!-- Folding Slots -->
10:29:02:  <slot id='0' type='CPU'>
10:29:02:    <paused v='true'/>
10:29:02:  </slot>
10:29:02:  <slot id='1' type='GPU'>
10:29:02:    <paused v='true'/>
10:29:02:  </slot>
10:29:02:</config>
10:29:11:17:127.0.0.1:New Web session
10:43:02:FS00:Unpaused
10:43:02:FS01:Unpaused
[...]
10:43:16:WU00:FS00:FahCore a7: Download complete
10:43:16:WU00:FS00:Valid core signature
10:43:17:WU00:FS00:Unpacked 19.85MiB to cores/cores.foldingathome.org/win/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7.exe
10:43:17:WU00:FS00:Unpacked 2.64MiB to cores/cores.foldingathome.org/win/64bit-avx-256/a7-0.0.19/Core_a7.fah/libfftw3f-3.dll
[...]
10:43:18:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13416 run:470 clone:91 gen:2 core:0x22 unit:0x0000000712bc7d9a5f02afafa76f9215
10:43:18:WU01:FS01:Downloading core from http://cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah
10:43:18:WU01:FS01:Connecting to cores.foldingathome.org:80
10:43:18:WU01:FS01:FahCore 22: Downloading 4.39MiB
10:43:19:WU00:FS00:0xa7:Completed 1 out of 250000 steps (0%)
10:43:22:WU01:FS01:FahCore 22: Download complete
10:43:23:WU01:FS01:Valid core signature
10:43:23:WU01:FS01:Unpacked 14.46MiB to cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe
10:43:23:WU01:FS01:Starting
10:43:23:WU01:FS01:Running FahCore: D:\Folding\FAHClient/FAHCoreWrapper.exe D:\Folding\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 706 -lifeline 596 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:43:23:WU01:FS01:Started FahCore on PID 9728
10:43:23:WU01:FS01:Core PID:9864
10:43:23:WU01:FS01:FahCore 0x22 started
10:43:23:WU01:FS01:0x22:*********************** Log Started 2020-07-14T10:43:23Z ***********************
10:43:23:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
10:43:23:WU01:FS01:0x22:       Core: Core22
10:43:23:WU01:FS01:0x22:       Type: 0x22
10:43:23:WU01:FS01:0x22:    Version: 0.0.11
10:43:23:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:43:23:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
10:43:23:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
10:43:23:WU01:FS01:0x22:       Date: Jun 26 2020
10:43:23:WU01:FS01:0x22:       Time: 19:49:16
10:43:23:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
10:43:23:WU01:FS01:0x22:     Branch: core22-0.0.11
10:43:23:WU01:FS01:0x22:   Compiler: Visual C++ 2015
10:43:23:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:43:23:WU01:FS01:0x22:   Platform: win32 10
10:43:23:WU01:FS01:0x22:       Bits: 64
10:43:23:WU01:FS01:0x22:       Mode: Release
10:43:23:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
10:43:23:WU01:FS01:0x22:             <peastman@stanford.edu>
10:43:23:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 9728 -checkpoint 15
10:43:23:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:43:23:WU01:FS01:0x22:************************************ libFAH ************************************
10:43:23:WU01:FS01:0x22:       Date: Jun 26 2020
10:43:23:WU01:FS01:0x22:       Time: 19:47:12
10:43:23:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
10:43:23:WU01:FS01:0x22:     Branch: HEAD
10:43:23:WU01:FS01:0x22:   Compiler: Visual C++ 2015
10:43:23:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:43:23:WU01:FS01:0x22:   Platform: win32 10
10:43:23:WU01:FS01:0x22:       Bits: 64
10:43:23:WU01:FS01:0x22:       Mode: Release
10:43:23:WU01:FS01:0x22:************************************ CBang *************************************
10:43:23:WU01:FS01:0x22:       Date: Jun 26 2020
10:43:23:WU01:FS01:0x22:       Time: 19:46:11
10:43:23:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
10:43:23:WU01:FS01:0x22:     Branch: master
10:43:23:WU01:FS01:0x22:   Compiler: Visual C++ 2015
10:43:23:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:43:23:WU01:FS01:0x22:   Platform: win32 10
10:43:23:WU01:FS01:0x22:       Bits: 64
10:43:23:WU01:FS01:0x22:       Mode: Release
10:43:23:WU01:FS01:0x22:************************************ System ************************************
10:43:23:WU01:FS01:0x22:        CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
10:43:23:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
10:43:23:WU01:FS01:0x22:       CPUs: 8
10:43:23:WU01:FS01:0x22:     Memory: 15.96GiB
10:43:23:WU01:FS01:0x22:Free Memory: 10.00GiB
10:43:23:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
10:43:23:WU01:FS01:0x22: OS Version: 6.2
10:43:23:WU01:FS01:0x22:Has Battery: false
10:43:23:WU01:FS01:0x22: On Battery: false
10:43:23:WU01:FS01:0x22: UTC Offset: 2
10:43:23:WU01:FS01:0x22:        PID: 9864
10:43:23:WU01:FS01:0x22:        CWD: D:\Folding\FAHClient\work
10:43:23:WU01:FS01:0x22:********************************************************************************
10:43:23:WU01:FS01:0x22:Project: 13416 (Run 470, Clone 91, Gen 2)
10:43:23:WU01:FS01:0x22:Unit: 0x0000000712bc7d9a5f02afafa76f9215
10:43:23:WU01:FS01:0x22:Reading tar file core.xml
10:43:23:WU01:FS01:0x22:Reading tar file integrator.xml
10:43:23:WU01:FS01:0x22:Reading tar file state.xml.bz2
10:43:23:WU01:FS01:0x22:Reading tar file system.xml.bz2
10:43:23:WU01:FS01:0x22:Digital signatures verified
10:43:23:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
10:43:23:WU01:FS01:0x22:Version 0.0.11
10:43:23:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
10:43:23:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
10:43:23:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
10:43:23:WU01:FS01:0x22:  Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
10:43:43:FS00:Finishing
10:43:43:FS01:Finishing
10:43:57:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
10:45:51:WU00:FS00:0xa7:Completed 2500 out of 250000 steps (1%)
10:48:07:WU01:FS01:0x22:Completed 10000 out of 1000000 steps (1%)
[everything ran fine, CPU WU finished and uploaded correctly]
17:38:26:WU01:FS01:0x22:Completed 990000 out of 1000000 steps (99%)
17:42:50:WU01:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
17:42:50:WU01:FS01:0x22:Average performance: 65.5539 ns/day
17:42:55:WU01:FS01:0x22:Saving result file ..\logfile_01.txt
17:42:55:WU01:FS01:0x22:Saving result file checkpointState.xml.bz2
17:42:55:WU01:FS01:0x22:Saving result file globals.csv
17:42:55:WU01:FS01:0x22:Saving result file positions.xtc
17:42:55:WU01:FS01:0x22:Saving result file science.log
17:42:55:WU01:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
17:43:50:WARNING:WU01:FS01:FahCore returned: FAILED_3 (255 = 0xff)
17:43:50:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13416 run:470 clone:91 gen:2 core:0x22 unit:0x0000000712bc7d9a5f02afafa76f9215
17:43:50:WU01:FS01:Uploading 5.82MiB to 18.188.125.154
17:43:50:WU01:FS01:Connecting to 18.188.125.154:8080
17:43:56:WU01:FS01:Upload 35.43%
17:44:02:WU01:FS01:Upload 77.30%
17:44:05:WU01:FS01:Upload complete
17:44:05:WU01:FS01:Server responded WORK_ACK (400)
17:44:05:WU01:FS01:Cleaning up


Code: Select all
FahCore_22.exe
   0.0.0.0
   5ef65146
   ntdll.dll
   10.0.19041.207
   cad89ab4
   c0000374
   00000000000fdec9
   2688
   01d659cb98f1d0c0
   D:\Folding\FAHClient\cores\cores.foldingathome.org\win\64bit\22-0.0.11\Core_22.fah\FahCore_22.exe
   C:\WINDOWS\SYSTEM32\ntdll.dll


To be honest, I'm a bit lost. Hardware is pretty new and some threads I discovered here indicate to me the error might not be on my end. Any new ideas on this?

*edit* Just came in: Same WU, smells like the same error. Thread.

Re: faulty WU 13415, 2495,0,1

PostPosted: Sun Jul 19, 2020 10:27 am
by bruce
My theory is that the crash is while the data is being compressed for upload and it takes a lot of virtual RAM. Can you expand the paging file?

Re: faulty WU 13415, 2495,0,1

PostPosted: Mon Jul 20, 2020 9:25 am
by Jan
I allowed the OS to handle the size of the paging file on its own now. Will report.

Re: faulty WU 13415, 2495,0,1

PostPosted: Mon Jul 20, 2020 4:49 pm
by Jan
Alright, just had the EXACT SAME error with a 13418 WU. RAM -actual and virtual- are not even close to being fully used, I observed the last percents actively. I don't think this is something on my end.

Code: Select all
15:39:24:WU01:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
15:39:24:WU01:FS01:0x22:Average performance: 67.4473 ns/day
15:39:29:WU01:FS01:0x22:Saving result file ..\logfile_01.txt
15:39:29:WU01:FS01:0x22:Saving result file checkpointState.xml.bz2
15:39:29:WU01:FS01:0x22:Saving result file globals.csv
15:39:29:WU01:FS01:0x22:Saving result file positions.xtc
15:39:29:WU01:FS01:0x22:Saving result file science.log
15:39:29:WU01:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
15:48:11:WARNING:WU01:FS01:FahCore returned: FAILED_3 (255 = 0xff)
15:48:11:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13418 run:568 clone:87 gen:2 core:0x22 unit:0x0000000212bc7d9a5f12829162cf6612


Code: Select all
FahCore_22.exe
   0.0.0.0
   5ef65146
   ntdll.dll
   10.0.19041.207
   cad89ab4
   c0000374
   00000000000fdec9
   16b8
   01d65e7177d8f8f3
   D:\Folding\FAHClient\cores\cores.foldingathome.org\win\64bit\22-0.0.11\Core_22.fah\FahCore_22.exe
   C:\WINDOWS\SYSTEM32\ntdll.dll
   8c977839-f889-4452-9fa2-499ab6d8e971

Re: faulty WU 13415, 2495,0,1

PostPosted: Tue Jul 21, 2020 3:19 pm
by bruce
Thanks for the report.

Re: faulty WU 13415, 2495,0,1

PostPosted: Sat Jul 25, 2020 3:37 pm
by cayenne187
no 13415 wu lately. 13416 does this once a week. current wu 13416, 1207, 252, 1 is running for second time after crashing during upload. on the rx570.

Re: faulty WU 13415, 2495,0,1

PostPosted: Sun Jul 26, 2020 2:17 am
by JohnChodera
Hi folks! We've shut down these older projects---13418 is the latest active project in the 134xx series.

~ John Chodera // MSKCC