faulty WU 13415, 2495,0,1

Moderators: Site Moderators, FAHC Science Team

Re: faulty WU 13415, 2495,0,1

Postby mwroggenbuck » Sat Jun 27, 2020 12:57 pm

Hello all,

In the last 24 hours, I have had 8 failures. They were all 13415. I am also running on an RX 570.

I have been talking about this error in viewtopic.php?f=81&t=35482

From now on, I will only post to this thread.

I thought I had figured something out with cpu affinity, but apparently that was not enough.

From now on, I will let the system run like it normally would. If a job fails twice, I will save it off and send to John.

On another note, I want to thank everyone (and I mean everyone) associated with this project. I am enough of a geek to truly appreciate the amount of CPU power we are throwing at this virus. I also am enough of a physical chemist to appreciate the sheer magnitude of the variations of structures we are trying to examine. This is a very, very impressive effort. Everyone deserves a pat on the back. :D
mwroggenbuck
 
Posts: 108
Joined: Tue Mar 24, 2020 1:47 pm

Re: faulty WU 13415, 2495,0,1

Postby BobWilliams757 » Sat Jun 27, 2020 1:22 pm

I haven't had many of these yet (only 6), but all have run without error and well above the average PPD for my system. Even on this meager Vega 11 onboard chip, I only get GPU utilization up in the 80-85% range with low memory use. CPU use is higher than many WU's, with average utilization up in the 10% range. There has been a little variance in WU speed with each RCG, but within normal based on other computer use.

If there is anything I can do to help figure out the issues that other people are having, please let me know. I have to think that hardware architecture is in this picture somewhere, as it seems to be a trend that certain WU's just don't play well with certain hardware.
BobWilliams757
 
Posts: 118
Joined: Fri Apr 03, 2020 3:22 pm

Re: faulty WU 13415, 2495,0,1

Postby cayenne187 » Sat Jun 27, 2020 1:43 pm

some complete ok. a week ago some were crashing in the middle and restarting, none since. as far as i know only the RX570 does it. i am continuing to run 13415 on the GTX970 to see if it fails. of note is all these WU have 2500 points and on my RX570 they run slowly, using only 20% power and deliver like 2500-5000 points. the current WU on GTX970 is using 60% power and 14000 points. it's like 13415 can't see how fast the RX570 is and utilize it.
Date of last Work Unit 2020-08-04 22:17:53
Total score 63,300,083
Total WUs 1,013
Overall rank (if points are combined) 15,440 of 2,735,299
cayenne187
 
Posts: 39
Joined: Thu May 14, 2020 8:56 pm

Re: faulty WU 13415, 2495,0,1

Postby Jan » Thu Jul 16, 2020 12:38 pm

Over some hops I found my way into this thread - I think my issue fits:

Ran two WUs yesterday, CPU WU worked fine, GPU WU finished with 100% and no indication of a problem, then crashed. Windows pop up with "FAH Core stopped working", FAH log says "FAILED_3 (255 = 0xff)". However, windows event log indicates the same issue other posters had, see below. I am also running a RX570 (8GB, no OC), project 13416, Core version 0.11. Cleared the log of redundant stuff:

Code: Select all
*********************** Log Started 2020-07-14T10:29:00Z ***********************
10:29:00:Trying to access database...
10:29:00:Successfully acquired database lock
10:29:01:Downloading GPUs.txt from assign1.foldingathome.org:80
10:29:01:Connecting to assign1.foldingathome.org:80
10:29:01:Read GPUs.txt
10:29:01:Enabled folding slot 00: PAUSED cpu:7 (by user)
10:29:01:Enabled folding slot 01: PAUSED gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590] (by user)
10:29:02:****************************** FAHClient ******************************
10:29:02:        Version: 7.6.13
10:29:02:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:29:02:      Copyright: 2020 foldingathome.org
10:29:02:       Homepage: https://foldingathome.org/
10:29:02:           Date: Apr 27 2020
10:29:02:           Time: 21:21:01
10:29:02:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
10:29:02:         Branch: master
10:29:02:       Compiler: Visual C++ 2008
10:29:02:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
10:29:02:       Platform: win32 10
10:29:02:           Bits: 32
10:29:02:           Mode: Release
10:29:02:         Config: D:\Folding\FAHClient\config.xml
10:29:02:******************************** CBang ********************************
10:29:02:           Date: Apr 24 2020
10:29:02:           Time: 17:07:55
10:29:02:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
10:29:02:         Branch: master
10:29:02:       Compiler: Visual C++ 2008
10:29:02:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
10:29:02:       Platform: win32 10
10:29:02:           Bits: 32
10:29:02:           Mode: Release
10:29:02:******************************* System ********************************
10:29:02:            CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
10:29:02:         CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
10:29:02:           CPUs: 8
10:29:02:         Memory: 15.96GiB
10:29:02:    Free Memory: 10.38GiB
10:29:02:        Threads: WINDOWS_THREADS
10:29:02:     OS Version: 6.2
10:29:02:    Has Battery: false
10:29:02:     On Battery: false
10:29:02:     UTC Offset: 2
10:29:02:            PID: 596
10:29:02:            CWD: D:\Folding\FAHClient
10:29:02:  Win32 Service: false
10:29:02:             OS: Windows 10 Enterprise
10:29:02:        OS Arch: AMD64
10:29:02:           GPUs: 1
10:29:02:          GPU 0: Bus:1 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
10:29:02:                 470/480/570/580/590]
10:29:02:           CUDA: Not detected: Failed to open dynamic library 'nvcuda.dll': Das
10:29:02:                 angegebene Modul wurde nicht gefunden.
10:29:02:
10:29:02:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.12
10:29:02:******************************* libFAH ********************************
10:29:02:           Date: Apr 15 2020
10:29:02:           Time: 14:53:14
10:29:02:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
10:29:02:         Branch: master
10:29:02:       Compiler: Visual C++ 2008
10:29:02:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
10:29:02:       Platform: win32 10
10:29:02:           Bits: 32
10:29:02:           Mode: Release
10:29:02:***********************************************************************
10:29:02:<config>
10:29:02:  <!-- Folding Slot Configuration -->
10:29:02:  <client-type v='advanced'/>
10:29:02:
10:29:02:  <!-- Network -->
10:29:02:  <proxy v=':8080'/>
10:29:02:
10:29:02:  <!-- Slot Control -->
10:29:02:  <power v='full'/>
10:29:02:
10:29:02:  <!-- User Information -->
10:29:02:  <passkey v='*****'/>
10:29:02:  <team v='244976'/>
10:29:02:  <user v='SomeGuyinIlmenau'/>
10:29:02:
10:29:02:  <!-- Folding Slots -->
10:29:02:  <slot id='0' type='CPU'>
10:29:02:    <paused v='true'/>
10:29:02:  </slot>
10:29:02:  <slot id='1' type='GPU'>
10:29:02:    <paused v='true'/>
10:29:02:  </slot>
10:29:02:</config>
10:29:11:17:127.0.0.1:New Web session
10:43:02:FS00:Unpaused
10:43:02:FS01:Unpaused
[...]
10:43:16:WU00:FS00:FahCore a7: Download complete
10:43:16:WU00:FS00:Valid core signature
10:43:17:WU00:FS00:Unpacked 19.85MiB to cores/cores.foldingathome.org/win/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7.exe
10:43:17:WU00:FS00:Unpacked 2.64MiB to cores/cores.foldingathome.org/win/64bit-avx-256/a7-0.0.19/Core_a7.fah/libfftw3f-3.dll
[...]
10:43:18:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13416 run:470 clone:91 gen:2 core:0x22 unit:0x0000000712bc7d9a5f02afafa76f9215
10:43:18:WU01:FS01:Downloading core from http://cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah
10:43:18:WU01:FS01:Connecting to cores.foldingathome.org:80
10:43:18:WU01:FS01:FahCore 22: Downloading 4.39MiB
10:43:19:WU00:FS00:0xa7:Completed 1 out of 250000 steps (0%)
10:43:22:WU01:FS01:FahCore 22: Download complete
10:43:23:WU01:FS01:Valid core signature
10:43:23:WU01:FS01:Unpacked 14.46MiB to cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe
10:43:23:WU01:FS01:Starting
10:43:23:WU01:FS01:Running FahCore: D:\Folding\FAHClient/FAHCoreWrapper.exe D:\Folding\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 706 -lifeline 596 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:43:23:WU01:FS01:Started FahCore on PID 9728
10:43:23:WU01:FS01:Core PID:9864
10:43:23:WU01:FS01:FahCore 0x22 started
10:43:23:WU01:FS01:0x22:*********************** Log Started 2020-07-14T10:43:23Z ***********************
10:43:23:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
10:43:23:WU01:FS01:0x22:       Core: Core22
10:43:23:WU01:FS01:0x22:       Type: 0x22
10:43:23:WU01:FS01:0x22:    Version: 0.0.11
10:43:23:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:43:23:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
10:43:23:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
10:43:23:WU01:FS01:0x22:       Date: Jun 26 2020
10:43:23:WU01:FS01:0x22:       Time: 19:49:16
10:43:23:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
10:43:23:WU01:FS01:0x22:     Branch: core22-0.0.11
10:43:23:WU01:FS01:0x22:   Compiler: Visual C++ 2015
10:43:23:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:43:23:WU01:FS01:0x22:   Platform: win32 10
10:43:23:WU01:FS01:0x22:       Bits: 64
10:43:23:WU01:FS01:0x22:       Mode: Release
10:43:23:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
10:43:23:WU01:FS01:0x22:             <peastman@stanford.edu>
10:43:23:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 9728 -checkpoint 15
10:43:23:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:43:23:WU01:FS01:0x22:************************************ libFAH ************************************
10:43:23:WU01:FS01:0x22:       Date: Jun 26 2020
10:43:23:WU01:FS01:0x22:       Time: 19:47:12
10:43:23:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
10:43:23:WU01:FS01:0x22:     Branch: HEAD
10:43:23:WU01:FS01:0x22:   Compiler: Visual C++ 2015
10:43:23:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:43:23:WU01:FS01:0x22:   Platform: win32 10
10:43:23:WU01:FS01:0x22:       Bits: 64
10:43:23:WU01:FS01:0x22:       Mode: Release
10:43:23:WU01:FS01:0x22:************************************ CBang *************************************
10:43:23:WU01:FS01:0x22:       Date: Jun 26 2020
10:43:23:WU01:FS01:0x22:       Time: 19:46:11
10:43:23:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
10:43:23:WU01:FS01:0x22:     Branch: master
10:43:23:WU01:FS01:0x22:   Compiler: Visual C++ 2015
10:43:23:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:43:23:WU01:FS01:0x22:   Platform: win32 10
10:43:23:WU01:FS01:0x22:       Bits: 64
10:43:23:WU01:FS01:0x22:       Mode: Release
10:43:23:WU01:FS01:0x22:************************************ System ************************************
10:43:23:WU01:FS01:0x22:        CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
10:43:23:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
10:43:23:WU01:FS01:0x22:       CPUs: 8
10:43:23:WU01:FS01:0x22:     Memory: 15.96GiB
10:43:23:WU01:FS01:0x22:Free Memory: 10.00GiB
10:43:23:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
10:43:23:WU01:FS01:0x22: OS Version: 6.2
10:43:23:WU01:FS01:0x22:Has Battery: false
10:43:23:WU01:FS01:0x22: On Battery: false
10:43:23:WU01:FS01:0x22: UTC Offset: 2
10:43:23:WU01:FS01:0x22:        PID: 9864
10:43:23:WU01:FS01:0x22:        CWD: D:\Folding\FAHClient\work
10:43:23:WU01:FS01:0x22:********************************************************************************
10:43:23:WU01:FS01:0x22:Project: 13416 (Run 470, Clone 91, Gen 2)
10:43:23:WU01:FS01:0x22:Unit: 0x0000000712bc7d9a5f02afafa76f9215
10:43:23:WU01:FS01:0x22:Reading tar file core.xml
10:43:23:WU01:FS01:0x22:Reading tar file integrator.xml
10:43:23:WU01:FS01:0x22:Reading tar file state.xml.bz2
10:43:23:WU01:FS01:0x22:Reading tar file system.xml.bz2
10:43:23:WU01:FS01:0x22:Digital signatures verified
10:43:23:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
10:43:23:WU01:FS01:0x22:Version 0.0.11
10:43:23:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
10:43:23:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
10:43:23:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
10:43:23:WU01:FS01:0x22:  Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
10:43:43:FS00:Finishing
10:43:43:FS01:Finishing
10:43:57:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
10:45:51:WU00:FS00:0xa7:Completed 2500 out of 250000 steps (1%)
10:48:07:WU01:FS01:0x22:Completed 10000 out of 1000000 steps (1%)
[everything ran fine, CPU WU finished and uploaded correctly]
17:38:26:WU01:FS01:0x22:Completed 990000 out of 1000000 steps (99%)
17:42:50:WU01:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
17:42:50:WU01:FS01:0x22:Average performance: 65.5539 ns/day
17:42:55:WU01:FS01:0x22:Saving result file ..\logfile_01.txt
17:42:55:WU01:FS01:0x22:Saving result file checkpointState.xml.bz2
17:42:55:WU01:FS01:0x22:Saving result file globals.csv
17:42:55:WU01:FS01:0x22:Saving result file positions.xtc
17:42:55:WU01:FS01:0x22:Saving result file science.log
17:42:55:WU01:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
17:43:50:WARNING:WU01:FS01:FahCore returned: FAILED_3 (255 = 0xff)
17:43:50:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13416 run:470 clone:91 gen:2 core:0x22 unit:0x0000000712bc7d9a5f02afafa76f9215
17:43:50:WU01:FS01:Uploading 5.82MiB to 18.188.125.154
17:43:50:WU01:FS01:Connecting to 18.188.125.154:8080
17:43:56:WU01:FS01:Upload 35.43%
17:44:02:WU01:FS01:Upload 77.30%
17:44:05:WU01:FS01:Upload complete
17:44:05:WU01:FS01:Server responded WORK_ACK (400)
17:44:05:WU01:FS01:Cleaning up


Code: Select all
FahCore_22.exe
   0.0.0.0
   5ef65146
   ntdll.dll
   10.0.19041.207
   cad89ab4
   c0000374
   00000000000fdec9
   2688
   01d659cb98f1d0c0
   D:\Folding\FAHClient\cores\cores.foldingathome.org\win\64bit\22-0.0.11\Core_22.fah\FahCore_22.exe
   C:\WINDOWS\SYSTEM32\ntdll.dll


To be honest, I'm a bit lost. Hardware is pretty new and some threads I discovered here indicate to me the error might not be on my end. Any new ideas on this?

*edit* Just came in: Same WU, smells like the same error. Thread.
Jan
 
Posts: 80
Joined: Tue Mar 31, 2020 7:46 pm

Re: faulty WU 13415, 2495,0,1

Postby bruce » Sun Jul 19, 2020 10:27 am

My theory is that the crash is while the data is being compressed for upload and it takes a lot of virtual RAM. Can you expand the paging file?
bruce
 
Posts: 20152
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: faulty WU 13415, 2495,0,1

Postby Jan » Mon Jul 20, 2020 9:25 am

I allowed the OS to handle the size of the paging file on its own now. Will report.
Jan
 
Posts: 80
Joined: Tue Mar 31, 2020 7:46 pm

Re: faulty WU 13415, 2495,0,1

Postby Jan » Mon Jul 20, 2020 4:49 pm

Alright, just had the EXACT SAME error with a 13418 WU. RAM -actual and virtual- are not even close to being fully used, I observed the last percents actively. I don't think this is something on my end.

Code: Select all
15:39:24:WU01:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
15:39:24:WU01:FS01:0x22:Average performance: 67.4473 ns/day
15:39:29:WU01:FS01:0x22:Saving result file ..\logfile_01.txt
15:39:29:WU01:FS01:0x22:Saving result file checkpointState.xml.bz2
15:39:29:WU01:FS01:0x22:Saving result file globals.csv
15:39:29:WU01:FS01:0x22:Saving result file positions.xtc
15:39:29:WU01:FS01:0x22:Saving result file science.log
15:39:29:WU01:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
15:48:11:WARNING:WU01:FS01:FahCore returned: FAILED_3 (255 = 0xff)
15:48:11:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13418 run:568 clone:87 gen:2 core:0x22 unit:0x0000000212bc7d9a5f12829162cf6612


Code: Select all
FahCore_22.exe
   0.0.0.0
   5ef65146
   ntdll.dll
   10.0.19041.207
   cad89ab4
   c0000374
   00000000000fdec9
   16b8
   01d65e7177d8f8f3
   D:\Folding\FAHClient\cores\cores.foldingathome.org\win\64bit\22-0.0.11\Core_22.fah\FahCore_22.exe
   C:\WINDOWS\SYSTEM32\ntdll.dll
   8c977839-f889-4452-9fa2-499ab6d8e971
Jan
 
Posts: 80
Joined: Tue Mar 31, 2020 7:46 pm

Re: faulty WU 13415, 2495,0,1

Postby bruce » Tue Jul 21, 2020 3:19 pm

Thanks for the report.
bruce
 
Posts: 20152
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: faulty WU 13415, 2495,0,1

Postby cayenne187 » Sat Jul 25, 2020 3:37 pm

no 13415 wu lately. 13416 does this once a week. current wu 13416, 1207, 252, 1 is running for second time after crashing during upload. on the rx570.
cayenne187
 
Posts: 39
Joined: Thu May 14, 2020 8:56 pm

Re: faulty WU 13415, 2495,0,1

Postby JohnChodera » Sun Jul 26, 2020 2:17 am

Hi folks! We've shut down these older projects---13418 is the latest active project in the 134xx series.

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 408
Joined: Fri Feb 22, 2013 10:59 pm

Previous

Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron