project 13416 high wu failure rate

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Post Reply
Yeroon
Posts: 25
Joined: Tue Jul 07, 2020 11:09 pm

project 13416 high wu failure rate

Post by Yeroon »

Having an issue in the last few days where I get a high rate of bad wu, but only stemming from 13416. They fail in a multitude of ways, a few have been described here for 13415, so I don't think this really a surprise.

Comp is 2 rx470 cards, ubuntu 20.04 / 5.4 kernel. 20.20 amdgpu opencl.
Latest config:

Code: Select all

<config>
  <!-- Client Control -->
  <fold-anon v='true'/>

  <!-- Folding Slot Configuration -->
  <cause v='COVID_19'/>

  <!-- Network -->
  <proxy v=':8080'/>

  <!-- User Information -->
  <passkey v='redacted'/>
  <team v='234771'/>
  <user v='Yeroon'/>

  <!-- Folding Slots -->
  <slot id='1' type='GPU'>
    <gpu-index v='0'/>
    <opencl-index v='0'/>
  </slot>
  <slot id='0' type='GPU'>
    <gpu-index v='1'/>
    <opencl-index v='1'/>
  </slot>
  <slot id='2' type='CPU'>
    <cpus v='8'/>
  </slot>
</config>
First bit of log when I started

Code: Select all

*********************** Log Started 2020-07-06T23:25:03Z ***********************
23:25:03:Trying to access database...
23:25:03:Successfully acquired database lock
23:25:03:Read GPUs.txt
23:25:04:Enabled folding slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590]
23:25:04:****************************** FAHClient ******************************
23:25:04:        Version: 7.6.13
23:25:04:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
23:25:04:      Copyright: 2020 foldingathome.org
23:25:04:       Homepage: https://foldingathome.org/
23:25:04:           Date: Apr 28 2020
23:25:04:           Time: 04:20:16
23:25:04:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
23:25:04:         Branch: master
23:25:04:       Compiler: GNU 8.3.0
23:25:04:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
23:25:04:                 -funroll-loops -fno-pie
23:25:04:       Platform: linux2 4.19.0-5-amd64
23:25:04:           Bits: 64
23:25:04:           Mode: Release
23:25:04:           Args: --child /etc/fahclient/config.xml --run-as root
23:25:04:                 --pid-file=/var/run/fahclient.pid --daemon
23:25:04:         Config: /etc/fahclient/config.xml
23:25:04:******************************** CBang ********************************
23:25:04:           Date: Apr 25 2020
23:25:04:           Time: 00:07:53
23:25:04:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
23:25:04:         Branch: master
23:25:04:       Compiler: GNU 8.3.0
23:25:04:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
23:25:04:                 -funroll-loops -fno-pie -fPIC
23:25:04:       Platform: linux2 4.19.0-5-amd64
23:25:04:           Bits: 64
23:25:04:           Mode: Release
23:25:04:******************************* System ********************************
23:25:04:            CPU: AMD Ryzen 5 3600 6-Core Processor
23:25:04:         CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
23:25:04:           CPUs: 12
23:25:04:         Memory: 31.30GiB
23:25:04:    Free Memory: 29.93GiB
23:25:04:        Threads: POSIX_THREADS
23:25:04:     OS Version: 5.4
23:25:04:    Has Battery: false
23:25:04:     On Battery: false
23:25:04:     UTC Offset: -4
23:25:04:            PID: 2705
23:25:04:            CWD: /var/lib/fahclient
23:25:04:             OS: Linux 5.4.0-40-generic x86_64
23:25:04:        OS Arch: AMD64
23:25:04:           GPUs: 2
23:25:04:          GPU 0: Bus:5 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
23:25:04:                 470/480/570/580/590]
23:25:04:          GPU 1: Bus:6 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
23:25:04:                 470/480/570/580/590]
23:25:04:           CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
23:25:04:                 libcuda.so: cannot open shared object file: No such file or
23:25:04:                 directory
23:25:04:OpenCL Device 0: Platform:0 Device:0 Bus:5 Slot:0 Compute:1.2 Driver:3110.6
23:25:04:OpenCL Device 1: Platform:0 Device:1 Bus:6 Slot:0 Compute:1.2 Driver:3110.6
23:25:04:******************************* libFAH ********************************
23:25:04:           Date: Apr 15 2020
23:25:04:           Time: 21:43:24
23:25:04:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
23:25:04:         Branch: master
23:25:04:       Compiler: GNU 8.3.0
23:25:04:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
23:25:04:                 -funroll-loops -fno-pie
23:25:04:       Platform: linux2 4.19.0-5-amd64
23:25:04:           Bits: 64
23:25:04:           Mode: Release
23:25:04:***********************************************************************
23:25:04:<config>
23:25:04:  <!-- Client Control -->
23:25:04:  <fold-anon v='true'/>
23:25:04:
23:25:04:  <!-- Network -->
23:25:04:  <proxy v=':8080'/>
23:25:04:
23:25:04:  <!-- User Information -->
23:25:04:  <passkey v='*****'/>
23:25:04:  <team v='234771'/>
23:25:04:  <user v='Yeroon'/>
23:25:04:
23:25:04:  <!-- Folding Slots -->
23:25:04:  <slot id='1' type='GPU'>
23:25:04:    <gpu-index v='0'/>
23:25:04:    <opencl-index v='0'/>
23:25:04:  </slot>
23:25:04:</config>
got the following failures as I was getting the machine going:

Code: Select all

WU00:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0 
WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13416 run:1108 clone:124 gen:0 core:0x22 unit:0x0000000012bc7d9a5f02af7c6e30c25c

WU01:FS00:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:13416 run:1154 clone:124 gen:0 core:0x22 unit:0x0000000012bc7d9a5f02af79269ad718

WU02:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13416 run:1156 clone:124 gen:0 core:0x22 unit:0x0000000012bc7d9a5f02af798c4a5c5c

WU03:FS00:0x22:ERROR:Potential energy error of 246.816, threshold of 10
WU03:FS00:0x22:ERROR:Reference Potential Energy: -1.23702e+06 | Given Potential Energy: -1.23677e+06
WU03:FS00:Sending unit results: id:03 state:SEND error:FAULTY project:13416 run:1274 clone:124 gen:0 core:0x22 unit:0x0000000012bc7d9a5f02af716b44eddb
Finishes other wu without issue. Will get more failures on starting 13416 units throughout the day:

Code: Select all

WU00:FS00:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13416 run:298 clone:23 gen:3 core:0x22 unit:0x0000000312bc7d9a5f00a7ee24fd8ddb

WU00:FS01:0x22:An exception occurred at step 250: Particle coordinate is nan
WU00:FS01:0x22:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13416 run:757 clone:39 gen:0 core:0x22 unit:0x0000000112bc7d9a5f02af9af0e6828b

WU02:FS01:0x22:ERROR:Force RMSE error of 41.1941 with threshold of 5
WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13416 run:652 clone:8 gen:0 core:0x22 unit:0x0000000212bc7d9a5f02afa3f80e66a3

WU03:FS01:0x22:ERROR:Force RMSE error of 13.4286 with threshold of 5
WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13416 run:1122 clone:8 gen:0 core:0x22 unit:0x0000000212bc7d9a5f02af7c63ccc089

WU00:FS00:0x22:ERROR:Force RMSE error of 9.85244 with threshold of 5
WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13416 run:1059 clone:102 gen:0 core:0x22 unit:0x0000000112bc7d9a5f02af809604a333
Once a wu gets past any initial failures, they go 100%. Some 13416 units have a lower ppd and lower power useage, but thats a different matter.
I can get full logs if it helps.
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: project 13416 high wu failure rate

Post by psaam0001 »

I am posting a log of one that is expected to take my Ryzen3's integrated GPU another 5.22 days (as of the time I captured it).

-- Start of captured log information --

Code: Select all

******************************* Date: 2020-07-07 *******************************
07:36:04:WU01:FS01:Connecting to assign1.foldingathome.org:80
07:36:04:WU01:FS01:Assigned to work server 18.188.125.154
07:36:04:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:raven [Radeon RX Vega gfx902] from 18.188.125.154
07:36:04:WU01:FS01:Connecting to 18.188.125.154:8080
07:36:05:WU01:FS01:Downloading 7.05MiB
07:36:06:WU01:FS01:Download complete
07:36:06:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13416 run:719 clone:187 gen:0 core:0x22 unit:0x0000000012bc7d9a5f02af9cd2f0a1bf
07:50:19:WU01:FS01:Starting
07:50:19:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\saam4\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 706 -lifeline 8312 -checkpoint 30 -gpu-vendor amd -opencl-platform 1 -opencl-device 0 -gpu 0
07:50:19:WU01:FS01:Started FahCore on PID 7552
07:50:19:WU01:FS01:Core PID:7944
07:50:19:WU01:FS01:FahCore 0x22 started
07:50:20:WU01:FS01:0x22:*********************** Log Started 2020-07-07T07:50:19Z ***********************
07:50:20:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
07:50:20:WU01:FS01:0x22:       Core: Core22
07:50:20:WU01:FS01:0x22:       Type: 0x22
07:50:20:WU01:FS01:0x22:    Version: 0.0.11
07:50:20:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
07:50:20:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
07:50:20:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
07:50:20:WU01:FS01:0x22:       Date: Jun 26 2020
07:50:20:WU01:FS01:0x22:       Time: 19:49:16
07:50:20:WU01:FS01:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
07:50:20:WU01:FS01:0x22:     Branch: core22-0.0.11
07:50:20:WU01:FS01:0x22:   Compiler: Visual C++ 2015
07:50:20:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
07:50:20:WU01:FS01:0x22:   Platform: win32 10
07:50:20:WU01:FS01:0x22:       Bits: 64
07:50:20:WU01:FS01:0x22:       Mode: Release
07:50:20:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
07:50:20:WU01:FS01:0x22:             <peastman@stanford.edu>
07:50:20:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 7552 -checkpoint 30
07:50:20:WU01:FS01:0x22:             -gpu-vendor amd -opencl-platform 1 -opencl-device 0 -gpu 0
07:50:20:WU01:FS01:0x22:************************************ libFAH ************************************
07:50:20:WU01:FS01:0x22:       Date: Jun 26 2020
07:50:20:WU01:FS01:0x22:       Time: 19:47:12
07:50:20:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
07:50:20:WU01:FS01:0x22:     Branch: HEAD
07:50:20:WU01:FS01:0x22:   Compiler: Visual C++ 2015
07:50:20:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
07:50:20:WU01:FS01:0x22:   Platform: win32 10
07:50:20:WU01:FS01:0x22:       Bits: 64
07:50:20:WU01:FS01:0x22:       Mode: Release
07:50:20:WU01:FS01:0x22:************************************ CBang *************************************
07:50:20:WU01:FS01:0x22:       Date: Jun 26 2020
07:50:20:WU01:FS01:0x22:       Time: 19:46:11
07:50:20:WU01:FS01:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
07:50:20:WU01:FS01:0x22:     Branch: master
07:50:20:WU01:FS01:0x22:   Compiler: Visual C++ 2015
07:50:20:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
07:50:20:WU01:FS01:0x22:   Platform: win32 10
07:50:20:WU01:FS01:0x22:       Bits: 64
07:50:20:WU01:FS01:0x22:       Mode: Release
07:50:20:WU01:FS01:0x22:************************************ System ************************************
07:50:20:WU01:FS01:0x22:        CPU: AMD Ryzen 3 3200G with Radeon Vega Graphics
07:50:20:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 24 Stepping 1
07:50:20:WU01:FS01:0x22:       CPUs: 4
07:50:20:WU01:FS01:0x22:     Memory: 13.93GiB
07:50:20:WU01:FS01:0x22:Free Memory: 9.75GiB
07:50:20:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
07:50:20:WU01:FS01:0x22: OS Version: 6.2
07:50:20:WU01:FS01:0x22:Has Battery: false
07:50:20:WU01:FS01:0x22: On Battery: false
07:50:20:WU01:FS01:0x22: UTC Offset: -4
07:50:20:WU01:FS01:0x22:        PID: 7944
07:50:20:WU01:FS01:0x22:        CWD: C:\Users\saam4\AppData\Roaming\FAHClient\work
07:50:20:WU01:FS01:0x22:********************************************************************************
07:50:20:WU01:FS01:0x22:Project: 13416 (Run 719, Clone 187, Gen 0)
07:50:20:WU01:FS01:0x22:Unit: 0x0000000012bc7d9a5f02af9cd2f0a1bf
07:50:20:WU01:FS01:0x22:Reading tar file core.xml
07:50:20:WU01:FS01:0x22:Reading tar file integrator.xml
07:50:20:WU01:FS01:0x22:Reading tar file state.xml.bz2
07:50:20:WU01:FS01:0x22:Reading tar file system.xml.bz2
07:50:20:WU01:FS01:0x22:Digital signatures verified
07:50:20:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
07:50:20:WU01:FS01:0x22:Version 0.0.11
07:50:21:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
07:50:21:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
07:50:21:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
07:50:21:WU01:FS01:0x22:  Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
07:50:40:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
08:38:09:WU01:FS01:0x22:Completed 10000 out of 1000000 steps (1%)
10:07:57:WU01:FS01:0x22:Completed 20000 out of 1000000 steps (2%)
11:37:31:WU01:FS01:0x22:Completed 30000 out of 1000000 steps (3%)
******************************* Date: 2020-07-07 *******************************
13:07:39:WU01:FS01:0x22:Completed 40000 out of 1000000 steps (4%)
14:36:19:WU01:FS01:0x22:Completed 50000 out of 1000000 steps (5%)
16:04:38:WU01:FS01:0x22:Completed 60000 out of 1000000 steps (6%)
17:24:21:WU01:FS01:0x22:Completed 70000 out of 1000000 steps (7%)
18:42:31:WU01:FS01:0x22:Completed 80000 out of 1000000 steps (8%)
******************************* Date: 2020-07-07 *******************************
20:01:19:WU01:FS01:0x22:Completed 90000 out of 1000000 steps (9%)
21:19:40:WU01:FS01:0x22:Completed 100000 out of 1000000 steps (10%)
22:38:12:WU01:FS01:0x22:Completed 110000 out of 1000000 steps (11%)
00:05:10:WU01:FS01:0x22:Completed 120000 out of 1000000 steps (12%)
******************************* Date: 2020-07-08 *******************************
01:31:53:WU01:FS01:0x22:Completed 130000 out of 1000000 steps (13%)
-- End of captured log information --

Paul

Mod Edit: Added Code Tags - PantherX
Knish
Posts: 232
Joined: Tue Mar 17, 2020 5:20 am

Re: project 13416 high wu failure rate

Post by Knish »

Yeroon if you see in your logs that the Sending of the Faulty WU is successful, then you don't really need to report it here, unless there's something you think might be getting missed

whatever it's worth, your 2nd posted WU was successfully completed
Yeroon
Posts: 25
Joined: Tue Jul 07, 2020 11:09 pm

Re: project 13416 high wu failure rate

Post by Yeroon »

They seem be uploading properly after failing. Just wanted to offer the issue up in case there are any known issues with my latest config (20.04 + amdgpu 20.20) that would be fixable on my end, or provide any relevant info that doesnt make it when uploading as a failed WU.
markdotgooley
Posts: 101
Joined: Tue Apr 21, 2020 11:46 am

Re: project 13416 high wu failure rate

Post by markdotgooley »

Getting insanely high estimated PPD from 13416 running on two RTX 2060 cards, like 3.5 million/day... maybe there's a bonus given for higher chance of failure?
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: project 13416 high wu failure rate

Post by psaam0001 »

I ended up manually dumping that WU I posted that log for.

Paul
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: project 13416 high wu failure rate

Post by bruce »

After you manually dumped the WU, was a failure report uploaded for that WU?
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: project 13416 high wu failure rate

Post by psaam0001 »

Could the expiration time frames be adjusted to 5-7 days for these GPU work units?

Paul
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: project 13416 high wu failure rate

Post by JohnChodera »

@Yeroon: We're aware of this issue on some driver/GPU combinations and are working to build more instrumentation into the 0.0.12 core22 release so we can debug further what is going on.

@markdotgooley: We've adjusted the base points to compensate for an unexpectedly wide RUN-to-RUN variance in the WU runtimes. We're trying to get to the bottom of this variance and hope to fix it in the upcoming 134xx projects.

@psaam0001: Due to the necessity to turn these results around to the chemists in 24-48 hours, we've tried to keep the WUs short (1-2 hours) and the expiration times correspondingly short to incentivize completing the WUs in a time that would be scientifically useful in advancing the open science COVID Moonshot program (http://covid.postera.ai/covid) toward useful therapies that will end the pandemic. We can make these a bit longer, but I'm worried about making them too long as a result.

Thanks so much for bearing with us, everyone. We're still getting tons of useful data here, but I realize there are many more issues than normal. We're working hard to get those sorted in both project setup and core improvements.

~ John Chodera // MSKCC
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: project 13416 high wu failure rate

Post by psaam0001 »

Thanks John for your reply. I'm likely going to have to see what happens in the long run, as I know that we are all trying to pull off a miracle in a short period of time.

It seems like the Moonshot WU's are running OK on the NVIDIA GTX 1650 GPU that I have in the same system.... So if I must temporarily disable folding on the Ryzen3's integrated GPU, I will do that. I just want to make sure those Moonshot WU's can be completed successfully, within the time parameters that allow them to be useful.

Paul
Post Reply