Page 1 of 1

Project 13414 2 Faulty WUs

Posted: Mon Jun 22, 2020 6:08 am
by pylgrym
I've had 9 WUs on project 13414 in the last 2 or 3 days, 2 in the last 24 hours were faulty. Here are the relevant logs:

Code: Select all

12:06:07:WU01:FS01:Starting
12:06:07:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 1531 -checkpoint 1
5 -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
12:06:07:WU01:FS01:Started FahCore on PID 19874
12:06:07:WU01:FS01:Core PID:19878
12:06:07:WU01:FS01:FahCore 0x22 started
12:06:07:WU01:FS01:0x22:*********************** Log Started 2020-06-21T12:06:07Z ***********************
12:06:07:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
12:06:07:WU01:FS01:0x22:       Core: Core22
12:06:07:WU01:FS01:0x22:       Type: 0x22
12:06:07:WU01:FS01:0x22:    Version: 0.0.10
12:06:07:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
12:06:07:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
12:06:07:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
12:06:07:WU01:FS01:0x22:       Date: Jun 16 2020
12:06:07:WU01:FS01:0x22:       Time: 15:55:31
12:06:07:WU01:FS01:0x22:   Revision: 147051aad40bcbec7d4b25105bbedfab425f1dc2
12:06:07:WU01:FS01:0x22:     Branch: core22-0.0.10
12:06:07:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
12:06:07:WU01:FS01:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
12:06:07:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
12:06:07:WU01:FS01:0x22:       Bits: 64
12:06:07:WU01:FS01:0x22:       Mode: Release
12:06:07:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
12:06:07:WU01:FS01:0x22:             <peastman@stanford.edu>
12:06:07:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 19874 -checkpoint 15
12:06:07:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
12:06:07:WU01:FS01:0x22:************************************ libFAH ************************************
12:06:07:WU01:FS01:0x22:       Date: Jun 2 2020
12:06:07:WU01:FS01:0x22:       Time: 00:07:31
12:06:07:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
12:06:07:WU01:FS01:0x22:     Branch: HEAD
12:06:07:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
12:06:07:WU01:FS01:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
12:06:07:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
12:06:07:WU01:FS01:0x22:       Bits: 64
12:06:07:WU01:FS01:0x22:       Mode: Release
12:06:07:WU01:FS01:0x22:************************************ CBang *************************************
12:06:07:WU01:FS01:0x22:       Date: May 31 2020
12:06:07:WU01:FS01:0x22:       Time: 20:16:34
12:06:07:WU01:FS01:0x22:   Revision: 75fcee0b8e713cb47f5191a3689d5f4f07244c7f
12:06:07:WU01:FS01:0x22:     Branch: HEAD
12:06:07:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
12:06:07:WU01:FS01:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
12:06:07:WU01:FS01:0x22:             -fPIC
12:06:07:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
12:06:07:WU01:FS01:0x22:       Bits: 64
12:06:07:WU01:FS01:0x22:       Mode: Release
12:06:07:WU01:FS01:0x22:************************************ System ************************************
12:06:07:WU01:FS01:0x22:        CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
12:06:07:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
12:06:07:WU01:FS01:0x22:       CPUs: 12
12:06:07:WU01:FS01:0x22:     Memory: 31.13GiB
12:06:07:WU01:FS01:0x22:Free Memory: 12.19GiB
12:06:07:WU01:FS01:0x22:    Threads: POSIX_THREADS
12:06:07:WU01:FS01:0x22: OS Version: 4.15
12:06:07:WU01:FS01:0x22:Has Battery: false
12:06:07:WU01:FS01:0x22: On Battery: false
12:06:07:WU01:FS01:0x22: UTC Offset: 1
12:06:07:WU01:FS01:0x22:        PID: 19878
12:06:07:WU01:FS01:0x22:        CWD: /var/lib/fahclient/work
12:06:07:WU01:FS01:0x22:********************************************************************************
12:06:07:WU01:FS01:0x22:Project: 13414 (Run 92, Clone 49, Gen 0)
12:06:07:WU01:FS01:0x22:Unit: 0x0000000412bc7d9a5eed8c71067ebec5
12:06:07:WU01:FS01:0x22:Digital signatures verified
12:06:07:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
12:06:07:WU01:FS01:0x22:Version 0.0.10
12:06:07:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
12:06:07:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
12:06:07:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
12:06:07:WU01:FS01:0x22:  Global context and integrator variables write interval: 250 steps (0.025%) [4000 total]
12:06:11:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
12:06:34:WU01:FS01:0x22:An exception occurred at step 501: Particle coordinate is nan
12:06:34:WU01:FS01:0x22:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
12:06:34:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
12:06:34:WU01:FS01:0x22:Saving result file ../logfile_01.txt
12:06:34:WU01:FS01:0x22:Saving result file globals.csv
12:06:34:WU01:FS01:0x22:Saving result file science.log
12:06:34:WU01:FS01:0x22:Saving result file state.xml
12:06:35:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
ESC[93m12:06:36:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)ESC[0m
12:06:36:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13414 run:92 clone:49 gen:0 core:0x22 unit:0x0000000412bc7d9a5eed8c71067ebec5
and

Code: Select all

01:09:30:WU01:FS01:Starting
01:09:30:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 1531 -checkpoint 1
5 -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
01:09:30:WU01:FS01:Started FahCore on PID 9691
01:09:30:WU01:FS01:Core PID:9695
01:09:30:WU01:FS01:FahCore 0x22 started
01:09:31:WU01:FS01:0x22:*********************** Log Started 2020-06-22T01:09:30Z ***********************
01:09:31:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
01:09:31:WU01:FS01:0x22:       Core: Core22
01:09:31:WU01:FS01:0x22:       Type: 0x22
01:09:31:WU01:FS01:0x22:    Version: 0.0.10
01:09:31:WU01:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
01:09:31:WU01:FS01:0x22:  Copyright: 2020 foldingathome.org
01:09:31:WU01:FS01:0x22:   Homepage: https://foldingathome.org/
01:09:31:WU01:FS01:0x22:       Date: Jun 16 2020
01:09:31:WU01:FS01:0x22:       Time: 15:55:31
01:09:31:WU01:FS01:0x22:   Revision: 147051aad40bcbec7d4b25105bbedfab425f1dc2
01:09:31:WU01:FS01:0x22:     Branch: core22-0.0.10
01:09:31:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
01:09:31:WU01:FS01:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
01:09:31:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
01:09:31:WU01:FS01:0x22:       Bits: 64
01:09:31:WU01:FS01:0x22:       Mode: Release
01:09:31:WU01:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
01:09:31:WU01:FS01:0x22:             <peastman@stanford.edu>
01:09:31:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 9691 -checkpoint 15
01:09:31:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
01:09:31:WU01:FS01:0x22:************************************ libFAH ************************************
01:09:31:WU01:FS01:0x22:       Date: Jun 2 2020
01:09:31:WU01:FS01:0x22:       Time: 00:07:31
01:09:31:WU01:FS01:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
01:09:31:WU01:FS01:0x22:     Branch: HEAD
01:09:31:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
01:09:31:WU01:FS01:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
01:09:31:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
01:09:31:WU01:FS01:0x22:       Bits: 64
01:09:31:WU01:FS01:0x22:       Mode: Release
01:09:31:WU01:FS01:0x22:************************************ CBang *************************************
01:09:31:WU01:FS01:0x22:       Date: May 31 2020
01:09:31:WU01:FS01:0x22:       Time: 20:16:34
01:09:31:WU01:FS01:0x22:   Revision: 75fcee0b8e713cb47f5191a3689d5f4f07244c7f
01:09:31:WU01:FS01:0x22:     Branch: HEAD
01:09:31:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
01:09:31:WU01:FS01:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
01:09:31:WU01:FS01:0x22:             -fPIC
01:09:31:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
01:09:31:WU01:FS01:0x22:       Bits: 64
01:09:31:WU01:FS01:0x22:       Mode: Release
01:09:31:WU01:FS01:0x22:************************************ System ************************************
01:09:31:WU01:FS01:0x22:        CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
01:09:31:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
01:09:31:WU01:FS01:0x22:       CPUs: 12
01:09:31:WU01:FS01:0x22:     Memory: 31.13GiB
01:09:31:WU01:FS01:0x22:Free Memory: 15.40GiB
01:09:31:WU01:FS01:0x22:    Threads: POSIX_THREADS
01:09:31:WU01:FS01:0x22: OS Version: 4.15
01:09:31:WU01:FS01:0x22:Has Battery: false
01:09:31:WU01:FS01:0x22: On Battery: false
01:09:31:WU01:FS01:0x22: UTC Offset: 1
01:09:31:WU01:FS01:0x22:        PID: 9695
01:09:31:WU01:FS01:0x22:        CWD: /var/lib/fahclient/work
01:09:31:WU01:FS01:0x22:********************************************************************************
01:09:31:WU01:FS01:0x22:Project: 13414 (Run 141, Clone 69, Gen 0)
01:09:31:WU01:FS01:0x22:Unit: 0x0000000112bc7d9a5eed8c6fefe67d54
01:09:31:WU01:FS01:0x22:Digital signatures verified
01:09:31:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
01:09:31:WU01:FS01:0x22:Version 0.0.10
01:09:31:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
01:09:31:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
01:09:31:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
01:09:31:WU01:FS01:0x22:  Global context and integrator variables write interval: 250 steps (0.025%) [4000 total]
01:09:35:WU01:FS01:0x22:Completed 0 out of 1000000 steps (0%)
01:10:13:WU01:FS01:0x22:An exception occurred at step 2258: Particle coordinate is nan
01:10:13:WU01:FS01:0x22:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
01:10:13:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
01:10:13:WU01:FS01:0x22:Saving result file ../logfile_01.txt
01:10:13:WU01:FS01:0x22:Saving result file globals.csv
01:10:13:WU01:FS01:0x22:Saving result file science.log
01:10:13:WU01:FS01:0x22:Saving result file state.xml
01:10:14:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
ESC[93m01:10:15:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)ESC[0m
01:10:15:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13414 run:141 clone:69 gen:0 core:0x22 unit:0x0000000112bc7d9a5eed8c6fefe67d54


Re: Project 13414 2 Faulty WUs

Posted: Thu Jun 25, 2020 4:52 am
by bruce
Projects 13414 are experimental and tend to have more errors than most other projects. They are being monitored closely and the PI learns from each error report ... generally leading to a better analysis for future WUs.

Then, too, FAH tends to be a more stringent overclocking benchmark than the traditionally used versions. Your nan errors might just be because you're overclocking and provoking an instability. Try removing the overclocking if you have one set and see if it helps.

Re: Project 13414 2 Faulty WUs

Posted: Thu Jun 25, 2020 5:35 am
by pylgrym
bruce wrote:Projects 13414 are experimental and tend to have more errors than most other projects. They are being monitored closely and the PI learns from each error report ... generally leading to a better analysis for future WUs.

Then, too, FAH tends to be a more stringent overclocking benchmark than the traditionally used versions. Your nan errors might just be because you're overclocking and provoking an instability. Try removing the overclocking if you have one set and see if it helps.
Thanks. I'm not overclocking (or underclocking). I've since had another 8 WUs from this project with no apparent problems aside from some rather erratic points allocations. No big deal and good to know it's being monitored.

Re: Project 13414 2 Faulty WUs

Posted: Thu Jun 25, 2020 5:40 am
by bruce
closely monitored.

The methodology is brand new and is very challenging.