Project 17716 (Run 61, Clone 2, Gen 72) Repeatedly Stalled

Moderators: Site Moderators, FAHC Science Team

Post Reply
Whompithian
Posts: 39
Joined: Thu Jun 25, 2020 12:40 am

Project 17716 (Run 61, Clone 2, Gen 72) Repeatedly Stalled

Post by Whompithian »

Head of log:

Code: Select all

*********************** Log Started 2021-03-29T19:10:22Z ***********************
2021-03-29:19:10:22:************************** libFAH **************************
2021-03-29:19:10:22:           Date: Oct 20 2020
2021-03-29:19:10:22:           Time: 20:36:41
2021-03-29:19:10:22:       Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
2021-03-29:19:10:22:         Branch: master
2021-03-29:19:10:22:       Compiler: GNU 4.9.4
2021-03-29:19:10:22:        Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22:                 -O3 -funroll-loops
2021-03-29:19:10:22:       Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22:           Bits: 64
2021-03-29:19:10:22:           Mode: Release
2021-03-29:19:10:22:************************ FAHClient *************************
2021-03-29:19:10:22:        Version: 7.6.21
2021-03-29:19:10:22:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
2021-03-29:19:10:22:      Copyright: 2020 foldingathome.org
2021-03-29:19:10:22:       Homepage: https://foldingathome.org/
2021-03-29:19:10:22:           Date: Oct 20 2020
2021-03-29:19:10:22:           Time: 20:38:59
2021-03-29:19:10:22:       Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
2021-03-29:19:10:22:         Branch: master
2021-03-29:19:10:22:       Compiler: GNU 4.9.4
2021-03-29:19:10:22:        Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22:                 -O3 -funroll-loops
2021-03-29:19:10:22:       Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22:           Bits: 64
2021-03-29:19:10:22:           Mode: Release
2021-03-29:19:10:22:           Args: --config=/etc/fahclient/config.xml --chdir=/var/lib/fahclient/
2021-03-29:19:10:22:         Config: /etc/fahclient/config.xml
2021-03-29:19:10:22:************************** CBang ***************************
2021-03-29:19:10:22:           Date: Oct 20 2020
2021-03-29:19:10:22:           Time: 18:38:01
2021-03-29:19:10:22:       Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
2021-03-29:19:10:22:         Branch: master
2021-03-29:19:10:22:       Compiler: GNU 4.9.4
2021-03-29:19:10:22:        Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22:                 -O3 -funroll-loops -fPIC
2021-03-29:19:10:22:       Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22:           Bits: 64
2021-03-29:19:10:22:           Mode: Release
2021-03-29:19:10:22:************************** System **************************
2021-03-29:19:10:22:            CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
2021-03-29:19:10:22:         CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
2021-03-29:19:10:22:           CPUs: 64
2021-03-29:19:10:22:         Memory: 125.51GiB
2021-03-29:19:10:22:    Free Memory: 124.62GiB
2021-03-29:19:10:22:        Threads: POSIX_THREADS
2021-03-29:19:10:22:     OS Version: 4.18
2021-03-29:19:10:22:    Has Battery: false
2021-03-29:19:10:22:     On Battery: false
2021-03-29:19:10:22:     UTC Offset: -7
2021-03-29:19:10:22:            PID: 1661
2021-03-29:19:10:22:            CWD: /var/lib/fahclient
2021-03-29:19:10:22:             OS: Linux 4.18.0-240.1.1.el8_3.x86_64 x86_64
2021-03-29:19:10:22:        OS Arch: AMD64
2021-03-29:19:10:22:           GPUs: 2
2021-03-29:19:10:22:          GPU 0: Bus:10 Slot:0 Func:0 AMD:5 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22:          GPU 1: Bus:69 Slot:0 Func:0 AMD:5 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22:           CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
2021-03-29:19:10:22:                 libcuda.so: cannot open shared object file: No such file or
2021-03-29:19:10:22:                 directory
2021-03-29:19:10:22:OpenCL Device 0: Platform:0 Device:0 Bus:69 Slot:0 Compute:2.0 Driver:3004.6
2021-03-29:19:10:22:OpenCL Device 1: Platform:0 Device:1 Bus:10 Slot:0 Compute:2.0 Driver:3004.6
2021-03-29:19:10:22:************************************************************
2021-03-29:19:10:22:<config>
2021-03-29:19:10:22:  <!-- Error Handling -->
2021-03-29:19:10:22:  <max-slot-errors v='20'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Folding Core -->
2021-03-29:19:10:22:  <checkpoint v='5'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Folding Slot Configuration -->
2021-03-29:19:10:22:  <client-type v='advanced'/>
2021-03-29:19:10:22:  <cpus v='54'/>
2021-03-29:19:10:22:  <disable-viz v='true'/>
2021-03-29:19:10:22:  <max-packet-size v='big'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- GUI -->
2021-03-29:19:10:22:  <gui-enabled v='false'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- HTTP Server -->
2021-03-29:19:10:22:  <max-connect-time v='604800'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Logging -->
2021-03-29:19:10:22:  <log-date v='true'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Remote Command Server -->
2021-03-29:19:10:22:  <command-address v='127.0.0.1'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Slot Control -->
2021-03-29:19:10:22:  <pause-on-battery v='false'/>
2021-03-29:19:10:22:  <power v='FULL'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- User Information -->
2021-03-29:19:10:22:  <passkey v='*****'/>
2021-03-29:19:10:22:  <team v='40524'/>
2021-03-29:19:10:22:  <user v='Whompithian'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Web Server Sessions -->
2021-03-29:19:10:22:  <session-lifetime v='0'/>
2021-03-29:19:10:22:  <session-timeout v='0'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Work Unit Control -->
2021-03-29:19:10:22:  <stall-detection-enabled v='true'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Folding Slots -->
2021-03-29:19:10:22:  <slot id='0' type='CPU'/>
2021-03-29:19:10:22:  <slot id='1' type='GPU'>
2021-03-29:19:10:22:    <pci-bus v='10'/>
2021-03-29:19:10:22:    <pci-slot v='0'/>
2021-03-29:19:10:22:  </slot>
2021-03-29:19:10:22:  <slot id='2' type='GPU'>
2021-03-29:19:10:22:    <pci-bus v='69'/>
2021-03-29:19:10:22:    <pci-slot v='0'/>
2021-03-29:19:10:22:  </slot>
2021-03-29:19:10:22:</config>
2021-03-29:19:10:22:Trying to access database...
2021-03-29:19:10:22:Successfully acquired database lock
2021-03-29:19:10:22:FS00:Initialized folding slot 00: cpu:54
2021-03-29:19:10:22:FS01:Initialized folding slot 01: gpu:10:0 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22:FS02:Initialized folding slot 02: gpu:69:0 Vega 10 XL/XT [Radeon RX Vega 56/64]
Work unit details:

Code: Select all

2021-03-29:19:10:23:WU00:FS01:0x22:*********************** Log Started 2021-03-29T19:10:22Z ***********************
2021-03-29:19:10:23:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
2021-03-29:19:10:23:WU00:FS01:0x22:       Core: Core22
2021-03-29:19:10:23:WU00:FS01:0x22:       Type: 0x22
2021-03-29:19:10:23:WU00:FS01:0x22:    Version: 0.0.13
2021-03-29:19:10:23:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
2021-03-29:19:10:23:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
2021-03-29:19:10:23:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
2021-03-29:19:10:23:WU00:FS01:0x22:       Date: Sep 19 2020
2021-03-29:19:10:23:WU00:FS01:0x22:       Time: 01:10:35
2021-03-29:19:10:23:WU00:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
2021-03-29:19:10:23:WU00:FS01:0x22:     Branch: core22-0.0.13
2021-03-29:19:10:23:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:19:10:23:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:19:10:23:WU00:FS01:0x22:             -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
2021-03-29:19:10:23:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
2021-03-29:19:10:23:WU00:FS01:0x22:       Bits: 64
2021-03-29:19:10:23:WU00:FS01:0x22:       Mode: Release
2021-03-29:19:10:23:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
2021-03-29:19:10:23:WU00:FS01:0x22:             <peastman@stanford.edu>
2021-03-29:19:10:23:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 1748 -checkpoint 5
2021-03-29:19:10:23:WU00:FS01:0x22:             -opencl-platform 0 -opencl-device 1 -gpu-vendor amd -gpu 1
2021-03-29:19:10:23:WU00:FS01:0x22:             -gpu-usage 100
2021-03-29:19:10:23:WU00:FS01:0x22:************************************ libFAH ************************************
2021-03-29:19:10:23:WU00:FS01:0x22:       Date: Sep 15 2020
2021-03-29:19:10:23:WU00:FS01:0x22:       Time: 05:14:43
2021-03-29:19:10:23:WU00:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
2021-03-29:19:10:23:WU00:FS01:0x22:     Branch: HEAD
2021-03-29:19:10:23:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:19:10:23:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:19:10:23:WU00:FS01:0x22:             -funroll-loops
2021-03-29:19:10:23:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
2021-03-29:19:10:23:WU00:FS01:0x22:       Bits: 64
2021-03-29:19:10:23:WU00:FS01:0x22:       Mode: Release
2021-03-29:19:10:23:WU00:FS01:0x22:************************************ CBang *************************************
2021-03-29:19:10:23:WU00:FS01:0x22:       Date: Sep 15 2020
2021-03-29:19:10:23:WU00:FS01:0x22:       Time: 05:11:04
2021-03-29:19:10:23:WU00:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
2021-03-29:19:10:23:WU00:FS01:0x22:     Branch: HEAD
2021-03-29:19:10:23:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:19:10:23:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:19:10:23:WU00:FS01:0x22:             -funroll-loops -fPIC
2021-03-29:19:10:23:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
2021-03-29:19:10:23:WU00:FS01:0x22:       Bits: 64
2021-03-29:19:10:23:WU00:FS01:0x22:       Mode: Release
2021-03-29:19:10:23:WU00:FS01:0x22:************************************ System ************************************
2021-03-29:19:10:23:WU00:FS01:0x22:        CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
2021-03-29:19:10:23:WU00:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
2021-03-29:19:10:23:WU00:FS01:0x22:       CPUs: 64
2021-03-29:19:10:23:WU00:FS01:0x22:     Memory: 125.51GiB
2021-03-29:19:10:23:WU00:FS01:0x22:Free Memory: 124.41GiB
2021-03-29:19:10:23:WU00:FS01:0x22:    Threads: POSIX_THREADS
2021-03-29:19:10:23:WU00:FS01:0x22: OS Version: 4.18
2021-03-29:19:10:23:WU00:FS01:0x22:Has Battery: false
2021-03-29:19:10:23:WU00:FS01:0x22: On Battery: false
2021-03-29:19:10:23:WU00:FS01:0x22: UTC Offset: -7
2021-03-29:19:10:23:WU00:FS01:0x22:        PID: 1752
2021-03-29:19:10:23:WU00:FS01:0x22:        CWD: /var/lib/fahclient/work
2021-03-29:19:10:23:WU00:FS01:0x22:************************************ OpenMM ************************************
2021-03-29:19:10:23:WU00:FS01:0x22:   Revision: 189320d0
2021-03-29:19:10:23:WU00:FS01:0x22:********************************************************************************
2021-03-29:19:10:23:WU00:FS01:0x22:Project: 17716 (Run 61, Clone 2, Gen 72)
2021-03-29:19:10:23:WU00:FS01:0x22:Unit: 0x00000000000000000000000000000000
2021-03-29:19:10:23:WU00:FS01:0x22:Digital signatures verified
2021-03-29:19:10:23:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
2021-03-29:19:10:23:WU00:FS01:0x22:Version 0.0.13
2021-03-29:19:10:23:WU00:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
2021-03-29:19:10:23:WU00:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
2021-03-29:19:10:23:WU00:FS01:0x22:  XTC frame write interval: 25000 steps (2.5%) [40 total]
2021-03-29:19:10:23:WU00:FS01:0x22:  Global context and integrator variables write interval: disabled
2021-03-29:19:10:23:WU00:FS01:0x22:There are 3 platforms available.
2021-03-29:19:10:23:WU00:FS01:0x22:Platform 0: Reference
2021-03-29:19:10:23:WU00:FS01:0x22:Platform 1: CPU
2021-03-29:19:10:23:WU00:FS01:0x22:Platform 2: OpenCL
2021-03-29:19:10:23:WU00:FS01:0x22:  opencl-device 1 specified
2021-03-29:19:10:28:WU00:FS01:0x22:Attempting to create OpenCL context:
2021-03-29:19:10:28:WU00:FS01:0x22:  Configuring platform OpenCL
2021-03-29:19:10:36:WU00:FS01:0x22:  Using OpenCL on platformId 0 and gpu 1
2021-03-29:19:10:36:WU00:FS01:0x22:Completed 250000 out of 1000000 steps (25%)
Work unit stalls:

Code: Select all

2021-03-29:19:18:11:WU00:FS01:0x22:Completed 330000 out of 1000000 steps (33%)
2021-03-29:19:18:59:WU02:FS00:0xa8:Completed 45000 out of 500000 steps (9%)
2021-03-29:19:19:16:WU03:FS02:0x22:Completed 40000 out of 1000000 steps (4%)
2021-03-29:19:19:48:WU02:FS00:0xa8:Completed 50000 out of 500000 steps (10%)
2021-03-29:19:20:37:WU02:FS00:0xa8:Completed 55000 out of 500000 steps (11%)
2021-03-29:19:21:25:WU03:FS02:0x22:Completed 50000 out of 1000000 steps (5%)
2021-03-29:19:21:26:WU02:FS00:0xa8:Completed 60000 out of 500000 steps (12%)
2021-03-29:19:21:26:WU03:FS02:0x22:Checkpoint completed at step 50000
2021-03-29:19:22:15:WU02:FS00:0xa8:Completed 65000 out of 500000 steps (13%)
2021-03-29:19:23:04:WU02:FS00:0xa8:Completed 70000 out of 500000 steps (14%)
2021-03-29:19:23:37:WU03:FS02:0x22:Completed 60000 out of 1000000 steps (6%)
2021-03-29:19:23:52:WU02:FS00:0xa8:Completed 75000 out of 500000 steps (15%)
2021-03-29:19:24:40:WU02:FS00:0xa8:Completed 80000 out of 500000 steps (16%)
2021-03-29:19:25:28:WU02:FS00:0xa8:Completed 85000 out of 500000 steps (17%)
2021-03-29:19:25:49:WU03:FS02:0x22:Completed 70000 out of 1000000 steps (7%)
2021-03-29:19:26:17:WU02:FS00:0xa8:Completed 90000 out of 500000 steps (18%)
2021-03-29:19:27:05:WU02:FS00:0xa8:Completed 95000 out of 500000 steps (19%)
2021-03-29:19:27:53:WU02:FS00:0xa8:Completed 100000 out of 500000 steps (20%)
2021-03-29:19:27:59:WU03:FS02:0x22:Completed 80000 out of 1000000 steps (8%)
2021-03-29:19:28:32:WU00:FS01:0x22:Watchdog triggered, requesting soft shutdown down
2021-03-29:19:28:41:WU02:FS00:0xa8:Completed 105000 out of 500000 steps (21%)
2021-03-29:19:29:29:WU02:FS00:0xa8:Completed 110000 out of 500000 steps (22%)
2021-03-29:19:30:08:WU03:FS02:0x22:Completed 90000 out of 1000000 steps (9%)
2021-03-29:19:30:17:WU02:FS00:0xa8:Completed 115000 out of 500000 steps (23%)
2021-03-29:19:31:04:WU02:FS00:0xa8:Completed 120000 out of 500000 steps (24%)
2021-03-29:19:31:54:WU02:FS00:0xa8:Completed 125000 out of 500000 steps (25%)
2021-03-29:19:32:19:WU03:FS02:0x22:Completed 100000 out of 1000000 steps (10%)
2021-03-29:19:32:20:WU03:FS02:0x22:Checkpoint completed at step 100000
2021-03-29:19:32:43:WU02:FS00:0xa8:Completed 130000 out of 500000 steps (26%)
2021-03-29:19:33:31:WU02:FS00:0xa8:Completed 135000 out of 500000 steps (27%)
2021-03-29:19:34:19:WU02:FS00:0xa8:Completed 140000 out of 500000 steps (28%)
2021-03-29:19:34:31:WU03:FS02:0x22:Completed 110000 out of 1000000 steps (11%)
2021-03-29:19:35:06:WU02:FS00:0xa8:Completed 145000 out of 500000 steps (29%)
2021-03-29:19:35:54:WU02:FS00:0xa8:Completed 150000 out of 500000 steps (30%)
2021-03-29:19:36:41:WU03:FS02:0x22:Completed 120000 out of 1000000 steps (12%)
2021-03-29:19:36:42:WU02:FS00:0xa8:Completed 155000 out of 500000 steps (31%)
2021-03-29:19:37:29:WU02:FS00:0xa8:Completed 160000 out of 500000 steps (32%)
2021-03-29:19:38:17:WU02:FS00:0xa8:Completed 165000 out of 500000 steps (33%)
2021-03-29:19:38:32:WU00:FS01:0x22:Watchdog shutdown failed, hard shutdown triggered
2021-03-29:19:38:54:WU03:FS02:0x22:Completed 130000 out of 1000000 steps (13%)
2021-03-29:19:39:04:WU02:FS00:0xa8:Completed 170000 out of 500000 steps (34%)
2021-03-29:19:39:52:WU02:FS00:0xa8:Completed 175000 out of 500000 steps (35%)
2021-03-29:19:40:39:WU02:FS00:0xa8:Completed 180000 out of 500000 steps (36%)
2021-03-29:19:41:01:WU03:FS02:0x22:Completed 140000 out of 1000000 steps (14%)
2021-03-29:19:41:26:WU02:FS00:0xa8:Completed 185000 out of 500000 steps (37%)
2021-03-29:19:42:16:WU02:FS00:0xa8:Completed 190000 out of 500000 steps (38%)
2021-03-29:19:43:05:WU02:FS00:0xa8:Completed 195000 out of 500000 steps (39%)
2021-03-29:19:43:13:WU03:FS02:0x22:Completed 150000 out of 1000000 steps (15%)
2021-03-29:19:43:14:WU03:FS02:0x22:Checkpoint completed at step 150000
2021-03-29:19:43:53:WU02:FS00:0xa8:Completed 200000 out of 500000 steps (40%)
2021-03-29:19:44:41:WU02:FS00:0xa8:Completed 205000 out of 500000 steps (41%)
2021-03-29:19:45:21:WU03:FS02:0x22:Completed 160000 out of 1000000 steps (16%)
2021-03-29:19:45:29:WU02:FS00:0xa8:Completed 210000 out of 500000 steps (42%)
2021-03-29:19:46:17:WU02:FS00:0xa8:Completed 215000 out of 500000 steps (43%)
2021-03-29:19:47:06:WU02:FS00:0xa8:Completed 220000 out of 500000 steps (44%)
2021-03-29:19:47:30:WU03:FS02:0x22:Completed 170000 out of 1000000 steps (17%)
2021-03-29:19:47:53:WU02:FS00:0xa8:Completed 225000 out of 500000 steps (45%)
2021-03-29:19:48:13:WARNING:FS01:WU appears to be stalled, ending run
2021-03-29:19:48:13:FS01:Shutting core down
2021-03-29:19:48:41:WU02:FS00:0xa8:Completed 230000 out of 500000 steps (46%)
2021-03-29:19:49:14:WARNING:FS01:Killing WU00
2021-03-29:19:49:14:WARNING:FS01:Killing WU00
...dozens of these...
2021-03-29:19:49:29:WARNING:FS01:Killing WU00
2021-03-29:19:49:29:WARNING:FS01:Killing WU00
2021-03-29:19:49:29:WU02:FS00:0xa8:Completed 235000 out of 500000 steps (47%)
2021-03-29:19:49:29:WARNING:FS01:Killing WU00
2021-03-29:19:49:29:WARNING:FS01:Killing WU00
...dozens of these...
2021-03-29:19:49:39:WARNING:FS01:Killing WU00
2021-03-29:19:49:39:WARNING:FS01:Killing WU00
2021-03-29:19:49:39:WU03:FS02:0x22:Completed 180000 out of 1000000 steps (18%)
2021-03-29:19:49:39:WARNING:FS01:Killing WU00
2021-03-29:19:49:40:WARNING:FS01:Killing WU00
...>100 of these...
2021-03-29:19:50:16:WARNING:FS01:Killing WU00
2021-03-29:19:50:17:WARNING:FS01:Killing WU00
2021-03-29:19:50:17:WU02:FS00:0xa8:Completed 240000 out of 500000 steps (48%)
2021-03-29:19:50:17:WARNING:FS01:Killing WU00
2021-03-29:19:50:17:WARNING:FS01:Killing WU00
...dozens of these...
2021-03-29:19:50:26:WARNING:FS01:Killing WU00
2021-03-29:19:50:26:WARNING:FS01:Killing WU00
2021-03-29:19:50:26:Caught signal SIGTERM(15) on PID 1661
2021-03-29:19:50:26:Exiting, please wait. . .
2021-03-29:19:50:26:WU02:FS00:0xa8:Caught signal SIGTERM(15) on PID 3351
2021-03-29:19:50:26:WU02:FS00:0xa8:Exiting, please wait. . .
2021-03-29:19:50:26:WU03:FS02:0x22:Caught signal SIGTERM(15) on PID 1764
2021-03-29:19:50:26:WU03:FS02:0x22:Exiting, please wait. . .
2021-03-29:19:50:26:WU03:FS02:0x22:Folding@home Core Shutdown: INTERRUPTED
2021-03-29:19:50:27:WU02:FS00:0xa8:Folding@home Core Shutdown: INTERRUPTED
2021-03-29:19:50:28:WARNING:FS01:Killing WU00
2021-03-29:19:50:28:FS02:Shutting core down
2021-03-29:19:50:28:WARNING:FS01:Killing WU00
2021-03-29:19:50:28:WARNING:FS01:Killing WU00
...>1,000 of these...
2021-03-29:19:51:56:WARNING:FS01:Killing WU00
2021-03-29:19:51:56:WARNING:FS02:Killing WU03
This work unit stalled overnight, preventing another GPU work unit on the same system from completing. Both work units continued to utilize the GPUs at full power after stalling. The system had to be rebooted before FAHClient could be restarted. The work unit resumed at 25% and stalled again at 33%, as indicated in the log snippets above. Again, a reboot was required to restart the client. The system is a stock CentOS 8.3 desktop running kernel 4.18.0-240.1.1.el8_3.x86_64 and libopencl-amdgpu-pro.x86_64 19.50. Both newer kernels and newer amdgpu-pro have resulted in complete failure to run Folding@home on this hardware. Other work units from project 17716 have demonstrated similar behavior. What can be done to guard against stalled work units that do not accept a soft shutdown command and continue to consume resources without end?
Post Reply