Project 17435 (Run 0, Clone 172, Gen 111) Stalled Before 0%

Moderators: Site Moderators, FAHC Science Team

Project 17435 (Run 0, Clone 172, Gen 111) Stalled Before 0%

Postby Whompithian » Tue Mar 30, 2021 7:07 am

Head of log:

Code: Select all
*********************** Log Started 2021-03-29T19:10:22Z ***********************
2021-03-29:19:10:22:************************** libFAH **************************
2021-03-29:19:10:22:           Date: Oct 20 2020
2021-03-29:19:10:22:           Time: 20:36:41
2021-03-29:19:10:22:       Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
2021-03-29:19:10:22:         Branch: master
2021-03-29:19:10:22:       Compiler: GNU 4.9.4
2021-03-29:19:10:22:        Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22:                 -O3 -funroll-loops
2021-03-29:19:10:22:       Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22:           Bits: 64
2021-03-29:19:10:22:           Mode: Release
2021-03-29:19:10:22:************************ FAHClient *************************
2021-03-29:19:10:22:        Version: 7.6.21
2021-03-29:19:10:22:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
2021-03-29:19:10:22:      Copyright: 2020 foldingathome.org
2021-03-29:19:10:22:       Homepage: https://foldingathome.org/
2021-03-29:19:10:22:           Date: Oct 20 2020
2021-03-29:19:10:22:           Time: 20:38:59
2021-03-29:19:10:22:       Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
2021-03-29:19:10:22:         Branch: master
2021-03-29:19:10:22:       Compiler: GNU 4.9.4
2021-03-29:19:10:22:        Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22:                 -O3 -funroll-loops
2021-03-29:19:10:22:       Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22:           Bits: 64
2021-03-29:19:10:22:           Mode: Release
2021-03-29:19:10:22:           Args: --config=/etc/fahclient/config.xml --chdir=/var/lib/fahclient/
2021-03-29:19:10:22:         Config: /etc/fahclient/config.xml
2021-03-29:19:10:22:************************** CBang ***************************
2021-03-29:19:10:22:           Date: Oct 20 2020
2021-03-29:19:10:22:           Time: 18:38:01
2021-03-29:19:10:22:       Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
2021-03-29:19:10:22:         Branch: master
2021-03-29:19:10:22:       Compiler: GNU 4.9.4
2021-03-29:19:10:22:        Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections
2021-03-29:19:10:22:                 -O3 -funroll-loops -fPIC
2021-03-29:19:10:22:       Platform: linux2 5.8.0-1-amd64
2021-03-29:19:10:22:           Bits: 64
2021-03-29:19:10:22:           Mode: Release
2021-03-29:19:10:22:************************** System **************************
2021-03-29:19:10:22:            CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
2021-03-29:19:10:22:         CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
2021-03-29:19:10:22:           CPUs: 64
2021-03-29:19:10:22:         Memory: 125.51GiB
2021-03-29:19:10:22:    Free Memory: 124.62GiB
2021-03-29:19:10:22:        Threads: POSIX_THREADS
2021-03-29:19:10:22:     OS Version: 4.18
2021-03-29:19:10:22:    Has Battery: false
2021-03-29:19:10:22:     On Battery: false
2021-03-29:19:10:22:     UTC Offset: -7
2021-03-29:19:10:22:            PID: 1661
2021-03-29:19:10:22:            CWD: /var/lib/fahclient
2021-03-29:19:10:22:             OS: Linux 4.18.0-240.1.1.el8_3.x86_64 x86_64
2021-03-29:19:10:22:        OS Arch: AMD64
2021-03-29:19:10:22:           GPUs: 2
2021-03-29:19:10:22:          GPU 0: Bus:10 Slot:0 Func:0 AMD:5 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22:          GPU 1: Bus:69 Slot:0 Func:0 AMD:5 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22:           CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
2021-03-29:19:10:22:                 libcuda.so: cannot open shared object file: No such file or
2021-03-29:19:10:22:                 directory
2021-03-29:19:10:22:OpenCL Device 0: Platform:0 Device:0 Bus:69 Slot:0 Compute:2.0 Driver:3004.6
2021-03-29:19:10:22:OpenCL Device 1: Platform:0 Device:1 Bus:10 Slot:0 Compute:2.0 Driver:3004.6
2021-03-29:19:10:22:************************************************************
2021-03-29:19:10:22:<config>
2021-03-29:19:10:22:  <!-- Error Handling -->
2021-03-29:19:10:22:  <max-slot-errors v='20'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Folding Core -->
2021-03-29:19:10:22:  <checkpoint v='5'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Folding Slot Configuration -->
2021-03-29:19:10:22:  <client-type v='advanced'/>
2021-03-29:19:10:22:  <cpus v='54'/>
2021-03-29:19:10:22:  <disable-viz v='true'/>
2021-03-29:19:10:22:  <max-packet-size v='big'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- GUI -->
2021-03-29:19:10:22:  <gui-enabled v='false'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- HTTP Server -->
2021-03-29:19:10:22:  <max-connect-time v='604800'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Logging -->
2021-03-29:19:10:22:  <log-date v='true'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Remote Command Server -->
2021-03-29:19:10:22:  <command-address v='127.0.0.1'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Slot Control -->
2021-03-29:19:10:22:  <pause-on-battery v='false'/>
2021-03-29:19:10:22:  <power v='FULL'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- User Information -->
2021-03-29:19:10:22:  <passkey v='*****'/>
2021-03-29:19:10:22:  <team v='40524'/>
2021-03-29:19:10:22:  <user v='Whompithian'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Web Server Sessions -->
2021-03-29:19:10:22:  <session-lifetime v='0'/>
2021-03-29:19:10:22:  <session-timeout v='0'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Work Unit Control -->
2021-03-29:19:10:22:  <stall-detection-enabled v='true'/>
2021-03-29:19:10:22:
2021-03-29:19:10:22:  <!-- Folding Slots -->
2021-03-29:19:10:22:  <slot id='0' type='CPU'/>
2021-03-29:19:10:22:  <slot id='1' type='GPU'>
2021-03-29:19:10:22:    <pci-bus v='10'/>
2021-03-29:19:10:22:    <pci-slot v='0'/>
2021-03-29:19:10:22:  </slot>
2021-03-29:19:10:22:  <slot id='2' type='GPU'>
2021-03-29:19:10:22:    <pci-bus v='69'/>
2021-03-29:19:10:22:    <pci-slot v='0'/>
2021-03-29:19:10:22:  </slot>
2021-03-29:19:10:22:</config>
2021-03-29:19:10:22:Trying to access database...
2021-03-29:19:10:22:Successfully acquired database lock
2021-03-29:19:10:22:FS00:Initialized folding slot 00: cpu:54
2021-03-29:19:10:22:FS01:Initialized folding slot 01: gpu:10:0 Vega 10 XL/XT [Radeon RX Vega 56/64]
2021-03-29:19:10:22:FS02:Initialized folding slot 02: gpu:69:0 Vega 10 XL/XT [Radeon RX Vega 56/64]

Work unit details and stall:

Code: Select all
2021-03-29:23:05:23:WU01:FS02:0x22:*********************** Log Started 2021-03-29T23:05:22Z ***********************
2021-03-29:23:05:23:WU01:FS02:0x22:*************************** Core22 Folding@home Core ***************************
2021-03-29:23:05:23:WU01:FS02:0x22:       Core: Core22
2021-03-29:23:05:23:WU01:FS02:0x22:       Type: 0x22
2021-03-29:23:05:23:WU01:FS02:0x22:    Version: 0.0.13
2021-03-29:23:05:23:WU01:FS02:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
2021-03-29:23:05:23:WU01:FS02:0x22:  Copyright: 2020 foldingathome.org
2021-03-29:23:05:23:WU01:FS02:0x22:   Homepage: https://foldingathome.org/
2021-03-29:23:05:23:WU01:FS02:0x22:       Date: Sep 19 2020
2021-03-29:23:05:23:WU01:FS02:0x22:       Time: 01:10:35
2021-03-29:23:05:23:WU01:FS02:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
2021-03-29:23:05:23:WU01:FS02:0x22:     Branch: core22-0.0.13
2021-03-29:23:05:23:WU01:FS02:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:23:05:23:WU01:FS02:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:23:05:23:WU01:FS02:0x22:             -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
2021-03-29:23:05:23:WU01:FS02:0x22:   Platform: linux2 4.19.76-linuxkit
2021-03-29:23:05:23:WU01:FS02:0x22:       Bits: 64
2021-03-29:23:05:23:WU01:FS02:0x22:       Mode: Release
2021-03-29:23:05:23:WU01:FS02:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
2021-03-29:23:05:23:WU01:FS02:0x22:             <peastman@stanford.edu>
2021-03-29:23:05:23:WU01:FS02:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 4216 -checkpoint 5
2021-03-29:23:05:23:WU01:FS02:0x22:             -opencl-platform 0 -opencl-device 0 -gpu-vendor amd -gpu 0
2021-03-29:23:05:23:WU01:FS02:0x22:             -gpu-usage 100
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ libFAH ************************************
2021-03-29:23:05:23:WU01:FS02:0x22:       Date: Sep 15 2020
2021-03-29:23:05:23:WU01:FS02:0x22:       Time: 05:14:43
2021-03-29:23:05:23:WU01:FS02:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
2021-03-29:23:05:23:WU01:FS02:0x22:     Branch: HEAD
2021-03-29:23:05:23:WU01:FS02:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:23:05:23:WU01:FS02:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:23:05:23:WU01:FS02:0x22:             -funroll-loops
2021-03-29:23:05:23:WU01:FS02:0x22:   Platform: linux2 4.19.76-linuxkit
2021-03-29:23:05:23:WU01:FS02:0x22:       Bits: 64
2021-03-29:23:05:23:WU01:FS02:0x22:       Mode: Release
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ CBang *************************************
2021-03-29:23:05:23:WU01:FS02:0x22:       Date: Sep 15 2020
2021-03-29:23:05:23:WU01:FS02:0x22:       Time: 05:11:04
2021-03-29:23:05:23:WU01:FS02:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
2021-03-29:23:05:23:WU01:FS02:0x22:     Branch: HEAD
2021-03-29:23:05:23:WU01:FS02:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
2021-03-29:23:05:23:WU01:FS02:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
2021-03-29:23:05:23:WU01:FS02:0x22:             -funroll-loops -fPIC
2021-03-29:23:05:23:WU01:FS02:0x22:   Platform: linux2 4.19.76-linuxkit
2021-03-29:23:05:23:WU01:FS02:0x22:       Bits: 64
2021-03-29:23:05:23:WU01:FS02:0x22:       Mode: Release
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ System ************************************
2021-03-29:23:05:23:WU01:FS02:0x22:        CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
2021-03-29:23:05:23:WU01:FS02:0x22:     CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
2021-03-29:23:05:23:WU01:FS02:0x22:       CPUs: 64
2021-03-29:23:05:23:WU01:FS02:0x22:     Memory: 125.51GiB
2021-03-29:23:05:23:WU01:FS02:0x22:Free Memory: 122.25GiB
2021-03-29:23:05:23:WU01:FS02:0x22:    Threads: POSIX_THREADS
2021-03-29:23:05:23:WU01:FS02:0x22: OS Version: 4.18
2021-03-29:23:05:23:WU01:FS02:0x22:Has Battery: false
2021-03-29:23:05:23:WU01:FS02:0x22: On Battery: false
2021-03-29:23:05:23:WU01:FS02:0x22: UTC Offset: -7
2021-03-29:23:05:23:WU01:FS02:0x22:        PID: 4220
2021-03-29:23:05:23:WU01:FS02:0x22:        CWD: /var/lib/fahclient/work
2021-03-29:23:05:23:WU01:FS02:0x22:************************************ OpenMM ************************************
2021-03-29:23:05:23:WU01:FS02:0x22:   Revision: 189320d0
2021-03-29:23:05:23:WU01:FS02:0x22:********************************************************************************
2021-03-29:23:05:23:WU01:FS02:0x22:Project: 17435 (Run 0, Clone 172, Gen 111)
2021-03-29:23:05:23:WU01:FS02:0x22:Unit: 0x00000000000000000000000000000000
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file core.xml
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file integrator.xml.bz2
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file state.xml.bz2
2021-03-29:23:05:23:WU01:FS02:0x22:Reading tar file system.xml.bz2
2021-03-29:23:05:23:WU01:FS02:0x22:Digital signatures verified
2021-03-29:23:05:23:WU01:FS02:0x22:Folding@home GPU Core22 Folding@home Core
2021-03-29:23:05:23:WU01:FS02:0x22:Version 0.0.13
2021-03-29:23:05:23:WU01:FS02:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
2021-03-29:23:05:23:WU01:FS02:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
2021-03-29:23:05:23:WU01:FS02:0x22:  XTC frame write interval: 10000 steps (0.8%) [125 total]
2021-03-29:23:05:23:WU01:FS02:0x22:  Global context and integrator variables write interval: disabled
2021-03-29:23:05:23:WU01:FS02:0x22:There are 3 platforms available.
2021-03-29:23:05:23:WU01:FS02:0x22:Platform 0: Reference
2021-03-29:23:05:23:WU01:FS02:0x22:Platform 1: CPU
2021-03-29:23:05:23:WU01:FS02:0x22:Platform 2: OpenCL
2021-03-29:23:05:23:WU01:FS02:0x22:  opencl-device 0 specified
2021-03-29:23:05:28:WU03:FS02:Upload 56.24%
2021-03-29:23:05:33:WU03:FS02:Upload complete
2021-03-29:23:05:33:WU03:FS02:Server responded WORK_ACK (400)
2021-03-29:23:05:33:WU03:FS02:Final credit estimate, 97604.00 points
2021-03-29:23:05:33:WU03:FS02:Cleaning up
2021-03-29:23:05:42:WU01:FS02:0x22:Attempting to create OpenCL context:
2021-03-29:23:05:42:WU01:FS02:0x22:  Configuring platform OpenCL
2021-03-29:23:06:03:WU00:FS00:0xa8:Completed 2300000 out of 5000000 steps (46%)
2021-03-29:23:06:33:WU02:FS01:0x22:Completed 540000 out of 1000000 steps (54%)
2021-03-29:23:07:40:WU00:FS00:0xa8:Completed 2350000 out of 5000000 steps (47%)
2021-03-29:23:08:37:WU02:FS01:0x22:Completed 550000 out of 1000000 steps (55%)
2021-03-29:23:08:37:WU02:FS01:0x22:Checkpoint completed at step 550000
2021-03-29:23:09:12:WU00:FS00:0xa8:Completed 2400000 out of 5000000 steps (48%)
2021-03-29:23:10:42:WU02:FS01:0x22:Completed 560000 out of 1000000 steps (56%)
2021-03-29:23:10:49:WU00:FS00:0xa8:Completed 2450000 out of 5000000 steps (49%)
2021-03-29:23:12:23:WU00:FS00:0xa8:Completed 2500000 out of 5000000 steps (50%)
2021-03-29:23:12:48:WU02:FS01:0x22:Completed 570000 out of 1000000 steps (57%)
2021-03-29:23:13:59:WU00:FS00:0xa8:Completed 2550000 out of 5000000 steps (51%)
2021-03-29:23:14:51:WU02:FS01:0x22:Completed 580000 out of 1000000 steps (58%)
2021-03-29:23:15:37:WU00:FS00:0xa8:Completed 2600000 out of 5000000 steps (52%)
2021-03-29:23:16:54:WU02:FS01:0x22:Completed 590000 out of 1000000 steps (59%)
2021-03-29:23:17:14:WU00:FS00:0xa8:Completed 2650000 out of 5000000 steps (53%)
2021-03-29:23:18:50:WU00:FS00:0xa8:Completed 2700000 out of 5000000 steps (54%)
2021-03-29:23:18:56:WU02:FS01:0x22:Completed 600000 out of 1000000 steps (60%)
2021-03-29:23:18:57:WU02:FS01:0x22:Checkpoint completed at step 600000

Work unit refuses to release GPU:

Code: Select all
2021-03-30:00:55:05:Caught signal SIGTERM(15) on PID 1698
2021-03-30:00:55:05:Exiting, please wait. . .
2021-03-30:00:55:05:WU01:FS02:0x22:Caught signal SIGTERM(15) on PID 4220
2021-03-30:00:55:05:WU01:FS02:0x22:Exiting, please wait. . .
2021-03-30:00:55:05:WU00:FS01:0x22:Caught signal SIGTERM(15) on PID 4707
2021-03-30:00:55:05:WU00:FS01:0x22:Exiting, please wait. . .
2021-03-30:00:55:05:WU00:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED
2021-03-30:00:55:05:WU03:FS00:0xa7:Caught signal SIGTERM(15) on PID 4643
2021-03-30:00:55:05:WU03:FS00:0xa7:Exiting, please wait. . .
2021-03-30:00:55:06:FS02:Shutting core down
2021-03-30:00:56:07:WARNING:FS02:Killing WU01
2021-03-30:00:56:07:WARNING:FS02:Killing WU01
...thousands of these...
2021-03-30:00:56:34:WARNING:FS02:Killing WU01
2021-03-30:00:56:35:WARNING:FS02:Killing WU01

systemd unable to stop FAHClient:

Code: Select all
Mar 29 17:55:05 folding.home systemd[1]: Stopping Folding@home V7 Client...
Mar 29 17:56:35 folding.home systemd[1]: FAHClient.service: State 'stop-sigterm' timed out. Killing.
Mar 29 17:56:35 folding.home systemd[1]: FAHClient.service: Killing process 1698 (FAHClient) with signal SIGKILL.
Mar 29 17:56:35 folding.home systemd[1]: FAHClient.service: Killing process 4220 (FahCore_22) with signal SIGKILL.
Mar 29 17:58:05 folding.home systemd[1]: FAHClient.service: Processes still around after SIGKILL. Ignoring.
Mar 29 17:59:35 folding.home systemd[1]: FAHClient.service: State 'stop-final-sigterm' timed out. Killing.
Mar 29 17:59:35 folding.home systemd[1]: FAHClient.service: Killing process 1698 (FAHClient) with signal SIGKILL.
Mar 29 17:59:35 folding.home systemd[1]: FAHClient.service: Killing process 4220 (FahCore_22) with signal SIGKILL.
Mar 29 18:01:05 folding.home systemd[1]: FAHClient.service: Processes still around after final SIGKILL. Entering failed mode.
Mar 29 18:01:05 folding.home systemd[1]: FAHClient.service: Failed with result 'timeout'.
Mar 29 18:01:05 folding.home systemd[1]: Stopped Folding@home V7 Client.

This work unit stalled as soon as it had a lock on the GPU, which it ran to ~100% capacity. The GPU continued to be utilized and a shutdown of the system stalled for more than 30 minutes, at which time it was hard reset. The work unit had to be dumped manually before folding could resume. The system is a stock CentOS 8.3 desktop running kernel 4.18.0-240.1.1.el8_3.x86_64 and libopencl-amdgpu-pro.x86_64 19.50. Both newer kernels and newer amdgpu-pro have resulted in complete failure to run Folding@home on this hardware. Other work units from project 17435 and its series have demonstrated similar behavior. What can be done to guard against a work unit that misbehaves so badly as to consume 100% GPU - without doing any real work - while preventing a system from shutting down?

viewtopic.php?f=108&t=36871#p349658
bruce wrote:Show us the log of what conditions led to 100% utilization with with zero progress after a day. Something is seriously wrong and we want to prevent that from happening to others.
Whompithian
 
Posts: 26
Joined: Thu Jun 25, 2020 1:40 am

Re: Project 17435 (Run 0, Clone 172, Gen 111) Stalled Before

Postby Whompithian » Wed Apr 07, 2021 6:54 am

No one seems bothered by the fact that what I just described is the behavior of malware: a program entices a user to run it with the promise of some benefit but, when run, it consumes system resources without providing the purported benefit and, instead, causes undesired system behavior. I am bothered by the prospect of the assignment servers returning malware to my system, so I have split each slot into its own running FAHClient instance and set up a systemd until to run the following script:

Code: Select all
#!/usr/bin/bash                                                                                                                                                                                                                             

TOKEN=":Project:"
SYSTEMD_UNIT="fahclient@${1}.service"
FAH_CONF="/etc/fahclient/${1}.xml"
FAH_DIR="/var/lib/fahclient/${1}/"
FAH_USER="fahclient"

FDELIM=":"
PDELIM=" "
PFIELD=2
WUDELIM="U"
WUFIELD=2
SLEEP_TIME=5

ERR_SYSTEMD_INSTANCE=1

declare -A VALID_INSTANCE
VALID_INSTANCE["gpu0"]="true"
VALID_INSTANCE["gpu1"]="true"

declare -A DUMP_PROJECT
DUMP_PROJECT["17433"]="true"
DUMP_PROJECT["17434"]="true"
DUMP_PROJECT["17435"]="true"

[ ${VALID_INSTANCE["${1}"]} == "true" ] || exit ${ERR_SYSTEMD_INSTANCE}

/usr/bin/journalctl --follow --output=cat --unit=${SYSTEMD_UNIT} --lines=1 | while read line
do
  /usr/bin/echo "${line}" | /usr/bin/grep --quiet "${TOKEN}"
  if [ ${?} == 0 ]
  then
    project=$(/usr/bin/echo "${line}" | /usr/bin/cut --delimiter="${PDELIM}" --fields=${PFIELD})
    if [ ${DUMP_PROJECT["${project}"]} == "true" ]
    then
      wu=$(/usr/bin/echo "${line}" | /usr/bin/cut --delimiter="${WUDELIM}" --fields=${WUFIELD} | /usr/bin/cut --delimiter=${FDELIM} --fields=1)
      /usr/bin/systemctl stop ${SYSTEMD_UNIT}
      /usr/bin/cd "${FAH_DIR}"
      /usr/bin/sleep ${SLEEP_TIME}
      /usr/bin/sudo --user=${FAH_USER} /usr/bin/FAHClient --config="${FAH_CONF}" --chdir="${FAH_DIR}" --dump ${wu}
      /usr/bin/sleep ${SLEEP_TIME}
      /usr/bin/systemctl start ${SYSTEMD_UNIT}
    fi
  fi
done

If anyone has a solution that isn't so blunt, especially if it saves me from an unnecessary download, I would be glad to try it out.
Whompithian
 
Posts: 26
Joined: Thu Jun 25, 2020 1:40 am

Re: Project 17435 (Run 0, Clone 172, Gen 111) Stalled Before

Postby iero » Wed Apr 07, 2021 3:34 pm

I've been getting some WUs of the same project on my RX 480 lately. They seem to hit the GPU hard [99% Usage, not typical of the RX 480] and the have a long TPF of 5mins10secs. They have bumped my PPD to +20% levels, around 500K.
Image
GPU only
Rig1: Ryzen 3 3100/RTX 3060 Gaming OC 12GB/16GB 3200MHz-Aiming for 24/7
Rig2: Ryzen 3 2200g/RX 480 Gaming X 4GB/8GB 3000MHz-Aiming for 12/7
Folding since 05/02/2021
iero
 
Posts: 105
Joined: Tue Feb 09, 2021 11:40 am

Re: Project 17435 (Run 0, Clone 172, Gen 111) Stalled Before

Postby BobWilliams757 » Thu Apr 08, 2021 2:47 am

Whompithian wrote:No one seems bothered by the fact that what I just described is the behavior of malware: a program entices a user to run it with the promise of some benefit but, when run, it consumes system resources without providing the purported benefit and, instead, causes undesired system behavior. I am bothered by the prospect of the assignment servers returning malware to my system, so I have split each slot into its own running FAHClient instance and set up a systemd until to run the following script:

Code: Select all
#!/usr/bin/bash                                                                                                                                                                                                                             

TOKEN=":Project:"
SYSTEMD_UNIT="fahclient@${1}.service"
FAH_CONF="/etc/fahclient/${1}.xml"
FAH_DIR="/var/lib/fahclient/${1}/"
FAH_USER="fahclient"

FDELIM=":"
PDELIM=" "
PFIELD=2
WUDELIM="U"
WUFIELD=2
SLEEP_TIME=5

ERR_SYSTEMD_INSTANCE=1

declare -A VALID_INSTANCE
VALID_INSTANCE["gpu0"]="true"
VALID_INSTANCE["gpu1"]="true"

declare -A DUMP_PROJECT
DUMP_PROJECT["17433"]="true"
DUMP_PROJECT["17434"]="true"
DUMP_PROJECT["17435"]="true"

[ ${VALID_INSTANCE["${1}"]} == "true" ] || exit ${ERR_SYSTEMD_INSTANCE}

/usr/bin/journalctl --follow --output=cat --unit=${SYSTEMD_UNIT} --lines=1 | while read line
do
  /usr/bin/echo "${line}" | /usr/bin/grep --quiet "${TOKEN}"
  if [ ${?} == 0 ]
  then
    project=$(/usr/bin/echo "${line}" | /usr/bin/cut --delimiter="${PDELIM}" --fields=${PFIELD})
    if [ ${DUMP_PROJECT["${project}"]} == "true" ]
    then
      wu=$(/usr/bin/echo "${line}" | /usr/bin/cut --delimiter="${WUDELIM}" --fields=${WUFIELD} | /usr/bin/cut --delimiter=${FDELIM} --fields=1)
      /usr/bin/systemctl stop ${SYSTEMD_UNIT}
      /usr/bin/cd "${FAH_DIR}"
      /usr/bin/sleep ${SLEEP_TIME}
      /usr/bin/sudo --user=${FAH_USER} /usr/bin/FAHClient --config="${FAH_CONF}" --chdir="${FAH_DIR}" --dump ${wu}
      /usr/bin/sleep ${SLEEP_TIME}
      /usr/bin/systemctl start ${SYSTEMD_UNIT}
    fi
  fi
done

If anyone has a solution that isn't so blunt, especially if it saves me from an unnecessary download, I would be glad to try it out.


I highly doubt the servers have anything to do with the issues. If FAH servers were feeding us malware I'm sure the folding base would diminish quickly. With so many various types of hardware that can be used for folding, they can't guarantee stability in all systems, but they certainly aren't the cause of system issues either. I'm folding on a very limited resource APU, and it has done 50+ work units in that series with no issues and 100% completion rate. On this system, the vast majority of work units will push the GPU to 100%, so there is no change there.

I run a simple windows system, so I can't help you with software. But often the details of drivers, versions, updates, etc all come into the picture when folding, especially if it's an instance where there are often a lot of versions floating around and still in use. There were offers of help in the previous thread you linked, but you didn't have the information they needed to help. But I agree.... if you have a system that won't run work units properly the issue is with the system and that should be priority over protecting the system to make it easier to dump work units.

It seems you are running a rather unconventional setup, so for that reason the amount of help people might be able to provide is likely limited. Running multiple instances of the client is likely just making things harder IMHO. On my system the only time I've had any stability problems was when I realized that most benchmark/testing hardware will allow some instability if you push overclocks too far, where FAH will find those flaws quicker. Any needs to pause, finish, restart, or resume slots works every time without issue.
BobWilliams757
 
Posts: 137
Joined: Fri Apr 03, 2020 3:22 pm

Re: Project 17435 (Run 0, Clone 172, Gen 111) Stalled Before

Postby Whompithian » Thu Apr 08, 2021 6:56 am

BobWilliams757 wrote:It seems you are running a rather unconventional setup, so for that reason the amount of help people might be able to provide is likely limited.

A Threadripper with dual Vega 64 running CentOS is hardly unconventional for a folding system. I just want to track down the cause of the problem my system has with specific work units. Since I have not been able to do that and I have not received any constructive guidance on where to begin, I resorted to this convoluted multi-client hack that, frankly, was not worth the amount of time I put into it. In fact, I have dedicated far too much of my time to trying to run a stable folding system over the past year since it was decided that a working OpenCL implementation is not sufficient to allow a system to fold. Instead, it is up to the broken OpenCL detection of the client to decide if the system can fold, regardless of how well the cores that do that actual work are able to run on the system's OpenCL. Before that change, when I had an OpenCL implementation that ran well on my system despite FAHClient not detecting it, my system was rock solid and had to dump very few work units, none of which had to be dumped manually. Since that change, I have not been able to find a version of OpenCL that is both stable and visible to FAHClient and I just want to find a solution to that artificially crafted problem.
Whompithian
 
Posts: 26
Joined: Thu Jun 25, 2020 1:40 am


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron