Page 1 of 1

INTERRUPTED Problem, just started recently.

Posted: Fri Apr 29, 2022 9:20 pm
by markfw
I get the below on multiple computers, from 5950x running Linux to 7B12 EPYC all with at least one from processor. This only started recently, no changes, they run for months at a time befor reboot. Below is a sample of the log:

Code: Select all

21:17:07:WU00:FS01:0x22:Please consider upgrading your client version.
21:17:07:WU00:FS01:0x22:There are 4 platforms available.
21:17:07:WU00:FS01:0x22:Platform 0: Reference
21:17:07:WU00:FS01:0x22:Platform 1: CPU
21:17:07:WU00:FS01:0x22:Platform 2: OpenCL
21:17:07:WU00:FS01:0x22:  opencl-device -1 specified
21:17:07:WU00:FS01:0x22:Platform 3: CUDA
21:17:07:WU00:FS01:0x22:  cuda-device 0 specified
21:17:16:WU00:FS01:0x22:Attempting to create CUDA context:
21:17:16:WU00:FS01:0x22:  Configuring platform CUDA
21:17:32:WU00:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
21:18:07:WU00:FS01:Starting
21:18:07:WU00:FS01:Removing old file 'work/00/logfile_01-20220429-201306.txt'
21:18:07:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.20/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 1805 -checkpoint 15 -cuda-device 0 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
21:18:07:WU00:FS01:Started FahCore on PID 61122
21:18:07:WU00:FS01:Core PID:61126
21:18:07:WU00:FS01:FahCore 0x22 started
21:18:07:WU00:FS01:0x22:*********************** Log Started 2022-04-29T21:18:07Z ***********************
21:18:07:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
21:18:07:WU00:FS01:0x22:       Core: Core22
21:18:07:WU00:FS01:0x22:       Type: 0x22
21:18:07:WU00:FS01:0x22:    Version: 0.0.20
21:18:07:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:18:07:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
21:18:07:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
21:18:07:WU00:FS01:0x22:       Date: Jan 20 2022
21:18:07:WU00:FS01:0x22:       Time: 00:57:52
21:18:07:WU00:FS01:0x22:   Revision: 3f211b8a4346514edbff34e3cb1c0e0ec951373c
21:18:07:WU00:FS01:0x22:     Branch: HEAD
21:18:07:WU00:FS01:0x22:   Compiler: GNU 9.4.0
21:18:07:WU00:FS01:0x22:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
21:18:07:WU00:FS01:0x22:             -fdata-sections -O3 -funroll-loops -fno-pie
21:18:07:WU00:FS01:0x22:             -DOPENMM_VERSION="\"7.7.0\""
21:18:07:WU00:FS01:0x22:   Platform: linux 5.11.0-1025-azure
21:18:07:WU00:FS01:0x22:       Bits: 64
21:18:07:WU00:FS01:0x22:       Mode: Release
21:18:07:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
21:18:07:WU00:FS01:0x22:             <peastman@stanford.edu>
21:18:07:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 61122 -checkpoint 15
21:18:07:WU00:FS01:0x22:             -cuda-device 0 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
21:18:07:WU00:FS01:0x22:************************************ libFAH ************************************
21:18:07:WU00:FS01:0x22:       Date: Jan 20 2022
21:18:07:WU00:FS01:0x22:       Time: 00:57:22
21:18:07:WU00:FS01:0x22:   Revision: 9f4ad694e75c2350d4bb6b8b5b769ba27e483a2f
21:18:07:WU00:FS01:0x22:     Branch: HEAD
21:18:07:WU00:FS01:0x22:   Compiler: GNU 9.4.0
21:18:07:WU00:FS01:0x22:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
21:18:07:WU00:FS01:0x22:             -fdata-sections -O3 -funroll-loops -fno-pie
21:18:07:WU00:FS01:0x22:   Platform: linux 5.11.0-1025-azure
21:18:07:WU00:FS01:0x22:       Bits: 64
21:18:07:WU00:FS01:0x22:       Mode: Release
21:18:07:WU00:FS01:0x22:************************************ CBang *************************************
21:18:07:WU00:FS01:0x22:       Date: Jan 20 2022
21:18:07:WU00:FS01:0x22:       Time: 00:57:00
21:18:07:WU00:FS01:0x22:   Revision: ab023d155b446906d55b0f6c9a1eedeea04f7a1a
21:18:07:WU00:FS01:0x22:     Branch: HEAD
21:18:07:WU00:FS01:0x22:   Compiler: GNU 9.4.0
21:18:07:WU00:FS01:0x22:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
21:18:07:WU00:FS01:0x22:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
21:18:07:WU00:FS01:0x22:   Platform: linux 5.11.0-1025-azure
21:18:07:WU00:FS01:0x22:       Bits: 64
21:18:07:WU00:FS01:0x22:       Mode: Release
21:18:07:WU00:FS01:0x22:************************************ System ************************************
21:18:07:WU00:FS01:0x22:        CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
21:18:07:WU00:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
21:18:07:WU00:FS01:0x22:       CPUs: 32
21:18:07:WU00:FS01:0x22:     Memory: 31.33GiB
21:18:07:WU00:FS01:0x22:Free Memory: 1.91GiB
21:18:07:WU00:FS01:0x22:    Threads: POSIX_THREADS
21:18:07:WU00:FS01:0x22: OS Version: 4.15
21:18:07:WU00:FS01:0x22:Has Battery: false
21:18:07:WU00:FS01:0x22: On Battery: false
21:18:07:WU00:FS01:0x22: UTC Offset: -7
21:18:07:WU00:FS01:0x22:        PID: 61126
21:18:07:WU00:FS01:0x22:        CWD: /var/lib/fahclient/work
21:18:07:WU00:FS01:0x22:************************************ OpenMM ************************************
21:18:07:WU00:FS01:0x22:    Version: 7.7.0
21:18:07:WU00:FS01:0x22:********************************************************************************
21:18:07:WU00:FS01:0x22:Project: 18201 (Run 7172, Clone 2, Gen 1)
21:18:07:WU00:FS01:0x22:Digital signatures verified
21:18:07:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
21:18:07:WU00:FS01:0x22:Version 0.0.20
21:18:07:WU00:FS01:0x22:  Checkpoint write interval: 25000 steps (2%) [50 total]
21:18:07:WU00:FS01:0x22:  JSON viewer frame write interval: 12500 steps (1%) [100 total]
21:18:07:WU00:FS01:0x22:  XTC frame write interval: 20000 steps (1.6%) [62 total]
21:18:07:WU00:FS01:0x22:  Global context and integrator variables write interval: disabled
21:18:07:WU00:FS01:0x22:No -opencl-device specified; using deprecated -gpu argument as an alias for -opencl-device.
21:18:07:WU00:FS01:0x22:Please consider upgrading your client version.
21:18:07:WU00:FS01:0x22:There are 4 platforms available.
21:18:07:WU00:FS01:0x22:Platform 0: Reference
21:18:07:WU00:FS01:0x22:Platform 1: CPU
21:18:07:WU00:FS01:0x22:Platform 2: OpenCL
21:18:07:WU00:FS01:0x22:  opencl-device -1 specified
21:18:07:WU00:FS01:0x22:Platform 3: CUDA
21:18:07:WU00:FS01:0x22:  cuda-device 0 specified
21:18:16:WU00:FS01:0x22:Attempting to create CUDA context:
21:18:16:WU00:FS01:0x22:  Configuring platform CUDA
21:18:33:WU00:FS01:FahCore returned: INTERRUPTED (102 = 0x66)

Re: INTERRUPTED Problem, just started recently.

Posted: Fri Apr 29, 2022 9:21 pm
by markfw
I can't find the edit button. All are running linux cinnamon mint 19.2 all 2080TI video cards, 510 drivers.

Re: INTERRUPTED Problem, just started recently.

Posted: Fri Apr 29, 2022 11:39 pm
by Joe_H
markfw wrote: Fri Apr 29, 2022 9:21 pm I can't find the edit button. All are running linux cinnamon mint 19.2 all 2080TI video cards, 510 drivers.
Should show up as the left most button grouped at the top right of the post, symbol is supposed to be a pencil I guess.

Re: INTERRUPTED Problem, just started recently.

Posted: Sat Apr 30, 2022 12:12 am
by markfw
I found it. And I may have found the problem. Based on my research, it said "slow cpus" well, a 5950x is not slow. BUT only allowing one thread out of 32 to service the GPU is NOT good when the other 31 are Rosetta@home. I suspended all Rosetta, until I verify its the problem, but then I have to find out how many threads the GPU needs (its a 2080TI on all 6 boxes I have had an issue on).

I am Number 28 world-wide on F@H !

And thanks Joe ! I don't know how I missed it.

Re: INTERRUPTED Problem, just started recently.

Posted: Sat Apr 30, 2022 8:46 am
by PaulTV
The FahCore_22 process is single-threaded, afaik, and in top it should show 100% CPU usage (so take a single thread). Asides from that, it'd be good to have a thread reserved for the OS itself. The FahCore_22 process will show a nice setting of 20, which is the lowest priority. If the Rosetta processes have a lower nice (higher prio), it may explain that behavior. On my Linux folding rig, I have a script in cron to re-nice the process.

In /etc/crontab:

Code: Select all

*/5 * * * *   root    /usr/local/bin/renice_fah.sh > /dev/null 2>&1
The script /usr/local/bin/renice_fah.sh itself (don't forget to make executable) - this will give FahCore_22 a higher prio than standard user processes, but not as high as the critical OS processes:

Code: Select all

#!/usr/bin/env bash

set -ue

nice_to="-10"

### Get running core 22 process including nice
psline="$(ps -le | grep -e FahCore_22 | grep -v grep | tail -1)"
if [ -z "${psline}" ]
then
        echo "GPU core process not found"
        exit 0
fi
currentnice="$(echo "${psline}" | awk '{ print $8 }')"
if [ "${currentnice}" = "${nice_to}" ]
then
        echo "GPU core process already reniced to ${nice_to}"
        exit 0
fi
currentpid="$(echo "${psline}" | awk '{ print $4 }')"
echo "Re-nicing GPU Core process ${currentpid}"
renice "${nice_to}" -p "${currentpid}"
(edited because of rookie mistake; nice of 20 is lowest prio, -20 is highest prio, not the other way around)

Re: INTERRUPTED Problem, just started recently.

Posted: Mon May 09, 2022 5:20 pm
by toTOW
PaulTV wrote: Sat Apr 30, 2022 8:46 am The FahCore_22 process is single-threaded, afaik, and in top it should show 100% CPU usage (so take a single thread).
It's not entirely true : the code feeding the GPU is single threaded, because there's nothing else it could do at the same time, but some parts of the core (the sanity checks and checkpoint writes) are multi-threaded.

Re: INTERRUPTED Problem, just started recently.

Posted: Wed May 11, 2022 7:57 am
by gunnarre
In any case, the core shouldn't crash just because the CPU is loaded with higher priority tasks. An impact to performance would be expected if that happens, but it shouldn't crash completely. This sounds to me like either a bug in the folding core or the OS - if hardware stability has been eliminated as the cause.