Linux CPU stall with corresponding "Clock Skew" warning.

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Post Reply
emf
Posts: 16
Joined: Tue Apr 21, 2020 1:35 pm

Linux CPU stall with corresponding "Clock Skew" warning.

Post by emf »

Hey folks,

Ever since core 22 0.0.13 dropped on my machine, some WU's (mostly 13426 project) have some kind of weird bug. The GPUs (GTX 1060 6GB) go off into la-la land and the entire machine freezes. It _usually_ comes back with a message similar to:

Code: Select all

kernel: [3036338.798684] watchdog: BUG: soft lockup - CPU#3 stuck for 92s! [FahCore_22:11492]
and a corresponding message in the fahclient log like:

Code: Select all

WARNING:WU02:FS00:Detected clock skew (25 mins 00 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
Sometimes, if it's a short stall, it only prints a

Code: Select all

ERROR:Receive error: 110: Connection timed out
from losing the fahcontrol app socket from a different system.

Dunno what's going on with these WU's, but it's a mess. This machine doesn't have any power management crap and _was_ working pretty well until the 26th.
Is there anything I can do here to debug? (also, the machine is not running any CPU slots; just the two GPU slots, as it's an otherwise wimpy machine and it's all it can do to keep the GPU's fed.)

System info:

Code: Select all

12:37:45:****************************** FAHClient ******************************
12:37:45:        Version: 7.6.9
12:37:45:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
12:37:45:      Copyright: 2020 foldingathome.org
12:37:45:       Homepage: https://foldingathome.org/
12:37:45:           Date: Apr 17 2020
12:37:45:           Time: 18:11:26
12:37:45:       Revision: 398c2b17fa535e0cc6c9d10856b2154c32771646
12:37:45:         Branch: master
12:37:45:       Compiler: GNU 8.3.0
12:37:45:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
12:37:45:                 -funroll-loops -fno-pie
12:37:45:       Platform: linux2 4.19.0-5-amd64
12:37:45:           Bits: 64
12:37:45:           Mode: Release
12:37:45:           Args: --child /etc/fahclient/config.xml --run-as fahclient
12:37:45:                 --pid-file=/var/run/fahclient.pid --daemon
12:37:45:         Config: /etc/fahclient/config.xml
12:37:45:******************************** CBang ********************************
12:37:45:           Date: Apr 17 2020
12:37:45:           Time: 18:10:13
12:37:45:       Revision: 2fb0be7809c5e45287a122ca5fbc15b5ae859a3b
12:37:45:         Branch: master
12:37:45:       Compiler: GNU 8.3.0
12:37:45:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
12:37:45:                 -funroll-loops -fno-pie -fPIC
12:37:45:       Platform: linux2 4.19.0-5-amd64
12:37:45:           Bits: 64
12:37:45:           Mode: Release
12:37:45:******************************* System ********************************
12:37:45:            CPU: Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
12:37:45:         CPU ID: GenuineIntel Family 6 Model 23 Stepping 6
12:37:45:           CPUs: 4
12:37:45:         Memory: 31.41GiB
12:37:45:    Free Memory: 30.70GiB
12:37:45:        Threads: POSIX_THREADS
12:37:45:     OS Version: 4.15
12:37:45:    Has Battery: false
12:37:45:     On Battery: false
12:37:45:     UTC Offset: 0
12:37:45:            PID: 1551
12:37:45:            CWD: /var/lib/fahclient
12:37:45:             OS: Linux 4.15.0-109-generic x86_64
12:37:45:        OS Arch: AMD64
12:37:45:           GPUs: 2
12:37:45:          GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:7 GP106 [GeForce GTX 1060 6GB] 4372
12:37:45:          GPU 1: Bus:5 Slot:0 Func:0 NVIDIA:7 GP106 [GeForce GTX 1060 6GB] 4372
12:37:45:  CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:6.1 Driver:10.2
12:37:45:  CUDA Device 1: Platform:0 Device:1 Bus:5 Slot:0 Compute:6.1 Driver:10.2
12:37:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:440.100
12:37:45:OpenCL Device 1: Platform:0 Device:1 Bus:5 Slot:0 Compute:1.2 Driver:440.100
12:37:45:******************************* libFAH ********************************
12:37:45:           Date: Apr 15 2020
12:37:45:           Time: 21:43:24
12:37:45:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
12:37:45:         Branch: master
12:37:45:       Compiler: GNU 8.3.0
12:37:45:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
12:37:45:                 -funroll-loops -fno-pie
12:37:45:       Platform: linux2 4.19.0-5-amd64
12:37:45:           Bits: 64
12:37:45:           Mode: Release
12:37:45:***********************************************************************
Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by Neil-B »

Are the drivers latest from vendor? (not sure of version numbers for linux drivers) if not might be worth updating just to discount ... might be related to an odd cuda issue where tries to start cuda folding and doesn't properly switch to opencl if not available iirc ... two identical GPUs might be causing the issue bug I'll let those who are GPU specialists give you a better diagnosis ... or maybe the new core is pushing thermals/power draw just to the edge of stability?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by psaam0001 »

I think the latest *OFFICIAL NVidia* drivers to support his card is 450.66.

I know that there are newer drivers out there, but in the supported hardware notes suggest that they are specifically for the RTX 30xx series.

From Terminal (as root/super user), he can enter a "dnf update" (w/o the quote marks) command to see if newer supported drivers are found.

Paul
Last edited by psaam0001 on Wed Sep 30, 2020 2:25 pm, edited 1 time in total.
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by gunnarre »

Try updating your FAHClient to version 7.6.13 and update the Nvidia drivers to the newest version (450 is the newest one). I'm successfully running dual GPUs on the Nvidia Server drivers, version 450.51.06-Ubuntu0.18.04.2. 450.66 is the newest regular driver.

Run the command "nvidia-smi" in the command line. You'll get something looking like this:

Code: Select all

nvidia-smi
Wed Sep 30 15:58:06 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 52%   68C    P0    N/A /  75W |    177MiB /  4040MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:1C:00.0 Off |                  N/A |
| 55%   71C    P2   180W / 200W |    237MiB /  8118MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11013      C   ...13/Core_22.fah/FahCore_22      175MiB |
|    1   N/A  N/A      5337      C   ...13/Core_22.fah/FahCore_22      235MiB |
+-----------------------------------------------------------------------------+
You can use the identifiers in the list to individually control the power target of the GPU, and reduce power to e.g. 100 watts: (Replace with the IDs from your listing:

Code: Select all

sudo nvidia-smi -i 00000000:1B:00.0 -pl 100
sudo nvidia-smi -i 00000000:1C:00.0 -pl 100
In case your stability was marginal before, CUDA folding might have pushed you over the edge to stability.

If that doesn't help, it might be a good idea to run a memory test on your system RAM, or check if the kernel needs updating.
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
emf
Posts: 16
Joined: Tue Apr 21, 2020 1:35 pm

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by emf »

Trying the power limit on the GPU's idea now. Dropped to 100W from 120W max.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by bruce »

Some debugging code has been updated in FAHCore_22 and this is one that Development will want to look at. I don't see the PRCG numbers in your post. They can probably grep for it but having the numbers can't hurt.
emf
Posts: 16
Joined: Tue Apr 21, 2020 1:35 pm

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by emf »

So far with the power limit adjustment it hasn't griped, but i had a good >24h window between stalls yesterday, so I'm not 100% convinced. it hasn't hurt performance in any meaningful way, so that's good.

bruce: Due to having two cards in the system, it's hard to nail down which WU might be the one at fault, but I can take a guess at the most recent one this morning that caused me to hard powercycle the system.

first off i have three stalls ~9:00 UTC

Code: Select all

09:25:58:WARNING:WU01:FS00:Detected clock skew (1 mins 05 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:49:37:WARNING:WU01:FS00:Detected clock skew (1 mins 16 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:49:37:WARNING:WU00:FS01:Detected clock skew (1 mins 16 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
At 9:25, the system was:
uploading project:17400 run:0 clone:1087 gen:3,
just starting project:13426 run:6316 clone:4 gen:1, (had not even completed the 0% checkpoint)
and was at ~84% on project:13426 run:6041 clone:16 gen:3.

at 9:50 it was at 91% on 6041, and 7% on 6316, and the last message in any log is at 10:00 when it froze hard. The system was powercycled at 12:37 UTC and came back up and finished both without further issue; restarting from 90% and 5% checkpoints.

So, my guess would be that project:13426 run:6041 clone:16 gen:3 was the culprit in this event; assuming that it is actually a code problem and not a hardware stability problem as suggested above.

(i can track down other PCRG's in the same manner, or i can provide the whole fahclient log corpus if it helps)
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by PantherX »

I think that the 440 driver base is old so do upgrade to the latest ones to be sure. GeForce GTX 1060 is a Pascal GPU so should work without major issues on your system.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
ipkh
Posts: 175
Joined: Thu Jul 16, 2015 2:03 pm

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by ipkh »

The clock skew is a known issue with laptop chips and power states. I thought it was confined to sleep/hibernation but maybe there's an edge case causing this.
I'd make sure you have the latest Ubuntu updates.

But it's also possible you have a power/heat issue due to the Cuda efficiency gains. So maybe dust out the vents and whatnot.
gunnarre
Posts: 567
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by gunnarre »

You'll get a clock skew warning if the core froze for some reason. I've seen this happen on CPUs, and it would make sense that the same would happen if the whole machine freezes but wakes up again witout crashing completely.
Image
Online: GTX 1660 Super, GTX 1080, GTX 1050 Ti 4G OC, RX580 + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 960, GTX 950
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by bruce »

...and the hardware can disable itself if it's getting too hot or it could be defective hardware, of course. Start by underclocking the candidate GPU and see if that stops the hangs.
emf
Posts: 16
Joined: Tue Apr 21, 2020 1:35 pm

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by emf »

tl;dr - one of the CPU cores failed. Didn't have anything to do with the WU's or the GPU's or the drivers at all.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by bruce »

So "Defective Hardware."
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Linux CPU stall with corresponding "Clock Skew" warning.

Post by PantherX »

emf wrote:...one of the CPU cores failed...
Out of curiosity, how did you figure that out?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply