GPU Freezing Up?

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Post Reply
WTS
Posts: 97
Joined: Sun May 19, 2019 5:49 pm
Location: Arkansas

GPU Freezing Up?

Post by WTS »

I have version 7.5.1. For about a week, the GPUs, nVidia 1070s, two each, have been freezing up at 99.99%, still showing "Running", while two other slots are showing "Ready" at 0.00%. Nothing I do, rebooting, downloading software again, seems to unfreeze the two GPU clients. The one CPU slot continues to run normally. Please help if you know a solution. Has anyone else had this problem?
Joe_H
Site Admin
Posts: 7854
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: GPU Freezing Up?

Post by Joe_H »

Welcome to the folding support forum.

Most often this type of problem is caused by a video system reset. Checking the OS error logs for reports from the time of this starting may give you some information. Most often these resets are caused by excessive overclocking, or the GPU overheating. Less often it is from the video driver crashing due to lack of resources.

This is a known issue where a video reset interrupts processing and is not detected by the client. It keeps counting down towards completion as if the WU is still being worked on based on previous progress. But if you examine the log, no progress is shown for the WU. Stopping and restarting the folding slot should resume at the last checkpoint, but it is better to find the cause for the resets and correct that.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
WTS
Posts: 97
Joined: Sun May 19, 2019 5:49 pm
Location: Arkansas

Re: GPU Freezing Up?

Post by WTS »

Joe_H wrote:Welcome to the folding support forum.

Most often this type of problem is caused by a video system reset. Checking the OS error logs for reports from the time of this starting may give you some information. Most often these resets are caused by excessive overclocking, or the GPU overheating. Less often it is from the video driver crashing due to lack of resources.

This is a known issue where a video reset interrupts processing and is not detected by the client. It keeps counting down towards completion as if the WU is still being worked on based on previous progress. But if you examine the log, no progress is shown for the WU. Stopping and restarting the folding slot should resume at the last checkpoint, but it is better to find the cause for the resets and correct that.
Thanks! That gives me a partial solution, I'm still searching for the root cause.
I'll keep y'all posted.
Reality is what you stumble over when you walk around with your eyes closed.
WTS
Posts: 97
Joined: Sun May 19, 2019 5:49 pm
Location: Arkansas

Re: GPU Freezing Up?

Post by WTS »

WTS wrote:
Joe_H wrote:Welcome to the folding support forum.

Most often this type of problem is caused by a video system reset. Checking the OS error logs for reports from the time of this starting may give you some information. Most often these resets are caused by excessive overclocking, or the GPU overheating. Less often it is from the video driver crashing due to lack of resources.

This is a known issue where a video reset interrupts processing and is not detected by the client. It keeps counting down towards completion as if the WU is still being worked on based on previous progress. But if you examine the log, no progress is shown for the WU. Stopping and restarting the folding slot should resume at the last checkpoint, but it is better to find the cause for the resets and correct that.
Thanks! That gives me a partial solution, I'm still searching for the root cause.
I'll keep y'all posted.
Pausing and restarting is a partial solution. I still can't find the root cause or a way to repair the problem. Can anyone help or provide full solutions?
Reality is what you stumble over when you walk around with your eyes closed.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU Freezing Up?

Post by bruce »

It's a hardware problem, not a FAH problemj. The root causes are the GPU failures and you need to diagnose and fix the hardware failures. (Reduce/remove the overclocking or fix the overheating.) FAH does not run on unreliable hardware.

We may be able to help you determine the specific problems you are encountering if you post segments of FAH's log.
WTS
Posts: 97
Joined: Sun May 19, 2019 5:49 pm
Location: Arkansas

Re: GPU Freezing Up?

Post by WTS »

bruce wrote:It's a hardware problem, not a FAH problemj. The root causes are the GPU failures and you need to diagnose and fix the hardware failures. (Reduce/remove the overclocking or fix the overheating.) FAH does not run on unreliable hardware.

We may be able to help you determine the specific problems you are encountering if you post segments of FAH's log.
I don't overclock and I have no overheating as far as I can tell. I'll post some log files below:

Code: Select all

04:55:59:FS00:Unpaused
04:55:59:FS01:Unpaused
04:55:59:FS02:Unpaused
04:55:59:WU03:FS02:Starting
04:55:59:WU03:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 03 -suffix 01 -version 705 -lifeline 12180 -checkpoint 10 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
04:55:59:WU03:FS02:Started FahCore on PID 17900
04:55:59:WU03:FS02:Core PID:9608
04:55:59:WU03:FS02:FahCore 0x21 started
04:55:59:WU00:FS01:Starting
04:55:59:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 00 -suffix 01 -version 705 -lifeline 12180 -checkpoint 10 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 1 -cuda-device 1 -gpu 1
04:55:59:WU00:FS01:Started FahCore on PID 20424
04:55:59:WU00:FS01:Core PID:22300
04:55:59:WU00:FS01:FahCore 0x21 started
04:55:59:WU04:FS00:Starting
04:55:59:WU04:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/Win32/AMD64/AVX/Core_a7.fah/FahCore_a7.exe -dir 04 -suffix 01 -version 705 -lifeline 12180 -checkpoint 10 -np 6
04:55:59:WU04:FS00:Started FahCore on PID 19964
04:55:59:WU04:FS00:Core PID:21408
04:55:59:WU04:FS00:FahCore 0xa7 started
04:56:00:WU03:FS02:0x21:*********************** Log Started 2019-05-21T04:55:59Z ***********************
04:56:00:WU03:FS02:0x21:Project: 11726 (Run 0, Clone 758, Gen 330)
04:56:00:WU03:FS02:0x21:Unit: 0x000001d08ca304e75b8d9ee7ec3f9da5
04:56:00:WU03:FS02:0x21:CPU: 0x00000000000000000000000000000000
04:56:00:WU03:FS02:0x21:Machine: 2
04:56:00:WU03:FS02:0x21:Reading tar file core.xml
04:56:00:WU03:FS02:0x21:Reading tar file integrator.xml
04:56:00:WU03:FS02:0x21:Reading tar file state.xml
04:56:00:WU03:FS02:0x21:Reading tar file system.xml
04:56:00:WU03:FS02:0x21:Digital signatures verified
04:56:00:WU03:FS02:0x21:Folding@home GPU Core21 Folding@home Core
04:56:00:WU03:FS02:0x21:Version 0.0.18
04:56:00:WU00:FS01:0x21:*********************** Log Started 2019-05-21T04:56:00Z ***********************
04:56:00:WU00:FS01:0x21:Project: 11726 (Run 0, Clone 1252, Gen 312)
04:56:00:WU00:FS01:0x21:Unit: 0x000001c08ca304e75b95ba1d29edaf7d
04:56:00:WU00:FS01:0x21:CPU: 0x00000000000000000000000000000000
04:56:00:WU00:FS01:0x21:Machine: 1
04:56:00:WU00:FS01:0x21:Reading tar file core.xml
04:56:00:WU00:FS01:0x21:Reading tar file integrator.xml
04:56:00:WU00:FS01:0x21:Reading tar file state.xml
04:56:00:WU00:FS01:0x21:Reading tar file system.xml
04:56:00:WU00:FS01:0x21:Digital signatures verified
04:56:00:WU00:FS01:0x21:Folding@home GPU Core21 Folding@home Core
04:56:00:WU00:FS01:0x21:Version 0.0.18
04:56:00:WU04:FS00:0xa7:*********************** Log Started 2019-05-21T04:56:00Z ***********************
04:56:00:WU04:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
04:56:00:WU04:FS00:0xa7:       Type: 0xa7
04:56:00:WU04:FS00:0xa7:       Core: Gromacs
04:56:00:WU04:FS00:0xa7:    Website: https://foldingathome.org/
04:56:00:WU04:FS00:0xa7:  Copyright: (c) 2009-2018 foldingathome.org
04:56:00:WU04:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:56:00:WU04:FS00:0xa7:       Args: -dir 04 -suffix 01 -version 705 -lifeline 19964 -checkpoint 10 -np
04:56:00:WU04:FS00:0xa7:             6
04:56:00:WU04:FS00:0xa7:     Config: <none>
04:56:00:WU04:FS00:0xa7:************************************ Build *************************************
04:56:00:WU04:FS00:0xa7:    Version: 0.0.17
04:56:00:WU04:FS00:0xa7:       Date: Apr 27 2018
04:56:00:WU04:FS00:0xa7:       Time: 16:19:36
04:56:00:WU04:FS00:0xa7: Repository: Git
04:56:00:WU04:FS00:0xa7:   Revision: 21359963583d09ec2063ef946399441c4df4ccd7
04:56:00:WU04:FS00:0xa7:     Branch: master
04:56:00:WU04:FS00:0xa7:   Compiler: Visual C++ 2008
04:56:00:WU04:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
04:56:00:WU04:FS00:0xa7:   Platform: win32 10
04:56:00:WU04:FS00:0xa7:       Bits: 64
04:56:00:WU04:FS00:0xa7:       Mode: Release
04:56:00:WU04:FS00:0xa7:       SIMD: avx_256
04:56:00:WU04:FS00:0xa7:************************************ System ************************************
04:56:00:WU04:FS00:0xa7:        CPU: Unknown
04:56:00:WU04:FS00:0xa7:     CPU ID: 
04:56:00:WU04:FS00:0xa7:       CPUs: 8
04:56:00:WU04:FS00:0xa7:     Memory: 63.96GiB
04:56:00:WU04:FS00:0xa7:Free Memory: 45.80GiB
04:56:00:WU04:FS00:0xa7:    Threads: WINDOWS_THREADS
04:56:00:WU04:FS00:0xa7: OS Version: 6.2
04:56:00:WU04:FS00:0xa7:Has Battery: true
04:56:00:WU04:FS00:0xa7: On Battery: false
04:56:00:WU04:FS00:0xa7: UTC Offset: -5
04:56:00:WU04:FS00:0xa7:        PID: 21408
04:56:00:WU04:FS00:0xa7:        CWD: C:\ProgramData\FAHClient\work
04:56:00:WU04:FS00:0xa7:         OS: Windows 10 Pro
04:56:00:WU04:FS00:0xa7:    OS Arch: AMD64
04:56:00:WU04:FS00:0xa7:********************************************************************************
04:56:00:WU04:FS00:0xa7:Project: 14143 (Run 4, Clone 2, Gen 60)
04:56:00:WU04:FS00:0xa7:Unit: 0x000000440002894c5ca3a727c6bb9b2a
04:56:00:WU04:FS00:0xa7:Digital signatures verified
04:56:00:WU04:FS00:0xa7:Calling: mdrun -s frame60.tpr -o frame60.trr -cpi state.cpt -cpt 10 -nt 6
04:56:00:WU04:FS00:0xa7:Steps: first=150000000 total=2500000
04:56:00:WU04:FS00:0xa7:Completed 414792 out of 2500000 steps (16%)
04:56:07:WU03:FS02:0x21:Completed 0 out of 5000000 steps (0%)
04:56:07:WU03:FS02:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
04:56:07:WU00:FS01:0x21:Completed 0 out of 5000000 steps (0%)
04:56:07:WU00:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
04:56:33:Removing old file 'configs/config-20190519-182005.xml'
04:56:33:Saving configuration to config.xml
04:56:33:<config>
04:56:33:  <!-- Folding Core -->
04:56:33:  <checkpoint v='10'/>
04:56:33:  <core-priority v='low'/>
04:56:33:
04:56:33:  <!-- Network -->
04:56:33:  <proxy v=':8080'/>
04:56:33:
04:56:33:  <!-- Slot Control -->
04:56:33:  <power v='full'/>
04:56:33:
04:56:33:  <!-- User Information -->
04:56:33:  <passkey v='********************************'/>
04:56:33:  <user v='m14m@earthlink.net'/>
04:56:33:
04:56:33:  <!-- Folding Slots -->
04:56:33:  <slot id='0' type='CPU'/>
04:56:33:  <slot id='1' type='GPU'/>
04:56:33:  <slot id='2' type='GPU'/>
04:56:33:</config>
04:57:34:WU04:FS00:0xa7:Completed 425000 out of 2500000 steps (17%)
04:58:05:WU00:FS01:0x21:Completed 50000 out of 5000000 steps (1%)
04:58:08:WU03:FS02:0x21:Completed 50000 out of 5000000 steps (1%)
05:00:07:WU00:FS01:0x21:Completed 100000 out of 5000000 steps (2%)
05:00:16:WU03:FS02:0x21:Completed 100000 out of 5000000 steps (2%)
05:02:08:WU00:FS01:0x21:Completed 150000 out of 5000000 steps (3%)
05:02:24:WU03:FS02:0x21:Completed 150000 out of 5000000 steps (3%)
05:02:53:WU04:FS00:0xa7:Completed 450000 out of 2500000 steps (18%)
05:04:09:WU00:FS01:0x21:Completed 200000 out of 5000000 steps (4%)
This problem happened very suddenly, about a week ago, not too long after a GPU
driver update. Is it possible the driver had something to do with this? I've made no
other changes.

Also, this is affecting _both_ GPUs, not just one.

Thanks for any help you can provide!
Last edited by WTS on Tue May 21, 2019 10:57 am, edited 1 time in total.
Reality is what you stumble over when you walk around with your eyes closed.
WTS
Posts: 97
Joined: Sun May 19, 2019 5:49 pm
Location: Arkansas

Re: GPU Freezing Up?

Post by WTS »

P.S. Is there any program that'll "analyze" the GPUs, and is there any way I can post a screen shot of the program results to the forum? Thanks.
Reality is what you stumble over when you walk around with your eyes closed.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU Freezing Up?

Post by bruce »

Yes, it might have been a driver update, but not nedessarily. In the segments of FAH's log that you posted, both GPUs are folding normally.

Slot 01

Code: Select all

04:55:59:FS01:Unpaused
    04:55:59:WU00:FS01:Starting
    04:55:59:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 00 -suffix 01 -version 705 -lifeline 12180 -checkpoint 10 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 1 -cuda-device 1 -gpu 1
    04:55:59:WU00:FS01:Started FahCore on PID 20424
    04:55:59:WU00:FS01:Core PID:22300
    04:55:59:WU00:FS01:FahCore 0x21 started
    04:56:00:WU00:FS01:0x21:*********************** Log Started 2019-05-21T04:56:00Z ***********************
    04:56:00:WU00:FS01:0x21:Project: 11726 (Run 0, Clone 1252, Gen 312)
    04:56:00:WU00:FS01:0x21:Unit: 0x000001c08ca304e75b95ba1d29edaf7d
    04:56:00:WU00:FS01:0x21:CPU: 0x00000000000000000000000000000000
    04:56:00:WU00:FS01:0x21:Machine: 1
    04:56:00:WU00:FS01:0x21:Reading tar file core.xml
    04:56:00:WU00:FS01:0x21:Reading tar file integrator.xml
    04:56:00:WU00:FS01:0x21:Reading tar file state.xml
    04:56:00:WU00:FS01:0x21:Reading tar file system.xml
    04:56:00:WU00:FS01:0x21:Digital signatures verified
    04:56:00:WU00:FS01:0x21:Folding@home GPU Core21 Folding@home Core
    04:56:00:WU00:FS01:0x21:Version 0.0.18
    04:56:07:WU00:FS01:0x21:Completed 0 out of 5000000 steps (0%)
    04:56:07:WU00:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
    04:58:05:WU00:FS01:0x21:Completed 50000 out of 5000000 steps (1%)
    05:00:07:WU00:FS01:0x21:Completed 100000 out of 5000000 steps (2%)
    05:02:08:WU00:FS01:0x21:Completed 150000 out of 5000000 steps (3%)
    05:04:09:WU00:FS01:0x21:Completed 200000 out of 5000000 steps (4%)
Slot 02

Code: Select all

04:55:59:FS02:Unpaused
    04:55:59:WU03:FS02:Starting
    04:55:59:WU03:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 03 -suffix 01 -version 705 -lifeline 12180 -checkpoint 10 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
    04:55:59:WU03:FS02:Started FahCore on PID 17900
    04:55:59:WU03:FS02:Core PID:9608
    04:55:59:WU03:FS02:FahCore 0x21 started
    04:56:00:WU03:FS02:0x21:*********************** Log Started 2019-05-21T04:55:59Z ***********************
    04:56:00:WU03:FS02:0x21:Project: 11726 (Run 0, Clone 758, Gen 330)
    04:56:00:WU03:FS02:0x21:Unit: 0x000001d08ca304e75b8d9ee7ec3f9da5
    04:56:00:WU03:FS02:0x21:CPU: 0x00000000000000000000000000000000
    04:56:00:WU03:FS02:0x21:Machine: 2
    04:56:00:WU03:FS02:0x21:Reading tar file core.xml
    04:56:00:WU03:FS02:0x21:Reading tar file integrator.xml
    04:56:00:WU03:FS02:0x21:Reading tar file state.xml
    04:56:00:WU03:FS02:0x21:Reading tar file system.xml
    04:56:00:WU03:FS02:0x21:Digital signatures verified
    04:56:00:WU03:FS02:0x21:Folding@home GPU Core21 Folding@home Core
    04:56:00:WU03:FS02:0x21:Version 0.0.18
     04:56:07:WU03:FS02:0x21:Completed 0 out of 5000000 steps (0%)
    04:56:07:WU03:FS02:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
    04:58:08:WU03:FS02:0x21:Completed 50000 out of 5000000 steps (1%)
    05:00:16:WU03:FS02:0x21:Completed 100000 out of 5000000 steps (2%)
    05:02:24:WU03:FS02:0x21:Completed 150000 out of 5000000 steps (3%)
WTS
Posts: 97
Joined: Sun May 19, 2019 5:49 pm
Location: Arkansas

Re: GPU Freezing Up?

Post by WTS »

Is there anyway to post a screen shot/capture of the problem?
Reality is what you stumble over when you walk around with your eyes closed.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU Freezing Up?

Post by bruce »

Screen shots cannot be posted in the forum software, but you can store it in a cloud account and share its URL.
WTS
Posts: 97
Joined: Sun May 19, 2019 5:49 pm
Location: Arkansas

Re: GPU Freezing Up?

Post by WTS »

bruce wrote:Screen shots cannot be posted in the forum software, but you can store it in a cloud account and share its URL.
I reinstalled the last driver. The problem has been cleared. Something must've happened during the original driver installation.
In any case, the problem is fixed.
Reality is what you stumble over when you walk around with your eyes closed.
Post Reply