GPU slot continuously returns INTERRUPTED

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

Post Reply
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

GPU slot continuously returns INTERRUPTED

Post by hiigaran »

Code: Select all

20:48:00:WU05:FS04:0x21:*********************** Log Started 2017-04-04T20:48:00Z ***********************
20:48:00:WU05:FS04:0x21:Project: 10496 (Run 162, Clone 18, Gen 42)
20:48:00:WU05:FS04:0x21:Unit: 0x0000003e8ca304f556bbb19331942678
20:48:00:WU05:FS04:0x21:CPU: 0x00000000000000000000000000000000
20:48:00:WU05:FS04:0x21:Machine: 4
20:48:00:WU05:FS04:0x21:Reading tar file core.xml
20:48:00:WU05:FS04:0x21:Reading tar file system.xml
20:48:01:WU05:FS04:0x21:Reading tar file integrator.xml
20:48:01:WU05:FS04:0x21:Reading tar file state.xml
20:48:01:WU05:FS04:0x21:Digital signatures verified
20:48:01:WU05:FS04:0x21:Folding@home GPU Core21 Folding@home Core
20:48:01:WU05:FS04:0x21:Version 0.0.18
20:48:16:WU05:FS04:FahCore returned: INTERRUPTED (102 = 0x66)
The above is the log on my GPU slot that I have noticed is constantly repeating. GPU slot keeps cycling between ready and running during this time. I was under the impression that INTERRUPTED is caused by either the user or client pausing the slot, but this is not the case. This system has been running without any user interaction for several days, though a restart did not help. The other three GPU slots, and CPU slot are working fine. Just in case, I tried setting priority higher, and unchecking the pause on battery options. No change.

Any additional information required?

EDIT: Uhh...I think I might have posted this in the wrong forum

EDIT2: I've noticed that the PRCG is always 10496 (162,18,42). Wondering if perhaps it is this particular WU causing the issues. It's always at 0%, so I can't tell if it's the same WU, or if it is restarting the WU completely. Is there a way to force a download of a new WU?
Last edited by hiigaran on Tue Apr 04, 2017 9:09 pm, edited 1 time in total.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU slot continuously returns INTERRUPTED

Post by bruce »

Which drivers are installed?
Also, pleas post the top 100 lines of the log which shows your systems characteristics.
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Re: GPU slot continuously returns INTERRUPTED

Post by hiigaran »

I'll assume this is everything. Drivers are 370.28

Code: Select all

*********************** Log Started 2017-04-04T20:45:40Z ***********************
20:45:40:************************* Folding@home Client *************************
20:45:40:      Website: http://folding.stanford.edu/
20:45:40:    Copyright: (c) 2009-2016 Stanford University
20:45:40:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:45:40:         Args: --child --lifeline 1825 /etc/fahclient/config.xml --run-as
20:45:40:               fahclient --pid-file=/var/run/fahclient.pid --daemon
20:45:40:       Config: /etc/fahclient/config.xml
20:45:40:******************************** Build ********************************
20:45:40:      Version: 7.4.16
20:45:40:         Date: Jan 6 2017
20:45:40:         Time: 08:08:33
20:45:40:   Repository: Git
20:45:40:     Revision: e12187cbb0bd6937c067b9749af011374563b7b9
20:45:40:       Branch: master
20:45:40:     Compiler: GNU 4.9.2
20:45:40:      Options: -std=gnu++98 -O3 -funroll-loops -ffast-math -mfpmath=sse
20:45:40:               -fno-unsafe-math-optimizations -msse2
20:45:40:     Platform: linux2 4.8.0-2-amd64
20:45:40:         Bits: 64
20:45:40:         Mode: Release
20:45:40:******************************* System ********************************
20:45:40:          CPU: Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz
20:45:40:       CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
20:45:40:         CPUs: 8
20:45:40:       Memory: 7.72GiB
20:45:40:  Free Memory: 7.28GiB
20:45:40:      Threads: POSIX_THREADS
20:45:40:   OS Version: 4.4
20:45:40:  Has Battery: false
20:45:40:   On Battery: false
20:45:40:   UTC Offset: 4
20:45:40:          PID: 1827
20:45:40:          CWD: /var/lib/fahclient
20:45:40:           OS: Linux 4.4.0-66-generic x86_64
20:45:40:      OS Arch: AMD64
20:45:40:         GPUs: 4
20:45:40:        GPU 0: Bus:4 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:        GPU 1: Bus:5 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:        GPU 2: Bus:8 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:        GPU 3: Bus:9 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:CUDA Device 0: Platform:0 Device:0 Bus:4 Slot:0 Compute:6.1 Driver:8.0
20:45:40:CUDA Device 1: Platform:0 Device:1 Bus:5 Slot:0 Compute:6.1 Driver:8.0
20:45:40:CUDA Device 2: Platform:0 Device:2 Bus:8 Slot:0 Compute:6.1 Driver:8.0
20:45:40:CUDA Device 3: Platform:0 Device:3 Bus:9 Slot:0 Compute:6.1 Driver:8.0
20:45:40:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
20:45:40:               libOpenCL.so: cannot open shared object file: No such file or
20:45:40:               directory
20:45:40:***********************************************************************
20:45:40:<config>
20:45:40:  <!-- Client Control -->
20:45:40:  <fold-anon v='true'/>
20:45:40:
20:45:40:  <!-- Folding Slot Configuration -->
20:45:40:  <gpu v='false'/>
20:45:40:
20:45:40:  <!-- Network -->
20:45:40:  <proxy v=':8080'/>
20:45:40:
20:45:40:  <!-- Slot Control -->
20:45:40:  <power v='full'/>
20:45:40:
20:45:40:  <!-- User Information -->
20:45:40:  <passkey v='********************************'/>
20:45:40:  <team v='212997'/>
20:45:40:  <user v='hiigaran'/>
20:45:40:
20:45:40:  <!-- Folding Slots -->
20:45:40:  <slot id='0' type='CPU'/>
20:45:40:  <slot id='1' type='GPU'>
20:45:40:    <opencl-index v='0'/>
20:45:40:  </slot>
20:45:40:  <slot id='2' type='GPU'>
20:45:40:    <cuda-index v='1'/>
20:45:40:    <opencl-index v='1'/>
20:45:40:  </slot>
20:45:40:  <slot id='3' type='GPU'>
20:45:40:    <cuda-index v='2'/>
20:45:40:    <opencl-index v='2'/>
20:45:40:  </slot>
20:45:40:  <slot id='4' type='GPU'>
20:45:40:    <cuda-index v='3'/>
20:45:40:    <opencl-index v='3'/>
20:45:40:  </slot>
20:45:40:</config>
20:45:40:Switching to user fahclient
20:45:40:Trying to access database...
20:45:41:Successfully acquired database lock
20:45:41:Enabled folding slot 00: READY cpu:4
20:45:41:Enabled folding slot 01: READY gpu:0:GP104 [GeForce GTX 1080]
20:45:41:Enabled folding slot 02: READY gpu:1:GP104 [GeForce GTX 1080]
20:45:41:Enabled folding slot 03: READY gpu:2:GP104 [GeForce GTX 1080]
20:45:41:Enabled folding slot 04: READY gpu:3:GP104 [GeForce GTX 1080]
20:45:41:WU00:FS00:Starting
20:45:41:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -np 4
20:45:41:WU00:FS00:Started FahCore on PID 1837
20:45:41:WU00:FS00:Core PID:1841
20:45:41:WU00:FS00:FahCore 0xa4 started
20:45:42:WU02:FS03:Starting
20:45:42:WU02:FS03:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 2 -cuda-device 2 -gpu 2
20:45:42:WU02:FS03:Started FahCore on PID 1847
20:45:42:WU02:FS03:Core PID:1851
20:45:42:WU02:FS03:FahCore 0x21 started
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:*------------------------------*
20:45:42:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
20:45:42:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:Preparing to commence simulation
20:45:42:WU00:FS00:0xa4:- Looking at optimizations...
20:45:42:WU00:FS00:0xa4:- Files status OK
20:45:42:WU00:FS00:0xa4:- Expanded 887768 -> 2072336 (decompressed 233.4 percent)
20:45:42:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=887768 data_size=2072336, decompressed_data_size=2072336 diff=0
20:45:42:WU00:FS00:0xa4:- Digital signature verified
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:Project: 8633 (Run 0, Clone 52, Gen 62)
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:Assembly optimizations on if available.
20:45:42:WU00:FS00:0xa4:Entering M.D.
20:45:42:WU05:FS04:Starting
20:45:42:WU05:FS04:Removing old file './work/05/logfile_01-20170404-201242.txt'
20:45:42:WU05:FS04:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 05 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 3 -cuda-device 3 -gpu 3
20:45:42:WU05:FS04:Started FahCore on PID 1855
20:45:42:WU05:FS04:Core PID:1859
20:45:42:WU05:FS04:FahCore 0x21 started
20:45:42:WU04:FS02:Starting
20:45:42:WU04:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 04 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 1 -cuda-device 1 -gpu 1
20:45:42:WU04:FS02:Started FahCore on PID 1860
20:45:42:WU04:FS02:Core PID:1864
20:45:42:WU04:FS02:FahCore 0x21 started
20:45:43:WU01:FS01:Starting
20:45:43:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
20:45:43:WU01:FS01:Started FahCore on PID 1865
20:45:43:WU01:FS01:Core PID:1869
20:45:43:WU01:FS01:FahCore 0x21 started
20:45:44:WU05:FS04:0x21:*********************** Log Started 2017-04-04T20:45:43Z ***********************
20:45:44:WU05:FS04:0x21:Project: 10496 (Run 162, Clone 18, Gen 42)
20:45:44:WU05:FS04:0x21:Unit: 0x0000003e8ca304f556bbb19331942678
20:45:44:WU05:FS04:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU05:FS04:0x21:Machine: 4
20:45:44:WU05:FS04:0x21:Reading tar file core.xml
20:45:44:WU05:FS04:0x21:Reading tar file system.xml
20:45:44:WU02:FS03:0x21:*********************** Log Started 2017-04-04T20:45:44Z ***********************
20:45:44:WU02:FS03:0x21:Project: 9196 (Run 1, Clone 50, Gen 368)
20:45:44:WU02:FS03:0x21:Unit: 0x000001efab40415c57cb3f32bfc1f20e
20:45:44:WU02:FS03:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU02:FS03:0x21:Machine: 3
20:45:44:WU02:FS03:0x21:Digital signatures verified
20:45:44:WU02:FS03:0x21:Folding@home GPU Core21 Folding@home Core
20:45:44:WU02:FS03:0x21:Version 0.0.18
20:45:44:WU01:FS01:0x21:*********************** Log Started 2017-04-04T20:45:44Z ***********************
20:45:44:WU01:FS01:0x21:Project: 9178 (Run 15, Clone 15, Gen 226)
20:45:44:WU01:FS01:0x21:Unit: 0x00000138ab436c6957b24c2a0ac9ed8f
20:45:44:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU01:FS01:0x21:Machine: 1
20:45:44:WU01:FS01:0x21:Digital signatures verified
20:45:44:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
20:45:44:WU01:FS01:0x21:Version 0.0.18
20:45:44:WU04:FS02:0x21:*********************** Log Started 2017-04-04T20:45:44Z ***********************
20:45:44:WU04:FS02:0x21:Project: 10496 (Run 102, Clone 14, Gen 73)
20:45:44:WU04:FS02:0x21:Unit: 0x000000568ca304f556bbad604c30b42a
20:45:44:WU04:FS02:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU04:FS02:0x21:Machine: 2
20:45:44:WU04:FS02:0x21:Digital signatures verified
20:45:44:WU04:FS02:0x21:Folding@home GPU Core21 Folding@home Core
20:45:44:WU04:FS02:0x21:Version 0.0.18
20:45:45:WU04:FS02:0x21:  Found a checkpoint file
20:45:46:WU02:FS03:0x21:  Found a checkpoint file
20:45:46:WU05:FS04:0x21:Reading tar file integrator.xml
20:45:46:WU05:FS04:0x21:Reading tar file state.xml
20:45:46:WU05:FS04:0x21:Digital signatures verified
20:45:46:WU05:FS04:0x21:Folding@home GPU Core21 Folding@home Core
20:45:46:WU05:FS04:0x21:Version 0.0.18
20:45:46:WU01:FS01:0x21:  Found a checkpoint file
20:45:48:WU00:FS00:0xa4:Using Gromacs checkpoints
20:45:49:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WU00:FS00:0xa4:Resuming from checkpoint
20:45:50:WU00:FS00:0xa4:Verified 00/wudata_01.log
20:45:51:WU00:FS00:0xa4:Verified 00/wudata_01.trr
20:45:51:WU00:FS00:0xa4:Verified 00/wudata_01.xtc
20:45:51:WU00:FS00:0xa4:Verified 00/wudata_01.edr
20:45:51:WU00:FS00:0xa4:Completed 856830 out of 1250000 steps  (68%)
20:45:59:WU05:FS04:FahCore returned: INTERRUPTED (102 = 0x66)

bollix47
Posts: 2941
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: GPU slot continuously returns INTERRUPTED

Post by bollix47 »

20:45:40: OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
20:45:40: libOpenCL.so: cannot open shared object file: No such file or
20:45:40: directory
How were the drivers installed? When using the ones from nvidia, opencl is installed too but other sources may not.
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Re: GPU slot continuously returns INTERRUPTED

Post by hiigaran »

The instructions I followed had these commands:

Code: Select all

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt-get install nvidia-370 nvidia-settings
sudo apt-get install mesa-common-dev
sudo apt-get install freeglut3-dev
That being said, the other three cards are working just fine. It's just this one card. I've got two identical systems, each with four cards. Just one system, with just one card, on just one slot is having this problem. So I don't know if drivers are the cause of this particular problem.
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: GPU slot continuously returns INTERRUPTED

Post by _r2w_ben »

Have you tried physically swapping two of the cards to see if the problem moves with the card or is dependent on the PCIe slot?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU slot continuously returns INTERRUPTED

Post by bruce »

1) Where is libOpenCL.so? is it accessible through the path or is it somewhere that FAH can see such as the CWD listed in FAH's startup?

2) I suggest you discard Project: 10496 (Run 162, Clone 18, Gen 42) AKA WU05 on your system. Several people have attempted to run this same WU and all have failed. I'll mark it as a corrupt WU and it shouldn't be assigned again after 8am pacific time.

If the you then have the same problem with another WU on that GPU,
3) Here are some other things you can try.

I'm not certain, but you MAY have to reinstall the drivers AFTER installing the last GPU and you MAY have to reinstall FAHClient after all of that. I'd first pause FAH, then go through the processes of reinstalling drvers and software before reactivating FAH's folding function. Depending on how the installers are configured, you may also have to reboot once or twice, but Linux is a lot better about that than Windows.

Lets us know if this helps.
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Re: GPU slot continuously returns INTERRUPTED

Post by hiigaran »

How do I discard the WU?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU slot continuously returns INTERRUPTED

Post by bruce »

The official method is to run FAHClient --dump 05
(note the double dash after the space)


The unofficial method is to pause that slot and delete the subdirectory 05 from inside of the work directory.
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Re: GPU slot continuously returns INTERRUPTED

Post by hiigaran »

Command didn't do anything. Am I supposed to stop FAHClient first? If so, how?

As for the second method, where is the directory? Haven't found it in /etc/fahclient, or in /home/user/.FAH. The former contains a single .xml file, and the latter a single .db file. ls -a shows no hidden files. I'm guessing there has to be another directory.
Joe_H
Site Admin
Posts: 7856
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: GPU slot continuously returns INTERRUPTED

Post by Joe_H »

Your current working directory is shown in the log you posted as /var/lib/fahclient, that is the first place to check. Your config file is shown as being in /etc/fahclient, I forget what else if anything the client stores there by default.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
SteveWillis
Posts: 409
Joined: Fri Apr 15, 2016 12:42 am
Hardware configuration: PC 1:
Linux Mint 17.3
three gtx 1080 GPUs One on a powered header
Motherboard = [MB-AM3-AS-SB-990FXR2] qty 1 Asus Sabertooth 990FX(+59.99)
CPU = [CPU-AM3-FX-8320BR] qty 1 AMD FX 8320 Eight Core 3.5GHz(+41.99)

PC2:
Linux Mint 18
Open air case
Motherboard: ASUS Crosshair V Formula-Z AM3+ AMD 990FX SATA 6Gb/s USB 3.0 ATX AMD
AMD FD6300WMHKBOX FX-6300 6-Core Processor Black Edition with Cooler Master Hyper 212 EVO - CPU Cooler with 120mm PWM Fan
three gtx 1080,
one gtx 1080 TI on a powered header

Re: GPU slot continuously returns INTERRUPTED

Post by SteveWillis »

for me it's in /var/lib/fahclient/work
after I delete the directory I restart the client
sudo /etc/init.d/FAHClient stop
sudo /etc/init.d/FAHClient start
Image

1080 and 1080TI GPUs on Linux Mint
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU slot continuously returns INTERRUPTED

Post by bruce »

SteveWillis wrote:for me it's in /var/lib/fahclient/work
after I delete the directory I restart the client
sudo /etc/init.d/FAHClient stop
sudo /etc/init.d/FAHClient start
(or just sudo /etc/init.d/FAHClient restart :ewink:

If you use the unofficial method, once the subdirectory is gone, you shouldn't need to restart ... just unpause the slot and it should recover.

If you did it before 8am, FAH may re-download the same WU.
SteveWillis
Posts: 409
Joined: Fri Apr 15, 2016 12:42 am
Hardware configuration: PC 1:
Linux Mint 17.3
three gtx 1080 GPUs One on a powered header
Motherboard = [MB-AM3-AS-SB-990FXR2] qty 1 Asus Sabertooth 990FX(+59.99)
CPU = [CPU-AM3-FX-8320BR] qty 1 AMD FX 8320 Eight Core 3.5GHz(+41.99)

PC2:
Linux Mint 18
Open air case
Motherboard: ASUS Crosshair V Formula-Z AM3+ AMD 990FX SATA 6Gb/s USB 3.0 ATX AMD
AMD FD6300WMHKBOX FX-6300 6-Core Processor Black Edition with Cooler Master Hyper 212 EVO - CPU Cooler with 120mm PWM Fan
three gtx 1080,
one gtx 1080 TI on a powered header

Re: GPU slot continuously returns INTERRUPTED

Post by SteveWillis »

I was thinking that for some reason it didn't like "restart" but I might have been thinking about something else. Computers can be so finicky
Image

1080 and 1080TI GPUs on Linux Mint
hiigaran
Posts: 134
Joined: Thu Nov 17, 2011 6:01 pm

Re: GPU slot continuously returns INTERRUPTED

Post by hiigaran »

Alrighty, everything is working as it should be after the delete. Thanks guys.
Post Reply