Page 1 of 1

GPU slot continuously returns INTERRUPTED

Posted: Tue Apr 04, 2017 8:56 pm
by hiigaran

Code: Select all

20:48:00:WU05:FS04:0x21:*********************** Log Started 2017-04-04T20:48:00Z ***********************
20:48:00:WU05:FS04:0x21:Project: 10496 (Run 162, Clone 18, Gen 42)
20:48:00:WU05:FS04:0x21:Unit: 0x0000003e8ca304f556bbb19331942678
20:48:00:WU05:FS04:0x21:CPU: 0x00000000000000000000000000000000
20:48:00:WU05:FS04:0x21:Machine: 4
20:48:00:WU05:FS04:0x21:Reading tar file core.xml
20:48:00:WU05:FS04:0x21:Reading tar file system.xml
20:48:01:WU05:FS04:0x21:Reading tar file integrator.xml
20:48:01:WU05:FS04:0x21:Reading tar file state.xml
20:48:01:WU05:FS04:0x21:Digital signatures verified
20:48:01:WU05:FS04:0x21:Folding@home GPU Core21 Folding@home Core
20:48:01:WU05:FS04:0x21:Version 0.0.18
20:48:16:WU05:FS04:FahCore returned: INTERRUPTED (102 = 0x66)
The above is the log on my GPU slot that I have noticed is constantly repeating. GPU slot keeps cycling between ready and running during this time. I was under the impression that INTERRUPTED is caused by either the user or client pausing the slot, but this is not the case. This system has been running without any user interaction for several days, though a restart did not help. The other three GPU slots, and CPU slot are working fine. Just in case, I tried setting priority higher, and unchecking the pause on battery options. No change.

Any additional information required?

EDIT: Uhh...I think I might have posted this in the wrong forum

EDIT2: I've noticed that the PRCG is always 10496 (162,18,42). Wondering if perhaps it is this particular WU causing the issues. It's always at 0%, so I can't tell if it's the same WU, or if it is restarting the WU completely. Is there a way to force a download of a new WU?

Re: GPU slot continuously returns INTERRUPTED

Posted: Tue Apr 04, 2017 9:08 pm
by bruce
Which drivers are installed?
Also, pleas post the top 100 lines of the log which shows your systems characteristics.

Re: GPU slot continuously returns INTERRUPTED

Posted: Tue Apr 04, 2017 10:03 pm
by hiigaran
I'll assume this is everything. Drivers are 370.28

Code: Select all

*********************** Log Started 2017-04-04T20:45:40Z ***********************
20:45:40:************************* Folding@home Client *************************
20:45:40:      Website: http://folding.stanford.edu/
20:45:40:    Copyright: (c) 2009-2016 Stanford University
20:45:40:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:45:40:         Args: --child --lifeline 1825 /etc/fahclient/config.xml --run-as
20:45:40:               fahclient --pid-file=/var/run/fahclient.pid --daemon
20:45:40:       Config: /etc/fahclient/config.xml
20:45:40:******************************** Build ********************************
20:45:40:      Version: 7.4.16
20:45:40:         Date: Jan 6 2017
20:45:40:         Time: 08:08:33
20:45:40:   Repository: Git
20:45:40:     Revision: e12187cbb0bd6937c067b9749af011374563b7b9
20:45:40:       Branch: master
20:45:40:     Compiler: GNU 4.9.2
20:45:40:      Options: -std=gnu++98 -O3 -funroll-loops -ffast-math -mfpmath=sse
20:45:40:               -fno-unsafe-math-optimizations -msse2
20:45:40:     Platform: linux2 4.8.0-2-amd64
20:45:40:         Bits: 64
20:45:40:         Mode: Release
20:45:40:******************************* System ********************************
20:45:40:          CPU: Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz
20:45:40:       CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
20:45:40:         CPUs: 8
20:45:40:       Memory: 7.72GiB
20:45:40:  Free Memory: 7.28GiB
20:45:40:      Threads: POSIX_THREADS
20:45:40:   OS Version: 4.4
20:45:40:  Has Battery: false
20:45:40:   On Battery: false
20:45:40:   UTC Offset: 4
20:45:40:          PID: 1827
20:45:40:          CWD: /var/lib/fahclient
20:45:40:           OS: Linux 4.4.0-66-generic x86_64
20:45:40:      OS Arch: AMD64
20:45:40:         GPUs: 4
20:45:40:        GPU 0: Bus:4 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:        GPU 1: Bus:5 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:        GPU 2: Bus:8 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:        GPU 3: Bus:9 Slot:0 Func:0 NVIDIA:5 GP104 [GeForce GTX 1080]
20:45:40:CUDA Device 0: Platform:0 Device:0 Bus:4 Slot:0 Compute:6.1 Driver:8.0
20:45:40:CUDA Device 1: Platform:0 Device:1 Bus:5 Slot:0 Compute:6.1 Driver:8.0
20:45:40:CUDA Device 2: Platform:0 Device:2 Bus:8 Slot:0 Compute:6.1 Driver:8.0
20:45:40:CUDA Device 3: Platform:0 Device:3 Bus:9 Slot:0 Compute:6.1 Driver:8.0
20:45:40:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
20:45:40:               libOpenCL.so: cannot open shared object file: No such file or
20:45:40:               directory
20:45:40:***********************************************************************
20:45:40:<config>
20:45:40:  <!-- Client Control -->
20:45:40:  <fold-anon v='true'/>
20:45:40:
20:45:40:  <!-- Folding Slot Configuration -->
20:45:40:  <gpu v='false'/>
20:45:40:
20:45:40:  <!-- Network -->
20:45:40:  <proxy v=':8080'/>
20:45:40:
20:45:40:  <!-- Slot Control -->
20:45:40:  <power v='full'/>
20:45:40:
20:45:40:  <!-- User Information -->
20:45:40:  <passkey v='********************************'/>
20:45:40:  <team v='212997'/>
20:45:40:  <user v='hiigaran'/>
20:45:40:
20:45:40:  <!-- Folding Slots -->
20:45:40:  <slot id='0' type='CPU'/>
20:45:40:  <slot id='1' type='GPU'>
20:45:40:    <opencl-index v='0'/>
20:45:40:  </slot>
20:45:40:  <slot id='2' type='GPU'>
20:45:40:    <cuda-index v='1'/>
20:45:40:    <opencl-index v='1'/>
20:45:40:  </slot>
20:45:40:  <slot id='3' type='GPU'>
20:45:40:    <cuda-index v='2'/>
20:45:40:    <opencl-index v='2'/>
20:45:40:  </slot>
20:45:40:  <slot id='4' type='GPU'>
20:45:40:    <cuda-index v='3'/>
20:45:40:    <opencl-index v='3'/>
20:45:40:  </slot>
20:45:40:</config>
20:45:40:Switching to user fahclient
20:45:40:Trying to access database...
20:45:41:Successfully acquired database lock
20:45:41:Enabled folding slot 00: READY cpu:4
20:45:41:Enabled folding slot 01: READY gpu:0:GP104 [GeForce GTX 1080]
20:45:41:Enabled folding slot 02: READY gpu:1:GP104 [GeForce GTX 1080]
20:45:41:Enabled folding slot 03: READY gpu:2:GP104 [GeForce GTX 1080]
20:45:41:Enabled folding slot 04: READY gpu:3:GP104 [GeForce GTX 1080]
20:45:41:WU00:FS00:Starting
20:45:41:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -np 4
20:45:41:WU00:FS00:Started FahCore on PID 1837
20:45:41:WU00:FS00:Core PID:1841
20:45:41:WU00:FS00:FahCore 0xa4 started
20:45:42:WU02:FS03:Starting
20:45:42:WU02:FS03:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 2 -cuda-device 2 -gpu 2
20:45:42:WU02:FS03:Started FahCore on PID 1847
20:45:42:WU02:FS03:Core PID:1851
20:45:42:WU02:FS03:FahCore 0x21 started
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:*------------------------------*
20:45:42:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
20:45:42:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:Preparing to commence simulation
20:45:42:WU00:FS00:0xa4:- Looking at optimizations...
20:45:42:WU00:FS00:0xa4:- Files status OK
20:45:42:WU00:FS00:0xa4:- Expanded 887768 -> 2072336 (decompressed 233.4 percent)
20:45:42:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=887768 data_size=2072336, decompressed_data_size=2072336 diff=0
20:45:42:WU00:FS00:0xa4:- Digital signature verified
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:Project: 8633 (Run 0, Clone 52, Gen 62)
20:45:42:WU00:FS00:0xa4:
20:45:42:WU00:FS00:0xa4:Assembly optimizations on if available.
20:45:42:WU00:FS00:0xa4:Entering M.D.
20:45:42:WU05:FS04:Starting
20:45:42:WU05:FS04:Removing old file './work/05/logfile_01-20170404-201242.txt'
20:45:42:WU05:FS04:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 05 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 3 -cuda-device 3 -gpu 3
20:45:42:WU05:FS04:Started FahCore on PID 1855
20:45:42:WU05:FS04:Core PID:1859
20:45:42:WU05:FS04:FahCore 0x21 started
20:45:42:WU04:FS02:Starting
20:45:42:WU04:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 04 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 1 -cuda-device 1 -gpu 1
20:45:42:WU04:FS02:Started FahCore on PID 1860
20:45:42:WU04:FS02:Core PID:1864
20:45:42:WU04:FS02:FahCore 0x21 started
20:45:43:WU01:FS01:Starting
20:45:43:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/fahwebx.stanford.edu/cores/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 704 -lifeline 1827 -checkpoint 15 -gpu-vendor nvidia -opencl-device 0 -cuda-device 0 -gpu 0
20:45:43:WU01:FS01:Started FahCore on PID 1865
20:45:43:WU01:FS01:Core PID:1869
20:45:43:WU01:FS01:FahCore 0x21 started
20:45:44:WU05:FS04:0x21:*********************** Log Started 2017-04-04T20:45:43Z ***********************
20:45:44:WU05:FS04:0x21:Project: 10496 (Run 162, Clone 18, Gen 42)
20:45:44:WU05:FS04:0x21:Unit: 0x0000003e8ca304f556bbb19331942678
20:45:44:WU05:FS04:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU05:FS04:0x21:Machine: 4
20:45:44:WU05:FS04:0x21:Reading tar file core.xml
20:45:44:WU05:FS04:0x21:Reading tar file system.xml
20:45:44:WU02:FS03:0x21:*********************** Log Started 2017-04-04T20:45:44Z ***********************
20:45:44:WU02:FS03:0x21:Project: 9196 (Run 1, Clone 50, Gen 368)
20:45:44:WU02:FS03:0x21:Unit: 0x000001efab40415c57cb3f32bfc1f20e
20:45:44:WU02:FS03:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU02:FS03:0x21:Machine: 3
20:45:44:WU02:FS03:0x21:Digital signatures verified
20:45:44:WU02:FS03:0x21:Folding@home GPU Core21 Folding@home Core
20:45:44:WU02:FS03:0x21:Version 0.0.18
20:45:44:WU01:FS01:0x21:*********************** Log Started 2017-04-04T20:45:44Z ***********************
20:45:44:WU01:FS01:0x21:Project: 9178 (Run 15, Clone 15, Gen 226)
20:45:44:WU01:FS01:0x21:Unit: 0x00000138ab436c6957b24c2a0ac9ed8f
20:45:44:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU01:FS01:0x21:Machine: 1
20:45:44:WU01:FS01:0x21:Digital signatures verified
20:45:44:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
20:45:44:WU01:FS01:0x21:Version 0.0.18
20:45:44:WU04:FS02:0x21:*********************** Log Started 2017-04-04T20:45:44Z ***********************
20:45:44:WU04:FS02:0x21:Project: 10496 (Run 102, Clone 14, Gen 73)
20:45:44:WU04:FS02:0x21:Unit: 0x000000568ca304f556bbad604c30b42a
20:45:44:WU04:FS02:0x21:CPU: 0x00000000000000000000000000000000
20:45:44:WU04:FS02:0x21:Machine: 2
20:45:44:WU04:FS02:0x21:Digital signatures verified
20:45:44:WU04:FS02:0x21:Folding@home GPU Core21 Folding@home Core
20:45:44:WU04:FS02:0x21:Version 0.0.18
20:45:45:WU04:FS02:0x21:  Found a checkpoint file
20:45:46:WU02:FS03:0x21:  Found a checkpoint file
20:45:46:WU05:FS04:0x21:Reading tar file integrator.xml
20:45:46:WU05:FS04:0x21:Reading tar file state.xml
20:45:46:WU05:FS04:0x21:Digital signatures verified
20:45:46:WU05:FS04:0x21:Folding@home GPU Core21 Folding@home Core
20:45:46:WU05:FS04:0x21:Version 0.0.18
20:45:46:WU01:FS01:0x21:  Found a checkpoint file
20:45:48:WU00:FS00:0xa4:Using Gromacs checkpoints
20:45:49:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WARNING:FS02:Size of positions 18948 does not match topology 18865
20:45:50:WU00:FS00:0xa4:Resuming from checkpoint
20:45:50:WU00:FS00:0xa4:Verified 00/wudata_01.log
20:45:51:WU00:FS00:0xa4:Verified 00/wudata_01.trr
20:45:51:WU00:FS00:0xa4:Verified 00/wudata_01.xtc
20:45:51:WU00:FS00:0xa4:Verified 00/wudata_01.edr
20:45:51:WU00:FS00:0xa4:Completed 856830 out of 1250000 steps  (68%)
20:45:59:WU05:FS04:FahCore returned: INTERRUPTED (102 = 0x66)


Re: GPU slot continuously returns INTERRUPTED

Posted: Tue Apr 04, 2017 10:52 pm
by bollix47
20:45:40: OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
20:45:40: libOpenCL.so: cannot open shared object file: No such file or
20:45:40: directory
How were the drivers installed? When using the ones from nvidia, opencl is installed too but other sources may not.

Re: GPU slot continuously returns INTERRUPTED

Posted: Tue Apr 04, 2017 11:54 pm
by hiigaran
The instructions I followed had these commands:

Code: Select all

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt-get install nvidia-370 nvidia-settings
sudo apt-get install mesa-common-dev
sudo apt-get install freeglut3-dev
That being said, the other three cards are working just fine. It's just this one card. I've got two identical systems, each with four cards. Just one system, with just one card, on just one slot is having this problem. So I don't know if drivers are the cause of this particular problem.

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 2:05 am
by _r2w_ben
Have you tried physically swapping two of the cards to see if the problem moves with the card or is dependent on the PCIe slot?

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 3:46 am
by bruce
1) Where is libOpenCL.so? is it accessible through the path or is it somewhere that FAH can see such as the CWD listed in FAH's startup?

2) I suggest you discard Project: 10496 (Run 162, Clone 18, Gen 42) AKA WU05 on your system. Several people have attempted to run this same WU and all have failed. I'll mark it as a corrupt WU and it shouldn't be assigned again after 8am pacific time.

If the you then have the same problem with another WU on that GPU,
3) Here are some other things you can try.

I'm not certain, but you MAY have to reinstall the drivers AFTER installing the last GPU and you MAY have to reinstall FAHClient after all of that. I'd first pause FAH, then go through the processes of reinstalling drvers and software before reactivating FAH's folding function. Depending on how the installers are configured, you may also have to reboot once or twice, but Linux is a lot better about that than Windows.

Lets us know if this helps.

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 4:14 am
by hiigaran
How do I discard the WU?

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 4:22 am
by bruce
The official method is to run FAHClient --dump 05
(note the double dash after the space)


The unofficial method is to pause that slot and delete the subdirectory 05 from inside of the work directory.

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 6:26 am
by hiigaran
Command didn't do anything. Am I supposed to stop FAHClient first? If so, how?

As for the second method, where is the directory? Haven't found it in /etc/fahclient, or in /home/user/.FAH. The former contains a single .xml file, and the latter a single .db file. ls -a shows no hidden files. I'm guessing there has to be another directory.

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 7:19 am
by Joe_H
Your current working directory is shown in the log you posted as /var/lib/fahclient, that is the first place to check. Your config file is shown as being in /etc/fahclient, I forget what else if anything the client stores there by default.

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 11:21 am
by SteveWillis
for me it's in /var/lib/fahclient/work
after I delete the directory I restart the client
sudo /etc/init.d/FAHClient stop
sudo /etc/init.d/FAHClient start

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 3:10 pm
by bruce
SteveWillis wrote:for me it's in /var/lib/fahclient/work
after I delete the directory I restart the client
sudo /etc/init.d/FAHClient stop
sudo /etc/init.d/FAHClient start
(or just sudo /etc/init.d/FAHClient restart :ewink:

If you use the unofficial method, once the subdirectory is gone, you shouldn't need to restart ... just unpause the slot and it should recover.

If you did it before 8am, FAH may re-download the same WU.

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 3:19 pm
by SteveWillis
I was thinking that for some reason it didn't like "restart" but I might have been thinking about something else. Computers can be so finicky

Re: GPU slot continuously returns INTERRUPTED

Posted: Wed Apr 05, 2017 10:34 pm
by hiigaran
Alrighty, everything is working as it should be after the delete. Thanks guys.