Problem with 2 cards settup

Moderators: Site Moderators, FAHC Science Team

Problem with 2 cards settup

Postby beer » Wed Oct 15, 2014 5:09 pm

I have a weird behavior on my system.
I have 3 slots:
slot 0: SMP
slot 1: GPU: 0:GK104
slot 2: GPU: 1: GM204
see:
https://www.dropbox.com/s/4iesbayu9upc3 ... d.png?dl=0

My problem is when I start slot 2 then I resive a core 15 and my geforce 660 ti starts working and when I start slot 1 then I resive a core 17 and my geforce 970 starts working and short after crashes (only core 16).

I am a bid confused about the log as I have attach.

This part say slot 2 should use gpu 0
Code: Select all
15:47:33:  <slot id='2' type='GPU'>
15:47:33:    <gpu-index v='0'/>
15:47:33:  </slot>

and this part say
Code: Select all
15:47:33:        GPU 0: NVIDIA:3 GK104 [GeForce GTX 660 Ti]
15:47:33:        GPU 1: NVIDIA:4 GM204 [GeForce GTX 970]

That gpu 0 is GeForce GTX 660 Ti but the graphical userinterface says that slot 2 is for gm204.

I am confused
Code: Select all
*********************** Log Started 2014-10-15T15:47:33Z ***********************
15:47:33:************************* Folding@home Client *************************
15:47:33:      Website: http://folding.stanford.edu/
15:47:33:    Copyright: (c) 2009-2014 Stanford University
15:47:33:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:47:33:         Args:
15:47:33:       Config: C:/Users/beer/AppData/Roaming/FAHClient/config.xml
15:47:33:******************************** Build ********************************
15:47:33:      Version: 7.4.4
15:47:33:         Date: Mar 4 2014
15:47:33:         Time: 20:26:54
15:47:33:      SVN Rev: 4130
15:47:33:       Branch: fah/trunk/client
15:47:33:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
15:47:33:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
15:47:33:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
15:47:33:     Platform: win32 XP
15:47:33:         Bits: 32
15:47:33:         Mode: Release
15:47:33:******************************* System ********************************
15:47:33:          CPU: Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz
15:47:33:       CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
15:47:33:         CPUs: 8
15:47:33:       Memory: 7.94GiB
15:47:33:  Free Memory: 6.86GiB
15:47:33:      Threads: WINDOWS_THREADS
15:47:33:   OS Version: 6.1
15:47:33:  Has Battery: false
15:47:33:   On Battery: false
15:47:33:   UTC Offset: 2
15:47:33:          PID: 3060
15:47:33:          CWD: C:/Users/beer/AppData/Roaming/FAHClient
15:47:33:           OS: Windows 7 Professional
15:47:33:      OS Arch: AMD64
15:47:33:         GPUs: 2
15:47:33:        GPU 0: NVIDIA:3 GK104 [GeForce GTX 660 Ti]
15:47:33:        GPU 1: NVIDIA:4 GM204 [GeForce GTX 970]
15:47:33:         CUDA: 5.2
15:47:33:  CUDA Driver: 6050
15:47:33:Win32 Service: false
15:47:33:***********************************************************************
15:47:33:<config>
15:47:33:  <!-- Network -->
15:47:33:  <proxy v=':8080'/>
15:47:33:
15:47:33:  <!-- Slot Control -->
15:47:33:  <power v='full'/>
15:47:33:
15:47:33:  <!-- Folding Slots -->
15:47:33:  <slot id='0' type='CPU'>
15:47:33:    <paused v='true'/>
15:47:33:  </slot>
15:47:33:  <slot id='1' type='GPU'>
15:47:33:    <gpu-index v='1'/>
15:47:33:  </slot>
15:47:33:  <slot id='2' type='GPU'>
15:47:33:    <gpu-index v='0'/>
15:47:33:  </slot>
beer
 
Posts: 178
Joined: Tue Dec 13, 2011 12:18 pm

Re: Problem with 2 cards settup

Postby 7im » Wed Oct 15, 2014 6:24 pm

Slot order has nothing to do with hardware order. That's why they each have their own separate ID numbering.

GPU 0 is gpu index 0, while also being Slot 2, as shown in the log.

15:47:33: <slot id='2' type='GPU'>
15:47:33: <gpu-index v='0'/>
15:47:33: </slot>
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: Problem with 2 cards settup

Postby beer » Wed Oct 15, 2014 6:42 pm

slot 2 has hardware 0 that should be GK104 [GeForce GTX 660 Ti] according to the log but the UI shows me that slots 2 is connected to GM204 [GeForce GTX 970].

Is that why the WU is sent to the wrong card?
beer
 
Posts: 178
Joined: Tue Dec 13, 2011 12:18 pm

Re: Problem with 2 cards settup

Postby Sn1ken » Wed Oct 15, 2014 6:53 pm

Try Uninstall FAHClient and install after uninstall.

Maybe it mixes the cards somehow.
Image
Sn1ken
 
Posts: 60
Joined: Tue Feb 11, 2014 1:58 pm

Re: Problem with 2 cards settup

Postby 7im » Wed Oct 15, 2014 6:59 pm

beer wrote:slot 2 has hardware 0 that should be GK104 [GeForce GTX 660 Ti] according to the log but the UI shows me that slots 2 is connected to GM204 [GeForce GTX 970].

Is that why the WU is sent to the wrong card?


That's a different issue. If you haven't reinstalled the client since adding the GTX 970, that is the problem with the indexes.

If you reinstall, and still have the problem, set the indexes manually. https://foldingforum.org/viewtopic.php?p=199379#p199379
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: Problem with 2 cards settup

Postby beer » Thu Oct 16, 2014 5:53 am

Hi
Reinstalling did not help but after I did fellow the guide then it seems to be running as it should
beer
 
Posts: 178
Joined: Tue Dec 13, 2011 12:18 pm

Re: Problem with 2 cards settup

Postby ChristianVirtual » Sat Nov 08, 2014 1:23 pm

First time I have the experience of wrong "GPU in a slot".

History: Originally I had two GTX 780 on an ASUS board with CentOS 7; I removed the GTX 780 closer to the CPU and moved the second (a bit faster) GTX 780 on that slot. In the now empty slot I placed my shiny new GTX 970. Reason for that swap is mainly airflow as I don't want to cover the shorter 970 with the longer 780 in the vertical MB installation.

Before that swap I removed the GPU slots from the original FAH setup

System power on and all fine, at a first glance

Code: Select all
[cl@Linuxpowered ~]$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GK110 [GeForce GTX 780] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 13c2 (rev a1)


The script to set the fans also worked; 80% for 780 and 65% for 970; mean my xorg.conf with two GPU card based on PCI address also still worked.

Code: Select all
[cl@Linuxpowered ~]$ nvidia-smi
Sat Nov  8 20:52:46 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 343.22     Driver Version: 343.22         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 780     Off  | 0000:01:00.0     N/A |                  N/A |
| 80%   58C    P0    N/A /  N/A |    316MiB /  3071MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 970     Off  | 0000:02:00.0     N/A |                  N/A |
| 65%   54C    P2    N/A /  N/A |    130MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
+-----------------------------------------------------------------------------+


For testing also the output the FAHClient :
Code: Select all
[cl@Linuxpowered ~]$ FAHClient --lspci
VendorID:DeviceID:Vendor Name:Description
0x8086:0x0100:Intel Corporation:
0x8086:0x0101:Intel Corporation:
0x8086:0x0105:Intel Corporation:
0x8086:0x1e31:Intel Corporation:
0x8086:0x1e3a:Intel Corporation:
0x8086:0x1503:Intel Corporation:
0x8086:0x1e2d:Intel Corporation:
0x8086:0x1e20:Intel Corporation:
0x8086:0x1e10:Intel Corporation:
0x8086:0x1e14:Intel Corporation:
0x8086:0x244e:Intel Corporation:
0x8086:0x1e1e:Intel Corporation:
0x8086:0x1e26:Intel Corporation:
0x8086:0x1e44:Intel Corporation:
0x8086:0x1e02:Intel Corporation:
0x8086:0x1e22:Intel Corporation:
0x10de:0x1004:NVIDIA Corporation:GK110 [GeForce GTX 780]
0x10de:0x0e1a:NVIDIA Corporation:
0x10de:0x13c2:NVIDIA Corporation:
0x10de:0x0fbb:NVIDIA Corporation:
0x1b21:0x1042:ASMedia Technology Inc.:
0x1b21:0x1080:ASMedia Technology Inc.:
0x1b21:0x1042:ASMedia Technology Inc.:


Now after power-on and start of FAHClient I added the GPU slots back to the system (I configured as usual client-type beta but I'm sure the issue is not related to specific WU; hence posting here. If not appropriate please move to the beta forum).

Code: Select all
*********************** Log Started 2014-11-08T11:05:43Z ***********************
11:05:43:************************* Folding@home Client *************************
11:05:43:    Website: http://folding.stanford.edu/
11:05:43:  Copyright: (c) 2009-2014 Stanford University
11:05:43:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
11:05:43:       Args: --child --lifeline 2450 /etc/fahclient/config.xml --run-as
11:05:43:             fahclient --pid-file=/var/run/fahclient.pid --daemon
11:05:43:     Config: /etc/fahclient/config.xml
11:05:43:******************************** Build ********************************
11:05:43:    Version: 7.4.4
11:05:43:       Date: Mar 4 2014
11:05:43:       Time: 12:01:17
11:05:43:    SVN Rev: 4130
11:05:43:     Branch: fah/trunk/client
11:05:43:   Compiler: GNU 4.1.2 20080704 (Red Hat 4.1.2-46)
11:05:43:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
11:05:43:             -fno-unsafe-math-optimizations -msse2
11:05:43:   Platform: linux2 2.6.18-164.11.1.el5
11:05:43:       Bits: 64
11:05:43:       Mode: Release
11:05:43:******************************* System ********************************
11:05:43:        CPU: Intel(R) Core(TM) i7-2600S CPU @ 2.80GHz
11:05:43:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
11:05:43:       CPUs: 8
11:05:43:     Memory: 7.59GiB
11:05:43:Free Memory: 7.05GiB
11:05:43:    Threads: POSIX_THREADS
11:05:43: OS Version: 3.10
11:05:43:Has Battery: false
11:05:43: On Battery: false
11:05:43: UTC Offset: 9
11:05:43:        PID: 2452
11:05:43:        CWD: /var/lib/fahclient
11:05:43:         OS: Linux 3.10.0-123.9.2.el7.x86_64 x86_64
11:05:43:    OS Arch: AMD64
11:05:43:       GPUs: 2
11:05:43:      GPU 0: NVIDIA:3 GK110 [GeForce GTX 780]
11:05:43:      GPU 1: NVIDIA:4 GM204 [GeForce GTX 970]
11:05:43:       CUDA: 5.2
11:05:43:CUDA Driver: 6050
11:05:43:***********************************************************************
11:05:43:<config>
11:05:43:  <!-- HTTP Server -->
11:05:43:
11:05:43:  <!-- Logging -->
11:05:43:  <log-rotate-max v='1024'/>
11:05:43:
11:05:43:  <!-- Network -->
11:05:43:  <proxy v=':8080'/>
11:05:43:
11:05:43:  <!-- Slot Control -->
11:05:43:  <power v='full'/>
11:05:43:
11:05:43:  <!-- User Information -->
11:05:43:  <team v='3446'/>
11:05:43:  <user v='ChristianFAH'/>
11:05:43:
11:05:43:  <!-- Folding Slots -->
11:05:43:  <slot id='0' type='CPU'>
11:05:43:    <client-type v='<beta>'/>
11:05:43:    <cpus v='6'/>
11:05:43:    <pause-on-start v='true'/>
11:05:43:  </slot>
11:05:43:</config>


11:05:43:Switching to user fahclient
11:05:43:Trying to access database...
11:05:43:Successfully acquired database lock
11:05:43:Enabled folding slot 00: PAUSED cpu:6 (by user)
11:10:15:Adding folding slot 01: PAUSED gpu:0:GK110 [GeForce GTX 780] (by user)


11:10:15:Saving configuration to /etc/fahclient/config.xml
11:10:15:<config>
11:10:15:  <!-- HTTP Server -->
11:10:15:
11:10:15:  <!-- Logging -->
11:10:15:  <log-rotate-max v='1024'/>
11:10:15:
11:10:15:  <!-- Network -->
11:10:15:  <proxy v=':8080'/>
11:10:15:
11:10:15:
11:10:15:  <!-- Slot Control -->
11:10:15:  <power v='full'/>
11:10:15:
11:10:15:  <!-- User Information -->
11:10:15:  <team v='3446'/>
11:10:15:  <user v='ChristianFAH'/>
11:10:15:
11:10:15:  <!-- Folding Slots -->
11:10:15:  <slot id='0' type='CPU'>
11:10:15:    <client-type v='beta'/>
11:10:15:    <cpus v='6'/>
11:10:15:    <pause-on-start v='true'/>
11:10:15:  </slot>
11:10:15:  <slot id='1' type='GPU'>
11:10:15:    <client-type v='beta'/>
11:10:15:    <pause-on-start v='true'/>
11:10:15:  </slot>
11:10:15:</config>
11:10:45:Adding folding slot 02: PAUSED gpu:1:GM204 [GeForce GTX 970] (by user)


11:10:45:Saving configuration to /etc/fahclient/config.xml
11:10:45:<config>
11:10:45:  <!-- HTTP Server -->
11:10:45:  <!-- Logging -->
11:10:45:  <log-rotate-max v='1024'/>
11:10:45:
11:10:45:  <!-- Network -->
11:10:45:  <proxy v=':8080'/>
11:10:45:
11:10:45:  <!-- Slot Control -->
11:10:45:  <power v='full'/>
11:10:45:
11:10:45:  <!-- User Information -->
11:10:45:  <team v='3446'/>
11:10:45:  <user v='ChristianFAH'/>
11:10:45:
11:10:45:  <!-- Folding Slots -->
11:10:45:  <slot id='0' type='CPU'>
11:10:45:    <client-type v='beta'/>
11:10:45:    <cpus v='6'/>
11:10:45:    <pause-on-start v='true'/>
11:10:45:  </slot>
11:10:45:  <slot id='1' type='GPU'>
11:10:45:    <client-type v='beta'/>
11:10:45:    <pause-on-start v='true'/>
11:10:45:  </slot>
11:10:45:  <slot id='2' type='GPU'>
11:10:45:    <client-type v='beta'/>
11:10:45:    <pause-on-start v='true'/>
11:10:45:  </slot>
11:10:45:</config>
11:10:48:Saving configuration to /etc/fahclient/config.xml
11:10:48:<config>
11:10:48:  <!-- Logging -->
11:10:48:  <log-rotate-max v='1024'/>
11:10:48:
11:10:48:  <!-- Network -->
11:10:48:  <proxy v=':8080'/>
11:10:48:
11:10:48:  <!-- Slot Control -->
11:10:48:  <power v='full'/>
11:10:48:
11:10:48:  <!-- User Information -->
11:10:48:  <team v='3446'/>
11:10:48:  <user v='ChristianFAH'/>
11:10:48:
11:10:48:  <!-- Folding Slots -->
11:10:48:  <slot id='0' type='CPU'>
11:10:48:    <client-type v='beta'/>
11:10:48:    <cpus v='6'/>
11:10:48:    <pause-on-start v='true'/>
11:10:48:  </slot>
11:10:48:  <slot id='1' type='GPU'>
11:10:48:    <client-type v='beta'/>
11:10:48:    <pause-on-start v='true'/>
11:10:48:  </slot>
11:10:48:  <slot id='2' type='GPU'>
11:10:48:    <client-type v='beta'/>
11:10:48:    <pause-on-start v='true'/>
11:10:48:  </slot>
11:10:48:</config>
11:10:50:FS00:Unpaused

11:10:51:WU00:FS00:Connecting to 171.67.108.200:8080
11:10:51:WU00:FS00:Assigned to work server 128.143.199.96
11:10:51:WU00:FS00:Requesting new work unit for slot 00: READY cpu:6 from 128.143.199.96
11:10:51:WU00:FS00:Connecting to 128.143.199.96:8080
11:10:54:WU00:FS00:Downloading 4.12MiB
11:10:58:WU00:FS00:Download complete
11:10:58:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:8515 run:0 clone:5 gen:27 core:0xa3 unit:0x0000001cfbcb017c50241db006c690f0
11:10:58:WU00:FS00:Starting
11:10:58:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/beta/Core_a3.fah/FahCore_a3 -dir 00 -suffix 01 -version 704 -lifeline 2452 -checkpoint 15 -np 6
11:10:58:WU00:FS00:Started FahCore on PID 3630
11:10:58:WU00:FS00:Core PID:3634
11:10:58:WU00:FS00:FahCore 0xa3 started
11:10:58:WU00:FS00:0xa3:
11:10:58:WU00:FS00:0xa3:*------------------------------*
11:10:58:WU00:FS00:0xa3:Folding@Home Gromacs SMP Core
11:10:58:WU00:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
11:10:58:WU00:FS00:0xa3:
11:10:58:WU00:FS00:0xa3:Preparing to commence simulation
11:10:58:WU00:FS00:0xa3:- Looking at optimizations...
11:10:58:WU00:FS00:0xa3:- Created dyn
11:10:58:WU00:FS00:0xa3:- Files status OK
11:10:58:WU00:FS00:0xa3:- Expanded 4320694 -> 5389344 (decompressed 124.7 percent)
11:10:58:WU00:FS00:0xa3:Called DecompressByteArray: compressed_data_size=4320694 data_size=5389344, decompressed_data_size=5389344 diff=0
11:10:58:WU00:FS00:0xa3:- Digital signature verified
11:10:58:WU00:FS00:0xa3:
11:10:58:WU00:FS00:0xa3:Project: 8515 (Run 0, Clone 5, Gen 27)
11:10:58:WU00:FS00:0xa3:
11:10:58:WU00:FS00:0xa3:Assembly optimizations on if available.
11:10:58:WU00:FS00:0xa3:Entering M.D.
11:11:04:WU00:FS00:0xa3:Mapping NT from 6 to 6
11:11:05:WU00:FS00:0xa3:Completed 0 out of 500000 steps  (0%)



11:11:24:FS01:Unpaused
11:11:24:WU01:FS01:Connecting to 171.67.108.200:80
11:11:25:WU01:FS01:Assigned to work server 140.163.4.231
11:11:25:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:GK110 [GeForce GTX 780] from 140.163.4.231
11:11:25:WU01:FS01:Connecting to 140.163.4.231:8080
11:11:25:WU01:FS01:Downloading 4.84MiB
11:11:28:WU01:FS01:Download complete
11:11:28:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13000 run:564 clone:0 gen:88 core:0x17 unit:0x0000009d538b3db753103b07f0648cc2
11:11:28:WU01:FS01:Starting
11:11:28:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17 -dir 01 -suffix 01 -version 704 -lifeline 2452 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
11:11:28:WU01:FS01:Started FahCore on PID 3646
11:11:28:WU01:FS01:Core PID:3650
11:11:28:WU01:FS01:FahCore 0x17 started
11:11:29:WU01:FS01:0x17:*********************** Log Started 2014-11-08T11:11:28Z ***********************
11:11:29:WU01:FS01:0x17:Project: 13000 (Run 564, Clone 0, Gen 88)
11:11:29:WU01:FS01:0x17:Unit: 0x0000009d538b3db753103b07f0648cc2
11:11:29:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
11:11:29:WU01:FS01:0x17:Machine: 1
11:11:29:WU01:FS01:0x17:Reading tar file state.xml
11:11:29:WU01:FS01:0x17:Reading tar file system.xml
11:11:29:WU01:FS01:0x17:Reading tar file integrator.xml
11:11:29:WU01:FS01:0x17:Reading tar file core.xml
11:11:29:WU01:FS01:0x17:Digital signatures verified



11:12:23:FS02:Unpaused
11:12:23:WU02:FS02:Connecting to 171.67.108.200:80
11:12:24:WU02:FS02:Assigned to work server 171.67.108.52
11:12:24:WU02:FS02:Requesting new work unit for slot 02: READY gpu:1:GM204 [GeForce GTX 970] from 171.67.108.52
11:12:24:WU02:FS02:Connecting to 171.67.108.52:8080
11:12:24:WU02:FS02:Downloading 1.52MiB
11:12:26:WU02:FS02:Download complete
11:12:26:WU02:FS02:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:9201 run:916 clone:3 gen:5 core:0x17 unit:0x0000000f6652edc45399fa0d43c95842
11:12:26:WU02:FS02:Starting
11:12:26:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17 -dir 02 -suffix 01 -version 704 -lifeline 2452 -checkpoint 15 -gpu 1 -gpu-vendor nvidia
11:12:26:WU02:FS02:Started FahCore on PID 3670
11:12:26:WU02:FS02:Core PID:3674
11:12:26:WU02:FS02:FahCore 0x17 started
11:12:27:WU02:FS02:0x17:*********************** Log Started 2014-11-08T11:12:26Z ***********************
11:12:27:WU02:FS02:0x17:Project: 9201 (Run 916, Clone 3, Gen 5)
11:12:27:WU02:FS02:0x17:Unit: 0x0000000f6652edc45399fa0d43c95842
11:12:27:WU02:FS02:0x17:CPU: 0x00000000000000000000000000000000
11:12:27:WU02:FS02:0x17:Machine: 2
11:12:27:WU02:FS02:0x17:Reading tar file state.xml
11:12:27:WU02:FS02:0x17:Reading tar file system.xml
11:12:27:WU02:FS02:0x17:Reading tar file integrator.xml
11:12:27:WU02:FS02:0x17:Reading tar file core.xml
11:12:27:WU02:FS02:0x17:Digital signatures verified
11:12:49:WU02:FS02:0x17:Completed 0 out of 5000000 steps (0%)


11:14:39:WU01:FS01:0x17:ERROR:exception: Force RMSE error of 455.495 with threshold of 5


11:14:39:WU01:FS01:0x17:Saving result file logfile_01.txt
11:14:39:WU01:FS01:0x17:Saving result file badStateCheckpoint_1958637770
11:14:40:WU01:FS01:0x17:Saving result file badStateForceGroup0_1958637770Core.xml
11:14:43:WU01:FS01:0x17:Saving result file badStateForceGroup0_1958637770Ref.xml
11:14:45:WU01:FS01:0x17:Saving result file badStateForceGroup1_1958637770Core.xml
11:14:48:WU01:FS01:0x17:Saving result file badStateForceGroup1_1958637770Ref.xml
11:14:50:WU01:FS01:0x17:Saving result file badStateForceGroup2_1958637770Core.xml
11:14:53:WU01:FS01:0x17:Saving result file badStateForceGroup2_1958637770Ref.xml
11:14:55:WU01:FS01:0x17:Saving result file log.txt
11:14:55:WU01:FS01:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
11:14:56:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
11:14:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13000 run:564 clone:0 gen:88 core:0x17 unit:0x0000009d538b3db753103b07f0648cc2
11:14:56:WU01:FS01:Uploading 24.59MiB to 140.163.4.231
11:14:56:WU01:FS01:Connecting to 140.163.4.231:8080
11:14:56:WU03:FS01:Connecting to 171.67.108.200:80
11:14:56:WU03:FS01:Assigned to work server 140.163.4.233
11:14:56:WU03:FS01:Requesting new work unit for slot 01: READY gpu:0:GK110 [GeForce GTX 780] from 140.163.4.233
11:14:56:WU03:FS01:Connecting to 140.163.4.233:8080
11:14:57:WU03:FS01:Downloading 4.31MiB
11:15:01:WU03:FS01:Download complete
11:15:01:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:10468 run:0 clone:378 gen:71 core:0x17 unit:0x0000007a538b3db9538cb4159990d0d2
11:15:01:WU03:FS01:Starting
11:15:01:WU03:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17 -dir 03 -suffix 01 -version 704 -lifeline 2452 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
11:15:01:WU03:FS01:Started FahCore on PID 3712
11:15:01:WU03:FS01:Core PID:3716
11:15:01:WU03:FS01:FahCore 0x17 started
11:15:01:WU03:FS01:0x17:*********************** Log Started 2014-11-08T11:15:01Z ***********************
11:15:01:WU03:FS01:0x17:Project: 10468 (Run 0, Clone 378, Gen 71)
11:15:01:WU03:FS01:0x17:Unit: 0x0000007a538b3db9538cb4159990d0d2
11:15:01:WU03:FS01:0x17:CPU: 0x00000000000000000000000000000000
11:15:01:WU03:FS01:0x17:Machine: 1
11:15:01:WU03:FS01:0x17:Reading tar file state.xml
11:15:02:WU03:FS01:0x17:Reading tar file system.xml
11:15:02:WU01:FS01:Upload 23.64%
11:15:02:WU03:FS01:0x17:Reading tar file integrator.xml
11:15:02:WU03:FS01:0x17:Reading tar file core.xml
11:15:02:WU03:FS01:0x17:Digital signatures verified
11:15:08:WU01:FS01:Upload 44.99%
11:15:14:WU01:FS01:Upload 68.12%
11:15:17:WU01:FS01:Upload complete
11:15:17:WU01:FS01:Server responded WORK_ACK (400)
11:15:17:WU01:FS01:Cleaning up
11:15:31:WU02:FS02:0x17:Completed 50000 out of 5000000 steps (1%)


11:17:27:WU03:FS01:0x17:ERROR:exception: Force RMSE error of 414.231 with threshold of 5
11:17:27:WU03:FS01:0x17:Saving result file logfile_01.txt
11:17:27:WU03:FS01:0x17:Saving result file badStateCheckpoint_872351648
11:17:29:WU03:FS01:0x17:Saving result file badStateForceGroup0_872351648Core.xml
11:17:30:WU03:FS01:0x17:Saving result file badStateForceGroup0_872351648Ref.xml
11:17:33:WU03:FS01:0x17:Saving result file badStateForceGroup1_872351648Core.xml
11:17:35:WU03:FS01:0x17:Saving result file badStateForceGroup1_872351648Ref.xml
11:17:37:WU03:FS01:0x17:Saving result file badStateForceGroup2_872351648Core.xml
11:17:39:WU03:FS01:0x17:Saving result file badStateForceGroup2_872351648Ref.xml
11:17:42:WU03:FS01:0x17:Saving result file log.txt
11:17:42:WU03:FS01:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
11:17:42:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
11:17:42:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:10468 run:0 clone:378 gen:71 core:0x17 unit:0x0000007a538b3db9538cb4159990d0d2
11:17:42:WU03:FS01:Uploading 22.89MiB to 140.163.4.233
11:17:42:WU03:FS01:Connecting to 140.163.4.233:8080
11:17:42:WU01:FS01:Connecting to 171.67.108.200:80
11:17:43:WU01:FS01:Assigned to work server 171.67.108.52
11:17:43:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:GK110 [GeForce GTX 780] from 171.67.108.52
11:17:43:WU01:FS01:Connecting to 171.67.108.52:8080
11:17:43:WU01:FS01:Downloading 1.52MiB
11:17:46:WU01:FS01:Download complete
11:17:46:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9201 run:195 clone:3 gen:4 core:0x17 unit:0x0000000f6652edc45399ddb050e36b02
11:17:46:WU01:FS01:Starting
11:17:46:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17 -dir 01 -suffix 01 -version 704 -lifeline 2452 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
11:17:46:WU01:FS01:Started FahCore on PID 3753
11:17:46:WU01:FS01:Core PID:3757
11:17:46:WU01:FS01:FahCore 0x17 started
11:17:46:WU01:FS01:0x17:*********************** Log Started 2014-11-08T11:17:46Z ***********************
11:17:46:WU01:FS01:0x17:Project: 9201 (Run 195, Clone 3, Gen 4)
11:17:46:WU01:FS01:0x17:Unit: 0x0000000f6652edc45399ddb050e36b02
11:17:46:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
11:17:46:WU01:FS01:0x17:Machine: 1
11:17:46:WU01:FS01:0x17:Reading tar file state.xml
11:17:46:WU01:FS01:0x17:Reading tar file system.xml
11:17:46:WU01:FS01:0x17:Reading tar file integrator.xml
11:17:46:WU01:FS01:0x17:Reading tar file core.xml
11:17:46:WU01:FS01:0x17:Digital signatures verified
11:17:48:WU03:FS01:Upload 24.03%
11:17:54:WU03:FS01:Upload 62.53%
11:17:56:FS01:Finishing
11:17:59:WU03:FS01:Upload complete
11:17:59:WU03:FS01:Server responded WORK_ACK (400)
11:17:59:WU03:FS01:Cleaning up
11:18:08:WU01:FS01:0x17:Completed 0 out of 5000000 steps (0%)
11:18:15:WU02:FS02:0x17:Completed 100000 out of 5000000 steps (2%)
11:20:03:WU01:FS01:0x17:Completed 50000 out of 5000000 steps (1%)
11:20:57:WU02:FS02:0x17:Completed 150000 out of 5000000 steps (3%)
11:21:59:WU01:FS01:0x17:Completed 100000 out of 5000000 steps (2%)
11:23:40:WU02:FS02:0x17:Completed 200000 out of 5000000 steps (4%)
11:23:55:WU01:FS01:0x17:Completed 150000 out of 5000000 steps (3%)
11:24:09:WU00:FS00:0xa3:Completed 5000 out of 500000 steps  (1%)
11:25:50:WU01:FS01:0x17:Completed 200000 out of 5000000 steps (4%)
11:26:23:WU02:FS02:0x17:Completed 250000 out of 5000000 steps (5%)
11:27:46:WU01:FS01:0x17:Completed 250000 out of 5000000 steps (5%)
11:29:06:WU02:FS02:0x17:Completed 300000 out of 5000000 steps (6%)
11:29:41:WU01:FS01:0x17:Completed 300000 out of 5000000 steps (6%)
11:31:37:WU01:FS01:0x17:Completed 350000 out of 5000000 steps (7%)
11:31:49:WU02:FS02:0x17:Completed 350000 out of 5000000 steps (7%)
11:33:33:WU01:FS01:0x17:Completed 400000 out of 5000000 steps (8%)
11:34:31:WU02:FS02:0x17:Completed 400000 out of 5000000 steps (8%)
11:35:28:WU01:FS01:0x17:Completed 450000 out of 5000000 steps (9%)
11:37:08:WU00:FS00:0xa3:Completed 10000 out of 500000 steps  (2%)
11:37:14:WU02:FS02:0x17:Completed 450000 out of 5000000 steps (9%)
11:37:24:WU01:FS01:0x17:Completed 500000 out of 5000000 steps (10%)
11:39:20:WU01:FS01:0x17:Completed 550000 out of 5000000 steps (11%)
11:39:57:WU02:FS02:0x17:Completed 500000 out of 5000000 steps (10%)


Two WU got returned with error (each one 13000 and 10468). First I though: opps; the cards might not be placed correctly. but nothing wrong with both.
Then I realised after some frames that FAHControl indicated much higher speed for the 780 (TPF 1:56) and much lower speed for 970 (TPF: 2:43); both with running 9201-WU.
Wait ... that exactly opposite of what the expectation is: 780 with 2:43 and 970 with 1:56.

So my conclusion is that the enumeration shows the wrong GPUs in the FAH environment.

Quick check with ocore_v20

Code: Select all
OpenCL compatible devices:
name: GeForce GTX 970 | platformId: 0 deviceId: 0
name: GeForce GTX 780 | platformId: 0 deviceId: 1


Question is now: do I need to really uninstall the whole client or should deletion/recreation of slots be enough ? Obvious it was not in the first try. If reinstall is required I have to wait another 20hours as the CPU is folding some WU.
Or can I set in advanced setting the OpenCL-index based on the ocore-opinion which is closer to what I see based on actual TPF.

Update: finally both 9201 finished and I changed the config with the addition of OpenCL and CUDA index (both identical and as per ocore suggestion)

Code: Select all
*********************** Log Started 2014-11-08T15:45:49Z ***********************
15:45:49:************************* Folding@home Client *************************
15:45:49:    Website: http://folding.stanford.edu/
15:45:49:  Copyright: (c) 2009-2014 Stanford University
15:45:49:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:45:49:       Args: --child --lifeline 13886 /etc/fahclient/config.xml --run-as
15:45:49:             fahclient --pid-file=/var/run/fahclient.pid --daemon
15:45:49:     Config: /etc/fahclient/config.xml
15:45:49:******************************** Build ********************************
15:45:49:    Version: 7.4.4
15:45:49:       Date: Mar 4 2014
15:45:49:       Time: 12:01:17
15:45:49:    SVN Rev: 4130
15:45:49:     Branch: fah/trunk/client
15:45:49:   Compiler: GNU 4.1.2 20080704 (Red Hat 4.1.2-46)
15:45:49:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
15:45:49:             -fno-unsafe-math-optimizations -msse2
15:45:49:   Platform: linux2 2.6.18-164.11.1.el5
15:45:49:       Bits: 64
15:45:49:       Mode: Release
15:45:49:******************************* System ********************************
15:45:49:        CPU: Intel(R) Core(TM) i7-2600S CPU @ 2.80GHz
15:45:49:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
15:45:49:       CPUs: 8
15:45:49:     Memory: 7.59GiB
15:45:49:Free Memory: 5.91GiB
15:45:49:    Threads: POSIX_THREADS
15:45:49: OS Version: 3.10
15:45:49:Has Battery: false
15:45:49: On Battery: false
15:45:49: UTC Offset: 9
15:45:49:        PID: 13888
15:45:49:        CWD: /var/lib/fahclient
15:45:49:         OS: Linux 3.10.0-123.9.2.el7.x86_64 x86_64
15:45:49:    OS Arch: AMD64
15:45:49:       GPUs: 2
15:45:49:      GPU 0: NVIDIA:3 GK110 [GeForce GTX 780]
15:45:49:      GPU 1: NVIDIA:4 GM204 [GeForce GTX 970]
15:45:49:       CUDA: 5.2
15:45:49:CUDA Driver: 6050
15:45:49:***********************************************************************
15:45:49:<config>
15:45:49:  <!-- Logging -->
15:45:49:  <log-rotate-max v='1024'/>
15:45:49:
15:45:49:  <!-- Network -->
15:45:49:  <proxy v=':8080'/>
15:45:49:
15:45:49:  <!-- Slot Control -->
15:45:49:  <power v='full'/>
15:45:49:
15:45:49:  <!-- User Information -->
15:45:49:  <team v='3446'/>
15:45:49:  <user v='ChristianFAH'/>
15:45:49:
15:45:49:  <!-- Folding Slots -->
15:45:49:  <slot id='0' type='CPU'>
15:45:49:    <client-type v='beta'/>
15:45:49:    <cpus v='6'/>
15:45:49:    <pause-on-start v='true'/>
15:45:49:  </slot>
15:45:49:  <slot id='1' type='GPU'>
15:45:49:    <client-type v='beta'/>
15:45:49:    <cuda-index v='1'/>
15:45:49:    <opencl-index v='1'/>
15:45:49:    <pause-on-start v='true'/>
15:45:49:  </slot>
15:45:49:  <slot id='2' type='GPU'>
15:45:49:    <client-type v='beta'/>
15:45:49:    <cuda-index v='0'/>
15:45:49:    <opencl-index v='0'/>
15:45:49:    <pause-on-start v='true'/>
15:45:49:  </slot>
15:45:49:</config>
15:45:49:Switching to user fahclient
15:45:49:Trying to access database...
15:45:53:Successfully acquired database lock
15:45:53:Enabled folding slot 00: PAUSED cpu:6 (by user)
15:45:53:Enabled folding slot 01: PAUSED gpu:0:GK110 [GeForce GTX 780] (by user)
15:45:53:Enabled folding slot 02: PAUSED gpu:1:GM204 [GeForce GTX 970] (by user)
15:46:11:FS00:Unpaused
15:46:11:WU00:FS00:Starting


The 780 for 13000 and the 970 keeps 9201; world as I know from the last week.

The frames/TPF need to realign a bit; but looks ok now with 1:55 for 970 and 7:25 for 13000 on 780. :)

Main problem is that with the wrong mapping the assignment server can't do it job and gave me WU not possible on 970 (and it mess up my statistics). Else back to normal case.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
User avatar
ChristianVirtual
 
Posts: 1596
Joined: Tue May 28, 2013 1:14 pm
Location: Tokyo

Re: Problem with 2 cards settup

Postby bruce » Sat Nov 08, 2014 6:04 pm

The basic architecture of a Kepler and a Maxwell GPU is different enough that the drivers AND FAHClient AND the Assignment Server AND any WU data that exists in the /work directory are all supposed to recognize the difference and to manage them differently. Installing the client performs certain initialization steps which don't happen if you just remove one GPU and replace it with another type of GPU so reinstallation is recommended. Then, too, there are a couple of obscure bugs that have never been fixed, so getting the setup properly aligned can go wrong for either reason, and nobody has systematically tested all of the possibilities

I would probably try removing both GPU slots plus any remnants of active WUs that exists in /work which SHOULD reinitialize the slots more-or-less like a removal/re-installation [with data] but whether that works or not has not been systematically tested, either. The bottom line is that you may have to set the values of the indexes manually but you would also have to do that when there's no residual WU data in /work, so full re-installation is probably the simplest method.
bruce
 
Posts: 19861
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.


Return to V7.4.4 Public Release Windows/Linux/MacOS X (deprecated)

Who is online

Users browsing this forum: No registered users and 3 guests

cron