Page 1 of 1

FAHClient does not support more than 10 GPUs

PostPosted: Mon Mar 23, 2020 5:59 am
by caseymdk
Hi all,

I setup an AWS p2.16xlarge EC2 instance, which has 16 K80 Tesla's in it. I wanted to bang out some workunits, while also learning some about how AWS GPU instances work. I was able to get assigned workunits for all slots, but once they started running, I saw reasonable ETAs of a few hours for 10 GPU slots, but an ETA of 3 days for the other 6. This didn't make sense as the base credit was roughly the same for each workunit.

I used the "nvidia-smi" command to see what the resource utilization of each GPU was, and this is what I found. Notice how 7 FahCore processes are all on one GPU, while 6 GPUs are idle.

Code: Select all
ubuntu@ip-172-31-33-143:/var/lib/fahclient/work/11/01$ nvidia-smi
Mon Mar 23 02:40:39 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:0F.0 Off |                    0 |
| N/A   79C    P0   147W / 149W |    159MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:00:10.0 Off |                    0 |
| N/A   60C    P0   148W / 149W |   1001MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:00:11.0 Off |                    0 |
| N/A   80C    P0   147W / 149W |    216MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:00:12.0 Off |                    0 |
| N/A   63C    P0   145W / 149W |    121MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:00:13.0 Off |                    0 |
| N/A   78C    P0   148W / 149W |    158MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:00:14.0 Off |                    0 |
| N/A   62C    P0   148W / 149W |    158MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:00:15.0 Off |                    0 |
| N/A   80C    P0   145W / 149W |    159MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:00:16.0 Off |                    0 |
| N/A   63C    P0   149W / 149W |    121MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   8  Tesla K80           On   | 00000000:00:17.0 Off |                    0 |
| N/A   81C    P0   150W / 149W |    159MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   9  Tesla K80           On   | 00000000:00:18.0 Off |                    0 |
| N/A   63C    P0   151W / 149W |    205MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|  10  Tesla K80           On   | 00000000:00:19.0 Off |                    0 |
| N/A   40C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  Tesla K80           On   | 00000000:00:1A.0 Off |                    0 |
| N/A   36C    P8    31W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  Tesla K80           On   | 00000000:00:1B.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  13  Tesla K80           On   | 00000000:00:1C.0 Off |                    0 |
| N/A   30C    P8    30W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  14  Tesla K80           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   39C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  15  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    31W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15920      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   148MiB |
|    1     15934      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15948      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15955      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15962      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15969      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   194MiB |
|    1     15976      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   205MiB |
|    1     19333      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   147MiB |
|    2     16190      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   205MiB |
|    3     16843      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    4     15927      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   147MiB |
|    5     16985      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   147MiB |
|    6     17325      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   148MiB |
|    7     17462      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    8     17469      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   148MiB |
|    9     19100      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   194MiB |
+-----------------------------------------------------------------------------+


I tried explicitly setting the GPU indexes (setting them from 0-15) on each slot, but after reloading folding@home with the new config, I still saw the same behaviour. I feel like maybe the GPU indexes from folding@home are being passed to CUDA/Drivers in decimal, but they need to be in hex? No idea.

Let me know if anyone has any thoughts on this, or if you've seen similar behaviour.

Thanks!

Re: FAHClient does not support more than 10 GPUs

PostPosted: Mon Mar 23, 2020 7:39 am
by Jesse_V
Pretty cool setup, thanks and welcome to the forum.

The GPU indices should be in standard notation. Did you change the "opencl-index" or "cuda-index" option?

It might be helpful to post some of the log. Any clues in there?

Re: FAHClient does not support more than 10 GPUs

PostPosted: Mon Mar 23, 2020 7:49 am
by caseymdk
Looks like the indexes are being passed to FahCore correctly. The fact that the multiple GPUs were all assigned to GPU 1 makes me think that it's only taking the first character in the argument...though I can't see why it would be doing that. This is the log from when the OpenCL, GPU, and CUDA indexes were left at -1. All looks correct, but this instance got assigned to GPU 1 instead of GPU 14.

Code: Select all
02:08:06:WU15:FS15:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 15 -suffix 01 -version 705 -lifeline 15900 -checkpoint 3 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 14 -cuda-device 14 -gpu 14
02:08:06:WU15:FS15:Started FahCore on PID 15944
02:08:06:Started thread 16 on PID 15900
02:08:06:WU15:FS15:Core PID:15948
02:08:06:WU15:FS15:FahCore 0x22 started
02:08:06:WU14:FS14:0x22:*********************** Log Started 2020-03-23T02:08:06Z ***********************
02:08:06:WU14:FS14:0x22:*************************** Core22 Folding@home Core ***************************
02:08:06:WU14:FS14:0x22:       Type: 0x22
02:08:06:WU14:FS14:0x22:       Core: Core22
02:08:06:WU14:FS14:0x22:    Website: https://foldingathome.org/
02:08:06:WU14:FS14:0x22:  Copyright: (c) 2009-2018 foldingathome.org
02:08:06:WU14:FS14:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
02:08:06:WU14:FS14:0x22:             <rafal.wiewiora@choderalab.org>
02:08:06:WU14:FS14:0x22:       Args: -dir 14 -suffix 01 -version 705 -lifeline 15937 -checkpoint 3
02:08:06:WU14:FS14:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 13
02:08:06:WU14:FS14:0x22:             -cuda-device 13 -gpu 13
02:08:06:WU14:FS14:0x22:     Config: <none>

Re: FAHClient does not support more than 10 GPUs

PostPosted: Mon Mar 23, 2020 11:48 am
by _r2w_ben
Someone else came across this with 13 GPUs. As you've noticed, the core seems to only parse the first character of the argument into a digit.

Re: FAHClient does not support more than 10 GPUs

PostPosted: Mon Mar 23, 2020 12:07 pm
by foldy

Re: FAHClient does not support more than 10 GPUs

PostPosted: Thu Apr 23, 2020 5:01 pm
by cfhdev
Now a new beta version is out has anyone tried it with more then 10 GPU's? I just want to see if this is still an issue with 7.6.10

Re: FAHClient does not support more than 10 GPUs

PostPosted: Tue Apr 28, 2020 12:02 am
by cfhdev
Well since I received no answer, I tried on a 16 GPU system and received the same results. 0-9 is fine 10 and up count as slot 1

Re: FAHClient does not support more than 10 GPUs

PostPosted: Tue Apr 28, 2020 2:38 am
by MeeLee
With modern hardware it's less feasible to hit 10 GPUs.
Especially those that are good for crunching, 1,5kW on the outlet could drive about 6 or 7 GPUs, 8 if you tune them.
It makes more sense to run 2 powerful GPUs (RTX2080Ti), than 10 slower ones (eg: GT 1030, GTX 1050, or older Kepler GPUs).
I believe the trend will only continue, meaning it'll make more sense in 5 years, to run one or two modern GPUs at that time, than run relativey 'older' RTX 2080 Ti GPUs.