1 core per GPU still holds true?

ProDigit · Post by **ProDigit** » Sat Jan 12, 2019 8:43 am

It has been said in the past to reserve 1 CPU core per graphics card.
Is there a ballpark on CPU speed per card's performance? (eg: a 2Ghz CPU bottlenecks a GTX 1050 or so?)
I'm thinking of installing a second system, with less cores (Quad core), but more powerful cards (RTX 2070 + other, higher end card in the future), and want to know what kind of CPU I should be looking for; in terms of high Ghz, or if a 3Ghz CPU will do the job?

It's important for me, because quite often, a 3Ghz CPU, if it's enough, will consume less power than a 4Ghz CPU, and costs less.

Second question is, my current CPU is only 1,89Ghz, but I have 20 CPU cores; will FAH automatically assign a second core to stream data to a fast GPU, when 1 core isn't enough, or will the one core bottleneck the card?
I have a feint impression it does use more cores when cpu speed is low, as with my 3 graphics cards connected, my system is needing 4 cores reserved (the CPU runs at 99+% utilization, with 16 of 20 CPU threads folding on 3 cards; and with 17 out of 20 cores folding with only 2 graphics cards(1050+1060); that is 1 core more than what it should do).

foldy · Post by **foldy** » Sat Jan 12, 2019 9:18 am

FAH will only use one CPU core per GPU. But it makes sense to have one CPU core free for the OS if you run CPU folding too which matches your observation.

So with a quad core CPU you can feed 4 GPUs with no CPU folding. Some say if it is a hyperthreading CPU with 8 threads you can even drive 8 GPUs. But I did not try that.

I did not here about a CPU limit in Ghz for GPU folding yet but I test it now ...

ProDigit · Post by **ProDigit** » Sat Jan 12, 2019 9:32 am

yea, it would make sense that fast graphics cards may require more than just 1 core to feed them.
Or, if one is running 2 slow graphics cards (like 2x GT1030), perhaps they both can be ran off of 1 core without PPD loss (in case the load of the CPU core is lower than 50% per card)?

foldy · Post by **foldy** » Sat Jan 12, 2019 9:35 am

The CPU core usage is in the nvidia OpenCL driver and it cannot use more than one CPU core. But sharing a CPU core by slow GPUs makes sense.

CPU clock test with intel 2500k sandy bridge 1.6 Ghz to 4.0 Ghz with FAHBench 2.3.1 64bit (FahCore_21) on Windows 7 with GTX 1080ti

Project: . 13782 9415
CPU 1.6Ghz 222 183 (-20%) BAD
CPU 2.0Ghz 265 219 (-5%) measure tolerance
CPU 2.5Ghz 267 222 (-3%) measure tolerance
CPU 4.0Ghz 268 230

Looks like you don't want to go below a 2 Ghz CPU on Windows guess Linux is the same. Maybe for newer generation CPUs the limit is also lower.

There are also some work units which do not use a full CPU core like project 14156 - there CPU clock doesn't matter.

AMD GPUs have low CPU usage so there you can maybe drive more GPUs on one CPU core or have lower than 2 Ghz.

I guess your server CPU with 1,89Ghz 20 cores is fine.

ProDigit · Post by **ProDigit** » Sat Jan 12, 2019 10:49 am

The 1080ti should be close to the RTX 2070 in terms of performance.

Is there a way to test 2x slower graphics cards, and set affinity to 1 core only, and see what happens?
If 2GB equals ~1M PPD, it would be interesting to see if 2x 1050 or 1060 cards can run on one core.

Just curious if the 1 core per graphics card is a set rule, or more a guideline.

foldy · Post by **foldy** » Sat Jan 12, 2019 11:01 am

Good idea. There are some tools to lock a process to a phycical CPU core for testing. You can run several instances of FAHbench and use a real work unit for testing. Or measure the TPF of FAH running.

ProDigit · Post by **ProDigit** » Sat Jan 12, 2019 11:08 am

Will try once I'm at home with my 1050 and 1030.
Interested to see if I could lock both of them to CPU affinity 0 or so, and what would happen.
In Windows it can be done via taskmanager.

ProDigit · Post by **ProDigit** » Mon Jan 14, 2019 4:07 am

The only information I could get, is the Kernel times,
Though all cores use 100% CPU utilization per GPU, the kernel times on my 1060 hovered around 90%, the 1050 around 85%, and the 1030 around 50%.
This was at 1,89Ghz.

When i combined the 1050 and 1030 to 1 core (meaning, the 1050 ran on core 2 and the 1030 on the hyperthreading core 3), my turbo boost kicked in, and ran the CPU at 2Ghz.
I ran the 1060 on core 0,
The 1050 on core 2,
The 1030 on core 3.

I then added CPU folding, and saw that for CPU folding, one core was also used 100%, with only 29% kernel times.
I presumed this is the 'missing core' I was referring to in other threads,
meaning,
I have 10 cores, 20 threads, 3 graphics cards.
Logically I would have 17 threads left over for cpu folding, no?
But I saw same CPU usage at 16 threads.
This explains it.
One CPU thread is occupied feeding the other 16 for folding, like one would for a GPU.

So now, I have 1 dedicated CPU core (2 threads) to feed the 1030 and 1050.
1 dedicated core to feed the 1060 with the hyper thread going to system overhead.
And 16 CPU dedicated cores for folding.
It's easy to set up, though every time a new work load comes in (every 5 to 20 hours), I'd have to reset thread affinity.
I'd just want to see if assigning thread affinity actually helps or improves performance, and keeping out CPU folding from interfering with the GPU threads.

The overhead thread (core 1, that shares the L-cache of the 1060 on core 0) uses about 20% CPU.
That leaves me around 80% of a single core left for using the operating system, or windows doing it's standard background shenanigans.

I also tried sharing Core 2 with the 1050 and 1030 graphics card (meaning, running 2 graphics cards from 1 core), and the system didn't crash, but I probably wouldn't recommend doing that, as I had the feeling performance dropped.
Kernel times on the shared thread (core 2) dropped from 85% on the 1050 to 75% with the 1050 and 1030 sharing the same thread.
I didn't run this test long enough to see how it would affect PPDs.

This however, might work in case you have eg: 2 graphics cards, and 2 dedicated CPU threads, and want to add a third card to the mix, it should run, despite the '1 CPU slot per GPU' limitation.
I don't know how it would perform though...
If I'd ever do that, I'd do CPU Thread sharing with the slowest cards for sure, and keep a dedicated CPU thread for the faster card.

foldy · Post by **foldy** » Mon Jan 14, 2019 6:26 pm

I made an experiment using FAHbench with several gtx 1070 using Linux with 2 CPU cores
FAH Project 14567
1 GPUs = 229 ns/d
2 GPUs = 229 ns/d
3 GPUs = 219 ns/d
4 GPUs = 205 ns/d

You can see that that for 1 or 2 GPUs there was no difference in speed as 2 CPU cores are available.
But using more GPUs than CPU cores the speed goes down by 10% for GPU to CPU ratio of 2:1

Now I need to do the same experiment using FAH

Tried running 2x RTX 2080 on 1 CPU core.
1 GPU = TPF 1:33 or TPF 2:29
2 GPUs = TPF 1:35 or TPF 2:33

So running 2 GPUs on 1 CPU core made TPF worse by 2-4 sec so this is not a real problem.

GPU usage shows as 85% when GPU:CPU = 2:1. When using equal CPU and GPU cores then GPU usage shows 97%.
(All tests were done on pcie 3.0 x1 risers but even on pcie 2.0 x1 the GPU usage still shows 97%.)

In general my feeling is 10% performance loss is not a problem. But running more than 4 fast GPUs on 1 CPU may bottleneck worse. Cannot try that on Windows but only Linux tests.

ProDigit · Post by **ProDigit** » Mon Jan 14, 2019 11:25 pm

I'd have to note, from my observation lately, that GPU utilization recorded by taskmanager in Windows (and perhaps Linux variants as well), is not very accurate; when it comes to the compute_0 (GPU utilization).
It's the closest thing we have to recording GPU activity, but, it's still immeasurably inaccurate.
When I dropped VRAM speed to idle speeds (3,7Ghz stock speed to 800-850Mhz idle VRAM speed), taskmanager showed my GPUs were still folding at 91-95%.
I thought, great! My GPUs are folding at nearly 100% capacity, and my graphics cards are running so much cooler!
But my PPDs dropped by 50%.
There's no way PPDs drop to 50%, if the GPU is doing the same amount of work it did before.
If the GPU did the same job, with slower ram, I'd expect a small drop, but not 50%. Or, if setting RAM to idle lowered PPD by 50%, GPU activity (compute_0) in taskmanager should show close to 50% (which it didn't).

So I'd think that compute_0 in taskmanager, is probably scaled to how much data there's transferred to and from VRAM (VRAM to GPU bus speed) rather than actual GPU activity, and depending on the slowest factor (either RAM or GPU).
When one bottlenecks, taskmanager will still show 100%, even if either RAM or GPU is running at 20% of it's capacity.

Another thing is, I would presume running a high powered card like the GTX 2080 over a PCIE 1x slot with riser, would also lower the card's performance significantly over running them at a native 16x slot;
Perhaps slows it down so much that the card is running at a fraction of its capacity?
The results you recorded probably are affected by the PCIE bottleneck.
However, it still is interesting to see how much of CPU you need to saturate a PCIE 1x slot.
Running my GTX 1050 over PCIE 16x slot, gets me 133k PPD. Switching it out to a 1x slot with riser, and it runs at 100k PPD.
I don't know if this is any indication that a PCIE 1x slot is about as fast as an RX 570, or GTX 770 need, and 33% too slow for a 1050. But correct me if I'm wrong on this.
I'm just trying to understand it all.

I presume the found results might be more accurate, when running the 2080 from a 16x slot, or compare results over PCIE 1x slot on slower graphics cards (I would presume a GTX 1050 or higher is bottlenecked by the 1x speeds).

Another thing about CPU,
I have learned, that 1CPU thread that's been locked by a graphics card, has a portion of the thread just locking in the CPU which is mostly zero or idle data being fed into the CPU keeping that CPU locked to that thread, and a portion doing the actual work.
It's easy to see how a lower performing 1030 uses 1 core, while the 1060 which performs 3x faster, still uses the same CPU usage as the faster card; save for kernel times.
The 1060 is 3x faster, and has 3x higher kernel times. (it gets 300kPPD while the 1030 gets only 41k PPD, but most of those are bonus points, not actually performing 7x faster).

From the kernel times, I can roughly deduct that in order for a Xeon to be fast enough to feed the card,
A GT 1030 needs a 600Mhz CPU or higher, to feed its data.
A GTX 1050 needs a 1Ghz or higher.
A GTX 1060 needs a 1.7Ghz or higher
A GTX 1070 needs a 2.5Ghz or higher
An RTX 2070 probably needs a 3.4Ghz or higher.
An RTX 2080 probably needs a 3.8Ghz or higher.
A Titan V should need 5Ghz or more, to get optimal results.

Numbers may be lower for Corei and Ryzen CPUs, as they are built different than Xeons.

The fact that you were able to run the RTX 2080 down 'till 2Ghz before serious throttling happened, might indicate that you're running the RTX 2080 at PCIE 1x speeds, essentially you're running it at 1/3rd of it's performance?
Just trying to make sense of it all...

foldy · Post by **foldy** » Tue Jan 15, 2019 7:35 am

PCIE limit bottlenecks much on Windows and less on Linux. I can run a gtx 1080ti ~=RTX 2070 at pcie x16 with CPU 2 Ghz without any loss on Windows. So your CPU Mhz list is not correct. I still think that a 10-20% performance loss because of bottlenecks is not so bad and still worth running that hardware.

I found on Windows for fast nvidia GPUs pcie 3.0 x4 and ~1.5 Ghz CPU are the lower limit before serious performance loss like 50% occur.
On Linux for fast GPUs pcie 2.0 x1 and ~1.5 Ghz CPU are the lower limit before serious performance loss like 50% occur.

My theory is CPU loads data through pcie to GPU and then GPU processes data without CPU until it has a result. Then CPU gets result from GPU through pcie and sends next data to GPU dependend on the result. So the CPU to GPU communication occurs in peaks and then CPU spin waiting time.
If the pcie bus speed is too slow or has bad driver usage like nvidia opencl on Windows then it takes too long for the CPU to GPU communication which makes GPU get idle for short time which slows down FAH. If CPU speed is too slow then the same occurs. Then we have a ratio of CPU/pcie to GPU time and if it is CPU/pcie 1 to GPU 99 that is great. If it is CPU/pcie 10 to GPU 90 we have a light bottleneck. And if it is CPU/pcie 50 to GPU 50 then we have bad performance.

Post by **bruce** » Tue Jan 15, 2019 5:34 pm

foldy wrote:My theory is CPU loads data through pcie to GPU and then GPU processes data without CPU until it has a result. Then CPU gets result from GPU through pcie and sends next data to GPU dependent on the result. So the CPU to GPU communication occurs in peaks and then CPU spin waiting time.
If the pcie bus speed is too slow or has bad driver usage like nvidia opencl on Windows then it takes too long for the CPU to GPU communication which makes GPU get idle for short time which slows down FAH. If CPU speed is too slow then the same occurs. Then we have a ratio of CPU/pcie to GPU time and if it is CPU/pcie 1 to GPU 99 that is great. If it is CPU/pcie 10 to GPU 90 we have a light bottleneck. And if it is CPU/pcie 50 to GPU 50 then we have bad performance.

That's my theory, too. Certainly the PCIe data transfers occur in peaks and the CPU spin-waits which, for the sake of increased game frame rates, the CPU is always ready to transfer data without the added delay of interrupt processing and task switching overhead. (That's probably insignificant for FAHBench).

When the GPU finishes the processing that has been assigned to it, it will pause until a new block of data is transferred so it's ready to be processed. It doesn't matter whether the PCIe bus is slow or the CPU is slow or the CPU is busy with some other task -- the result is the same -- nothing for the GPU to process yielding lower average performance. It's sort of like your browser opening a document over a slow internet connection. As soon as the first page of data is displayed, you start reading and it doesn't bother you if pages 2, 3, ... have been loaded into memory or not; they'll be ready to read before you're ready to look at them.

All of the atoms are divided into multiple pages. As soon as the positions of all nearby atoms are available for the current time, the GPU can begin summing the forces on each atom. As this process advances, I suspect that this data must be returned to main RAM. Once the last page is available to the CPU, updated positions of all atoms for the next time step can be computed and sending them to the GPU begins again.

What I do not know is how the software figures out how to allocate VRAM. (What constitutes a "page"?) Clearly, once the first page is transferred, the GPU can start running and the second page can start transferring. If multiple pages fit in VRAM, other shaders can processing them in parallel so the GPU doesn't need to wait again until the last page has been computed.

ProDigit · Post by **ProDigit** » Tue Jan 15, 2019 10:55 pm

Above, I looked at it from the perspective of running a card within 90% of efficiency, not 50%.
I'm already a bit disappointed if the card drops 10-20% vs a PCIE16x slot.
The drop off you're seeing on your faster card, seems to be in line with the drop off I'm seeing on my 1050, about 20%.
This would reenforce the idea that GPU WUs are sent from CPU to GPU in packets, and that continuous throughput isn't necessary.

I would however, say that this kind of programming, is still extremely inefficient.
Many folding cards have several gigabytes of VRAM that could have been used as buffer, and they're not (only 400Mb is used on my 1060, and sometimes less than 120MB on my 1030).
And the performance drop between a 16x slot, and a 1x slot, is undeniable!
Perhaps the creators of FAH, did not anticipate this problem with the cards of the past; but now that cards are much faster, they do tend to wait until a new package is received, and the card is running on a slower performance.
I've bought more hardware; a PCIE 1x slot splitter.
It splits the signal into 4 PCIE 1x slots. Will see how it might affect GPU performance.
Also bought a set of cheap GTX 1060 cards from China. They were priced about 1/3rd the new price, so let's see how they will perform on a PCIE splitter.
Estimated time of arrival is in 1 month....

I hope that the person writing the program, will add the ability to increase VRAM usage. This might run the card a slightly tad hotter, but it could potentially get rid of the PCIE bottleneck, and there might be a possibility to run many powerful cards on a single PCIE 1x slot in parallel.

I'm not really understanding what you mean with the part:

if it is CPU/pcie 1 to GPU 99 that is great. If it is CPU/pcie 10 to GPU 90 we have a light bottleneck. And if it is CPU/pcie 50 to GPU 50 then we have bad performance

I notice that the kernel times on the CPU show that the CPU is a near to constantly active, which could indicate a constant flow of data to the card.
Or at least, the CPU is constantly active, perhaps prepping the packages.

GPU usage shows as 85% when GPU:CPU = 2:1. When using equal CPU and GPU cores then GPU usage shows 97%.
(All tests were done on pcie 3.0 x1 risers but even on pcie 2.0 x1 the GPU usage still shows 97%.)

In general my feeling is 10% performance loss is not a problem. But running more than 4 fast GPUs on 1 CPU may bottleneck worse. Cannot try that on Windows but only Linux tests.

Do you, by any chance, have a Killawatt meter, to check and see if running it in this configuration saves you power?
I'd also be interested in seeing how PPDs compare.
If each card is running at 75% of PPDs, when sharing a PCIE or CPU slot, but use close to the same power consumption, it might not be a good setup.

I'm sure running 2 graphics cards off of 1 CPU and 1 PCIE 1x slot, will net higher PPDS overall than running only a single graphics card off of it.
However, it will definitely run less efficient (PPD/watt wise).

It might be a cheap solution to adding those lower performing graphics cards in parallel with one another, chewing up minimal CPU and PCIE resources.
For faster cards, I'd probably recommend to keep at least a dedicated PCIE 16x or 8x slot available.

foldy · Post by **foldy** » Wed Jan 16, 2019 12:41 pm

The kernel times of CPU come from the nvidia OpenCL driver. Additional FAH uses CPU to put data to GPU in packets I guess for 100 msec every 1 second. If you have 10 GPUs running on 1 CPU core then there is a high chance that 2 GPU want data transferred from CPU at the same time, so one GPU is delayed. I wish too developers could make it possible to queue more jobs on the GPU but I understand it so CPU needs to wait for first GPU job result data to generate the follow up data for next GPU job.

On LInux I can run RTX 2080 at pcie x1 with only small performance loss. I think it is better to have small number of fast GPUs than high number of slow GPUs.

Killawatt I cannot test because my servers run in the cloud. But I guess if I have some 20% performance loss on GPU because of limited CPU or pcie then GPU idles during this time which reduces power usage.

Post by **bruce** » Fri Jan 18, 2019 2:25 am

I would however, say that this kind of programming, is still extremely inefficient.

Right. The efficient way to program it is to connect your CPU directly to the same RAM that the data is already in and avoid the need for a PCIe channel to transfer data. -i.e. Fold with your CPU. Once the data exists in main RAM and has to be transferred to VRAM, it's going to be inefficient to a varying degree, depending on how much data has to be transferred, how fast it can be transferred, and how much faster the GPU is than the CPU. Data transfers happen at random times, and the chances of data collisions happening increase as the channel utilization grows.

Cloud computing is extremely inefficient for the same reason. Some part of the data has to flow through a very slow channel.

Suppose it takes 1 unit of time to transfer a kernel of work to a GPU across a 16x slot. It will take 16 units of time to transfer the same amount of data, and during that time, the GPU won't have anything to do. (Of course in either case, your OS might want to ask for resources or another FAH slot might interrupt the best of intentions.) Now suppose you have enough shaders to complete that processing in 1000/100/10/1 units of time. The fast GPU will complete that work before more data can arrive and utilization will be slow. (and results have to transferred back, too.)

Buffering work in VRAM also takes time, so that doesn't help at all. You can't transfer data through a busy channel.

You have to use two resources to process a unit of work ... the Shaders and the PCIe channel. You're always going to be waiting on at least one of them, no matter how you configure your system. If your GPU is slow compared to the channel, then the GPU appears busy most of the time and the channel utilization is near 0. If your channel is slow compared to the GPU, the channel utilization goes up and the GPU utilization goes down. Assuming efficient programming, both will be busy part (or most) of the time.

Is the communications channel half-duplex or full-duplex? How much loss of performance do you get when there are N-GPUs all sharing a single channel? (Assume a 16x channel is shared by 4 GPU all seeing a 4x connection. Now assume it's in use by one of them, what has to happen for a second one to start transferring to/from it's GPU?)

Folding Forum

1 core per GPU still holds true?

1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?

Re: 1 core per GPU still holds true?