Folding Forum

Posted: **Sat Jan 30, 2021 11:12 pm**

Starman157 wrote:...I've noticed that GPU processing WU 13444 has higher performance for the first 27%, then drops about 200,000 PPD for the next 25%, rises again to starting values for the following 25%, then drops 200,000 PPD again to the end. At least that is on the 6900xt. So it appears non uniformity isn't just a small scale issue, but can appear to be long term and vary depending on where in the WU processing is. What I was trying to figure out was whether simultaneous CPU processing was causing it. It wasn't. I guess it's just a feature of that WU...

You're right that it's a feature of the WU, specifically, it's a feature of Moonshot Project (134XX Series). Some Projects when them have four phases of workload where the first and third have the same performance while the the second and fourth have the same performance. Here's the sequence:
Equilibrium in molecule A
Nonequilibrium switch from A to B
Equilibrium in B
Nonequilibrium switch from B to A

Thus, you may notice the difference in the TFP (reflected in PPD) on your system. This is specific to those GPU Projects. Of course, if there's CPU contention with the GPU and other applications, it can impact that difference.

Posted: **Sun Jan 31, 2021 1:38 am**

@PantherX
Yeah. I've been trying to "nail jelly to the tree". Specifically maximizing efficiency in light of a rather shaky underpinning of ever changing CPU and GPU demands. Thanks for explaining the idiosyncrasies of the 13xxx WUs. Since I noticed that there was a distinct pattern to the changes in GPU performance, I assumed it was how the WU was "programmed".

As for contention, that's what I've been trying to minimize since I was lucky enough to "score" a 6900xt. The previous card was a 5700xt, and the performance difference between the two is quite shocking. At first blush, it appears the 6900xt is about 3x-5x faster than the 5700xt. As such, I assumed feeding the beastly 6900xt would involve more pressure on CPU resources to keep it fed (ignoring the power requirements are at best only a minor increase from the 5700xt). Finding out these contention issues when one doesn't fully understand the background of WU processing is fraught with a lot of guess work. Anyway, it seems everything is running smoothly and I'm happy with the balance I've finally achieved.

Folding ON!

Posted: **Sun Jan 31, 2021 2:18 am**

@MeeLee
I suspect that the Windows scheduling issues are more of a core quality issue than cache issues. I'm assuming of course that the Windows scheduler is completely ignorant of the programming of a particular thread and any possible parallelism that could be achieved by stuffing another thread onto an already consumed core. As far as the scheduler goes, I only sees that there is a process with something to do. Ok. Look around as to "who" can do it. This is where I think the scheduler gets it wrong. The quality scores (as reported by CTR "Clock Tuner for Ryzen") on my 3950x range from 193 down to 136. I suspect that Windows is taking these into account when loading up cores. As such, I'm "guessing" that it'd prefer to load up a second thread on a busy core than use one of the lower quality cores that isn't doing anything (or at least doing very little). All I'm doing is now forcing the issue by ensuring that 15 of my best cores are used, with the background Windows services et al being forced onto the last remaining core. I've run across a free program, Process Lasso, that allows me to prioritize the various processes since FAH doesn't have affinity control in the client specifically.

However, I am forcing the GPU threads onto the already busy CPU threads as a second thread since I suspect the workloads are quite different and shouldn't create a FP32 contention issue. Also, the GPU needs are fairly bursty at odd occasions (mainly at checkpoint times) so the overall impact of GPU interuptions should be minimal.

Yes, it's difficult measuring all this with so many moving "parts". As such, I'm only interested in maximizing efficiency with the least user input and "futzing". So running down your provided list:

1. Manual is not the way to go. PBO, set and forget. You don't get the same granular capability when going manual, and you only end up creating a LOT of heat. PBO does a much better job than I can and I've been overclocking CPUs since 1985.
2. BIOS automatics only when absolutely necessary. I've hand tuned memory timings along with IF timings too.
3. I don't have control over that other than to ensure it has the latest "production" BIOS. I've carefully selected the mobo for the 3950x believing that it's a more than adequate match (I'm using an Asus ROG Crosshair VIII Hero wifi). Power delivery stages, as determined by others, are more than adequate for stable overclocking a 3950x.
4. Power settings. Nope. Full power all the time.
5. Cooling the CPU. Thermaltake Water 3.0 360mm radiator using 3 120mm fans on full speed.
6. Never pause a WU unless absolutely necessary, after all, PPD is TIME calculated.
7. Already know about PPD differences.
8. Ah, the rest of the Windows crap. Turned off if I can do without, or relegated to the unused core if I cannot.

The case is a Lian Li O11 Dynamic XL with the sides off. It's really only meant to "hold" onto the components. Including the 3 fans for the rad, there are 9 total fans in the case moving air around to various areas. I learned early on that full time folding creates a lot of heat, so thermals are an important consideration in my builds (and always have been). Powering all this is a Seasonic Prime Titanium 850W, which also happens to be the lowest recommended power level for a Radeon 6900xt (which runs maxed at almost 2.7Ghz at 60C, 80C Tjunc) presently consuming 241W (although I've seen it as low as 200w).

The 3950x runs at 70-75C (depending on WU), 4.2Ghz (thanks PBO) at 1.3v or so.

I've taken many considerations into account for this Folding build. The only thing left to maximize the efficiency was the affinity control, hence my request figuring that native control within the application needing control would be better than other solutions (Process Lasso). I still would like the FAHClient to do what is needed since I'm brute forcing after that fact and there's a minor impact to performance on process startup (before Process Lasso get's its hands on things).

Posted: **Sun Jan 31, 2021 7:22 pm**

PantherX wrote:
Starman157 wrote:...I've noticed that GPU processing WU 13444 has higher performance for the first 27%, then drops about 200,000 PPD for the next 25%, rises again to starting values for the following 25%, then drops 200,000 PPD again to the end. At least that is on the 6900xt. So it appears non uniformity isn't just a small scale issue, but can appear to be long term and vary depending on where in the WU processing is. What I was trying to figure out was whether simultaneous CPU processing was causing it. It wasn't. I guess it's just a feature of that WU...
You're right that it's a feature of the WU, specifically, it's a feature of Moonshot Project (134XX Series). Some Projects when them have four phases of workload where the first and third have the same performance while the the second and fourth have the same performance. Here's the sequence:
Equilibrium in molecule A
Nonequilibrium switch from A to B
Equilibrium in B
Nonequilibrium switch from B to A

Thus, you may notice the difference in the TFP (reflected in PPD) on your system. This is specific to those GPU Projects. Of course, if there's CPU contention with the GPU and other applications, it can impact that difference.

Interesting stuff. I've run a couple of that WU but never noticed. But then again on my system the variations are small because my PPD average is small.

Posted: **Mon Feb 01, 2021 12:03 am**

Starman157 wrote:@MeeLee
I suspect that the Windows scheduling issues are more of a core quality issue than cache issues. I'm assuming of course that the Windows scheduler is completely ignorant of the programming of a particular thread and any possible parallelism that could be achieved by stuffing another thread onto an already consumed core. As far as the scheduler goes, I only sees that there is a process with something to do. Ok. Look around as to "who" can do it. This is where I think the scheduler gets it wrong. The quality scores (as reported by CTR "Clock Tuner for Ryzen") on my 3950x range from 193 down to 136. I suspect that Windows is taking these into account when loading up cores. As such, I'm "guessing" that it'd prefer to load up a second thread on a busy core than use one of the lower quality cores that isn't doing anything (or at least doing very little). All I'm doing is now forcing the issue by ensuring that 15 of my best cores are used, with the background Windows services et al being forced onto the last remaining core. I've run across a free program, Process Lasso, that allows me to prioritize the various processes since FAH doesn't have affinity control in the client specifically.

However, I am forcing the GPU threads onto the already busy CPU threads as a second thread since I suspect the workloads are quite different and shouldn't create a FP32 contention issue. Also, the GPU needs are fairly bursty at odd occasions (mainly at checkpoint times) so the overall impact of GPU interuptions should be minimal.

Yes, it's difficult measuring all this with so many moving "parts". As such, I'm only interested in maximizing efficiency with the least user input and "futzing". So running down your provided list:

1. Manual is not the way to go. PBO, set and forget. You don't get the same granular capability when going manual, and you only end up creating a LOT of heat. PBO does a much better job than I can and I've been overclocking CPUs since 1985.
2. BIOS automatics only when absolutely necessary. I've hand tuned memory timings along with IF timings too.
3. I don't have control over that other than to ensure it has the latest "production" BIOS. I've carefully selected the mobo for the 3950x believing that it's a more than adequate match (I'm using an Asus ROG Crosshair VIII Hero wifi). Power delivery stages, as determined by others, are more than adequate for stable overclocking a 3950x.
4. Power settings. Nope. Full power all the time.
5. Cooling the CPU. Thermaltake Water 3.0 360mm radiator using 3 120mm fans on full speed.
6. Never pause a WU unless absolutely necessary, after all, PPD is TIME calculated.
7. Already know about PPD differences.
8. Ah, the rest of the Windows crap. Turned off if I can do without, or relegated to the unused core if I cannot.

The case is a Lian Li O11 Dynamic XL with the sides off. It's really only meant to "hold" onto the components. Including the 3 fans for the rad, there are 9 total fans in the case moving air around to various areas. I learned early on that full time folding creates a lot of heat, so thermals are an important consideration in my builds (and always have been). Powering all this is a Seasonic Prime Titanium 850W, which also happens to be the lowest recommended power level for a Radeon 6900xt (which runs maxed at almost 2.7Ghz at 60C, 80C Tjunc) presently consuming 241W (although I've seen it as low as 200w).

The 3950x runs at 70-75C (depending on WU), 4.2Ghz (thanks PBO) at 1.3v or so.

I've taken many considerations into account for this Folding build. The only thing left to maximize the efficiency was the affinity control, hence my request figuring that native control within the application needing control would be better than other solutions (Process Lasso). I still would like the FAHClient to do what is needed since I'm brute forcing after that fact and there's a minor impact to performance on process startup (before Process Lasso get's its hands on things).

I'd have to disagree with you on some points.
Windows actually is aware of in what L-cache program data is buffered. It is more aware than we think!
It even predicts data to be loaded in the L-cache, before the program calls for it.

1- Manual overclocking on the Ryzen is a skill. So long you have a consistent data to crunch (eg: CPU folding of one specific WU or project), manual overclocking is much better than PBO.
You can increase the CPU frequency by about 5-15% over PBO. This, because the cores are fixed, rather than constantly fluctuating.
On my 3900x for instance, PBO runs around 3.85Ghz, while I can bump it to 3,92Ghz.
Other projects run 3,58Ghz, and I can bump them to 3,87Ghz with a manual overclock.
The problem is, when a project uses a power hungry extension in the CPU, like AVX or something, the CPU may hit an undervolt, and error.
That's where PBO becomes interesting. Especially if low CPU intensive projects are mixed with high intensive projects.

6- People pause WUs, when they try to measure performance between hardware, so they can for sure run the same project WU on both hardware. For instance, to measure if an Asus RTX 2060 is as fast as a MSI 2060 or other...
The small pause introduced (sometimes with a PC reset), will lower PPD, and jinx the score.

Posted: **Mon Feb 01, 2021 5:16 pm**

@MeeLee
1 - Manual overclocking with the tools provided from AMD do not have the same voltage and speed granularity that PBO provides. As for my specific 3950x, my own experience has proven that PBO provides an additional 75-100Mhz higher clocks than manual at a fixed speed and voltage. Also PBO is far more dynamic in its settings and adjusts according to load. Faster when light loads, slower when heavy. Makes sense. Also, manual overclocks eliminate any possibility of reaching the max speed on the 3950x (specifically 4.7Ghz), so say goodbye to your lovely single thread performance. The Ryzen Master software also forces any overclock to a specific speed (across an entire CCX) and an entirely fixed voltage (which is usually set to achieve stable operation at the speed chosen. Nothing dynamic about it, contrary to PBO. I have managed to pump about 185W through the 3950x ALL CORE LOAD in the quest for the highest stable clocks. The end result. Massive amounts of heat. Fierce really. Performance? Less than what PBO does at much lower voltages, but increased speed. Funny thing about overclocking. YMMV (your mileage may vary). So my 3950x is currently running 15 threads of Folding, consuming between 120-130W (less than the 140W max) at 4.175Ghz (although it does get up to 4.25Ghz) at the same time I'm crunching on a 6900xt consuming 250W at 2.7Ghz. Happy as a clam, as the saying goes.

As for point 6, sure, whatever tricks you need to do to "benchmark" a program that hasn't been designed as one from the programmers. Just be aware that those tricks can impact your numbers (so why are you doing it if you want accurate numbers). Also, since the WU calculation loads vary by work unit, it's kind of dubious as to what you're going to find. What may be good for one WU may not be for another. The WUs that you use for benchmarking aren't fixed, and neither is their programming. Just take a look at the nature of the changes for the COVID moonshot WUs (13xxx) to see an example (look previously in this forum thread).

I'm looking for the most efficient use of my entire computing resources (overall system performance - CPU + GPU) given the input of electricity to achieve it, with waste heat (and the noise necessary to remove it) the byproduct (which is fine for winter here in the cold north - summer is a different story).

Posted: **Mon Feb 22, 2021 2:18 am**

Hi everyone,

Sorry for the delay. I had to do more testing than planned due to the new core a8 work units messing things up (had to wait for a7 work units to be comparable to previous plots). I also got side-tracked by my main writing gig (sci-fi novels), but I wasn't going to leave you all hanging.

Here's part 4: https://greenfoldingathome.com/2021/02/ ... e-and-smt/

Key takeaways: The auto-overclocking on the Ryzen 9 (CPB) takes a huge chunk out of efficiency for only a modest performance improvement.

Side-Note: A8 work units are pretty great, but I didn't do much with them because I don't have a basis of comparison to the older tests.

Also, it looks like the reason there is a big nose-dive in performance and efficiency in the 17-25 thread region is Windows 10 itself, namely in how it chooses to keep a few physical CPU cores free and loads up logical processors with the work units. I have a few Ryzen Master screenshots in there showing this activity. Eventually, when you throw a hard enough problem at the processor with enough threads, Windows stops being silly and really cranks it up. It's now on my list to someday investigate this in Ubuntu to see if the Linux task scheduler does this or not.

Posted: **Fri Feb 26, 2021 8:35 pm**

Paragon wrote:Hi everyone,

Sorry for the delay. I had to do more testing than planned due to the new core a8 work units messing things up (had to wait for a7 work units to be comparable to previous plots). I also got side-tracked by my main writing gig (sci-fi novels), but I wasn't going to leave you all hanging.

Here's part 4: https://greenfoldingathome.com/2021/02/ ... e-and-smt/

Key takeaways: The auto-overclocking on the Ryzen 9 (CPB) takes a huge chunk out of efficiency for only a modest performance improvement.

Side-Note: A8 work units are pretty great, but I didn't do much with them because I don't have a basis of comparison to the older tests.

Also, it looks like the reason there is a big nose-dive in performance and efficiency in the 17-25 thread region is Windows 10 itself, namely in how it chooses to keep a few physical CPU cores free and loads up logical processors with the work units. I have a few Ryzen Master screenshots in there showing this activity. Eventually, when you throw a hard enough problem at the processor with enough threads, Windows stops being silly and really cranks it up. It's now on my list to someday investigate this in Ubuntu to see if the Linux task scheduler does this or not.

I think this is more how AMD drivers work.
They don't utilize the CPU fully, until there's a >75-80% CPU load.
The same is true for Linux btw.

Posted: **Fri Feb 26, 2021 10:51 pm**

it depends on how the task scheduler allocates resources and what the programmer was thinking the last time he changed that code. For example, does the OS change it's behavior when it encounters hardware that's running big.LITTLE ... and how does it handle a pair of HyperThreaded CPUs when both code segments are predominantly integer or predominantly Floating Point (or one of each). [Is your OS smart enough to handle FAHCore_a* differently than FAHCore_7* / _8* ?]

Folding Forum

Ryzen 9 3950x Benchmark Machine: What should I test for you?

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for