Page 1 of 3

debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 5:27 am
by astrorob
greetings all,

i had been happily chugging along on an ubuntu 18.04 system with a GTX1060 and also an RX5500 for a while, when i suffered some kind of disk corruption and the machine stopped booting. when i investigated, the kernel was crashing while loading the AMD kernel modules. so, i removed the RX5500, deleted and reinstalled the amd driver (amd-gpu-pro-20.20-1089774-ubuntu-18.04), reinstalled the card and was able to boot again.

FAHClient seemed happy again, but on the RX5500 i'm seeing very low performance - on project 13416 i've got 18m26s per frame, while on another machine with a RTX 2060, i see 3m56s per frame. the RX5500 is only producing about 50k PPD now, whereas before it was several hundred thousand per day. so something has gone wrong, but i can't tell what.

i'm starting to wonder if the hardware has gone bad given the initial circumstances. is there any way on linux for me to debug whether or not the RX5500 is properly configured? meaning, the clock rates are as expected and the # of cores are correct? i know about nvtop and nvidia-sma but i'm not aware of similar utilities for ATI. i guess if push came to shove i could install the card in a windows box and see what happens over there, but that will disrupt a 2nd machine that is also currently folding.

any ideas on this? thanks.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 8:03 am
by ajm
The 13416 WUs are special, kind of experimental, and very irregular, with large variations from run to run, so much so that the researchers modified their overall points rating. You can thus have 13416's that perform at some 30% of the usual capacity and others that give stupendous results. It can be upsetting but it's worth bearing with the researchers. And your hardware looks fine. Just the usual hiccups with AMD drivers.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 2:46 pm
by astrorob
ok - thanks. i did wonder about the project itself but since i happened to find the same one running on nvidia "normally" it gave me pause. i guess it didn't help that this GPU kept pulling the same project over and over again as i was debugging.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 4:01 pm
by astrorob
i should also say that i was motivated to fix this because i can only run overnight (~8 hours) due to high electricity prices during the day. the 13416 project on the AMD GPU has a runtime of almost 48 hours. the project will time out days before i can complete it like this. so although i don't have any problem helping out with these experimental WUs, it turns out that without spending big $$$ i'll never finish one of these. which is kind of counterproductive.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 10:21 pm
by zookeeny
I've had the same problem with my AMD card (RX 5700 XT) and the 13416 WUs. Approximately 85% of the time they're flagged as FAULTY and returned within seconds, but the other 15% just hang my GPU forever. I don't know what's actually going on with that last 15% - no progress is indicated in the log, and I've waited as long as 24 hours for something to happen before dumping them. Most of the other GPU WUs do fine, so it's something specific about the 13416 family of WUs. Now I just dump them immediately - no sense in needlessly tying up my GPU and delaying FaH's progress.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 10:39 pm
by astrorob
i was reading another thread with complaints about 13416 and it made me look at the CPU utilization of the CPU thread associated with the RX5500 GPU thread. it's higher than i've ever seen - 70% - which is something new for an AMD/ATI graphics card as far as i have seen.

the machine in question happened to have 6/8 threads running rosetta@home, so i dropped that to 5 threads and the RX5500 performance has improved a lot - from 25m per frame to 8m30s per frame, and the runtime went from ~48h to about 16h. something is still messed up since 13416 is also running on a GTX 1060 and RTX 2060, and the TPF there is 3m39s and 2m12s. so there's something about 13416 that's causing problems with our AMD GPUs, making them CPU-bound instead of GPU-bound.

i haven't actually completed a 13416 WU on the RX5500 because of the problems i outlined above, but it looks like sometime later tonight it will finish. i'll have to see if it ends up being a bad WU or not.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 10:52 pm
by zookeeny
Interesting - I never checked to see if the CPU utilization went up while the 31416 GPU task was running. But I have 24 CPU threads available, only 15 of which are allocated for Fah CPU tasks, so there should be plenty to also handle the GPU.

Whatever it is, it seems to be specifically a Linux/AMD problem. I don't know what the 31416 WUs are doing differently from the others, but the amdgpu driver sure doesn't like it.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 11:01 pm
by astrorob
ah - my W10 box has a very old AMD GPU and so i can't draw any meaningful comparisons between windows and linux here. the drivers are definitely a problem and if this turns out to only be a linux problem i'll bet the devs don't care. given the difficulty of getting AMD GPUs running on linux there probably aren't many of us around.

it would be nice if they could blacklist certain types of GPU or Platform+GPU on a WU by WU basis. or maybe they can, idk.

Re: debugging sudden low performance on RX5500

Posted: Sat Jul 11, 2020 11:39 pm
by zookeeny
I agree 100%. I'm considering changing my FAHClient to run only non-Covid tasks to avoid these WUs. Or instead of that, maybe writing a script to auto-dump them if they hang the GPU for more than 5 minutes. It's a shame that's necessary, but like you said, AMD cards running on Linux are a rarity... I doubt they warrant much developer attention.

Re: debugging sudden low performance on RX5500

Posted: Mon Jul 13, 2020 6:52 am
by JohnChodera
@astrorob: We are currently working internally by testing out a a wide assortment of systems, which have a wide variety of short WUs for different kinds of workloads.

The high CPU load is concerning. Is this constant throughout, or periodic (at checkpoints)? If constant, it is likely that the OpenMM CustomIntegrator we use is somehow eating up more CPU time than it should when sequencing many kernel launches. We may be able to do something about that.

If it's periodic, this is because the CPU vs GPU sanity checks that happen at every checkpoint (now every 5%, rather than 25%) try to use a number of CPU threads equal to the number of cores. They'll automatically load-balance between threads, but if you've got other cores churning away, this could potentially slow things down a lot. I've been meaning to find a way to (1) let you control how many threads to use, and (2) split the CPU sanity checks off onto an asynchronous thread so the GPU can continue chugging along.

The best way to figure out what is going on is to watch the `science.log` thats generated (or point me to one) to see if the time per % between 5-6%, 9-10% is significantly slower, and if this improves if you stop CPU workloads on other threads. If you can help us focus in on the issue, we can quickly get this fixed in an updated core and have you folding happily again.

Thanks so much for the feedback, and for helping us with the COVID Moonshot (http://covid.postera.ai/covid)!

~ John Chodera // MSKCC

Re: debugging sudden low performance on RX5500

Posted: Tue Jul 14, 2020 1:03 pm
by mwroggenbuck
I am currently running the 13416 project. I am seeing continuous high CPU usage (2 to 4 times normal) , and relatively low GPU usage (~75%). I also see this when the 22 core is the only significant CPU process on the system. AMD RX 570, windows 10 professional.

Re: debugging sudden low performance on RX5500

Posted: Tue Jul 14, 2020 1:53 pm
by Starman157
Thankfully, it appears that 13416 is special. I thought it was my rig(s) that were misbehaving.

I can help with the gremlin hunt John. I'm retired and have some time to do the hunting.

My folding rigs are a bit different that the original posters. I'm running an AMD 3950x with a 5700XT. Also a 2600x with a 5700XT. I'm using 29 threads on the 3950x (not the -1). I am NOT using CPU folding on the 2600x because it hurts graphics card performance far too much (overall slower if I run with CPU folding as well).

I' ve checked the science.log files on both machines and don't see the "time per %" listed there. I see time step in PS and Performance since last checkpoint only.

Further direction?

Re: debugging sudden low performance on RX5500

Posted: Tue Jul 14, 2020 2:46 pm
by HaloJones
Starman157 wrote:I'm using 29 threads on the 3950x (not the -1).

Further direction?
my understanding is that FAH hates primes over 5 so 29 is a really difficult number for FAH. I wonder if your logs show it stepping down to 27 threads.

Re: debugging sudden low performance on RX5500

Posted: Tue Jul 14, 2020 2:59 pm
by bruce
The science log doesn't show a time/%. That's in the regular public log. The GPU Science log will report time/time (ns/day). I don't remember if GROMACS reports ns/day for the CPU.

Re: debugging sudden low performance on RX5500

Posted: Tue Jul 14, 2020 3:42 pm
by Starman157
@HaloJones Yes, "15:04:22:WU01:FS02:0xa7:Reducing thread count from 29 to 28 to avoid domain decomposition by a prime number > 3"