debugging sudden low performance on RX5500

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

astrorob
Posts: 43
Joined: Sun Mar 15, 2020 7:59 pm

debugging sudden low performance on RX5500

Post by astrorob »

greetings all,

i had been happily chugging along on an ubuntu 18.04 system with a GTX1060 and also an RX5500 for a while, when i suffered some kind of disk corruption and the machine stopped booting. when i investigated, the kernel was crashing while loading the AMD kernel modules. so, i removed the RX5500, deleted and reinstalled the amd driver (amd-gpu-pro-20.20-1089774-ubuntu-18.04), reinstalled the card and was able to boot again.

FAHClient seemed happy again, but on the RX5500 i'm seeing very low performance - on project 13416 i've got 18m26s per frame, while on another machine with a RTX 2060, i see 3m56s per frame. the RX5500 is only producing about 50k PPD now, whereas before it was several hundred thousand per day. so something has gone wrong, but i can't tell what.

i'm starting to wonder if the hardware has gone bad given the initial circumstances. is there any way on linux for me to debug whether or not the RX5500 is properly configured? meaning, the clock rates are as expected and the # of cores are correct? i know about nvtop and nvidia-sma but i'm not aware of similar utilities for ATI. i guess if push came to shove i could install the card in a windows box and see what happens over there, but that will disrupt a 2nd machine that is also currently folding.

any ideas on this? thanks.
Image
ajm
Posts: 754
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: debugging sudden low performance on RX5500

Post by ajm »

The 13416 WUs are special, kind of experimental, and very irregular, with large variations from run to run, so much so that the researchers modified their overall points rating. You can thus have 13416's that perform at some 30% of the usual capacity and others that give stupendous results. It can be upsetting but it's worth bearing with the researchers. And your hardware looks fine. Just the usual hiccups with AMD drivers.
astrorob
Posts: 43
Joined: Sun Mar 15, 2020 7:59 pm

Re: debugging sudden low performance on RX5500

Post by astrorob »

ok - thanks. i did wonder about the project itself but since i happened to find the same one running on nvidia "normally" it gave me pause. i guess it didn't help that this GPU kept pulling the same project over and over again as i was debugging.
Image
astrorob
Posts: 43
Joined: Sun Mar 15, 2020 7:59 pm

Re: debugging sudden low performance on RX5500

Post by astrorob »

i should also say that i was motivated to fix this because i can only run overnight (~8 hours) due to high electricity prices during the day. the 13416 project on the AMD GPU has a runtime of almost 48 hours. the project will time out days before i can complete it like this. so although i don't have any problem helping out with these experimental WUs, it turns out that without spending big $$$ i'll never finish one of these. which is kind of counterproductive.
Image
zookeeny
Posts: 23
Joined: Fri Apr 03, 2020 1:32 pm

Re: debugging sudden low performance on RX5500

Post by zookeeny »

I've had the same problem with my AMD card (RX 5700 XT) and the 13416 WUs. Approximately 85% of the time they're flagged as FAULTY and returned within seconds, but the other 15% just hang my GPU forever. I don't know what's actually going on with that last 15% - no progress is indicated in the log, and I've waited as long as 24 hours for something to happen before dumping them. Most of the other GPU WUs do fine, so it's something specific about the 13416 family of WUs. Now I just dump them immediately - no sense in needlessly tying up my GPU and delaying FaH's progress.
astrorob
Posts: 43
Joined: Sun Mar 15, 2020 7:59 pm

Re: debugging sudden low performance on RX5500

Post by astrorob »

i was reading another thread with complaints about 13416 and it made me look at the CPU utilization of the CPU thread associated with the RX5500 GPU thread. it's higher than i've ever seen - 70% - which is something new for an AMD/ATI graphics card as far as i have seen.

the machine in question happened to have 6/8 threads running rosetta@home, so i dropped that to 5 threads and the RX5500 performance has improved a lot - from 25m per frame to 8m30s per frame, and the runtime went from ~48h to about 16h. something is still messed up since 13416 is also running on a GTX 1060 and RTX 2060, and the TPF there is 3m39s and 2m12s. so there's something about 13416 that's causing problems with our AMD GPUs, making them CPU-bound instead of GPU-bound.

i haven't actually completed a 13416 WU on the RX5500 because of the problems i outlined above, but it looks like sometime later tonight it will finish. i'll have to see if it ends up being a bad WU or not.
Image
zookeeny
Posts: 23
Joined: Fri Apr 03, 2020 1:32 pm

Re: debugging sudden low performance on RX5500

Post by zookeeny »

Interesting - I never checked to see if the CPU utilization went up while the 31416 GPU task was running. But I have 24 CPU threads available, only 15 of which are allocated for Fah CPU tasks, so there should be plenty to also handle the GPU.

Whatever it is, it seems to be specifically a Linux/AMD problem. I don't know what the 31416 WUs are doing differently from the others, but the amdgpu driver sure doesn't like it.
astrorob
Posts: 43
Joined: Sun Mar 15, 2020 7:59 pm

Re: debugging sudden low performance on RX5500

Post by astrorob »

ah - my W10 box has a very old AMD GPU and so i can't draw any meaningful comparisons between windows and linux here. the drivers are definitely a problem and if this turns out to only be a linux problem i'll bet the devs don't care. given the difficulty of getting AMD GPUs running on linux there probably aren't many of us around.

it would be nice if they could blacklist certain types of GPU or Platform+GPU on a WU by WU basis. or maybe they can, idk.
Image
zookeeny
Posts: 23
Joined: Fri Apr 03, 2020 1:32 pm

Re: debugging sudden low performance on RX5500

Post by zookeeny »

I agree 100%. I'm considering changing my FAHClient to run only non-Covid tasks to avoid these WUs. Or instead of that, maybe writing a script to auto-dump them if they hang the GPU for more than 5 minutes. It's a shame that's necessary, but like you said, AMD cards running on Linux are a rarity... I doubt they warrant much developer attention.
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: debugging sudden low performance on RX5500

Post by JohnChodera »

@astrorob: We are currently working internally by testing out a a wide assortment of systems, which have a wide variety of short WUs for different kinds of workloads.

The high CPU load is concerning. Is this constant throughout, or periodic (at checkpoints)? If constant, it is likely that the OpenMM CustomIntegrator we use is somehow eating up more CPU time than it should when sequencing many kernel launches. We may be able to do something about that.

If it's periodic, this is because the CPU vs GPU sanity checks that happen at every checkpoint (now every 5%, rather than 25%) try to use a number of CPU threads equal to the number of cores. They'll automatically load-balance between threads, but if you've got other cores churning away, this could potentially slow things down a lot. I've been meaning to find a way to (1) let you control how many threads to use, and (2) split the CPU sanity checks off onto an asynchronous thread so the GPU can continue chugging along.

The best way to figure out what is going on is to watch the `science.log` thats generated (or point me to one) to see if the time per % between 5-6%, 9-10% is significantly slower, and if this improves if you stop CPU workloads on other threads. If you can help us focus in on the issue, we can quickly get this fixed in an updated core and have you folding happily again.

Thanks so much for the feedback, and for helping us with the COVID Moonshot (http://covid.postera.ai/covid)!

~ John Chodera // MSKCC
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: debugging sudden low performance on RX5500

Post by mwroggenbuck »

I am currently running the 13416 project. I am seeing continuous high CPU usage (2 to 4 times normal) , and relatively low GPU usage (~75%). I also see this when the 22 core is the only significant CPU process on the system. AMD RX 570, windows 10 professional.
Starman157
Posts: 30
Joined: Tue Jul 14, 2020 12:55 pm
Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA

Re: debugging sudden low performance on RX5500

Post by Starman157 »

Thankfully, it appears that 13416 is special. I thought it was my rig(s) that were misbehaving.

I can help with the gremlin hunt John. I'm retired and have some time to do the hunting.

My folding rigs are a bit different that the original posters. I'm running an AMD 3950x with a 5700XT. Also a 2600x with a 5700XT. I'm using 29 threads on the 3950x (not the -1). I am NOT using CPU folding on the 2600x because it hurts graphics card performance far too much (overall slower if I run with CPU folding as well).

I' ve checked the science.log files on both machines and don't see the "time per %" listed there. I see time step in PS and Performance since last checkpoint only.

Further direction?
HaloJones
Posts: 920
Joined: Thu Jul 24, 2008 10:16 am

Re: debugging sudden low performance on RX5500

Post by HaloJones »

Starman157 wrote:I'm using 29 threads on the 3950x (not the -1).

Further direction?
my understanding is that FAH hates primes over 5 so 29 is a really difficult number for FAH. I wonder if your logs show it stepping down to 27 threads.
single 1070

Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: debugging sudden low performance on RX5500

Post by bruce »

The science log doesn't show a time/%. That's in the regular public log. The GPU Science log will report time/time (ns/day). I don't remember if GROMACS reports ns/day for the CPU.
Starman157
Posts: 30
Joined: Tue Jul 14, 2020 12:55 pm
Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA

Re: debugging sudden low performance on RX5500

Post by Starman157 »

@HaloJones Yes, "15:04:22:WU01:FS02:0xa7:Reducing thread count from 29 to 28 to avoid domain decomposition by a prime number > 3"
Post Reply