Continued trouble with MoonShot WUs 13414-13417

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Post Reply
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Continued trouble with MoonShot WUs 13414-13417

Post by ThWuensche »

Most of these WUs fail on all of my Radeon VII. I have been in direct contact with John Chodera, but the last message to him seems to stay unread. As John maybe is busy or away, maybe somebody else can help along the path he suggested:
Third, thanks for bringing up these failures. The surprising thing is that the failures are mostly due to large energy discrepancies between the CPU and GPU, which suggests a bug with OpenMM in working with your card or drivers---which is quite surprising. I see you're on linux---would you be able to try out OpenMM (via conda) and run the quick installation test? First, you would install miniconda (https://docs.conda.io/en/latest/miniconda.html) and then OpenMM

Code: Select all

    conda install -c conda-forge -c omnia openmm
    python -m simtk.testInstallation
If that checks out, I can give you a tarball (if you send your email) with more specific tests for 13417 to identify exactly what is going wrong.
I followed above mentioned proposal, openMM is installed and the test completed without errors.

I would be happy to test such a tarball, but first have to have it. It's dozens of these WUs that have lately failed on my GPUs. Other WUs are processed without trouble. If somebody can help, please let me know to whom PM my e-mail.

Regards, Thomas
ajm
Posts: 754
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: Continued trouble with MoonShot WUs 13414-13417

Post by ajm »

Just a few facts I gathered these last days. On my systems, only my AMD card (drivers 20.5.1) has had such failures with those WUs, none seen on Nvidia GPUs.
Someone else was (is) having recurring troubles on Windows with an AMD MB and a Radeon VII, that could be solved only by dduing the GPU drivers (20.5.1) and reinstalling them.
Yet someone else then claimed to be able to fold without issues using a similar setup (AMD mobo and Radeon VII) BUT a former version of the AMD drivers.
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: Continued trouble with MoonShot WUs 13414-13417

Post by psaam0001 »

I have made some adjustments to my AMD Video Driver Settings, and they will be tested when I get another Moonshot WU.

Should the performance be much better, I have screen shots that may help those of you who are having issues. I will post them when I can figure out how to get them on here from my local system, or where to save them so I can give a link.

Paul
aetch
Posts: 447
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: Continued trouble with MoonShot WUs 13414-13417

Post by aetch »

I wouldn't get too excited if you see an improvement.
I've had a run of units for Project 13416, some have ran fine, others have really struggled. This is without changing anything about my config.
Folding Rigs - None (25-Jun-2022)

ImageImage
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Post by NormalDiffusion »

I had the same problem on those wus under Windows 10. It was my undervolting. Running fine for weeks and months, but not with the new wus. Upped a little bit the voltage, everything fine now on my rvii and my 290x (my other rvii didn't have problems with the wus).
And for information, running old drivers: 19.6 from June for the rvii/290x combo, and 19.12.1 for the single rvii (will have to move back to 19.6 as it's a little bit faster (2%) on 13416).
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Post by ThWuensche »

Thanks for the feedback. I see for all of you the WUs are processed. I'm running on stock parameters, only thing that has been tweeked is increased fan speed to keep the GPUs cooler. One major difference is that, as it looks, all of you are running on Windows, while my machines run linux with the ROCM package. So probably it is not a fault of openMM with the Radeon VII as such, but more likely a problem on the linux driver/opencl implementation.

Anyhow if somebody from Chodera lab could provide the tarball mentioned by John and based on that the error could be found and maybe fixed (by AMD ROCm guys) that would be great.
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Post by NormalDiffusion »

You could try to undervolt the cards a little bit. Do you know on how much they are running now?
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Post by ThWuensche »

On two GPUs, to which I have access now, it shows floating voltage mostly in the range 1.08 - 1.12V, but at times I also see 0,74V. The 0,74V seem to be selected when the GPU is idle.

Basically I wouldn't like to interfere with the standard settings, the manufacturer should know how to set it so that the GPU works. Tweeking may result in things like better energy efficiency, but the default settings should provide stable operation. Also 4 of these GPUs show the same effect, so it does not look like hardware instability to me.
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Post by NormalDiffusion »

Yes, the voltage is changing with the MHz. I'm only speaking of the max value for the highest speed. Each rvii is coming with a different value set by the manufacturer. This value is too high. All my rvii are undervolted (2 of 3 are running fah 24/7). All of them were unstable *in fah* with the default max voltage. With the factory voltage, the card is generating too much heat. Not a big deal in games (never had problems in hours of playing), but fah is giving Nan errors on a regular basis, even if Temps were below the max of what the card can cope with. Undervolting solved it.
On my cards I have factory values ranging fron 98x mV up to 13xx mV.
The rvii can be a wonderful card, but is at the same time a diva...
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: Continued trouble with MoonShot WUs 13414-13417

Post by psaam0001 »

Updated Post:

Here are the links to my screen shots, for those of you who are using the current Windows version of the latest AMD Radeon Lite Control Panel. Though, if you know how to access the GUI App for making these changes in Mac OS X or a current Linux distribution, give them a try (hopefully they will help w/other non-Moonshot WU's that we know there are benchmarks for).

Disclaimer: I do not have access to a Mac, or a Linux PC with currently supported GPU hardware on it.

Display1: https://drive.google.com/file/d/1v4hBqU ... sp=sharing

Display2: https://drive.google.com/file/d/1LnVOxs ... sp=sharing

Display3: https://drive.google.com/file/d/15ypzmU ... sp=sharing

See if these settings help. To change the settings to what I show in Display1, you will need to click on Global Settings first.

Paul
Last edited by psaam0001 on Thu Jul 16, 2020 7:30 am, edited 2 times in total.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Continued trouble with MoonShot WUs 13414-13417

Post by bruce »

Multiple issues covered in a single topic.

* As was mentioned above,. I've seen reports that older drivers work better than the latest. I don't have any of those devices so I can't really make any useful comments.
* As a general rule, the Covid Moonshot WUs are small and not an ideal fit for many GPUs. It's impossible to benchmark them accurately.
* Also, a variety of different proteins have been grouped under a single project number making it ever harder to benchmark them all to a single standard.
psaam0001
Posts: 383
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: Continued trouble with MoonShot WUs 13414-13417

Post by psaam0001 »

Ok... Moonshot jobs are running better on my Ryzen3's Integrated GPU. But I did go into the Windows 10 control panel and turned off all of those power saving settings first. :D

I'll pay the local electric co-op for a few more watts of power, if it helps me give COVID-19 that OSHA compliant steel toed shoe to the backside.

Paul
Post Reply