Continued trouble with MoonShot WUs 13414-13417

It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Continued trouble with MoonShot WUs 13414-13417

Postby ThWuensche » Thu Jul 09, 2020 8:32 am

Most of these WUs fail on all of my Radeon VII. I have been in direct contact with John Chodera, but the last message to him seems to stay unread. As John maybe is busy or away, maybe somebody else can help along the path he suggested:

Third, thanks for bringing up these failures. The surprising thing is that the failures are mostly due to large energy discrepancies between the CPU and GPU, which suggests a bug with OpenMM in working with your card or drivers---which is quite surprising. I see you're on linux---would you be able to try out OpenMM (via conda) and run the quick installation test? First, you would install miniconda (https://docs.conda.io/en/latest/miniconda.html) and then OpenMM

Code: Select all
    conda install -c conda-forge -c omnia openmm
    python -m simtk.testInstallation

If that checks out, I can give you a tarball (if you send your email) with more specific tests for 13417 to identify exactly what is going wrong.


I followed above mentioned proposal, openMM is installed and the test completed without errors.

I would be happy to test such a tarball, but first have to have it. It's dozens of these WUs that have lately failed on my GPUs. Other WUs are processed without trouble. If somebody can help, please let me know to whom PM my e-mail.

Regards, Thomas
ThWuensche
 
Posts: 34
Joined: Fri May 29, 2020 5:10 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Postby ajm » Thu Jul 09, 2020 9:26 am

Just a few facts I gathered these last days. On my systems, only my AMD card (drivers 20.5.1) has had such failures with those WUs, none seen on Nvidia GPUs.
Someone else was (is) having recurring troubles on Windows with an AMD MB and a Radeon VII, that could be solved only by dduing the GPU drivers (20.5.1) and reinstalling them.
Yet someone else then claimed to be able to fold without issues using a similar setup (AMD mobo and Radeon VII) BUT a former version of the AMD drivers.
ajm
 
Posts: 495
Joined: Sat Mar 21, 2020 6:22 am
Location: Lucerne, Switzerland

Re: Continued trouble with MoonShot WUs 13414-13417

Postby psaam0001 » Thu Jul 09, 2020 10:20 am

I have made some adjustments to my AMD Video Driver Settings, and they will be tested when I get another Moonshot WU.

Should the performance be much better, I have screen shots that may help those of you who are having issues. I will post them when I can figure out how to get them on here from my local system, or where to save them so I can give a link.

Paul
psaam0001
 
Posts: 37
Joined: Mon May 18, 2020 3:02 am

Re: Continued trouble with MoonShot WUs 13414-13417

Postby aetch » Thu Jul 09, 2020 7:36 pm

I wouldn't get too excited if you see an improvement.
I've had a run of units for Project 13416, some have ran fine, others have really struggled. This is without changing anything about my config.
aetch
 
Posts: 32
Joined: Thu Jun 25, 2020 4:04 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Postby NormalDiffusion » Thu Jul 09, 2020 8:08 pm

I had the same problem on those wus under Windows 10. It was my undervolting. Running fine for weeks and months, but not with the new wus. Upped a little bit the voltage, everything fine now on my rvii and my 290x (my other rvii didn't have problems with the wus).
And for information, running old drivers: 19.6 from June for the rvii/290x combo, and 19.12.1 for the single rvii (will have to move back to 19.6 as it's a little bit faster (2%) on 13416).
NormalDiffusion
 
Posts: 62
Joined: Sat Apr 18, 2020 2:50 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Postby ThWuensche » Thu Jul 09, 2020 9:13 pm

Thanks for the feedback. I see for all of you the WUs are processed. I'm running on stock parameters, only thing that has been tweeked is increased fan speed to keep the GPUs cooler. One major difference is that, as it looks, all of you are running on Windows, while my machines run linux with the ROCM package. So probably it is not a fault of openMM with the Radeon VII as such, but more likely a problem on the linux driver/opencl implementation.

Anyhow if somebody from Chodera lab could provide the tarball mentioned by John and based on that the error could be found and maybe fixed (by AMD ROCm guys) that would be great.
ThWuensche
 
Posts: 34
Joined: Fri May 29, 2020 5:10 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Postby NormalDiffusion » Thu Jul 09, 2020 9:22 pm

You could try to undervolt the cards a little bit. Do you know on how much they are running now?
NormalDiffusion
 
Posts: 62
Joined: Sat Apr 18, 2020 2:50 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Postby ThWuensche » Thu Jul 09, 2020 9:57 pm

On two GPUs, to which I have access now, it shows floating voltage mostly in the range 1.08 - 1.12V, but at times I also see 0,74V. The 0,74V seem to be selected when the GPU is idle.

Basically I wouldn't like to interfere with the standard settings, the manufacturer should know how to set it so that the GPU works. Tweeking may result in things like better energy efficiency, but the default settings should provide stable operation. Also 4 of these GPUs show the same effect, so it does not look like hardware instability to me.
ThWuensche
 
Posts: 34
Joined: Fri May 29, 2020 5:10 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Postby NormalDiffusion » Thu Jul 09, 2020 10:11 pm

Yes, the voltage is changing with the MHz. I'm only speaking of the max value for the highest speed. Each rvii is coming with a different value set by the manufacturer. This value is too high. All my rvii are undervolted (2 of 3 are running fah 24/7). All of them were unstable *in fah* with the default max voltage. With the factory voltage, the card is generating too much heat. Not a big deal in games (never had problems in hours of playing), but fah is giving Nan errors on a regular basis, even if Temps were below the max of what the card can cope with. Undervolting solved it.
On my cards I have factory values ranging fron 98x mV up to 13xx mV.
The rvii can be a wonderful card, but is at the same time a diva...
NormalDiffusion
 
Posts: 62
Joined: Sat Apr 18, 2020 2:50 pm

Re: Continued trouble with MoonShot WUs 13414-13417

Postby psaam0001 » Thu Jul 09, 2020 10:16 pm

Updated Post:

Here are the links to my screen shots, for those of you who are using the current Windows version of the latest AMD Radeon Lite Control Panel. Though, if you know how to access the GUI App for making these changes in Mac OS X or a current Linux distribution, give them a try (hopefully they will help w/other non-Moonshot WU's that we know there are benchmarks for).

Disclaimer: I do not have access to a Mac, or a Linux PC with currently supported GPU hardware on it.

Display1: https://drive.google.com/file/d/1v4hBqUVbDeOay1mpQkScUR-xKozwydek/view?usp=sharing

Display2: https://drive.google.com/file/d/1LnVOxsTA_j5jhASeVIgiuTTg6vq3AIIO/view?usp=sharing

Display3: https://drive.google.com/file/d/15ypzmUD-nV5bZHSvQZVinok41n1gbMxL/view?usp=sharing

See if these settings help. To change the settings to what I show in Display1, you will need to click on Global Settings first.

Paul
Last edited by psaam0001 on Thu Jul 16, 2020 8:30 am, edited 2 times in total.
psaam0001
 
Posts: 37
Joined: Mon May 18, 2020 3:02 am

Re: Continued trouble with MoonShot WUs 13414-13417

Postby bruce » Fri Jul 10, 2020 12:02 am

Multiple issues covered in a single topic.

* As was mentioned above,. I've seen reports that older drivers work better than the latest. I don't have any of those devices so I can't really make any useful comments.
* As a general rule, the Covid Moonshot WUs are small and not an ideal fit for many GPUs. It's impossible to benchmark them accurately.
* Also, a variety of different proteins have been grouped under a single project number making it ever harder to benchmark them all to a single standard.
bruce
 
Posts: 19653
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: Continued trouble with MoonShot WUs 13414-13417

Postby psaam0001 » Thu Jul 16, 2020 8:27 am

Ok... Moonshot jobs are running better on my Ryzen3's Integrated GPU. But I did go into the Windows 10 control panel and turned off all of those power saving settings first. :D

I'll pay the local electric co-op for a few more watts of power, if it helps me give COVID-19 that OSHA compliant steel toed shoe to the backside.

Paul
psaam0001
 
Posts: 37
Joined: Mon May 18, 2020 3:02 am


Return to Problems with AMD/ATI drivers

Who is online

Users browsing this forum: No registered users and 1 guest

cron