Page 1 of 1

Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 7:32 am
by ThWuensche
Most of these WUs fail on all of my Radeon VII. I have been in direct contact with John Chodera, but the last message to him seems to stay unread. As John maybe is busy or away, maybe somebody else can help along the path he suggested:
Third, thanks for bringing up these failures. The surprising thing is that the failures are mostly due to large energy discrepancies between the CPU and GPU, which suggests a bug with OpenMM in working with your card or drivers---which is quite surprising. I see you're on linux---would you be able to try out OpenMM (via conda) and run the quick installation test? First, you would install miniconda (https://docs.conda.io/en/latest/miniconda.html) and then OpenMM

Code: Select all

    conda install -c conda-forge -c omnia openmm
    python -m simtk.testInstallation
If that checks out, I can give you a tarball (if you send your email) with more specific tests for 13417 to identify exactly what is going wrong.
I followed above mentioned proposal, openMM is installed and the test completed without errors.

I would be happy to test such a tarball, but first have to have it. It's dozens of these WUs that have lately failed on my GPUs. Other WUs are processed without trouble. If somebody can help, please let me know to whom PM my e-mail.

Regards, Thomas

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 8:26 am
by ajm
Just a few facts I gathered these last days. On my systems, only my AMD card (drivers 20.5.1) has had such failures with those WUs, none seen on Nvidia GPUs.
Someone else was (is) having recurring troubles on Windows with an AMD MB and a Radeon VII, that could be solved only by dduing the GPU drivers (20.5.1) and reinstalling them.
Yet someone else then claimed to be able to fold without issues using a similar setup (AMD mobo and Radeon VII) BUT a former version of the AMD drivers.

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 9:20 am
by psaam0001
I have made some adjustments to my AMD Video Driver Settings, and they will be tested when I get another Moonshot WU.

Should the performance be much better, I have screen shots that may help those of you who are having issues. I will post them when I can figure out how to get them on here from my local system, or where to save them so I can give a link.

Paul

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 6:36 pm
by aetch
I wouldn't get too excited if you see an improvement.
I've had a run of units for Project 13416, some have ran fine, others have really struggled. This is without changing anything about my config.

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 7:08 pm
by NormalDiffusion
I had the same problem on those wus under Windows 10. It was my undervolting. Running fine for weeks and months, but not with the new wus. Upped a little bit the voltage, everything fine now on my rvii and my 290x (my other rvii didn't have problems with the wus).
And for information, running old drivers: 19.6 from June for the rvii/290x combo, and 19.12.1 for the single rvii (will have to move back to 19.6 as it's a little bit faster (2%) on 13416).

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 8:13 pm
by ThWuensche
Thanks for the feedback. I see for all of you the WUs are processed. I'm running on stock parameters, only thing that has been tweeked is increased fan speed to keep the GPUs cooler. One major difference is that, as it looks, all of you are running on Windows, while my machines run linux with the ROCM package. So probably it is not a fault of openMM with the Radeon VII as such, but more likely a problem on the linux driver/opencl implementation.

Anyhow if somebody from Chodera lab could provide the tarball mentioned by John and based on that the error could be found and maybe fixed (by AMD ROCm guys) that would be great.

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 8:22 pm
by NormalDiffusion
You could try to undervolt the cards a little bit. Do you know on how much they are running now?

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 8:57 pm
by ThWuensche
On two GPUs, to which I have access now, it shows floating voltage mostly in the range 1.08 - 1.12V, but at times I also see 0,74V. The 0,74V seem to be selected when the GPU is idle.

Basically I wouldn't like to interfere with the standard settings, the manufacturer should know how to set it so that the GPU works. Tweeking may result in things like better energy efficiency, but the default settings should provide stable operation. Also 4 of these GPUs show the same effect, so it does not look like hardware instability to me.

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 9:11 pm
by NormalDiffusion
Yes, the voltage is changing with the MHz. I'm only speaking of the max value for the highest speed. Each rvii is coming with a different value set by the manufacturer. This value is too high. All my rvii are undervolted (2 of 3 are running fah 24/7). All of them were unstable *in fah* with the default max voltage. With the factory voltage, the card is generating too much heat. Not a big deal in games (never had problems in hours of playing), but fah is giving Nan errors on a regular basis, even if Temps were below the max of what the card can cope with. Undervolting solved it.
On my cards I have factory values ranging fron 98x mV up to 13xx mV.
The rvii can be a wonderful card, but is at the same time a diva...

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 9:16 pm
by psaam0001
Updated Post:

Here are the links to my screen shots, for those of you who are using the current Windows version of the latest AMD Radeon Lite Control Panel. Though, if you know how to access the GUI App for making these changes in Mac OS X or a current Linux distribution, give them a try (hopefully they will help w/other non-Moonshot WU's that we know there are benchmarks for).

Disclaimer: I do not have access to a Mac, or a Linux PC with currently supported GPU hardware on it.

Display1: https://drive.google.com/file/d/1v4hBqU ... sp=sharing

Display2: https://drive.google.com/file/d/1LnVOxs ... sp=sharing

Display3: https://drive.google.com/file/d/15ypzmU ... sp=sharing

See if these settings help. To change the settings to what I show in Display1, you will need to click on Global Settings first.

Paul

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 09, 2020 11:02 pm
by bruce
Multiple issues covered in a single topic.

* As was mentioned above,. I've seen reports that older drivers work better than the latest. I don't have any of those devices so I can't really make any useful comments.
* As a general rule, the Covid Moonshot WUs are small and not an ideal fit for many GPUs. It's impossible to benchmark them accurately.
* Also, a variety of different proteins have been grouped under a single project number making it ever harder to benchmark them all to a single standard.

Re: Continued trouble with MoonShot WUs 13414-13417

Posted: Thu Jul 16, 2020 7:27 am
by psaam0001
Ok... Moonshot jobs are running better on my Ryzen3's Integrated GPU. But I did go into the Windows 10 control panel and turned off all of those power saving settings first. :D

I'll pay the local electric co-op for a few more watts of power, if it helps me give COVID-19 that OSHA compliant steel toed shoe to the backside.

Paul