13422 failing on RX 5700XT Linux

Moderators: Site Moderators, FAHC Science Team

Neil-B
Posts: 2027
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 13422 failing on RX 5700XT Linux

Post by Neil-B »

It could be that the sanity checks/checkpoints are after steps 250 or 501 ... That may be why the errors always show up at that point as until checks are done then the core doesn't know there is an issue?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: 13422 failing on RX 5700XT Linux

Post by JimF »

If open science (for example science done at universities, financed by public money) will help us to have drugs approximately at the price of generics from the beginning, this will be a big advance for health and life of people, for which otherwise treatment will not be available.
Vijay Pande, the founder of FAH, is currently trying to make a lot of money for startup companies doing innovative work. I hope that includes anything they find useful from FAH. They need a lot of money from the successes to cover the failures, which are usually about 90% in risky fields that are being newly developed. They will use patents I expect to help them compete and bring products to market. If universities can do it better, that is fine with me. But I haven't seen it yet. (Universities get patents too by the way, and like to make as much money from them as they can.)
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Re: 13422 failing on RX 5700XT Linux

Post by ThWuensche »

JimF wrote:
If open science (for example science done at universities, financed by public money) will help us to have drugs approximately at the price of generics from the beginning, this will be a big advance for health and life of people, for which otherwise treatment will not be available.
Vijay Pande, the founder of FAH, is currently trying to make a lot of money for startup companies doing innovative work. I hope that includes anything they find useful from FAH. They need a lot of money from the successes to cover the failures, which are usually about 90% in risky fields that are being newly developed. They will use patents I expect to help them compete and bring products to market. If universities can do it better, that is fine with me. But I haven't seen it yet. (Universities get patents too by the way, and like to make as much money from them as they can.)
I hope that the results from FAH and project moonshot are published for unrestricted use instead of patented. At least that is what the promises look like and what should be self-evident for results based on "donors" electricity bills.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 13422 failing on RX 5700XT Linux

Post by bruce »

I'm not in the drug business ... and I'm not one to defend their high prices. When I went to work in aerospace engineering, it was very clear that our job was to design products that could be sold to our customers ... and if it included an innovation that could be patented because it was a really novel idea, that was especially good for the engineer who invented it. I don't think it's any different in Big Pharma. Research performed for public university cannot be patented, but the original work that happens to use that research can be.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 13422 failing on RX 5700XT Linux

Post by bruce »

Neil-B wrote:It could be that the sanity checks/checkpoints are after steps 250 or 501 ... That may be why the errors always show up at that point as until checks are done then the core doesn't know there is an issue?
That could be right, but I don't think so. If a NaNs occurs before reaching a sanity check, it should fail there. Another possibility is that the message which is being issued isn't an accurate description of the problem. Running the error condition locally should quickly isolate that possibility whether in John's lab or ThWuensche's

The first part of the log lists events that happen periodically. Do any of them happen at multiples of 250 steps
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: 13422 failing on RX 5700XT Linux

Post by JohnChodera »

> I hope that the results from FAH and project moonshot are published for unrestricted use instead of patented. At least that is what the promises look like and what should be self-evident for results based on "donors" electricity bills.

The COVID Moonshot data is all being put into the public domain, and has committed to making all data public and everything free of IP (no patents!) here: https://postera.ai/covid.

For now, the molecules we designed are all public and free of any IP protections, and all the data we collect will also be public---see here for more info: https://foldingathome.org/2020/08/24/co ... -sprint-3/

We have some philanthropic institutes lined up to help push to clinical trials so that we can deliver a drug that is as low-cost as possible that can be made by multiple manufacturers around the world.

We're working hard on automating the whole pipeline to bring you a real-time leaderboard for all compounds and compound statistics for each sprint so you can see the results of all calculations in real time.

We're also working to start automatically sweeping the raw simulation data (which is going to be much less exciting to the public, but useful to scientists a bit later) online, and will then be pushing it to the public server as soon as it is produced. It's just taking us a little while to get that infrastructure worked out, but it's coming very soon!

~ John Chodera // MSKCC
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: 13422 failing on RX 5700XT Linux

Post by JohnChodera »

With the information in your post I have installed miniconda on one more computer, installed openmm, downloaded the zip-file with tests and run RUN9. First it would break with particle coordinate nan before reporting any iterations, after setting the steps_per_iteration to one and increasing niterations it breaks after "completed 250 steps". That is a first hint, since I verified in the logs (FAH) on that computer that in most cases it also breaks at step 250. This indicates that it might be not a result of a calculation running away, but something systematically linked to step 250. Besides many occurrences of step 250 there are some with step 501. From the FAH logs I assumed that maybe 250 would be a first verification point and that would be the reason for that coincidence, but from running openmm in single iterations leading to the same result (counting up all steps before in the console output) that coincidence raises questions.
@ThWuensche: This is great! We now have a test case that makes it very easy to reproduce the issue! Can you post this on the OpenMM issue tracker? Our lead OpenMM developer Peter Eastman can work with you there to try a few more things to debug:

http://github.com/openmm/openmm

Please tag me there as @jchodera and I will chime in with more information and the test scripts.

If you can also post more details of your configuration (a FAH science logs header will do!) there, it will help keep the information organized.

Thank you for helping us get to the bottom of this!

~ John Chodera // MSKCC
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Re: 13422 failing on RX 5700XT Linux

Post by ThWuensche »

JohnChodera wrote:
@ThWuensche: This is great! We now have a test case that makes it very easy to reproduce the issue! Can you post this on the OpenMM issue tracker? Our lead OpenMM developer Peter Eastman can work with you there to try a few more things to debug:

http://github.com/openmm/openmm

Please tag me there as @jchodera and I will chime in with more information and the test scripts.

If you can also post more details of your configuration (a FAH science logs header will do!) there, it will help keep the information organized.

Thank you for helping us get to the bottom of this!

~ John Chodera // MSKCC
The issue is already open: https://github.com/openmm/openmm/issues/2813. As mentioned, if run with precision double it does not break.

The captured WU is 13422,4371,95,2. Here is the science.log from the time I rsynced the directory:

Code: Select all

*************************** Core22 Folding@home Core ***************************
       Core: Core22
       Type: 0x22
    Version: 0.0.11
     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
  Copyright: 2020 foldingathome.org
   Homepage: https://foldingathome.org/
       Date: Jun 27 2020
       Time: 22:50:00
   Revision: cfc2940c5dd1aa80f60daa6e28d4a2a417f74edb
     Branch: core22-0.0.11
   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
             -funroll-loops
   Platform: linux2 4.19.76-linuxkit
       Bits: 64
       Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
             <peastman@stanford.edu>
       Args: -dir 03 -suffix 01 -version 706 -lifeline 12914 -checkpoint 15
             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
************************************ libFAH ************************************
       Date: Jun 27 2020
       Time: 22:11:04
   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
     Branch: HEAD
   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
             -funroll-loops
   Platform: linux2 4.19.76-linuxkit
       Bits: 64
       Mode: Release
************************************ CBang *************************************
       Date: Jun 27 2020
       Time: 22:10:11
   Revision: f8529962055b0e7bde23e429f5072ff758089dee
     Branch: HEAD
   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
             -funroll-loops -fPIC
   Platform: linux2 4.19.76-linuxkit
       Bits: 64
       Mode: Release
************************************ System ************************************
        CPU: AMD Ryzen 7 3700X 8-Core Processor
     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
       CPUs: 16
     Memory: 15.59GiB
Free Memory: 11.82GiB
    Threads: POSIX_THREADS
 OS Version: 5.7
Has Battery: false
 On Battery: false
 UTC Offset: 2
        PID: 12918
        CWD: /var/lib/fahclient/work
********************************************************************************
Folding@home GPU Core22 Folding@home Core
Version 0.0.11
[1] compatible platform(s):
  -- 0 --
  PROFILE = FULL_PROFILE
  VERSION = OpenCL 2.0 AMD-APP (3137.0)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.

(2) device(s) found on platform 0:
  -- 0 --
  DEVICE_NAME = gfx906+sram-ecc
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VERSION = OpenCL 2.0 

  -- 1 --
  DEVICE_NAME = gfx906+sram-ecc
  DEVICE_VENDOR = Advanced Micro Devices, Inc.
  DEVICE_VERSION = OpenCL 2.0 

[ Entering Init ]
  Launch time: 2020-08-24T17:43:14Z
  Arguments passed: -dir 03 -suffix 01 -version 706 -lifeline 12914 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0 
  For testState comparison of CPU and GPU, will use:
    forceTolerance: 5 kJ/mol/nm
    energyTolerance: 10 kJ/mol
[ Leaving  Init ]
[ Entering Main ]
  Reading core settings...
  Total number of steps: 1000000
  Checkpoint write interval: 50000 steps (5%) [20 total]
  JSON viewer frame write interval: 10000 steps (1%) [100 total]
  XTC frame write interval: 250000 steps (25%) [4 total]
  Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
[ Initializing Core Contexts ]
  Using platform OpenCL
  Looking for vendor: amd...found on platformId 0
  Setting platform precision to mixed
  Setting DisablePmeStream to 1
    Checking for integrator.xml
    Found integrator.xml
Loading integrator from integrator.xml
Stream copied, deserializing...
    Checking for integrator.xml
    Found integrator.xml
Loading integrator from integrator.xml
Stream copied, deserializing...
    Checking for system.xml
    Checking for system.xml.gz
    Checking for system.xml.bz2
    Found system.xml.bz2
  Deserializing System...successful.
  Found 90551 atoms, 10 forces.
  Finding State XML file...
    Checking for state.xml
    Checking for state.xml.gz
    Checking for state.xml.bz2
    Found state.xml.bz2
  Deserializing State...successful.
    Ewald error tolerance in force 7 is 0.00025
    Ewald parameters: alpha 2.75697 nx 96 ny 96 nz 96
    Integrator Type: N6OpenMM16CustomIntegratorE
    Constraint Tolerance: 1e-08
    Time Step in PS: 0.004
    Using CPU platform for reference calculations.
  Performing initial sanity checks before starting work...
  Comparing forces and energies between initial State and CPU...
  Comparing forces and energies between GPU and CPU...
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: 13422 failing on RX 5700XT Linux

Post by JohnChodera »

Thanks, @ThWuensche! We'll continue the investigation on the OpenMM issue tracker and update everyone here with what we find.

~ John Chodera // MSKCC
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: 13422 failing on RX 5700XT Linux

Post by JohnChodera »

There's another VEGA-specific issue reported in the OpenMM issue tracker that may be related to the issues we're seeing here:
https://github.com/openmm/openmm/issues/2817

I'll keep you folks updated with what we find.

~ John Chodera // MSKCC
Post Reply