13422 failing on RX 5700XT Linux

ThWuensche · Post by **ThWuensche** » Fri Aug 21, 2020 2:41 pm

PantherX wrote:
ThWuensche wrote:...Do these WUs provide additional insight or are they only scheduled to avoid that people complain about missing GPU WUs?
The Project series 134XX are generally released in "pairs" (13410/13411 followed by 13412/13413, etc.) and each pair is an improvement over another. The most recent pairs are now sufficiently optimized that they are used for shortlisting potential drug compounds. This kind of simulation is a first for F@H thus, they are considered to be highly experimental and hae a higher rate of error initially but is lower nowadays. This is closely being monitored by John and optimizations are happening as soon as possible.

None of those Projects are "filler" projects to keep the GPUs busy. If there's no suitable WUs, your GPU would be idle.

Thanks for the answer. My question was related to the 1174x, 1175x projects, which my GPUs get after switching to "Cancer" to avoid the failing 1342x projects. Then I will keep it running that way until the issue with 1342x projects will hopefully be solved in one of the future core 22 versions.

Post by **bruce** » Fri Aug 21, 2020 4:55 pm

ThWuensche wrote:
PantherX wrote:None of those Projects are "filler" projects to keep the GPUs busy. If there's no suitable WUs, your GPU would be idle.
Thanks for the answer. My question was related to the 1174x, 1175x projects, which my GPUs get after switching to "Cancer" to avoid the failing 1342x projects. Then I will keep it running that way until the issue with 1342x projects will hopefully be solved in one of the future core 22 versions.

All FAH projects serve the needs of valid scientific research. Cancer researchers still have many unanswered questions and they still have ongoing projects. It's just that COVID19 became a top priority focus for FAH and their research lost a lot of their traditional support.

ThWuensche · Post by **ThWuensche** » Fri Aug 21, 2020 6:00 pm

bruce wrote:
ThWuensche wrote: Thanks for the answer. My question was related to the 1174x, 1175x projects, which my GPUs get after switching to "Cancer" to avoid the failing 1342x projects. Then I will keep it running that way until the issue with 1342x projects will hopefully be solved in one of the future core 22 versions.
All FAH projects serve the needs of valid scientific research. Cancer researchers still have many unanswered questions and they still have ongoing projects. It's just that COVID19 became a top priority focus for FAH and their research lost a lot of their traditional support.

Even selecting Cancer I get Covid projects, no issue, as I would like to support both Covid and Cancer related research. The intention of my post was that I need to avoid the 13422 WUs, as they fail for my setup (Linux, ROCm, Radeon VII) as well as for the initiator of this thread and his RX5700XT. Projects 1174x, 1175x are Covid projects, just they seem to be around for long time, so I was insure whether they still have high relevance. Actually it seems there are no Cancer projects out there, otherwise I should not exclusively get Covid WUs with that selection. But that question is leading far away from the topic of the thread ...

Neil-B · Post by **Neil-B** » Fri Aug 21, 2020 6:16 pm

I wouldn't normally suggest this behaviour, but you could look and see which WS is serving the 134xx projects and block that with your firewall ... even if the AS tries to assign you to that server it will fail and I believe the client will reconnect to the AS again (not sure if this is straight away or after a bit of a delay ... someone else may be able to clarify this.

ThWuensche · Post by **ThWuensche** » Fri Aug 21, 2020 6:25 pm

JohnChodera wrote:We're aware of some GPU/driver combinations that seem to produce large discrepancies or issues in forces between GPU and CPU, like is reported here:
Code: Select all
15:10:56:WU00:FS00:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
We're working on adding instrumentation to the core22 0.0.12 release to drop into debugging mode and send back a lot more information about what is going wrong in these cases. Hopefully we can get this out soon!

Thanks so much for bearing with us!

~ John Chodera // MSKCC

@JohnChodera: John, valuing your work I had assumed that you need and can use any support in that area, that's why I have been pushing with proposals to support in debugging, installed openMM according to your suggestion in a private message ... . But in the meantime I got the impression that external help is maybe not required, maybe not welcome, maybe not feasible. So I will stop bothering, placing comments all over the different posts and wait for your solution, will be what you call people providing compute resources, a "donor" instead of an active contributor. Basically this will save me a lot of effort, since diving into debugging would cost me a lot of time I can spend for other tasks. Should you come to the conclusion that you could need help from those who actually run the hardware/software combinations that see problems you're welcome to come back with request for support. Otherwise I will just keep my computers (contributing about 11MPPD) running as a passive "donor", at least as long as I can provide the electricity from our solar power cells during the summertime.

Post by **PantherX** » Fri Aug 21, 2020 10:22 pm

ThWuensche wrote:...But in the meantime I got the impression that external help is maybe not required, maybe not welcome, maybe not feasible. So I will stop bothering, placing comments all over the different posts and wait for your solution, will be what you call people providing compute resources, a "donor" instead of an active contributor...

Mentioning issues in the forum is good to highlight issues since there's a massive variety of hardware and software combinations that can't be tested by internal/beta donors. The latest version of FahCore_22 0.0.11 does manage to capture more errors. That means, when there's an error, it will upload the relevant files to the Server where John is actively monitoring them. The plan is to tackle errors that have the highest count and work down that (potentially very long) list. Thus, if you see in the log file that the "results" have been uploaded to the Server, there's no need to report here since the Server has the record. However, there are some cases where the FahCore_22 is unable to upload results due to the nature of the error and that's something that we encourage donors to report here since John will never "see" it on the Server.

Post by **bruce** » Sun Aug 23, 2020 7:49 am

I'm sure debugging remotely is no fun. V0.0.12 will capture more information and then there will be a 0.0.13 which will narrow the spectrum of bugs. I can't even figure out which Project_Run pairs that fail with which GPUs running with which drivers.

We did capture one AMD bug ("sortshortlist") that Navi fixed and we patched OpenMM for pre-Navi GPUs. We may have to do that again but this time it's not isolated to a specific error message so it may not be just one bug.

ThWuensche · Post by **ThWuensche** » Sun Aug 23, 2020 4:31 pm

PantherX, Bruce, thank you for your answers.

@PantherX:

where John is actively monitoring them

and

since John will never "see" it on the Server

from here starts the problem. Of course it is good that John as "Central Controller" sees it and takes his actions and I do not have any doubts that John Chodera and his team are extraordinarily qualified in that. Still it is a centralized system that might not take resources to their full use. That approach may have been sufficient until a few months ago, when FAH mostly was contributing to fundamental research. But now we are in a pandemy, thousands of people die everyday from it and FAH has taken a serious role in project moonshot, in which open science contributes to open solutions to try to save lives.

I would like to make a somewhat more "graphic" analogy: Consider somebody is in his car far out in nowhere and the car does not move forward and not back. Let's further assume that it's not really a fault of the car, but the car is just stuck in mud (aka it is a bug in the openCL implementation or driver, not the FAH application). Now you can say that the driver should stay in the car, the quality department of the car manufacturer "sees" that the car does not move and will in the next release of the firmware provide some more diagnostics. Then after some time the new firmware is installed by some form of radio link and now the car manufacturers quality department will see that the motor is running, the wheels are spinning and there is some power demand for that and will conclude that the wheels are probably not spinning in free air, but might spin in mud. However this diagnosis still does not do anything to get the car out of the mud (fix the openCL implementation or driver). As opposed to that scenario the car's driver could get out of the car, see that its sunk in mud, could collect a few stones or wood and get the car out of the mud. Afterwards the driver could inform the responsible road maintenance department to put a truck load of stones into that mud hole.

That's where we are: The driver has to wait until the "Central Controller" will do something to solve the problem. He can't get out of the car, detect the reason of the trouble and fix it, since unlike the whole open science moonshot project the FAH debugging is a closed show. I think that among the many thousands of "donors" there are probably around 1000 persons with experience in software development and maybe around 10-50 persons which would be ready to actively support. This could be a serious help for the project in fixing bugs. That does not mean that the approach of additional diagnostics in the core is useless - the driver in the car could be somebody in high heels and no glue how to get a car out of mud. But currently help by qualified supporters is excluded from the project and that may delay the efforts in saving lives in the progress of the pandemy.

@bruce: I agree that debugging remotely is no fun, furthermore it extremely delays the process of fixing bugs. As you describe it has to move through a number of core releases until some result can be expected. And it will bind a lot of resources, since it is not the easiest way of debugging.

I think the project should do better by collecting all the available experience and resources in an open, efficient way. As in my case, the first contact on the issue with John has been on 3rd of July, 1,5 months ago. I can not give any guarantees, since I have never done GPU programming, but I'm involved in engineering software for about 40 years. There are good chances that I could have found and fixed some of the problems occuring in my setup during that time. And I'm sure there are others with similar background active in the project. So let them be contributors, not only "donors".

Post by **bruce** » Sun Aug 23, 2020 6:44 pm

You picked a favorable example. You're not wrong, but there's more to it than you let on.

Stuck-in-the-mud is visually very easy for a driver to diagnose. Now suppose it's a bug in an obscure part of the car's computer code. Feel free to walk around the car all you want.

Today It even takes computerized diagnostic equipment and an internet connection to explain in a man-readable way why the "check engine" light is on.

ThWuensche · Post by **ThWuensche** » Sun Aug 23, 2020 7:41 pm

In this point the analogy is a little weak, as you point out. Stuck in the mud is easy to see, unlike the interaction between application and driver. To have a better analogy we should take an autonomous car without gas pedal ... . A car where the driver does not have influence on whether the wheels spin and which will from its own logic decide whether it drives the wheels. And if the wheels do not turn, since the logic of the car has decided that something is slippery and it will stop the wheels, it also will be more difficult to see that the reason, why it is not moving, is the fact that its caught in mud.

Actually my company is specialized in CAN bus solutions, the network used in cars. We have developed in car protocol layers also used for diagnostics, so I'm well aware of these issues. And following how much issues the car manufacturers had to provide useful onboard diagnostics for their closed environment does not give me a lot of hope that such diagnostics can provide a quick solution in an environment as diverse as the hardware and drivers FAH is running on.

If it would be as easy to diagnose as "stuck in mud" I would not ask. But to find the reason in that more complicated environment it is required to understand where it fails. And that can most effectively be done understanding the logic, with the source code, following it's actions, putting in debug output, narrowing the region of the problem ... . Compared to the autonomous car, if you don't know that it will stop the wheels if it decides something is slippery, it will be difficult to understand why it is not moving.

Post by **PantherX** » Mon Aug 24, 2020 5:02 am

ThWuensche wrote:...the whole open science moonshot project the FAH debugging is a closed show. I think that among the many thousands of "donors" there are probably around 1000 persons with experience in software development and maybe around 10-50 persons which would be ready to actively support. This could be a serious help for the project in fixing bugs...

I do understand your POV, ThWuensche. Prior to the pandemic, plans were in place to open source all aspects of F@H software, the client, viewer, FahCore, etc. The idea behind was to tap into the F@H Community and allow passionate developers to implement user features while the researchers work on scientific features and their research. Some progress was made (https://github.com/FoldingCommunity/Welcome) and then the pandemic arrived which obviously threw spanners in the work. While the original goal to move F@H to open source is still on the table, it would take a while to get there now.

Post by **JohnChodera** » Mon Aug 24, 2020 12:43 pm

> I think the project should do better by collecting all the available experience and resources in an open, efficient way. As in my case, the first contact on the issue with John has been on 3rd of July, 1,5 months ago. I can not give any guarantees, since I have never done GPU programming, but I'm involved in engineering software for about 40 years. There are good chances that I could have found and fixed some of the problems occuring in my setup during that time. And I'm sure there are others with similar background active in the project. So let them be contributors, not only "donors".

@ThWuensche: I totally agree! My lab does everything in the open from the start by default. You can check out our GitHub page to see even projects in the very beginning stages, and we love interacting with other contributors (and contributing to other projects): http://github.com/choderalab.

The issue here is that we inherited a legacy codebase that was created by a single developer in an era where this was not the norm, there are enormous barriers to transitioning it to the open source model, and that this has not been a priority for those with the resources to fund this transition until recently. We had finally been starting to make a push in this direction late last year, and I think the pandemic will accelerate this transition as we now try to bring on more people to help this transition.

The good news is that the underlying science core---OpenMM---is fully open source:
http://openmm.org

In principle, we just have to have you capture a workload that fails and run it through OpenMM locally to debug via the OpenMM issue tracker http://github.com/openmm/openmm/issues.

To install OpenMM from conda (I think you've done this already; if not, use Miniconda: https://docs.conda.io/en/latest/miniconda.html):

Code: Select all

conda install -c conda-forge -c omnia openmm

You can grab a ZIP archive of a variety of workloads here (222MB):
https://fah-ws3.s3.amazonaws.com/debug/ ... chmark.zip

To run each workload (in each subdirectory), you can use the short Python script here:

Code: Select all

from simtk import openmm, unit

def read_xml(filename):
    """Deserialize OpenMM object from XML file"""
    print(f'Reading {filename}...')
    with open(filename, 'r') as infile:
        return openmm.XmlSerializer.deserialize(infile.read())

system = read_xml('system.xml')
state = read_xml('state.xml')
integrator = read_xml('integrator.xml')

print('Creating Context...')
platform = openmm.Platform.getPlatformByName('OpenCL')
platform.setPropertyDefaultValue('Precision', 'mixed')
context = openmm.Context(system, integrator, platform)
context.setState(state)

print('Running simulation...')
niterations = 100
nsteps_per_iteration = 500
for iteration in range(niterations):
    integrator.step(nsteps_per_iteration)
    newstate = context.getState()
    print(f'completed {(iteration+1)*nsteps_per_iteration} steps')

# Clean up                                                                                                                                                                                                     
del context, integrator

I'd focus on the Moonshot workload examples, RUN8 and RUN9.
If the forces are blowing up early on FAH, this should run into those issues relatively quickly.

So we're almost to the situation you describe with the science cores---if you catch anything here, please open an issue in the OpenMM issue tracker and we can get more eyeballs from the open source community on this!
https://github.com/openmm/openmm/issues

Again, huge apologies for the delay---we've been working on automating the testing so we can do this at scale on all GPUs that encounter failures, since this was the most efficient use of our limited resources, but we've been sidetracked by automating the analysis infrastructure first so we could deliver scientific insights to the chemists as rapidly as possible. As you say, thousands of people are dying per day, and we're working flat-out to get to the point where we can go into clinical trials.

Thanks so much for sticking with us, and please do report what you find! In the meantime, we're working on getting the debug build put together over the next few days.

~ John Chodera // MSKCC

ThWuensche · Post by **ThWuensche** » Mon Aug 24, 2020 5:32 pm

@JohnChodera: John, thanks for the information. I know that you support open science from out of your heart and that is what mankind needs instead of the all-money-focused approaches of many drug-manufacturing companies. I also understand the pressure resulting from the need to advance science and improve the infrastructure all at the same time.

With the information in your post I have installed miniconda on one more computer, installed openmm, downloaded the zip-file with tests and run RUN9. First it would break with particle coordinate nan before reporting any iterations, after setting the steps_per_iteration to one and increasing niterations it breaks after "completed 250 steps". That is a first hint, since I verified in the logs (FAH) on that computer that in most cases it also breaks at step 250. This indicates that it might be not a result of a calculation running away, but something systematically linked to step 250. Besides many occurrences of step 250 there are some with step 501. From the FAH logs I assumed that maybe 250 would be a first verification point and that would be the reason for that coincidence, but from running openmm in single iterations leading to the same result (counting up all steps before in the console output) that coincidence raises questions.

Same with two 13422 WUs captured: WU 13422,4371,95,2 breaks with "Particle coordinate is nan" at (after) step 250, WU 13422,4386,10,2 breaks with "Particle coordinate is nan" at (after) step 501.

For what it's worth: the failure occurs after step 250 with single and mixed precision, with double it seems to run without that problem.

I will let you know about further findings.

JimF · Post by **JimF** » Mon Aug 24, 2020 5:47 pm

ThWuensche wrote:@JohnChodera: John, thanks for the information. I know that you support open science from out of your heart and that is what mankind needs instead of the all-money-focused approaches of many drug-manufacturing companies.

But there is a reason why we don't have non-profit drug companies. They don't make money.
Correspondingly, there is a reason all the existing ones are for profit. They do make money.

If you want to contribute to research for free, that is fine. But don't confuse that with getting a product on the market.

ThWuensche · Post by **ThWuensche** » Mon Aug 24, 2020 6:24 pm

JimF wrote:
ThWuensche wrote:@JohnChodera: John, thanks for the information. I know that you support open science from out of your heart and that is what mankind needs instead of the all-money-focused approaches of many drug-manufacturing companies.
But there is a reason why we don't have non-profit drug companies. They don't make money.
Correspondingly, there is a reason all the existing ones are for profit. They do make money.

If you want to contribute to research for free, that is fine. But don't confuse that with getting a product on the market.

I don't say non-profit, my company also is not giving it's products away for free. I mean the exhausting practice built on patents, look at price development of imatinib. Companies should make profit and producers of generics also do make profit. But if you put profit about humanity, knowing that people are desperate, it's everything but honest. If open science (for example science done at universities, financed by public money) will help us to have drugs approximately at the price of generics from the beginning, this will be a big advance for health and life of people, for which otherwise treatment will not be available.

Folding Forum

13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux

Re: 13422 failing on RX 5700XT Linux