Save failing WUs and run them on openMM locally - possible?

ThWuensche · Post by **ThWuensche** » Sun Jul 12, 2020 7:21 am

Is there a way to (at best automatically) save failing WUs and convert them to a job which can be run locally on openMM? That would allow to sort out whether the failure is caused by the OpenCL implementation or openMM (or something else). Possible fixes could be tested in a repeatable setup.

Post by **bruce** » Sun Jul 12, 2020 10:14 pm

Yes, that's possible, but if you're successful, you'll be prohibited from returning the results to FAH's servers. The most you could contribute is a log showing the error message(s) describing the failure, and that's most likely already being returned to the server for a diagnosis and potentially a fix to the FAHCore or to the WU's construction.

ThWuensche · Post by **ThWuensche** » Mon Jul 13, 2020 4:39 pm

Bruce, thanks for your reply. But what I consider is to debug into the opemMM/OpenCL implementation to find the reason for trouble. That requires that I can repeat the job run for debugging and to test possible fixes. If somebody from the openMM team can do it and I don't have to, that would be best and provide the quickest solution. But if they're to busy I could see what I can get. Fortunately both openMM and the ROCm software stack from AMD are open source.

Not returning the results to the FAH servers is not an issue in this case, I try to help with the software stability and reliability.

Post by **Joe_H** » Mon Jul 13, 2020 4:47 pm

My understanding is that they are working on getting a version of the core out that includes debugging symbols. When that is out they will have more information in the logged material sent back to them to dig further. JohnChodera should be posting once that is out.

Post by **bruce** » Mon Jul 13, 2020 8:58 pm

Once the debugging symbols are made available, the error reports that get uploaded after a crash will be decipherable. (You reqally don't want to read a core-dump without the symbol table produced when it was compiled.) John has been doing a pretty good job of peering into the errors and fix them. Notice that the version number jumped from 0.0.2 to 0.0.5 to something approximating 0.0.13 in not very long.

ThWuensche · Post by **ThWuensche** » Mon Jul 13, 2020 9:59 pm

I see I have to learn more about how the software is organized. Is documentation available?

I understand the cores are executables that get the data to process from the work units. As far as I see the cores are stripped. Debug symbols are not included, to include them would make the file a lot bigger. Of course debug symbols are needed to ease understanding, however they probably are not necessarily required as part of the distribution. Still the debug symbols help only to understand core dumps, not missfunction. Of course they are also extremely important to run the core in a debugger, but that probably is not what the typical "donor" does.

I consider it rather difficult to debug missfunction in a concrete setup (combination of OS, GPU hardware, OpenCL implementation) from remote. If the function is not monitored with breakpoints/watchpoints directly in a debugger the approach I would understand is to place debug output at different places of the code, That probably is not very effective to do on a remotely running core, since modified versions of the core would have to be guided right to that test platform and the relevant WUs in addition. Furthermore debugging eventually does not include only the core, but also things like the OpenCL runtime.

Of course remote debugging by running an interactive debugger is a possibility, but I don't think I would be comfortable if the software was controlled remotely to run interactive debug sessions on my system the way a remote controlled debugger allows. This would create backdoors that would be rather desastrous for such a project, which mostly builds on trust.

I would much more prefer interactive debugging as was my intention with the initial question. That of course requires active support of the "donor" and as such is limited. As the code in general is not open source, the possibility to run WUs in a local openMM instance without delivering the results to the servers seems a reasonable option.

Please correct me if I'm wrong or missed something.

Post by **JohnChodera** » Sun Jul 19, 2020 6:12 am

Hi folks! Thanks so much for taking the initiative to look into this.

OpenMM is indeed available to build and install on its own: http://openmm.org
It's also available as a conda-installable package for those who use Python: https://anaconda.org/omnia/openmm

When the client starts a core22 WU, it unpacks all the files you need in a directory like

Code: Select all

work/00/01

. You could, in principle, run the project from python with a simple script to further investigate, but in practice, it's hard for us to figure out exactly what has gone wrong with a particular combination of drivers and hardware.

We are currently hard at work at a core22 0.0.12 build that, when it encounters a force or energy RMSE error or unrecoverable NaN, will run a short battery of tests to further isolate the issue and identify which GPU kernels might be responsible for the issue.

Separately, there is a Win segfault issue that seems to happen sporadically on successful core exit (even after 100% completion). We think we're able to see sporadic segfaults on Linux as well, and are working on debugging this locally since this is much easier. It may be associated with some sort of race condition or uninitialized memory in one of the libraries we use.

We should have 0.0.12 in testing in a few days, and this will bring back a great deal more useful information to our servers to help fix whatever might be going on when we see failures.

Thanks for your patience---it's been difficult to juggle both infrastructure improvements and delivering rapid scientific insights for COVID-19 drug discovery, but we've been working hard to bring a few more folks onboard to help out!

~ John Chodera // MSKCC

ThWuensche · Post by **ThWuensche** » Sun Jul 19, 2020 8:01 pm

Dear John,

I'm well aware of the workload you have:
- provide research work
- develop infrastructure
- keep the contributors with their zoo of hardware/software active

That's why I wanted to support you at least in fixing the bugs visible in my setup. You had suggested to me to install openMM via miniconda and proposed to send me a tarball for tests after that. As I did not receive that tarball I considered you're just to busy. That was the reason for my question, trying to provide the debugging help without stealing your time.

However for me that clearly would mean to dive into software I'm not familiar with. So if you can find and fix the problems with the new release and without my help, that's even better.

Should you need my help in debugging, please let me know. In that case if you have either the mentioned tarball to test or that script you mention, that would ease my start.

Thanks for that great contribution to open/public science and medicine!

Best regards, Thomas

Folding Forum

Save failing WUs and run them on openMM locally - possible?

Save failing WUs and run them on openMM locally - possible?

Re: Save failing WUs and run them on openMM locally - possib

Re: Save failing WUs and run them on openMM locally - possib

Re: Save failing WUs and run them on openMM locally - possib

Re: Save failing WUs and run them on openMM locally - possib

Re: Save failing WUs and run them on openMM locally - possib

Re: Save failing WUs and run them on openMM locally - possib

Re: Save failing WUs and run them on openMM locally - possib