Running F@A on HPC

loadnabox · Post by **loadnabox** » Sat Mar 14, 2020 3:02 pm

I was having trouble finding much information on the topic, are there any suggestions for running this on HPC?

I have access to a decent sized HPC where I can run scientific applications for free during free cycles (at the risk of jobs getting canceled if a priority job comes in)

The problems are: It cannot run as a service
it must be installed to a shared drive, not the default locations
it cannot modify the default python environment

Additional considerations would be:
How to prevent multiple instances from clobbering each other's downloads (possibly on shared drive, maybe set the download to /tmp, still thinking this one through)

thoughts?

foldy · Post by **foldy** » Sat Mar 14, 2020 5:43 pm

Can it run a docker image?

Post by **bruce** » Sat Mar 14, 2020 7:08 pm

loadnabox wrote:I have access to a decent sized HPC where I can run scientific applications for free during free cycles (at the risk of jobs getting canceled if a priority job comes in)
...
thoughts?

Free is good.

FAH assigns each WU to a single Donor's machine and expects it to be completed promptly. You probably can't do anything about it, but it's costly to the science if the assignment isn't returned. A canceled job does get reassigned when it's missing for a "long time"

You may not care about the credits for the work done, but the points policy reflects those concepts.

loadnabox · Post by **loadnabox** » Sat Mar 14, 2020 8:04 pm

foldy wrote:Can it run a docker image?

We can run singularity, docker directly is out of the question since it requires root

loadnabox · Post by **loadnabox** » Sat Mar 14, 2020 8:19 pm

bruce wrote:
loadnabox wrote:I have access to a decent sized HPC where I can run scientific applications for free during free cycles (at the risk of jobs getting canceled if a priority job comes in)
...
thoughts?
Free is good.

FAH assigns each WU to a single Donor's machine and expects it to be completed promptly. You probably can't do anything about it, but it's costly to the science if the assignment isn't returned. A canceled job does get reassigned when it's missing for a "long time"

You may not care about the credits for the work done, but the points policy reflects those concepts.

Understood, it would probably be using a mix of K80's and V100's if I ran them in 2 hour chunks do you think would that be enough time to complete a full WU on those cards? Any general time frame on what a "long time" is?

The question specifically though pertains to the data files. It's probably best not to store on the local drives since there's no guarantee of getting the same system if resuming. (can you resume GPU work?)

There's some cross mounted directories so that I could resume from any system if it is interrupted, but you wouldn't want two systems trying to work on the same WU at once.

I was thinking of a tiny PERL script to include in the bash script that would get called and would insert a unique path of some type into the configuration file, hard part is figuring out how to make sure that not only is it able to figure out from the resume file what it should be resuming, but also cleaning it up after each run so that it doesn't try to re-run finished work.

I'm also a little concerned about it not being ensured GPU work. During my testing I frequently see idle GPU's on linux nodes (home server included). I want to be sure I'm not consuming a GPU that otherwise could be doing useful work. Sure my job can be canceled by a priority job, but other people may also be running in the "free cycle" mode with me, I wouldn't want to hold up their work with idle time. In general, while I'm not doing anything against any rules, it's just good form on my part to ensure the scientists that are the primary users of the system aren't impacted.

Post by **bruce** » Sat Mar 14, 2020 8:28 pm

https://apps.foldingathome.org/psummary shows that the assigned Timeout/Deadline are a lot longer than really necessary for GPU projects. CPU project may or may not be. Actual run-times will depend on the hardware characteristics and what resources are actually otherwise idle.

If you restart a canceled job and it had time to sync the checkpoint file, you may be able to resume and finish it.

X-Wing · Post by **X-Wing** » Sat Mar 14, 2020 8:32 pm

Just out of curiosity, what do you mean by "decent sized" HPC? Given the level of new involvement in FAH recently, I am interested to see how much raw performance has been added to the project.

loadnabox · Post by **loadnabox** » Sat Mar 14, 2020 8:40 pm

X-Wing wrote:Just out of curiosity, what do you mean by "decent sized" HPC? Given the level of new involvement in FAH recently, I am interested to see how much raw performance has been added to the project.

About 20,000 cores and around 500 GPU's

I can usually find a dozen V100's idle and about 30-40 K80's idle at any given time.

Again, I will only be taking idle time though, usage is easily over 80% continuous

loadnabox · Post by **loadnabox** » Sat Mar 14, 2020 9:00 pm

bruce wrote:https://apps.foldingathome.org/psummary shows that the assigned Timeout/Deadline are a lot longer than really necessary for GPU projects. CPU project may or may not be. Actual run-times will depend on the hardware characteristics and what resources are actually otherwise idle.

If you restart a canceled job and it had time to sync the checkpoint file, you may be able to resume and finish it.

Good information, thanks, It looks like the timeout is days so any somewhat recent GPU should be well under that.

My home 2080ti's take about 3.5 hours to complete a chunk, K80's are older so their single precision is a little slower (about half) and double precision a lot faster (about 6x) so it will depend on the data behind it. I know Gromacs can utilize mixed precision, so I guess it really depends on the protein model and data, but I'm guessing even in worst case K80's will finish in a few hours. Supposing I can get the WU to resume within a day (quite likely) I don't think I should be holding up any research. V100's should straight up smoke through any WU in a couple of hours.

As long as I've got you here, got a link to specifying the beta libraries in the config.xml? Might ensure I get more covid-19 GPU WU's from what I read.

Also, do certain WU's specify what type of GPu they want/need?

Post by **bruce** » Sat Mar 14, 2020 9:04 pm

FAH has been optimized for GPUs on home computers. Most utilization is Single Precision plus a small but essential fraction of Double Precision.

felipeportella · Post by **felipeportella** » Sat Mar 28, 2020 6:29 pm

loadnabox wrote: We can run singularity, docker directly is out of the question since it requires root

we managed to support F@H more or less in a HPC environment by using a singularity container. Our scripts are available at GitHub* with GPU support (we use V100 as well).

* as I'm new user in the Forum I can't put the link, but search for folding-at-home-docker-singularity-gpu (its public repo under my username)

Folding Forum

Running F@A on HPC

Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC

Re: Running F@A on HPC