Projects 13412-13415

Moderators: Site Moderators, FAHC Science Team

Post Reply
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Projects 13412-13415

Post by ThWuensche »

Don't know whether that's the right place, but response in the announcement forum is not possible.
COVID (GPU, core22 0.0.10) projects 13412-13415 to FAH

Unread postby JohnChodera » Fri Jun 19, 2020 9:48 pm
Projects 13412-13415 are the final stages of testing core22 based free energy calculations (using the new core22 0.0.10 release) for supporting the COVID Moonshot (http://postera.ai/covid).

13412 and 13414 model the interaction of ligands with the target protein and are restricted to newer GPUs, while 13413 and 13415 model the interactions of ligands in solvent and are restricted to older GPUs.

~ John Chodera // MSKCC
What are older/newer GPUs? I got a WU 13413 for my Radeon VII, running at about 300.000PPD instead of average 1.200.000PPD, Before I had a set of 13415 WUs, which all were rejected as faulty.

Is the Radeon VII already considered an older GPU, capable of handling only about 4000 atoms, leaving it mostly unused? Or is there still something wrong in the assignment?

I assigned these GPUs for support of the moonshot project. I consider that project very important and valuable. But WU assignment / GPU handling seems to be in need of improvement for the large amount of work in that project.
Crawdaddy79
Posts: 73
Joined: Sat Mar 21, 2020 3:56 pm

Re: Projects 13412-13415

Post by Crawdaddy79 »

Vega 64 here. I still consider my video card new, but maybe the project doesn't see it that way.

I have been getting 13415 non stop for the last two days. I have had two crashes right as the WU finishes and the core starts on the next WU - the results don't get sent until I start the client again after the crash, but so far, 100% of everything is returning OK. I get about 275k PPD when no CPU is folding on these projects when typical for my GPU is 1.0 - 1.2M PPD. There are a lot of threads about this run of projects and how they perform better on slower GPUs (relative to that GPU) but perform much worse on faster ones.

Make sure you have the latest drivers for your Radeon VII. Not sure why yours are all returning faulty.
Image
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: Projects 13412-13415

Post by JohnChodera »

> What are older/newer GPUs?

Excellent point here. The "old/new" terminology is totally inaccurate. The key issue is whether the system being simulated is sufficiently large to keep all of the stream processors on the GPU die busy most of the time.

13414 is a large, complex system of ~80K atoms containing a small molecule interacting with a protein, while 13415 is a small system of ~4K atoms containing just the ligand in solvent. Both simulations are needed for us to estimate the binding free energy difference for transforming one ligand into a related ligand to see if the new design idea will be more potent than the reference ligand.

13414 should be large enough to fill up GPUs with reasonably large numbers of stream processors (e.g. GTX 1080), while 13415 would not utilize all stream processors on that class of GPU. Instead, we're trying to find the right balance, allocating 13415 to GPUs with fewer stream processors. This is also difficult given (1) we only have the granularity of GPUSpecies assignments to work with, and (2) there is a surprising RUN-to-RUN variation (at least for 13415) that we weren't aware of earlier even though the number of atoms are nearly identical for all RUNs.

We're working on an extensive benchmarking analysis (project 17100) to help refine the granularity to address (1), but we're just now looking into the origin of (2).

These 134xx projects have very tight turnaround times---just a few days---but there will be many projects in this series to support the COVID Moonshot (http://postera.ai/covid), so our hope is that we can keep making systematic refinements with each project pair as we learn more from analyzing the data from the last batch. Between core22 0.0.10 and improvements in project preparation, we've managed to greatly reduce the NaN frequency, recover more usable data, and reduce the amount of work lost due to infrequent checkpoints when a NaN is encountered. I'm confident we can keep making improvements.

Thanks so much for bearing with us! These projects are providing an enormous amount of value to us.

~ John Chodera // MSKCC
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Projects 13412-13415

Post by foldy »

A temporary workaround for people with high shader count GPUs getting small work units could be to run 2 or more GPU slots concurrently on the same GPU. I got it working in Windows but had to set the GPU index and OpenCL index manually to the same values for all slots - because it is not supported by FAH. On Linux it failed because of "gpu index already busy" after first slot starting.

Maybe that could also be a feature for FAH to run several FahCores in parallel on the same GPU automatically? But it only makes sense if there are many very small work units which all need to get completed in several weeks only and cannot get completed on slow GPUs only and so need the help of fast GPUs too - even if they cannot run at full performance then.
HaloJones
Posts: 920
Joined: Thu Jul 24, 2008 10:16 am

Re: Projects 13412-13415

Post by HaloJones »

The issue though is that it appears to only be some generations not all.

e.g.
P13414 R825, C69, G1 TPF 04:11
P13414 R935, C78, G0 TPF 02:25


Same card, same Project, one gives 400K ppd, the second gives 939K ppd.

You can see why people are dumping.
Last edited by HaloJones on Mon Jun 22, 2020 12:26 pm, edited 1 time in total.
single 1070

Image
TPL
Posts: 104
Joined: Sun Apr 19, 2020 11:37 am

Re: Projects 13412-13415

Post by TPL »

In couple last days I fold many 13413 and 13415 with my slow GPU with success but I'm not getting them any more. Server always assigns some large WU. I have dumped about 10 of them now. I see no sense to go on with them when timeout is 1 day and my ETA is 3.55 days for example. So it seems to be over for that GPU for now, I removed the slot again.
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Re: Projects 13412-13415

Post by ThWuensche »

> What are older/newer GPUs?

Excellent point here. The "old/new" terminology is totally inaccurate. The key issue is whether the system being simulated is sufficiently large to keep all of the stream processors on the GPU die busy most of the time.
That's basically what I expected, if the distinction is not based on some functional difference. But in that case devices like Radeon VII probably should get the larger jobs 13412, 13414 and not 13413, 13415. Maybe the assignment went the wrong way, with smaller cards getting big jobs and bigger cards getting small jobs?
These projects are providing an enormous amount of value to us.
That's also my understanding and the reason why I added more GPUs. So it's extra important to get out the most of the stronger GPUs, loading them fully with large jobs. That was the reason for mentioning the issue, hope it helps.
ThWuensche
Posts: 80
Joined: Fri May 29, 2020 4:10 pm

Re: Projects 13412-13415

Post by ThWuensche »

Btw, that's my results on the 13415 WUs:

Code: Select all

20:14:35:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13415 run:633 clone:16 gen:0 core:0x22 unit:0x0000000012bc7d9a5eed8c3bb811baf0
20:16:37:WU05:FS01:Sending unit results: id:05 state:SEND error:FAULTY project:13415 run:689 clone:16 gen:0 core:0x22 unit:0x0000000012bc7d9a5eed8c3a4ae5579b
03:51:59:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13415 run:678 clone:20 gen:1 core:0x22 unit:0x0000000812bc7d9a5eed8c3a86253295
07:47:56:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13415 run:531 clone:25 gen:1 core:0x22 unit:0x0000000212bc7d9a5eed8c3eb5c10eea
10:39:26:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:13415 run:73 clone:29 gen:0 core:0x22 unit:0x0000000312bc7d9a5eed8c514859cc91
10:39:36:WU03:FS02:Sending unit results: id:03 state:SEND error:FAULTY project:13415 run:619 clone:27 gen:1 core:0x22 unit:0x0000000512bc7d9a5eed8c3c623b2381
11:34:21:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13415 run:107 clone:28 gen:1 core:0x22 unit:0x0000000212bc7d9a5eed8c501cc2bbca
21:29:37:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13415 run:830 clone:40 gen:0 core:0x22 unit:0x0000000012bc7d9a5eed8c353a73be1b
22:30:53:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13415 run:729 clone:40 gen:0 core:0x22 unit:0x0000000212bc7d9a5eed8c38a7022522
22:31:01:WU04:FS01:Sending unit results: id:04 state:SEND error:FAULTY project:13415 run:328 clone:42 gen:0 core:0x22 unit:0x0000000012bc7d9a5eed8c49dfd01b17
22:31:10:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13415 run:103 clone:40 gen:1 core:0x22 unit:0x0000000112bc7d9a5eed8c50d19c51cc
22:37:34:WU04:FS01:Sending unit results: id:04 state:SEND error:FAULTY project:13415 run:691 clone:40 gen:1 core:0x22 unit:0x0000000212bc7d9a5eed8c39b66fd6db
00:20:04:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13415 run:16 clone:43 gen:1 core:0x22 unit:0x0000000212bc7d9a5eed8c5265001211
00:20:29:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13415 run:937 clone:42 gen:1 core:0x22 unit:0x0000000112bc7d9a5eed8c31f4db5c5b
01:22:49:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:13415 run:523 clone:43 gen:0 core:0x22 unit:0x0000000112bc7d9a5eed8c3f0d2ce67a
01:22:57:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13415 run:890 clone:43 gen:1 core:0x22 unit:0x0000000112bc7d9a5eed8c3378cba8ee
06:09:04:WU04:FS02:Sending unit results: id:04 state:SEND error:NO_ERROR project:13415 run:523 clone:22 gen:1 core:0x22 unit:0x0000000612bc7d9a5eed8c3f34a1eb9e
HaloJones
Posts: 920
Joined: Thu Jul 24, 2008 10:16 am

Re: Projects 13412-13415

Post by HaloJones »

13414 (382, 85, 0) on my GTX1070 again 546047 ppd.

these units are all over the place depending on the RCG
single 1070

Image
Crawdaddy79
Posts: 73
Joined: Sat Mar 21, 2020 3:56 pm

Re: Projects 13412-13415

Post by Crawdaddy79 »

New batch of 16415 WUs are running at 25% of previous speeds. Got four today - usually complete these in little over an hour. Now they're taking nearly four hours. Instead of 13k points each @ 275k PPD, they're doing 8k points each at 60k PPD.

Currently crunching (844, 49, 1). Last crunched: run:844 clone:47 gen:1
Image
Post Reply