13421 WUs with abnormally long runtime

Moderators: Site Moderators, FAHC Science Team

13421 WUs with abnormally long runtime

Postby Sparkly » Wed Jul 29, 2020 10:20 am

PRCG numbers for some abnormally long runtime WUs

In my case this is based on 1 x CPU core running 1 x FAHCore_22

13421 - 4444, 0, 0 - ETA 13 days
13421 – 3142, 11, 0 – ETA 11 days

Tested what makes them move forward at some normal speed, and giving each of them 2 x CPU cores makes them rather happy, and giving them 3 x CPU cores makes them even more happy, which in general is rather grabby on CPU resources for a GPU WU.
Sparkly
 
Posts: 73
Joined: Sun Apr 19, 2020 12:01 pm

Re: 13421 WUs with abnormally long runtime

Postby bruce » Wed Jul 29, 2020 6:10 pm

Sparkly wrote:PRCG numbers for some abnormally long runtime WUs

In my case this is based on 1 x CPU core running 1 x FAHCore_22

Three questions:
* What GPU is doing the processing?
* How long were those assignments processing without interruption before you noted the estimated run-time?
* Are you running other projects that make heavy use of the GPU?
bruce
 
Posts: 19970
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 13421 WUs with abnormally long runtime

Postby Sparkly » Wed Jul 29, 2020 10:06 pm

Both WUs ran on its own RX580 GPU for like 8-10 hours, reaching like 3%, when the remaining ETA was recorded, before giving each of the WUs more cores to play with, resulting in a more normal runtime and ETA of like 3h, and this was the only two WUs running at the time, since everything else was turned off.
Sparkly
 
Posts: 73
Joined: Sun Apr 19, 2020 12:01 pm

Re: 13421 WUs with abnormally long runtime

Postby JohnChodera » Thu Jul 30, 2020 6:18 am

Thanks for the heads-up on these. We're still working to figure out why some RUNs for these are exceptionally long. Our best hypothesis so far is that this has to do with how constraints are handled, but more investigation is necessary. Hopefully we can dig into that this week now that the sprints are running.

As always, huge thanks for sticking with us despite the suboptimal situation right now!

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 406
Joined: Fri Feb 22, 2013 10:59 pm

Re: 13421 WUs with abnormally long runtime

Postby Yeroon » Fri Jul 31, 2020 6:41 pm

How are you giving the gpu work units more cores? I have lowered the cpu slot to 6c out of 12, and bumped a22 priority on a wu per wu basis when I can, but only see 2 process threads per wu.
I would like to try this as well, as 13421 has some pretty low performance for me (rx470, ppd down to half, power use down 30% per card)
Yeroon
 
Posts: 10
Joined: Wed Jul 08, 2020 12:09 am

Re: 13421 WUs with abnormally long runtime

Postby Sparkly » Fri Jul 31, 2020 7:49 pm

Yeroon wrote:How are you giving the gpu work units more cores?

I am using a process manager

https://www.bill2-software.com/processmanager/download.shtml

to automatically reduce the number cores a FAHcore_22 process gets access to in the first place, so unless you have reduced the amount of cores the process gets access to when it starts, then it will just grab what it can from available cores, so setting affinity might not make any difference for you in that case, unless you do it for manual load balancing purposes.

In Windows 10 you can set process affinity manually for each running process

https://thegeekpage.com/set-affinity-for-an-application-on-windows-10/
Sparkly
 
Posts: 73
Joined: Sun Apr 19, 2020 12:01 pm

Re: 13421 WUs with abnormally long runtime

Postby bruce » Fri Jul 31, 2020 8:08 pm

FAHCore_22 has two functions that use a CPU thread. One process moves data to/from mail RAM and the GPU. For NVidia, on WIndows, this process uses a spin-wait (rather han an interruptible sleep) so the CPU always appears to be busy. For AMD, it doesn't use a spin-wait.
The other thing it does (for either GPU) is it runs a sanity check which periodically checks the current state of the WU. That generally uses CPU resources for a very short period of time at widely spaced intervals.

To the best of my knowledge, it doesn't add up to much (except for the spin-wait) so I don't really worry about it.
bruce
 
Posts: 19970
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 13421 WUs with abnormally long runtime

Postby BobWilliams757 » Sat Aug 01, 2020 3:40 pm

Of the 20 WU's I've run on this project, only 2 seemed a bit out of range.

RCG 876, 79, 0 was a bit quicker, with a 56 second frame time vs 1:11 average. 215k PPD

RCG 6262, 31, 0 was the slow one, with a frame time of 2:47 and PPD of 42k

All the others with the average frame time of 1:11 netted approx 150k PPD

All other than the one slow one are on the high side of normal for the Vega 11 I'm using. But any project with atom counts below 15k or so are always on the fast side with this little onboard GPU, so really nothing unexpected.


Passed for any use in troubleshooting only. I'll fold 'em if they all drop down to slow return times. :D
BobWilliams757
 
Posts: 114
Joined: Fri Apr 03, 2020 3:22 pm

Re: 13421 WUs with abnormally long runtime

Postby bruce » Sat Aug 01, 2020 6:52 pm

Thanks.

Troubleshooting is focusing mostly on error reports (crashes, etc.) I don't think anybody notices (except donors like yourself) when a WU takes longer or shorter time than "normal" unless you report them.

One or two have been extracted for special attention, though.

I suggest a concise report sort of like this one including the hardware involved.
bruce
 
Posts: 19970
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.


Return to Issues with a specific WU

Who is online

Users browsing this forum: midhart90 and 3 guests

cron