GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback thread

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

jpalpant
Posts: 10
Joined: Wed Mar 11, 2020 12:52 am

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by jpalpant »

Sounds like a reasonable step, I'll give it a try! That said, I think the fault is likely in my Kubernetes environment, not my image. I just tried running the same image used for my Pod directly with docker run, and found it was able to start and run quite well. Interestingly, I then stopped the manually-run container and re-launched the Kubernetes-managed container - it was able to run successfully for a few thousand steps on the same WU, before being INTERRUPTED. Once interrupted, the work unit was discarded and the next one was not able to start without being immediately interrupted. Very weird, but not entirely surprising.
toTOW
Site Moderator
Posts: 6309
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by toTOW »

I looked for Project: 11741 (Run 0, Clone 2360, Gen 1) in the WU DB, and I found no entry ... could it be a bad WU ?

Messing up with VMs and other docker things is going to complicate the process of debugging ... don't count on me.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
jpalpant
Posts: 10
Joined: Wed Mar 11, 2020 12:52 am

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by jpalpant »

You know I feel very dumb, but I do have my system running now with Core22. I had a memory limit on the Pod and though the container itself wasn't getting OOMKilled, I think it might have been getting interrupted for that reason. I've removed it for now and the container is pretty memory-hungry, so that was a bad idea in the first place. I was also using the --cpu-usage argument, which I've removed for now. I'll poke around and see if I can reproduce with those settings. AND of course, I have new args for the WU: 23:02:16:WU01:FS01:0x22:Project: 11747 (Run 0, Clone 159, Gen 1) (this has changed several times before and not resolved my issue though).

@toTOW, is this work unit database accessible to users? I'd be curious, then I could at least do that step myself next time I mess something up :)

Update: TIL apparently Kubernetes is smart enough to try to kill individual processes within a container that are using too much memory first, rather than killing the whole container. Since FAH is well-behaved, the core process terminates when it gets interrupted, but FAHClient itself does not, and neither does my container's main process. You can only detect the OOM via 1) kernel logs and 2) node exporter, if you use it. And now I know that INTERRUPTED message means exactly what it says on the tin. https://github.com/kubernetes/kubernetes/issues/50632 (closed as working-as-designed)
toTOW
Site Moderator
Posts: 6309
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by toTOW »

Yes, it's available amongst other tool on this page : https://apps.foldingathome.org/
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
jpalpant
Posts: 10
Joined: Wed Mar 11, 2020 12:52 am

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by jpalpant »

Awesome, thanks toTOW! Excited to get back to folding.

I'm currently working on 11747 (0, 159, 1) which also doesn't show up in the WU DB - https://apps.foldingathome.org/wu#proje ... =159&gen=1 - weird.
Joe_H
Site Admin
Posts: 7868
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by Joe_H »

jpalpant wrote:Awesome, thanks toTOW! Excited to get back to folding.

I'm currently working on 11747 (0, 159, 1) which also doesn't show up in the WU DB - https://apps.foldingathome.org/wu#proje ... =159&gen=1 - weird.
By working on, do you mean currently processing? If so, the the WU will not show up in the database. That search is only for WU's that have been completed and turned in. It can take an hour or two from the time a WU is turned in before it shows up there.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by bruce »

Also, there is a setting for the number of retries for a failed WU (set by the project owner, so I don't know what it is).

FAH does it's best to run every PRCG once and only once ... except when it fails. Once a WU is successfully returned, it isn't sent out again and the trajectory moves on from PRCG to PRC(G+1)
jpalpant
Posts: 10
Joined: Wed Mar 11, 2020 12:52 am

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by jpalpant »

Joe_H wrote:By working on, do you mean currently processing? If so, the the WU will not show up in the database.
bruce wrote:FAH does it's best to run every PRCG once and only once ... except when it fails. Once a WU is successfully returned, it isn't sent out again and the trajectory moves on from PRCG to PRC(G+1)
Got it, I didn't realize that. Yes, I was looking at work units that were in-progress and not returned yet; I can see the WUs I've completed on that app. Very cool!
vnicolici
Posts: 15
Joined: Sun Mar 15, 2020 12:10 am

Post by vnicolici »

What exactly is the purpose of this project? I wanted to contribute to the COVID projects, and after an initial COVID work unit I got a unit from 11737 instead of COVID.
jima13
Posts: 29
Joined: Fri Dec 07, 2007 5:27 am
Location: La Grande, OR

Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa

Post by jima13 »

foldy wrote:0x22 is more demanding on HW than 0x21. So maybe downclock or power limit the failing gtx 1080ti could help.
About a day after this post I uninstalled fah and reinstalled, the system crashed 2x while entering passkey :) Then win 10 starts telling me to update, ok...done. System wouldn't reboot. Shut it down till a few minutes ago and now it boots...so I managed to get my passkey in and the gpu is back to 1.3k on 0x22 ....I'll check it again later to see if it's stable.
Image
Post Reply