faulty WU 13415, 2495,0,1

Moderators: Site Moderators, FAHC Science Team

JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: faulty WU 13415, 2495,0,1

Post by JohnChodera »

Also, to clarify a few things:

1. This doesn't seem to be due to bad WUs. If they reach 100% and you see this block, they have completed successfully:

Code: Select all

17:59:35:WU02:FS02:0x22:Completed 1000000 out of 1000000 steps (100%)
17:59:35:WU02:FS02:0x22:Average performance: 251.895 ns/day
17:59:35:WU02:FS02:0x22:Saving result file ..\logfile_01.txt
17:59:35:WU02:FS02:0x22:Saving result file checkpointState.xml
17:59:35:WU02:FS02:0x22:Saving result file globals.csv
17:59:35:WU02:FS02:0x22:Saving result file positions.xtc
17:59:35:WU02:FS02:0x22:Saving result file science.log
17:59:35:WU02:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT
2. The `UNKNOWN_ENUM` reports do not seem to be transmitted back to the WS, which worries me because that means we don't know how widespread the problem is.

3. This would appear to be some kind of bug either in the core code or libraries, the core build process, or how the client handles the data packet.

Did these issues only start appearing with the core22 0.0.10 build? We recently upgraded our internal libcbang/libfah for core builds from 1.2.0 to 1.5.0, and I worry that one of these libraries may have introduced some instability.

My suspicion is that you're seeing this with 13415 in particular because the WUs are short and you are seeing the result of calling to some unstable library call more frequently triggering more frequent failures.

Thanks so much for bearing with us, and for helping us get to the bottom of this! We're still getting a ton of useful data helping us with the COVID Moonshot work, but we're committed to improving the stability to make things better for everyone.

~ John Chodera // MSKCC
TPL
Posts: 104
Joined: Sun Apr 19, 2020 11:37 am

Re: faulty WU 13415, 2495,0,1

Post by TPL »

I stopped CPU slot after WU finished on it. Now GPU load rose to 70% and it might be ready before TO. CPU load is still under 25% and varies a lot but one/four thread seems not be enough for GPU.
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: faulty WU 13415, 2495,0,1

Post by JohnChodera »

Some thoughts on how to further debug this, for anyone who is experiencing this and is able to help:

1. When this happens, please save a copy of the wudata_01.dat file that corresponds to the WU that failed with UNKNOWN_ENUM
2. To track down whether the failure occurs in the core or the client, you can run the core separately on the wudata_01.dat file directly. Put a copy of wudata_01.dat in a directory called 00 and run the core directly on that directory with

Code: Select all

C:\path\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/beta/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
(exact syntax may vary depending on whether you are using WSL or powershell or command.exe, and which device you have)
If you see this reliably fail with a non-zero exit code, we have a reproducible test case
3. Please send us a copy of the wudata_01.dat file if it reliably triggers this failure!
4. If you have experience with the windows debugger, running it again through the debugger could yield a stack trace of where the failure is occurring. We can provide a debug core build with symbols to help produce a more meaningful stack trace if the first stack trace doesn't reveal much.
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: faulty WU 13415, 2495,0,1

Post by JimF »

OK, I have a candidate or two. But that path does not work on the command line of Win7 64-bit.
This default installation of FAH put the data files in "C:\Users\(user)\AppData\Roaming\FAHClient"
Any suggestions?

Also, I don't see a way to attach the wudata_01.dat to an email even if I confirm it.
TPL
Posts: 104
Joined: Sun Apr 19, 2020 11:37 am

Re: faulty WU 13415, 2495,0,1

Post by TPL »

I guess you should use what you just wrote as path. Up to Roaming/...

Also note all the swithces according to your configuration.
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: faulty WU 13415, 2495,0,1

Post by JimF »

Even that path does not work. It appears that among other things, the arguments for my RX 570 are not correct, even if I change "nvidia" to "amd".
I am willing to send the suspect file in and let them test it, if they can tell me how.
TPL
Posts: 104
Joined: Sun Apr 19, 2020 11:37 am

Re: faulty WU 13415, 2495,0,1

Post by TPL »

I'm not sure about this and can be wrong. But when I open cmd it goes directly to Users\my_user_name>. Then path would be AppData\Roaming... But are you running beta? Is this instruction working only with beta-version? For me the path would be:

Image

I think you need to zip wudata_01.dat to send it by email.
JimF
Posts: 652
Joined: Thu Jan 21, 2010 2:03 pm

Re: faulty WU 13415, 2495,0,1

Post by JimF »

Thanks. I am using the same thing. It is not beta.

Code: Select all

C:\Users\USERNAME\AppData\Roaming\FAHClient\cores\cores.foldingathome.org\v7\win\64bit\Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
No more of my time. If they want it, I will send it to them.
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: faulty WU 13415, 2495,0,1

Post by JohnChodera »

> Also, I don't see a way to attach the wudata_01.dat to an email even if I confirm it.

@JimF: You can email me directly at john.chodera@choderalab.org!
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: faulty WU 13415, 2495,0,1

Post by JohnChodera »

Let me check on that command-line with our windows testers and post an updated version to test locally. Thanks again for giving it a try!

~ John Chodera // MSKCC
cayenne187
Posts: 39
Joined: Thu May 14, 2020 7:56 pm
Hardware configuration: intel i7-4790 3.6ghz
16gb ram
win 10 pro
GTX970 4gb
Radeon RX570 4gb

Re: faulty WU 13415, 2495,0,1

Post by cayenne187 »

hi John, what you asked is too complicated for me. only 13415 WU on my RX570 crash. they crash at the end and remain in the que as 100% complete and ready and rerun themselves again later. same enum error on all. i am dumping 13415 for now and moving on. let me know how else i can help.
Date of last Work Unit 2020-08-04 22:17:53
Total score 63,300,083
Total WUs 1,013
Overall rank (if points are combined) 15,440 of 2,735,299
cayenne187
Posts: 39
Joined: Thu May 14, 2020 7:56 pm
Hardware configuration: intel i7-4790 3.6ghz
16gb ram
win 10 pro
GTX970 4gb
Radeon RX570 4gb

Re: faulty WU 13415, 2495,0,1

Post by cayenne187 »

it would be a nice feature to be able to reject certain projects that cause problems. and also to accept beta projects that may have problems and are willing to accept some logging and feedback responsibilities.
Date of last Work Unit 2020-08-04 22:17:53
Total score 63,300,083
Total WUs 1,013
Overall rank (if points are combined) 15,440 of 2,735,299
TPL
Posts: 104
Joined: Sun Apr 19, 2020 11:37 am

Re: faulty WU 13415, 2495,0,1

Post by TPL »

That is quite contradictory statement.
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: faulty WU 13415, 2495,0,1

Post by mwroggenbuck »

I also have had a couple of 13415 WU fail in the same way. At least it is happening to more people than just me now
JohnChodera
Pande Group Member
Posts: 470
Joined: Fri Feb 22, 2013 9:59 pm

Re: faulty WU 13415, 2495,0,1

Post by JohnChodera »

> hi John, what you asked is too complicated for me. only 13415 WU on my RX570 crash. they crash at the end and remain in the que as 100% complete and ready and rerun themselves again later. same enum error on all. i am dumping 13415 for now and moving on. let me know how else i can help.

@cayenne187: Thanks for the extra clues! We're still investigating what might be going on. Do all 13415 WUs show this behavior, or just some of them?

~ John Chodera // MSKCC
Locked