Page 3 of 4

Re: faulty WU 13415, 2495,0,1

Posted: Thu Jun 25, 2020 9:19 pm
by JohnChodera
Also, to clarify a few things:

1. This doesn't seem to be due to bad WUs. If they reach 100% and you see this block, they have completed successfully:

Code: Select all

17:59:35:WU02:FS02:0x22:Completed 1000000 out of 1000000 steps (100%)
17:59:35:WU02:FS02:0x22:Average performance: 251.895 ns/day
17:59:35:WU02:FS02:0x22:Saving result file ..\logfile_01.txt
17:59:35:WU02:FS02:0x22:Saving result file checkpointState.xml
17:59:35:WU02:FS02:0x22:Saving result file globals.csv
17:59:35:WU02:FS02:0x22:Saving result file positions.xtc
17:59:35:WU02:FS02:0x22:Saving result file science.log
17:59:35:WU02:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT
2. The `UNKNOWN_ENUM` reports do not seem to be transmitted back to the WS, which worries me because that means we don't know how widespread the problem is.

3. This would appear to be some kind of bug either in the core code or libraries, the core build process, or how the client handles the data packet.

Did these issues only start appearing with the core22 0.0.10 build? We recently upgraded our internal libcbang/libfah for core builds from 1.2.0 to 1.5.0, and I worry that one of these libraries may have introduced some instability.

My suspicion is that you're seeing this with 13415 in particular because the WUs are short and you are seeing the result of calling to some unstable library call more frequently triggering more frequent failures.

Thanks so much for bearing with us, and for helping us get to the bottom of this! We're still getting a ton of useful data helping us with the COVID Moonshot work, but we're committed to improving the stability to make things better for everyone.

~ John Chodera // MSKCC

Re: faulty WU 13415, 2495,0,1

Posted: Thu Jun 25, 2020 9:41 pm
by TPL
I stopped CPU slot after WU finished on it. Now GPU load rose to 70% and it might be ready before TO. CPU load is still under 25% and varies a lot but one/four thread seems not be enough for GPU.

Re: faulty WU 13415, 2495,0,1

Posted: Thu Jun 25, 2020 10:08 pm
by JohnChodera
Some thoughts on how to further debug this, for anyone who is experiencing this and is able to help:

1. When this happens, please save a copy of the wudata_01.dat file that corresponds to the WU that failed with UNKNOWN_ENUM
2. To track down whether the failure occurs in the core or the client, you can run the core separately on the wudata_01.dat file directly. Put a copy of wudata_01.dat in a directory called 00 and run the core directly on that directory with

Code: Select all

C:\path\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/beta/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
(exact syntax may vary depending on whether you are using WSL or powershell or command.exe, and which device you have)
If you see this reliably fail with a non-zero exit code, we have a reproducible test case
3. Please send us a copy of the wudata_01.dat file if it reliably triggers this failure!
4. If you have experience with the windows debugger, running it again through the debugger could yield a stack trace of where the failure is occurring. We can provide a debug core build with symbols to help produce a more meaningful stack trace if the first stack trace doesn't reveal much.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 12:40 pm
by JimF
OK, I have a candidate or two. But that path does not work on the command line of Win7 64-bit.
This default installation of FAH put the data files in "C:\Users\(user)\AppData\Roaming\FAHClient"
Any suggestions?

Also, I don't see a way to attach the wudata_01.dat to an email even if I confirm it.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 12:52 pm
by TPL
I guess you should use what you just wrote as path. Up to Roaming/...

Also note all the swithces according to your configuration.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 1:31 pm
by JimF
Even that path does not work. It appears that among other things, the arguments for my RX 570 are not correct, even if I change "nvidia" to "amd".
I am willing to send the suspect file in and let them test it, if they can tell me how.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 2:53 pm
by TPL
I'm not sure about this and can be wrong. But when I open cmd it goes directly to Users\my_user_name>. Then path would be AppData\Roaming... But are you running beta? Is this instruction working only with beta-version? For me the path would be:

Image

I think you need to zip wudata_01.dat to send it by email.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 3:39 pm
by JimF
Thanks. I am using the same thing. It is not beta.

Code: Select all

C:\Users\USERNAME\AppData\Roaming\FAHClient\cores\cores.foldingathome.org\v7\win\64bit\Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
No more of my time. If they want it, I will send it to them.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 5:57 pm
by JohnChodera
> Also, I don't see a way to attach the wudata_01.dat to an email even if I confirm it.

@JimF: You can email me directly at john.chodera@choderalab.org!

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 5:58 pm
by JohnChodera
Let me check on that command-line with our windows testers and post an updated version to test locally. Thanks again for giving it a try!

~ John Chodera // MSKCC

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 6:48 pm
by cayenne187
hi John, what you asked is too complicated for me. only 13415 WU on my RX570 crash. they crash at the end and remain in the que as 100% complete and ready and rerun themselves again later. same enum error on all. i am dumping 13415 for now and moving on. let me know how else i can help.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 7:19 pm
by cayenne187
it would be a nice feature to be able to reject certain projects that cause problems. and also to accept beta projects that may have problems and are willing to accept some logging and feedback responsibilities.

Re: faulty WU 13415, 2495,0,1

Posted: Fri Jun 26, 2020 8:13 pm
by TPL
That is quite contradictory statement.

Re: faulty WU 13415, 2495,0,1

Posted: Sat Jun 27, 2020 12:19 am
by mwroggenbuck
I also have had a couple of 13415 WU fail in the same way. At least it is happening to more people than just me now

Re: faulty WU 13415, 2495,0,1

Posted: Sat Jun 27, 2020 5:00 am
by JohnChodera
> hi John, what you asked is too complicated for me. only 13415 WU on my RX570 crash. they crash at the end and remain in the que as 100% complete and ready and rerun themselves again later. same enum error on all. i am dumping 13415 for now and moving on. let me know how else i can help.

@cayenne187: Thanks for the extra clues! We're still investigating what might be going on. Do all 13415 WUs show this behavior, or just some of them?

~ John Chodera // MSKCC