faulty WU 13415, 2495,0,1

Moderators: Site Moderators, FAHC Science Team

Re: faulty WU 13415, 2495,0,1

Postby JohnChodera » Thu Jun 25, 2020 10:19 pm

Also, to clarify a few things:

1. This doesn't seem to be due to bad WUs. If they reach 100% and you see this block, they have completed successfully:
Code: Select all
17:59:35:WU02:FS02:0x22:Completed 1000000 out of 1000000 steps (100%)
17:59:35:WU02:FS02:0x22:Average performance: 251.895 ns/day
17:59:35:WU02:FS02:0x22:Saving result file ..\logfile_01.txt
17:59:35:WU02:FS02:0x22:Saving result file checkpointState.xml
17:59:35:WU02:FS02:0x22:Saving result file globals.csv
17:59:35:WU02:FS02:0x22:Saving result file positions.xtc
17:59:35:WU02:FS02:0x22:Saving result file science.log
17:59:35:WU02:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT


2. The `UNKNOWN_ENUM` reports do not seem to be transmitted back to the WS, which worries me because that means we don't know how widespread the problem is.

3. This would appear to be some kind of bug either in the core code or libraries, the core build process, or how the client handles the data packet.

Did these issues only start appearing with the core22 0.0.10 build? We recently upgraded our internal libcbang/libfah for core builds from 1.2.0 to 1.5.0, and I worry that one of these libraries may have introduced some instability.

My suspicion is that you're seeing this with 13415 in particular because the WUs are short and you are seeing the result of calling to some unstable library call more frequently triggering more frequent failures.

Thanks so much for bearing with us, and for helping us get to the bottom of this! We're still getting a ton of useful data helping us with the COVID Moonshot work, but we're committed to improving the stability to make things better for everyone.

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 408
Joined: Fri Feb 22, 2013 10:59 pm

Re: faulty WU 13415, 2495,0,1

Postby TPL » Thu Jun 25, 2020 10:41 pm

I stopped CPU slot after WU finished on it. Now GPU load rose to 70% and it might be ready before TO. CPU load is still under 25% and varies a lot but one/four thread seems not be enough for GPU.
TPL
 
Posts: 99
Joined: Sun Apr 19, 2020 12:37 pm

Re: faulty WU 13415, 2495,0,1

Postby JohnChodera » Thu Jun 25, 2020 11:08 pm

Some thoughts on how to further debug this, for anyone who is experiencing this and is able to help:

1. When this happens, please save a copy of the wudata_01.dat file that corresponds to the WU that failed with UNKNOWN_ENUM
2. To track down whether the failure occurs in the core or the client, you can run the core separately on the wudata_01.dat file directly. Put a copy of wudata_01.dat in a directory called 00 and run the core directly on that directory with
Code: Select all
C:\path\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/beta/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0

(exact syntax may vary depending on whether you are using WSL or powershell or command.exe, and which device you have)
If you see this reliably fail with a non-zero exit code, we have a reproducible test case
3. Please send us a copy of the wudata_01.dat file if it reliably triggers this failure!
4. If you have experience with the windows debugger, running it again through the debugger could yield a stack trace of where the failure is occurring. We can provide a debug core build with symbols to help produce a more meaningful stack trace if the first stack trace doesn't reveal much.
User avatar
JohnChodera
Pande Group Member
 
Posts: 408
Joined: Fri Feb 22, 2013 10:59 pm

Re: faulty WU 13415, 2495,0,1

Postby JimF » Fri Jun 26, 2020 1:40 pm

OK, I have a candidate or two. But that path does not work on the command line of Win7 64-bit.
This default installation of FAH put the data files in "C:\Users\(user)\AppData\Roaming\FAHClient"
Any suggestions?

Also, I don't see a way to attach the wudata_01.dat to an email even if I confirm it.
JimF
 
Posts: 551
Joined: Thu Jan 21, 2010 3:03 pm

Re: faulty WU 13415, 2495,0,1

Postby TPL » Fri Jun 26, 2020 1:52 pm

I guess you should use what you just wrote as path. Up to Roaming/...

Also note all the swithces according to your configuration.
TPL
 
Posts: 99
Joined: Sun Apr 19, 2020 12:37 pm

Re: faulty WU 13415, 2495,0,1

Postby JimF » Fri Jun 26, 2020 2:31 pm

Even that path does not work. It appears that among other things, the arguments for my RX 570 are not correct, even if I change "nvidia" to "amd".
I am willing to send the suspect file in and let them test it, if they can tell me how.
JimF
 
Posts: 551
Joined: Thu Jan 21, 2010 3:03 pm

Re: faulty WU 13415, 2495,0,1

Postby TPL » Fri Jun 26, 2020 3:53 pm

I'm not sure about this and can be wrong. But when I open cmd it goes directly to Users\my_user_name>. Then path would be AppData\Roaming... But are you running beta? Is this instruction working only with beta-version? For me the path would be:

Image

I think you need to zip wudata_01.dat to send it by email.
TPL
 
Posts: 99
Joined: Sun Apr 19, 2020 12:37 pm

Re: faulty WU 13415, 2495,0,1

Postby JimF » Fri Jun 26, 2020 4:39 pm

Thanks. I am using the same thing. It is not beta.

Code: Select all
C:\Users\USERNAME\AppData\Roaming\FAHClient\cores\cores.foldingathome.org\v7\win\64bit\Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0


No more of my time. If they want it, I will send it to them.
JimF
 
Posts: 551
Joined: Thu Jan 21, 2010 3:03 pm

Re: faulty WU 13415, 2495,0,1

Postby JohnChodera » Fri Jun 26, 2020 6:57 pm

> Also, I don't see a way to attach the wudata_01.dat to an email even if I confirm it.

@JimF: You can email me directly at john.chodera@choderalab.org!
User avatar
JohnChodera
Pande Group Member
 
Posts: 408
Joined: Fri Feb 22, 2013 10:59 pm

Re: faulty WU 13415, 2495,0,1

Postby JohnChodera » Fri Jun 26, 2020 6:58 pm

Let me check on that command-line with our windows testers and post an updated version to test locally. Thanks again for giving it a try!

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 408
Joined: Fri Feb 22, 2013 10:59 pm

Re: faulty WU 13415, 2495,0,1

Postby cayenne187 » Fri Jun 26, 2020 7:48 pm

hi John, what you asked is too complicated for me. only 13415 WU on my RX570 crash. they crash at the end and remain in the que as 100% complete and ready and rerun themselves again later. same enum error on all. i am dumping 13415 for now and moving on. let me know how else i can help.
Date of last Work Unit 2020-08-04 22:17:53
Total score 63,300,083
Total WUs 1,013
Overall rank (if points are combined) 15,440 of 2,735,299
cayenne187
 
Posts: 39
Joined: Thu May 14, 2020 8:56 pm

Re: faulty WU 13415, 2495,0,1

Postby cayenne187 » Fri Jun 26, 2020 8:19 pm

it would be a nice feature to be able to reject certain projects that cause problems. and also to accept beta projects that may have problems and are willing to accept some logging and feedback responsibilities.
cayenne187
 
Posts: 39
Joined: Thu May 14, 2020 8:56 pm

Re: faulty WU 13415, 2495,0,1

Postby TPL » Fri Jun 26, 2020 9:13 pm

That is quite contradictory statement.
TPL
 
Posts: 99
Joined: Sun Apr 19, 2020 12:37 pm

Re: faulty WU 13415, 2495,0,1

Postby mwroggenbuck » Sat Jun 27, 2020 1:19 am

I also have had a couple of 13415 WU fail in the same way. At least it is happening to more people than just me now
mwroggenbuck
 
Posts: 108
Joined: Tue Mar 24, 2020 1:47 pm

Re: faulty WU 13415, 2495,0,1

Postby JohnChodera » Sat Jun 27, 2020 6:00 am

> hi John, what you asked is too complicated for me. only 13415 WU on my RX570 crash. they crash at the end and remain in the que as 100% complete and ready and rerun themselves again later. same enum error on all. i am dumping 13415 for now and moving on. let me know how else i can help.

@cayenne187: Thanks for the extra clues! We're still investigating what might be going on. Do all 13415 WUs show this behavior, or just some of them?

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 408
Joined: Fri Feb 22, 2013 10:59 pm

PreviousNext

Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron