A trio of bad WU (11713 & 11432)

Moderators: Site Moderators, FAHC Science Team

Post Reply
ikek
Posts: 13
Joined: Sun Jan 14, 2018 7:13 pm

A trio of bad WU (11713 & 11432)

Post by ikek »

Hello,

I have been folding for a couple of weeks and everything has been running more or less smooth. Once yesterday and twice today I had WU go bad and I am wondering if there are any explanations for this peculiar behaviour. All bad WU's have occured during normal computer (non-gpu intensive) like writing this post. It is somewhat annoying as they have failed at 43, 66 and 17 percent respectively. This hurts PPD.

Below are excerpts from the log. Added is date and project number. I will link full log when I am permitted by the forum to do so.

Code: Select all

13/01/18
19:39:43:WU01:FS01:0x21:Project: 11713 (Run 16, Clone 69, Gen 0)

20:49:53:WU01:FS01:0x21:Completed 3225000 out of 7500000 steps (43%)
20:50:22:WU01:FS01:0x21:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-5)
20:50:22:WU01:FS01:0x21:Saving result file logfile_01.txt
20:50:22:WU01:FS01:0x21:Saving result file log.txt
20:50:22:WU01:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
20:50:24:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:50:24:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11713 run:16 clone:69 gen:0 core:0x21 unit:0x000000008ca304e75a5a5225898be098

14/01/18
17:06:43:WU01:FS01:0x21:Project: 11432 (Run 0, Clone 907, Gen 3)

19:10:21:WU01:FS01:0x21:Completed 3300000 out of 5000000 steps (66%)
19:10:38:WU01:FS01:0x21:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-5)
19:10:38:WU01:FS01:0x21:Saving result file logfile_01.txt
19:10:38:WU01:FS01:0x21:Saving result file log.txt
19:10:38:WU01:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
19:10:40:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
19:10:40:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11432 run:0 clone:907 gen:3 core:0x21 unit:0x000000038ca304e85a5a6c408d14cabf

19:10:52:WU00:FS01:0x21:Project: 11432 (Run 1, Clone 628, Gen 1)
19:10:52:WU00:FS01:0x21:Unit: 0x000000028ca304e85a5a6c657bf2cf7c

19:42:55:WU00:FS01:0x21:Completed 850000 out of 5000000 steps (17%)
19:44:12:WU00:FS01:0x21:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-5)
19:44:13:WU00:FS01:0x21:Saving result file logfile_01.txt
19:44:13:WU00:FS01:0x21:Saving result file log.txt
19:44:13:WU00:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
19:44:14:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
19:44:14:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11432 run:1 clone:628 gen:1 core:0x21 unit:0x000000028ca304e85a5a6c657bf2cf7c
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: A trio of bad WU (11713 & 11432)

Post by bruce »

The message clEnqueueReadBuffer indicates that your GPU reported an error. The code 5 means CL_OUT_OF_RESOURCES, In fact, that code frequently means that the GPU is reset by the driver. (You'll probably find a report of the driver-reset in the event log.)

Continuing, the driver-reset generally means that the GPU is overheating or is unstable due to overclocking. Reducing the overclock settings general cures either one of them but adding a case fan or increasing the fan speed might take care of it.
Kuno
Posts: 31
Joined: Sat Sep 23, 2017 4:59 pm

Re: A trio of bad WU (11713 & 11432)

Post by Kuno »

Bruce really knows the error codes well. I wish there was a wiki somewhere that we could go to, to find out what these errors mean.
ikek
Posts: 13
Joined: Sun Jan 14, 2018 7:13 pm

Re: A trio of bad WU (11713 & 11432)

Post by ikek »

Bruce really knows the error codes well. I wish there was a wiki somewhere that we could go to, to find out what these errors mean.
I second that there ought to be an overview of typical errors and what commonly causes these readily available. I googled but could not find any reliable information and hence this thread.

Temps are not an issue, these are 65-70C (GPU). It is more likely the overclock. Then again it has been running at 2037 (core clock) and 2050 for a couple of weeks with no issues. This includes both the machine running unattended and the folding client running at full while the computer is used in light tasks.

I have downclocked the GPU and it seems fah stable. Time will tell if it needs adjustment.
kiore
Posts: 931
Joined: Fri Jan 16, 2009 5:45 pm
Location: USA

Re: A trio of bad WU (11713 & 11432)

Post by kiore »

Some work units, in my experience, less tolerant to overclocking so even when a setting stable for months a new work unit fails on it.
Image
i7 7800x RTX 3070 OS= win10. AMD 3700x RTX 2080ti OS= win10 .

Team page: http://www.rationalskepticism.org
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: A trio of bad WU (11713 & 11432)

Post by bruce »

OpenCL error codes can be found in several places including the official OpenCL site. https://www.khronos.org/registry/OpenCL ... rrors.html You have to dig a little to find what you're looking for in their manual.

Actually, this list is pretty useful: https://tersetalk.wordpress.com/2012/04 ... ror-codes/

Each software component used by FAH can issue it's own error messages can come from FAH, itself, from OpenCL, from your OS, etc. I have not found a comprehensive list. (I often have to google the error.)
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: A trio of bad WU (11713 & 11432)

Post by bruce »

kiore wrote:Some work units, in my experience, less tolerant to over-clocking so even when a setting stable for months a new work unit fails on it.
That's always true when you over-clock. First you have to find a group of benchmarks that use a maximum amount of each specific computing resource. Since there are additional variations possible, you have to add a safety margin so that in the worst case scenario stability is maintained.

In your case, your over-clock was stable for some WUs that use less than all of the resources but when you happen to get a more efficient project, it exceeds whatever margin you have allowed.
ikek
Posts: 13
Joined: Sun Jan 14, 2018 7:13 pm

Re: A trio of bad WU (11713 & 11432)

Post by ikek »

I would like to thank everyone for their contributions.

After a downclock of the core clock everything seems stable ( by 37-50). The odd thing is that, from memory, the system completed several of the heavy WUs without issue (nothing in log) when running at 2037 and the computer was utilized in the same light manner. Had it hit a wall from the first one of these then it would have been more apparant. Fortunatly it was an user error which is easily remedied.

Ill be daring enough to suggest that if the contents of bruce's post (two up), or something similar, was put into its own thread and stickied it could prevent a thread or two like mine. It never even crossed my mind to look into OpenCL error codes and I think many people will go here to look for information pertaining to error reporting in the fahcontrol log.

with regards
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: A trio of bad WU (11713 & 11432)

Post by bruce »

It's evident that some people can complete these projects and are pleased with the results.
Subject: Why are projects 9415 and 9414 such low PPD?
Luscious wrote:Back with an update here and it's evident projects 11432 and 11713 are putting my cards back at their previous performance level.

No system changes made whatsoever.
I guess 1171x are just a bit too efficient for your overclock and 941x are just enough LESS efficient (with correspondingly lower PPD) to run on everybody's machine.
sticks435
Posts: 40
Joined: Thu Mar 03, 2011 8:29 am

Re: A trio of bad WU (11713 & 11432)

Post by sticks435 »

I'm having the same issue with 11432 and 11431. 9941x and 1171x will fold just fine on my standard folding overclock, but soon as I hit a 1143x unit, it will fold somewhere between 0 and 10% then fail. I removed all manual overclock on my 980Ti Hybrid and am using the Nvidia stock settings and pretty sure it still failed (am at work at the moment so can't remember/verify). Out of box Evga boost clock is 1228 and mine runs at 1341 with default settings. Will have to inspect my logs when I get home and see what they say.

EDIT: Checked my logs, I have the exact same error as OP. Doesn't look like I have tried to fold one of these units since reverting to default settings.
Post Reply