Bad work unit/STATE indicator in FAHControl

Moderators: Site Moderators, FAHC Science Team

MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Bad work unit/STATE indicator in FAHControl

Post by MeeLee »

With the new Feat implementation of continuing a WU from last savestate upon entering a bad state, it's sometimes hard to know which GPU has failed a work unit, often allowing a bad configuration to run for prolonged time.
Whether this is a hardware failure, or more commonly, a wrong overclocking setting; I think we can do something to help speed up the correction process.

To make it easier for FAH users, and less WUs to fail, or slow down WUs due to wrong settings, I would want to recommend the following feature:

An indicator 'light' of sorts in FAHControl, where if I open FAHControl, I can immediately and clearly see when a GPU or CPU had a 'bad work unit' triggered since the last time I cleared the log or started FAHControl.

Rather than going through the entire log (visible, or the log entries that already are no more displaying in the log window), or trying to locate the log files on FAH, filtering out all the unnecessary data, it would be nice to have something as small as an indication light before each WU entry, when one 'bad unit' has happened, or something.

I thought of changing the color of the progress bar of each GPU/CPU, but current FAHControl for Linux, the color of the bar depends on the desktop theme installed.
So it may give some errors.
So maybe an extra column with an indicator light might be a better idea.
And since modern operating systems now require 720p resolutions at minimum (800p preferred), perhaps the layout can also be adjusted to reflect these screens (instead of current FAHControl layout, that would fit an 800x600 screen resolution).

The coding should be a very minimal type of coding.
Perhaps as few as 10-15 lines of code.
Like, IF 'BAD STATE' has been recorded in the log, change LED state of the appropriate slot, to '-1'
And as soon as the WU starts, if LED state = 5, change to 4.

LED state 5 = transparent/grey (inactive, pending download)
LED state 4 = Green (All Ok! Not been triggered).
LED state 3 = Yellow (One Bad WU recorded).
LED state 2 = Orange (Several BAD WUs triggered)
LED state 1 = Red (GPU failed/stopped).

Perhaps an extra line creating a popup window telling how many BAD STATEs have been recorded would also be nice.

Just a suggestion.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad work unit/STATE indicator in FAHControl

Post by bruce »

MeeLee wrote:With the new Feat implementation of continuing a WU from last savestate upon entering a bad state, it's sometimes hard to know which GPU has failed a work unit, often allowing a bad configuration to run for prolonged time.
I don't understand why you are having trouble knowing which GPU failed the WU. Your WUs are enumerated near the end of the first page of the log. Every message in the log identifies which slot and whcy WU issued the message. the number FS0 indicates that the WU identified as being associated with slot 0 is issuing the message.

Post the segment of a log where you are not able to tell plus the GPU enumeration at the top of that logl.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Bad work unit/STATE indicator in FAHControl

Post by MeeLee »

The problem is that I often leave the server for a day or two, without attention.
Log entries leave the log window after about 1/2 a day to a day.
Or sometimes the log file is so long it takes a few minutes to find the problem.
Occasionally I overlook a BAD state log entry, and am not even aware of any problems.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad work unit/STATE indicator in FAHControl

Post by bruce »

Open FAHControl. Select Log + Warnings&Errors. If necessary, select the slot & the WU and examine details around the error. Make note of the Date-Time of the last one you looked at so next time you can ignore the ones you have already examined.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Bad work unit/STATE indicator in FAHControl

Post by MeeLee »

I think the 'topology' errors make those BAD STATE errors disappear from that log.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad work unit/STATE indicator in FAHControl

Post by bruce »

I don't think so. Topology messages will appear in errors log filter (even though they shouldn't) and they're unrelated to messages about BAD STATE errors.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Bad work unit/STATE indicator in FAHControl

Post by MeeLee »

With what I say, is that there are so many topology errors, that the bad state logs are pushed out of the log (followed by tens or hundreds of topology errors).
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad work unit/STATE indicator in FAHControl

Post by bruce »

You're not understanding what I said above. Nothing is "pushed out of the log" You can view each WU and each slot selectively to minimize the "clutter" in the log. The topology messages do not appear in every WU and certainly not in every FAHCore. Learn to use the log filters in FAHControl.
MeeLee
Posts: 1375
Joined: Tue Feb 19, 2019 10:16 pm

Re: Bad work unit/STATE indicator in FAHControl

Post by MeeLee »

Bruce, I understand exactly what you said, and the issue is still present.
With multiple cards, you can agree that this method of trying to trace back BAD STATEs is not very efficient, unless you're near your PC all the time.
And I visit mine regularly.
The only other way to make this feasible, is reduce verbosity level. But that would also stop FAH from displaying any other information.
Current level is set to 3. I presume setting it to 2 or 1 will make the error log smaller, and show failed units, but it's not an efficient way to do things.
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Bad work unit/STATE indicator in FAHControl

Post by foldy »

Maybe it would be enough to add this info to the returned work units page? https://apps.foldingathome.org/cpu
So the work unit also sends the info if it had a bad state from which it recovered?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad work unit/STATE indicator in FAHControl

Post by bruce »

1) please report the PRCG numbers of a protein that gives the topology error and (2) the version number of FAHCore_21 that you're running.

When a WU starts, you'll probably find something that looks like this in the log.

> .... 0x21:Version 0.0.20

If it says :0x21:Version 0.0.18, pause all GPU WUs, find FAHCore_21 in one or more branches of the cores subdirectory and delete them. A new version will download when you start the first slot that was using FAHCore_21. Let if finish downloading before you start another GPU slot.
toTOW
Site Moderator
Posts: 6309
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Bad work unit/STATE indicator in FAHControl

Post by toTOW »

If you want to get back older entries form the log, hit the Refresh button, it should reload the whole data since the client started ... then you could filter on the slot you want to investigate and you'll get all the data filtered.

You can also use HFM that will do the history job for you automatically (Tools > Work unit history viewer) ... I have data since 2012 in this tool with all the WU I folded and the details (which machine, PRCG, PPD, TPF, status Finished/Failed, ...). You can do advanced queries (like in a database) to extract specific data (for instance, all failed WUs from a given slot).
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
prcowley
Posts: 23
Joined: Thu Jan 03, 2019 11:03 pm
Hardware configuration: Op Sys: Linux Ubuntu Studio 21.04 LTS
Kernal: 5.11.0-37-lowlatency
Proc: AMD Ryzen 7 1700 - 8-core
Mem: 32 GB
GPU: Nvidia GeForce GTX 1080Ti
Storage: 2 TB
Location: Gisborne, New Zealand
Contact:

Re: Bad work unit/STATE indicator in FAHControl

Post by prcowley »

I too am having theses from time to time.

Latest one just started and is only 5% complete. Here is the log fro FaH Client
05:34:10:WU01:FS00:Started FahCore on PID 15870
05:34:10:WU01:FS00:Core PID:15874
05:34:10:WU01:FS00:FahCore 0x21 started
05:34:10:WU01:FS00:0x21:*********************** Log Started 2019-05-31T05:34:10Z ***********************
05:34:10:WU01:FS00:0x21:Project: 14179 (Run 5, Clone 8, Gen 65)
05:34:10:WU01:FS00:0x21:Unit: 0x000000590002894c5cb3872708103860
05:34:10:WU01:FS00:0x21:CPU: 0x00000000000000000000000000000000
05:34:10:WU01:FS00:0x21:Machine: 0
05:34:10:WU01:FS00:0x21:Reading tar file core.xml
05:34:10:WU01:FS00:0x21:Reading tar file integrator.xml
05:34:10:WU01:FS00:0x21:Reading tar file state.xml
05:34:10:WU01:FS00:0x21:Reading tar file system.xml
05:34:10:WU01:FS00:0x21:Digital signatures verified
05:34:10:WU01:FS00:0x21:Folding@home GPU Core21 Folding@home Core
05:34:10:WU01:FS00:0x21:Version 0.0.18
05:34:12:WU01:FS00:0x21:Completed 0 out of 25000000 steps (0%)
05:34:12:WU01:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
05:34:14:WU00:FS00:Upload complete
05:34:14:WU00:FS00:Server responded WORK_ACK (400)
05:34:14:WU00:FS00:Final credit estimate, 62809.00 points
05:34:14:WU00:FS00:Cleaning up
05:36:32:WU01:FS00:0x21:Completed 250000 out of 25000000 steps (1%)
05:36:37:WARNING:FS00:Size of positions 286 does not match topology 16
05:38:51:WU01:FS00:0x21:Completed 500000 out of 25000000 steps (2%)
05:38:56:WARNING:FS00:Size of positions 286 does not match topology 16
05:41:10:WU01:FS00:0x21:Completed 750000 out of 25000000 steps (3%)
05:41:16:WARNING:FS00:Size of positions 286 does not match topology 16
05:43:32:WU01:FS00:0x21:Completed 1000000 out of 25000000 steps (4%)

In some of the earlier messages it said that core 21 might be at 0.0.20 and I notice mine is 0.0.18 Should I delete core 21 and see if a newer version downloads?

Cheers
Pete
Pete Cowley, Gisborne, New Zealand. The first city to see the light of the new day. :D
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bad work unit/STATE indicator in FAHControl

Post by bruce »

prcowley wrote:In some of the earlier messages it said that core 21 might be at 0.0.20 and I notice mine is 0.0.18 Should I delete core 21 and see if a newer version downloads?
Yes! And if you do get 0.0.20, please report that info here.
prcowley
Posts: 23
Joined: Thu Jan 03, 2019 11:03 pm
Hardware configuration: Op Sys: Linux Ubuntu Studio 21.04 LTS
Kernal: 5.11.0-37-lowlatency
Proc: AMD Ryzen 7 1700 - 8-core
Mem: 32 GB
GPU: Nvidia GeForce GTX 1080Ti
Storage: 2 TB
Location: Gisborne, New Zealand
Contact:

Re: Bad work unit/STATE indicator in FAHControl

Post by prcowley »

Well, unfortunately, after deleting core 21 v 0.0.18 several time and from several places it sees to keep coming back without being downloaded. A neat trick that seems like magic!

I finally managed to force it to re-download Core 21 but it is still v0.0.18 so perhaps that is the latest for Linux AMD64 and NVIDIA? Is there a later version?

Cheers
Pete
Pete Cowley, Gisborne, New Zealand. The first city to see the light of the new day. :D
Image
Post Reply