Troubleshooting "Bad WUs"

Moderators: Site Moderators, PandeGroup

Troubleshooting "Bad WUs"

Postby PantherX » Sun Oct 31, 2010 6:40 pm

While folding on your CPU/GPU you may have encountered a Work Unit (WU) that has given an error. When that happens, the WU falls into one of two categories and it is almost impossible to tell the difference:

1 - Problematic Hardware
Sometimes a WU assigned to your F@h Client is a good one but due to some specific hardware problem on your system, it cannot be folded correctly. Thus it has returned an error. This issue can be fixed on the donors side (more details below).

2 - Bad WU
A certain percentage of WUs will encounter an error on perfectly good hardware. The Pande Group (PG) tries to keep a very low error rate (<1% of the Project for released projects; above 1% or 2% for advmethods; often higher for earlier testing) but you still may occasionally get a WU that will fail. Unfortunately they can't tell which WUs are going to be bad until somebody runs them. This issue might be fixed on PG's side (more details below).

The primary purpose of this topic is to help you differentiate between the two by eliminating those things that you can do something about (item 1), leaving only Bad WUs to be reported.

A list of common errors can be found in these FahWiki pages; CoreStatus and Common Errors, which may give you an idea of what has happened. If you have encountered an error, please post the relevant section of your FAHlog/FAHlog-Prev (Details) in this Forum and make sure that the thread title is the PRCG (Details) of that WU.
Note: The use of -advmethods flag increases your probability of getting a bad WU since you will have access to late-beta stage WUs.
User avatar
PantherX
Site Moderator
 
Posts: 6614
Joined: Wed Dec 23, 2009 9:33 am

1 - Problematic Hardware

Postby PantherX » Mon Nov 01, 2010 8:10 am

  • Problematic Hardware
Due to the diversity of hardware, it is impractical to test the WUs on every setup. Before PG releases a Project to the public, they follow a certain protocol to ensure that their Project(s) run on as many systems as possible. Below is a brief summary of their protocol: (it assumes that the new Project passes each stage)
Stage 1: A new Project undergoes internal (PG Members only) testing to ensure that the new Project meets their standards on designated test hardware.
Stage 2: The new Project is made available to Beta Testers (Details) who ensure that the WUs can be processed without any errors on a range of available hardware.
Stage 3: The new Project is then made available to those F@h Donors that are using the -advmethods flag. The Project is then monitored to see if any anomaly occurs due to the diversified nature of hardware folding this new Project.
Stage 4: The new Project is then made available to all F@h Donors. Rare error reports may still be encountered (more details below).


    Troubleshooting The Classic Client & SMP2 BETA Client
Since both of the F@h Clients use the CPU, the troubleshooting techniques are common.

    Stock Systems
Running the F@h Clients on a stock system is very simple and there aren't many issues with these systems. The most common troubleshooting technique on these systems is:

1) If you are experiencing FILE_IO_ERROR, you should do the following:
A) Run CHKDSK to ensure that the hard disk drive isn't faulty.
B) Make sure that the folder isn't being "shared" by another Client. If you have multiple Clients, they must be running from separate folders.
C) Some Anti-Virus programs can interfere with the folding files. Adding the folding directory to the exception list will avoid this problem.
D) You may not have write permission for that folder so check your permission level.
E) The partition may be full so consider freeing up some space.

    Overclocked Systems
Please note that Overclocking (OCing) your hardware will void your warranty and may damage your hardware. You're solely responsible for your action(s). OCing can increase the frequency of hardware errors so PG Members are not supportive of it, nonetheless many F@h Donors OC safely (NOTE: Factory overclocking is still considered overclocking). This Forum has no official stance, but when solving a problem, we first focus on the OC simply because history has shown that it's often the root cause of reported problems.

This Forum doesn't assist people to OC their hardware. You are better off asking those questions in OCing Forums. F@h is a stressful application and an unstable system that generates errors and will probably cost you more points that the overclock gains. The errors also slow down the Science since duplicate WUs are sent out to determine if the WU is bad (more details below), hence if your system produces errors, you must adjust it to the point that it can process the most demanding WU without errors. If you are not sure of something, please avoid doing it or ask before doing it. The techniques are:

1) If you have OCed your CPU, please return it to stock settings and see if you still get errors. Folding is a CPU intensive application and pushes the CPU to its limits in ways that may not be detected by common stability tests. Hence you may need to reduce your OC settings to make it "F@h Stable". Do note that the F@h developers have their own stress software called Stress CPU which you can run to ensure that your CPU produces scientifically valid results. Stress CPU is more stressful than Prime95.

2) If you have OCed your RAM, please return them to stock settings and see if you encounter an error. Common tips to mitigate RAM errors are:
A) Loosen the timings since too tight timings may cause an issue.
B) Check the voltage applied to the RAM.

3) Make sure that the temperatures while folding are within a safe range. Since there isn't any clear definition of "safe range" due to the variety of available CPUs. Know that, the cooler the CPU runs while folding, the better it is. You can use temperature monitoring applications to see the current system temperature like Core Temp, HWMonitor, Real Temp which are available for free. If you have noticed that the temperature isn't in the safe range, you can:
A) Reduce your OC.
B) Increase your fan speed.
C) Ensure that your system has proper ventilation.
D) Clean any dust build-ups from your system.


    Troubleshooting The GPU2 Client & GPU3 BETA Client
Since both of the F@h Clients use the GPU, the troubleshooting techniques are common.

    Stock Systems
Running the F@h GPU Client(s) on a stock system is very simple and there aren't many issues with these systems. The most common troubleshooting techniques on these systems are:

1) If a factory OCed GPU is giving an error, you should lower the frequencies to the GPU manufacturer's stock settings to check if the error appears or not. If no error appears, then the factory OC wasn't "F@h Stable" hence you have to tweak the OC to avoid any further errors.

2) If you are experiencing FILE_IO_ERROR, you should do the following:
A) Run CHKDSK to ensure that the hard disk drive isn't faulty.
B) Make sure that the folder isn't being "shared" by another Client. If you have multiple Clients, please use separate folders for each.
C) Some Anti-Virus programs can interfere with the folding files. Make sure you add the folding file to the exception list.
D) Be sure that FAH recognizes your hardware. The incorrect use of the -forcegpu flag can cause failures and can adapt your FAH client for the wrong hardware.
E) You may not have write permission for that folder so check your permission level.
F) The partition may be full so consider freeing up some space.


3) When upgrading your drivers, please use this freeware application Driver Sweeper to remove any stubborn file(s) which might cause problems when your try to fold.

4) Please refrain form folding on BETA Drivers and reporting any issues with them. If you are using BETA Drivers, and a problem appears, F@h developers can't differentiate whether it is the Drivers fault (GPU Vendors' bug) or something in their software (F@h bug). Hence please wait for the WHQL Driver release which will be tested by the F@h developers and once it is verified to produce scientifically valid results, you can update your Drivers if you wish (Details). You can also search the Forum to see which Drivers are used for folding.

    Overclocked Systems
Please note that Over-Clocking (OCing) your hardware will void your warranty and may damage your hardware. You're solely responsible for your action(s). Please remember that officially, OCing is discouraged (Details). Do note that this Forum doesn't specifically cater towards OCing hardware so if you need advice, you are better off asking those questions in a forum that caters to such audience.

The reason to check one's hardware is that F@h is a stressful application and an unstable system will generate errors which will cost you your F@h Points. This also slows down the Science since duplicate WUs are sent out to ensure that the WU is bad (more details below). Hence if your system produces errors, you must adjust it to ensure that you can fold without any errors. The techniques are:

1) Return your GPU to its stock settings to see if the error reappears. If no error appears, then your OC wasn't "F@h Stable" hence you have to tweak the OC to avoid any further errors.

2) Test the memory on your GPU. OpenCL works on supported ATI/AMD GPUs as well as Nvidia GPUs.

OpenCL -> [ATI - 9.12 or higher Driver + ATI Stream SDK & Nvidia - 195 or higher Driver] Can you run MemtestCL without any errors at default? If yes, then repeat the test by increasing the number of runs and memory used by this method:
Code: Select all
Step 1: Start up a command prompt (start -> run -> cmd OR Win key+R -> type cmd OR Windows 7 users can browse to the directory and Shift+Right Click -> "Open command window here")

Step 2: Change to the directory where the MemtestCL executable is located

Step 3: Type this:
memtestCL 128 100

Step 4: The first value is the value of GPU RAM to be used while the second value is the number of times the test will run, both can be changed so you can check your GPU

Step 5: Once it completes the test, it will show you the Final error count. 0 will indicate everything is fine while a non-zero digit may indicate instabilities.
If the error value is not zero (0), then the GPU is faulty. Consider returning your GPU frequencies to stock or even even lower to see if the error stops. If it doesn't then the GPU isn't producing scientifically valid data so consider changing the GPU or stop the F@H GPU Client on that GPU. (Discussion Thread)
Note: If you have Nvidia GPUs, you can run MemtestG80 which uses CUDA. Just replace the "memtestCL" with "memtestG80" in the above instructions. (Discussion Thread)

3) Make sure that the temperatures while folding are within a safe range. Since there isn't any clear definition of "safe range" due to the variety of GPUs available. Know that, the cooler the GPU runs while folding, the better it is. You can use temperature monitoring applications to see the current system temperature like GPU-Z, HWMonitor, Real Temp which are available for free.

Heat problems are more of a problem if you have multiple GPUs or if you have a single-slot GPU which exhausts the heat into your case. Also, various projects put different stresses on the hardware, so just because the previous project didn't overheat doesn't mean that the next project won't.

If you have noticed that the temperature isn't in the safe range, you can:
A) Reduce your OC.
B) Increase your fan speed.
C) Increase the ventilation of your system.
D) Clean any dust build-ups from your system.


    Using qfix To Submit A Complete/Partial wuresult (V6 only)
Under some rare circumstances, the F@h Client may not be able to send a wuresult. In this case, you can use a 3rd party application called qfix. Do note that the instructions below are for Windows so if you use Linux or OSX, please adapt these instructions according to your OS.

    Using Command Line Interface
Step 1: Stop your F@h Client by using CTRL + C (for console version) or the X Button (for Systray Version)
Step 2: Place qfix your F@h Client folder
Step 3: Open a Command Line Interface windows by going to Start -> Run -> type cmd -> click OK
Step 4: Change the directory to your F@h Client's folder by using cd command
Step 5: Run qfix
Note A: If you're folding the same WU as the one which failed (same PRCG), note the number of the slot you're currently running. When a WU starts processing, it will tell you which slot position it's in with a message like "[12:45:08] Working on queue slot X [October 31 12:45:08 UTC]"
Step 6: Run your client with -send all flag (Method) to send the recovered partial/complete WU
Note B: If you're folding the same WU as the one which failed, delete your current WU by running the client with flag -delete X where X is the number displayed in Note A.
Step 7: Close the Command Line Interface window
Step 8: Restart your F@h Client like before (normal method)

    Using Windows Shortcuts
Step 1: Downloaded qfix and put it in the F@h Client folder
Step 2: Create a copy of qfix on the desktop as a shortcut
Step 3: Create a F@h shortcut with the -send all flag (Method)
Step 4: Stop your F@h Client by using CTRL + C (for console version) or the X Button (for Systray Version)
Step 5: Double click the qfix shortcut
Step 6: Double click the send all shortcut
Step 7: Restart your F@h Client like before (normal method)

Once you have submitted your complete/partial wuresult, please wait for the next update of the Official Stats before verifying that you have received credits.
User avatar
PantherX
Site Moderator
 
Posts: 6614
Joined: Wed Dec 23, 2009 9:33 am

2 - Bad WU

Postby PantherX » Mon Nov 01, 2010 8:38 am

  • Bad WU
It is statistically impossible to have a Project without a single bad WU. You may encounter them, although they are rare. If you happen to get a WU which may be bad, please make a report of it in this Forum stating the PRCG (Details) and the relevant section of the FAHlog/FAHlog-Prev (Details). Do note that for each error a WU gives, an additional copy is sent out to ensure that the WU is bad rather than faulty hardware. Once you have reported the WU, the Forum Administrators/Moderators will do a lookup on that WU. If three or more reports are there of an error, the WU is marked as bad. Please note that this isn't an instantaneous process so if you happen to be reassigned to the same WU, please do the following:
Step 1: Stop the F@h Client
Step 2: Delete the Work folder
Step 3: Delete the queue.dat file
Step 4: Change the Machine ID to another unique value
Step 5: Start the F@h Client

After that, you will be assigned another WU so you can continue folding. Do monitor the WU that you have reported. If it turns out to be bad, it is alright. If someone else completes it, then you need to check your system. Please remember that an occasional error can be expected and there isn't any definitive reason as to why it happened. It is generally okay to have an error or two once in a while but if the errors are frequent, then it is advised to look further to eliminate any cause of this problem so you can increase your contribution to F@h.

I wish to thanks the following users who have contributed in this thread: (alphabetically)
7im
bruce
toTOW
uncle fuzzy
User avatar
PantherX
Site Moderator
 
Posts: 6614
Joined: Wed Dec 23, 2009 9:33 am


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron