9121 (Run 11, Clone 2, Gen 161) Bad State Error

Moderators: Site Moderators, FAHC Science Team

Post Reply
parkut
Posts: 364
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

9121 (Run 11, Clone 2, Gen 161) Bad State Error

Post by parkut »

Nvidia GTX 960 GPU in Ubuntu 12.04 FAHclient version 7.3.6 system
running 24x7 with i7-4790k CPU using only (4) cores
has seen a few of this type of error. This is a fellow team mate's system.
WU before and after this one ran to completion with no problems.

Code: Select all

14:02:00:WU01:FS01:Connecting to assign-GPU.stanford.edu:80
14:02:00:WU01:FS01:News: 
14:02:00:WU01:FS01:Assigned to work server 171.64.65.84
14:02:00:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GM206 [GeForce GTX 960] from 171.64.65.84
14:02:00:WU01:FS01:Connecting to 171.64.65.84:8080
14:02:01:WU01:FS01:Downloading 3.28MiB
14:02:01:WU01:FS01:Download complete
14:02:02:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9121 run:11 clone:2 gen:161 core:0x18 unit:0x000000c00a3b1e78553ea21d10e428ee
14:02:04:WU01:FS01:Starting
14:02:04:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_18.fah/FahCore_18 -dir 01 -suffix 01 -version 703 -lifeline 7777 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
14:02:04:WU01:FS01:Started FahCore on PID 12813
14:02:04:WU01:FS01:Core PID:12817
14:02:04:WU01:FS01:FahCore 0x18 started
14:02:05:WU01:FS01:0x18:*********************** Log Started 2015-10-20T14:02:04Z ***********************
14:02:05:WU01:FS01:0x18:Project: 9121 (Run 11, Clone 2, Gen 161)
14:02:05:WU01:FS01:0x18:Unit: 0x000000c00a3b1e78553ea21d10e428ee
14:02:05:WU01:FS01:0x18:CPU: 0x00000000000000000000000000000000
14:02:05:WU01:FS01:0x18:Machine: 1
14:02:05:WU01:FS01:0x18:Reading tar file state.xml
14:02:05:WU01:FS01:0x18:Reading tar file system.xml
14:02:05:WU01:FS01:0x18:Reading tar file integrator.xml
14:02:05:WU01:FS01:0x18:Reading tar file core.xml
14:02:05:WU01:FS01:0x18:Digital signatures verified
14:02:05:WU01:FS01:0x18:Folding@home GPU core18
14:02:05:WU01:FS01:0x18:Version 0.0.4
14:02:12:WU01:FS01:0x18:Completed 0 out of 2500000 steps (0%)
14:02:12:WU01:FS01:0x18:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
14:05:01:WU01:FS01:0x18:Completed 25000 out of 2500000 steps (1%)
14:07:49:WU01:FS01:0x18:Completed 50000 out of 2500000 steps (2%)
14:10:36:WU01:FS01:0x18:Completed 75000 out of 2500000 steps (3%)
14:13:23:WU01:FS01:0x18:Completed 100000 out of 2500000 steps (4%)
14:16:14:WU01:FS01:0x18:Completed 125000 out of 2500000 steps (5%)
14:19:01:WU01:FS01:0x18:Completed 150000 out of 2500000 steps (6%)
14:21:49:WU01:FS01:0x18:Completed 175000 out of 2500000 steps (7%)
14:24:36:WU01:FS01:0x18:Completed 200000 out of 2500000 steps (8%)
14:27:27:WU01:FS01:0x18:Completed 225000 out of 2500000 steps (9%)
14:30:14:WU01:FS01:0x18:Completed 250000 out of 2500000 steps (10%)
14:33:02:WU01:FS01:0x18:Completed 275000 out of 2500000 steps (11%)
14:35:49:WU01:FS01:0x18:Completed 300000 out of 2500000 steps (12%)
14:38:39:WU01:FS01:0x18:Completed 325000 out of 2500000 steps (13%)
14:41:26:WU01:FS01:0x18:Completed 350000 out of 2500000 steps (14%)
14:44:14:WU01:FS01:0x18:Completed 375000 out of 2500000 steps (15%)
14:47:01:WU01:FS01:0x18:Completed 400000 out of 2500000 steps (16%)
14:47:03:WU01:FS01:0x18:Bad State detected... attempting to resume from last good checkpoint
14:49:50:WU01:FS01:0x18:Completed 325000 out of 2500000 steps (13%)
14:52:37:WU01:FS01:0x18:Completed 350000 out of 2500000 steps (14%)
14:55:25:WU01:FS01:0x18:Completed 375000 out of 2500000 steps (15%)
14:58:12:WU01:FS01:0x18:Completed 400000 out of 2500000 steps (16%)
14:58:14:WU01:FS01:0x18:Bad State detected... attempting to resume from last good checkpoint
15:01:01:WU01:FS01:0x18:Completed 325000 out of 2500000 steps (13%)
15:03:49:WU01:FS01:0x18:Completed 350000 out of 2500000 steps (14%)
15:06:36:WU01:FS01:0x18:Completed 375000 out of 2500000 steps (15%)
15:09:23:WU01:FS01:0x18:Completed 400000 out of 2500000 steps (16%)
15:12:13:WU01:FS01:0x18:Completed 425000 out of 2500000 steps (17%)
15:15:01:WU01:FS01:0x18:Completed 450000 out of 2500000 steps (18%)
15:17:48:WU01:FS01:0x18:Completed 475000 out of 2500000 steps (19%)
15:20:35:WU01:FS01:0x18:Completed 500000 out of 2500000 steps (20%)
15:23:25:WU01:FS01:0x18:Completed 525000 out of 2500000 steps (21%)
15:26:12:WU01:FS01:0x18:Completed 550000 out of 2500000 steps (22%)
15:29:00:WU01:FS01:0x18:Completed 575000 out of 2500000 steps (23%)
15:31:47:WU01:FS01:0x18:Completed 600000 out of 2500000 steps (24%)
15:31:49:WU01:FS01:0x18:Bad State detected... attempting to resume from last good checkpoint
15:31:49:WU01:FS01:0x18:Max number of retries reached. Aborting.
15:31:49:WU01:FS01:0x18:ERROR:exception: Max Retries Reached
15:31:49:WU01:FS01:0x18:Saving result file logfile_01.txt
15:31:49:WU01:FS01:0x18:Saving result file log.txt
15:31:49:WU01:FS01:0x18:Folding@home Core Shutdown: BAD_WORK_UNIT
15:31:49:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:31:49:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:9121 run:11 clone:2 gen:161 core:0x18 unit:0x000000c00a3b1e78553ea21d10e428ee
15:31:49:WU01:FS01:Uploading 2.78KiB to 171.64.65.84
15:31:49:WU01:FS01:Connecting to 171.64.65.84:8080
15:31:50:WU01:FS01:Upload complete
15:31:50:WU01:FS01:Server responded WORK_ACK (400)
15:31:50:WU01:FS01:Cleaning up
Joe_H
Site Admin
Posts: 7856
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 9121 (Run 11, Clone 2, Gen 161) Bad State Error

Post by Joe_H »

The WU has been successfully completed by another folder:
Hi ********(team ******),
Your WU (P9121 R11 C2 G161) was added to the stats database on 2015-10-20 11:07:50 for 46637.8 points of credit.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
suprleg
Posts: 57
Joined: Thu Apr 26, 2012 8:30 pm

Re: 9121 (Run 11, Clone 2, Gen 161) Bad State Error

Post by suprleg »

I read that the partially completed work is "sent" to another folder for finishing, but what is the cause of the initial error:

Code: Select all

15:31:49:WU01:FS01:0x18:Bad State detected... attempting to resume from last good checkpoint
15:31:49:WU01:FS01:0x18:Max number of retries reached. Aborting.
15:31:49:WU01:FS01:0x18:ERROR:exception: Max Retries Reached
15:31:49:WU01:FS01:0x18:Saving result file logfile_01.txt
15:31:49:WU01:FS01:0x18:Saving result file log.txt
15:31:49:WU01:FS01:0x18:Folding@home Core Shutdown: BAD_WORK_UNIT
15:31:49:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:31:49:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:9121 run:11 clone:2 gen:161 core:0x18 unit:0x000000c00a3b1e78553ea21d10e428ee
15:31:49:WU01:FS01:Uploading 2.78KiB to 171.64.65.84
15:31:49:WU01:FS01:Connecting to 171.64.65.84:8080
15:31:50:WU01:FS01:Upload complete
15:31:50:WU01:FS01:Server responded WORK_ACK (400)
15:31:50:WU01:FS01:Cleaning up
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 9121 (Run 11, Clone 2, Gen 161) Bad State Error

Post by Grandpa_01 »

Is he Overclocking ? What driver is he using ? Is the GPU getting hot ? can you provide more information. Most likely cause is a not 100% stable OC but there are other things that can cause the problem.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
toTOW
Site Moderator
Posts: 6296
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 9121 (Run 11, Clone 2, Gen 161) Bad State Error

Post by toTOW »

suprleg wrote:what is the cause of the initial error
We don't know yet. A new version of the core is expected soon™ that will return more information when a bad state occurs with the hope it will help debugging ...
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
parkut
Posts: 364
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: 9121 (Run 11, Clone 2, Gen 161) Bad State Error

Post by parkut »

The particular machine is not overclocked, it's a new build
EVGA 02G-P4-2966-KR GeForce GTX 960 Gaming 2GB 128-Bit GDDR5 PCI Express 3.0

Model Name: NVIDIA:5 GM206 [GeForce GTX 960]
Driver Version: 352.41
Gpu temp: 68C
Running Client Version: 7.3.6

*edit to add full GPU model name
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 9121 (Run 11, Clone 2, Gen 161) Bad State Error

Post by bruce »

toTOW wrote:
suprleg wrote:what is the cause of the initial error
We don't know yet. A new version of the core is expected soon™ that will return more information when a bad state occurs with the hope it will help debugging ...
FAH has always had an occasional WU with a Bad State error. Most can be attributed to unstable hardware and the frequency can be reduced by reducing the overclocking. Occasionally that's not the cause (and for which the exact cause is unknown). Originally, these were called "bad WUs" and they aborted promptly. Later it was found that the WU could be recovered from the previous checkpoint and completed successfully so code was added to do exactly that. That's fine as long as the statistics support NOT having another bad state in the same WU, so a limit of 3 was set., at which time the WU is aborted. Apparently that's what happened in this case.

When a WU is aborted, there's no way to know if it's due to unstable hardware or whatever the unknown cause is so the WU will be assigned to someone else who probably will complete it successfully. If it fails on a few different machines, the WU and the trajectory it is in is a lost cause and it's suspended. Most are successfully completed by someone else and the trajectory continues.

As far as your friend's machine is concerned, if this happens rarely, don't worry about it. If it happens frequently, his hardware is unstable.

This WU has already been reissued. One person (apparently your friend) had an error after 0.08 days. The next person to process the WU completed it successfully.
Post Reply