P9625-9643 and bad states- some observations

Moderators: Site Moderators, FAHC Science Team

billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: P9625-9643 and bad states- some observations

Post by billford »

mattifolder wrote: With my experience (GTX 970) the linux driver version 346.96 for folding is the fastest and most stable version (Linux mint mate 17.2).
I'd go along with that- 2x GTX 980's, same OS, a few % faster in terms of TPF, no noticeable change in BS frequency.

FWIW I'm only having problems with WUs from a few projects:

P9629/31/36/38: yet to have one run to completion (too many retries)

P9630/33/34/37: will sometimes run to completion, sometimes won't

All other Core_21's have run to completion (sometimes with one or two BS's), and no failures (or BS's afaict) on a GTX 780 Ti.

So far :ewink:

(One reservation about the failures- I can only find those by checking the client logs, and for various reasons (unconnected with folding) I've had an abnormally high number of system restarts over the last few weeks so not as many log files as I might like.)
Image
billford
Posts: 1005
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: P9625-9643 and bad states- some observations

Post by billford »

As I said in my OP, there appears to be a periodic/cumulative effect involved. From a WU that has just failed:

Code: Select all

09:26:38:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9634 run:1 clone:25 gen:31 core:0x21 unit:0x00000029ab436c9b5609bee39c568421
09:26:38:WU01:FS01:Starting
09:26:38:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 704 -lifeline 1351 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
09:26:38:WU01:FS01:Started FahCore on PID 17515
09:26:38:WU01:FS01:Core PID:17519
09:26:38:WU01:FS01:FahCore 0x21 started
09:26:38:WU01:FS01:0x21:*********************** Log Started 2015-11-04T09:26:38Z ***********************
09:26:38:WU01:FS01:0x21:Project: 9634 (Run 1, Clone 25, Gen 31)
09:26:38:WU01:FS01:0x21:Unit: 0x00000029ab436c9b5609bee39c568421
09:26:38:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
09:26:38:WU01:FS01:0x21:Machine: 1
09:26:38:WU01:FS01:0x21:Reading tar file core.xml
09:26:38:WU01:FS01:0x21:Reading tar file integrator.xml
09:26:38:WU01:FS01:0x21:Reading tar file state.xml
09:26:38:WU01:FS01:0x21:Reading tar file system.xml
09:26:38:WU01:FS01:0x21:Digital signatures verified
09:26:38:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
09:26:38:WU01:FS01:0x21:Version 0.0.12
09:27:01:WU01:FS01:0x21:Completed 0 out of 2000000 steps (0%)
09:27:01:WU01:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
09:28:36:WU01:FS01:0x21:Completed 20000 out of 2000000 steps (1%)
09:30:06:WU01:FS01:0x21:Completed 40000 out of 2000000 steps (2%)
09:31:36:WU01:FS01:0x21:Completed 60000 out of 2000000 steps (3%)
09:33:06:WU01:FS01:0x21:Completed 80000 out of 2000000 steps (4%)
09:34:36:WU01:FS01:0x21:Completed 100000 out of 2000000 steps (5%)
09:36:12:WU01:FS01:0x21:Completed 120000 out of 2000000 steps (6%)
09:37:42:WU01:FS01:0x21:Completed 140000 out of 2000000 steps (7%)
09:39:12:WU01:FS01:0x21:Completed 160000 out of 2000000 steps (8%)
09:40:42:WU01:FS01:0x21:Completed 180000 out of 2000000 steps (9%)
09:42:12:WU01:FS01:0x21:Completed 200000 out of 2000000 steps (10%)
09:42:17:WU01:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint
09:43:47:WU01:FS01:0x21:Completed 120000 out of 2000000 steps (6%)
09:45:17:WU01:FS01:0x21:Completed 140000 out of 2000000 steps (7%)
09:46:47:WU01:FS01:0x21:Completed 160000 out of 2000000 steps (8%)
09:48:17:WU01:FS01:0x21:Completed 180000 out of 2000000 steps (9%)
09:49:47:WU01:FS01:0x21:Completed 200000 out of 2000000 steps (10%)
09:51:23:WU01:FS01:0x21:Completed 220000 out of 2000000 steps (11%)
09:52:53:WU01:FS01:0x21:Completed 240000 out of 2000000 steps (12%)
09:54:23:WU01:FS01:0x21:Completed 260000 out of 2000000 steps (13%)
09:55:53:WU01:FS01:0x21:Completed 280000 out of 2000000 steps (14%)
09:57:23:WU01:FS01:0x21:Completed 300000 out of 2000000 steps (15%)
09:57:28:WU01:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint
09:58:58:WU01:FS01:0x21:Completed 220000 out of 2000000 steps (11%)
10:00:28:WU01:FS01:0x21:Completed 240000 out of 2000000 steps (12%)
10:01:58:WU01:FS01:0x21:Completed 260000 out of 2000000 steps (13%)
10:03:28:WU01:FS01:0x21:Completed 280000 out of 2000000 steps (14%)
10:04:58:WU01:FS01:0x21:Completed 300000 out of 2000000 steps (15%)
10:06:34:WU01:FS01:0x21:Completed 320000 out of 2000000 steps (16%)
10:08:03:WU01:FS01:0x21:Completed 340000 out of 2000000 steps (17%)
0:09:33:WU01:FS01:0x21:Completed 360000 out of 2000000 steps (18%)
10:11:03:WU01:FS01:0x21:Completed 380000 out of 2000000 steps (19%)
10:12:33:WU01:FS01:0x21:Completed 400000 out of 2000000 steps (20%)
10:12:37:WU01:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint
10:12:37:WU01:FS01:0x21:Max number of retries reached. Aborting.
10:12:37:WU01:FS01:0x21:ERROR:Max Retries Reached
10:12:37:WU01:FS01:0x21:Saving result file logfile_01.txt
10:12:37:WU01:FS01:0x21:Saving result file log.txt
10:12:37:WU01:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
\x1b[93m10:12:38:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)\x1b[0m
10:12:38:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:9634 run:1 clone:25 gen:31 core:0x21 unit:0x00000029ab436c9b5609bee39c568421
10:12:38:WU01:FS01:Uploading 10.50KiB to 171.67.108.155
10:12:38:WU01:FS01:Connecting to 171.67.108.155:8080
10:12:38:WU01:FS01:Upload complete
10:12:38:WU01:FS01:Server responded WORK_ACK (400)
10:12:38:WU01:FS01:Cleaning up
This is not untypical of all the BS's I see with Core_21- if the first BS is later than ~40% it will probably complete, less than ~20% it won't. In between it's a maybe.

Does it mean anything? No idea, but if the client was a bit more adaptive to BS's I'm pretty sure it would eventually complete.

Easy to implement- when the client writes a checkpoint, decrement (or clear) the BS counter. Worst case (assuming <3 BS's between checkpoints) it would take 2-3 times as long as it should.
Image
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: P9625-9643 and bad states- some observations

Post by Grandpa_01 »

bigblock990 wrote:
toTOW wrote:Did you change clock for the right P state ? 7010 MHz (1752 MHz real) is the clock for P0, but most cards are folding in P2 state, which default clock is 6000 MHz (1500 MHz real).
Yes, I created a modified bios so that p2 runs at 7010mhz same as p0. I then modified that bios so that p2 runs at 5500mhz to test Grandpa_01 theory in linux.
Just curious as to how you measured the running memory speed after you created the bios ?
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bigblock990
Posts: 20
Joined: Wed Sep 09, 2015 12:42 pm

Re: P9625-9643 and bad states- some observations

Post by bigblock990 »

Grandpa_01 wrote:
bigblock990 wrote:
toTOW wrote:Did you change clock for the right P state ? 7010 MHz (1752 MHz real) is the clock for P0, but most cards are folding in P2 state, which default clock is 6000 MHz (1500 MHz real).
Yes, I created a modified bios so that p2 runs at 7010mhz same as p0. I then modified that bios so that p2 runs at 5500mhz to test Grandpa_01 theory in linux.
Just curious as to how you measured the running memory speed after you created the bios ?
In xserver settings GUI if you click on "powermizer" tab for each gpu it shows current core and mem clocks right at the top.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: P9625-9643 and bad states- some observations

Post by Grandpa_01 »

And it showed something other than 6008Mhz available for P2 state which in x server is Performance level 2 ?

That is odd because from my understanding that is set by the drivers and there is only 1 option for P2 state in any of the drivers that I know of for the Maxwell's and that is 6008Mhz , I will see if I can get one of the NVidia developers if it is possible to get the P2 memory speed to change by changing the bios. I have a feeling it may be limited by the driver options but there is a good chance I misunderstood what I was told before.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
mattifolder

Re: P9625-9643 and bad states- some observations

Post by mattifolder »

Version 0.0.14 of Core_21 is running with client-type=beta, problems have gone. I copies the core (under linux) to standard core path, switched client to advanced and no bad states at the time with moderate OC (some pieces lower than core_18).
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: P9625-9643 and bad states- some observations

Post by bruce »

mattifolder wrote:Version 0.0.14 of Core_21 is running with client-type=beta, problems have gone. I copies the core (under linux) to standard core path, switched client to advanced and no bad states at the time with moderate OC (some pieces lower than core_18).
Good to hear.

Overclocking might cause the same error if there's insufficient margin to assure stability under worst-case conditions, but a good Factory Overclock tends to leave enough margin.
mattifolder

Re: P9625-9643 and bad states- some observations

Post by mattifolder »

bruce wrote:Overclocking might cause the same error if there's insufficient margin to assure stability under worst-case conditions, but a good Factory Overclock tends to leave enough margin.
The extra over factory OC for my GTX 970 ist about 170 to 180 MHz, that's not too much. I've a script watching folding logs by inotifyd, which decreases the value at every "Bat State" for actually project and overwrites my script configuration for the following projects. The script has an configuration table with values for every project type matching by numbers. The initially OC values are stability tested by other applications and the already finished projects. New projects will be added to the configuration manually after watching the script created extra log file.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: P9625-9643 and bad states- some observations

Post by bruce »

A good factory overclock is supposed to be able to remain stable under extreme conditions. Maybe "it's not much" seems easy to believe but even a small "I can get away with 170 to 180 MHz" without ever testing it in under extreme conditions might very well push it into marginal stability (i.e.- unstable). This is especially true if you're overclocking VRAM.

The Developers are tracking this issue very carefully. The clincher will be if your system develops internal NaNs which are not reproduced when the identical WU is assigned to another Donor.
Post Reply