10138/10139/10140 EUE

Moderators: Site Moderators, FAHC Science Team

PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 10138/10139/10140 EUE

Post by PantherX »

AFAICT, SMP:7 and SMP:10 are problematic when it comes to the decomposition of small projects (Thus these slots will no longer be getting these projects). I am unsure of what SMP:9 would do in this case. If you can test it out and report it, it would be nice.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 10138/10139/10140 EUE

Post by PantherX »

Humanoid1 wrote:...I was trusting this new 7.3.6 client to take/(work with) only WU's that were happy with an odd number of SMP cores and being a prime number at that....Obviously I am left with the conclusion that this newer 7.3.6 client has not been coded to fix this issue meaning anyone running with default settings could be failing all such odd/prime number sensitive WU's...
The FAHClient doesn't have this feature nor will it in the foreseeable future. The reason is that FAHClient doesn't do the folding, the FahCores does. Thus, the FahCore is responsible for spawning the "correct" number of threads once it has been told what the SMP Slot is configured as. Moreover, the assignment of WU isn't done by the FAHClient, it is done by the Assignment Server once it gets the required information from FAHClient.
Humanoid1 wrote:...Other experienced long time folders from my home OCF forum confirmed this suspicion.
Now having restarted running SMP10 I just have to wait to receive another of these number sensitive WU's to be 100% sure...
Do note that while you have configured your Slot as SMP:11, it is in fact, running as SMP:10 since the FahCore is automatically remapping from 11 to 10 as shown in your log:
08:55:27:WU02:FS01:0xa3:Mapping NT from 11 to 10

The reason for the remapping is that a rather significant number of project would fail on SMP:11 thus, it was blacklisted on the FahCore level. SMP:5, SMP:7 would fail on a minority of projects, specifically those which are small in size (atom count).

As stated, SMP:6, SMP:8 and SMP:12 will work. SMP:7 and SMP:10 will fail. However, I am unsure of what SMP:9 would do (would be nice to test it as previously mentioned).
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Humanoid1
Posts: 17
Joined: Mon Jun 04, 2012 1:33 pm
Hardware configuration: CPU: Xeon X5650 @ 4GHz
Gfx: Asus 7950 Direct CUII TOP @ 1.05GHz
RAM: Tri Channel 12GB Kingston LoVo 1,600MHz 8-9-8-21 1T
MB: Gigabyte X58A-OC
OS: Win 7 Pro 64bit

Re: 10138/10139/10140 EUE

Post by Humanoid1 »

Thanks for the responses (and redirect to this thread/forum section I had missed in my search ;))

Some great information there cheers PantherX, I try to stay well informed and is good to get a few points clarified.

After I successfully clear a few SMP WU's and repair my % to ensure the QRB during this last day or so of Chimp Challenge I will give SMP9 a try and update the results in this thread.

I get the impression you don't need the details of the failed projects I had with SMP11(corrected to 10) now we know what is going on.
If you would like them anyways, I will dig them out and post them for you.

Cheers,

Humanoid1
Jesse_V
Site Moderator
Posts: 2851
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: 10138/10139/10140 EUE

Post by Jesse_V »

Just to clarify, it's not the client's fault. The problem has to do with how the WU is decomposed and spread across N cores. Some WUs are susceptible to this problem, others aren't. It'd take some more complex server logic to assign WUs such that this problem could be avoided. An easier workaround is to tweak the SMP:N setting client-side.

What you did there, I see it. Apollo 13 is a great movie. :)
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 10138/10139/10140 EUE

Post by bruce »

Also, there's really no way for the Pande Group to predict which projects will fail with, say SMP:10 or even which WUs from a particular project. Once somebody reports problems with, say 10, the assignments can be restricted so that the project will avoid being given to machines configured with that number, but if nobody happened to run 10 while beta testing or if the WUs that were beta tested didn't happen to encounter this problem, the problem might not be discovered until later. Specific numbers, like SMP:11, are known to fail with a much higher probability, so the client will automatically map a setting of 11 to 10, but lots of projects are successful with 10 so it is not automatically remapped.

Has anybody tried smp:9?
Yasgur
Posts: 27
Joined: Sat Feb 23, 2008 4:55 am
Hardware configuration: Two systems.
System 1: ASRock X79 Extreme 6 mobo, Intel i7 3960X CPU, 2x Asus Radeon 6970 GPU's Crossfire, 32GB DDR3 1333 RAM, Intel 320 120GB SSD.

System 2: Asus Sabertooth X79 mobo, Intel i7 3930K, 2x EVGA GTX 680 GPU's SLI, 32GB DDR3 2133 RAM, Intel 520 180GB SSD, 2x Samsung 840 500GB SSD RAID 0.
Location: New Windsor, NY USA

Re: 10138/10139/10140 EUE

Post by Yasgur »

Had several failures on the 10140's with default settings on v7.3.6. Am running a 3960x and a pair of 6970's, and after setting SMP to 8, it seems to be okay.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 10138/10139/10140 EUE

Post by bruce »

Yasgur wrote:Had several failures on the 10140's with default settings on v7.3.6. Am running a 3960x and a pair of 6970's, and after setting SMP to 8, it seems to be okay.
Default settings varies depending on your hardware. You need to describe it in more detail. Do you have an 8-way system that defaults to smp:7 or something else? (Your hardware configuration is not listed in your profile.)
Yasgur
Posts: 27
Joined: Sat Feb 23, 2008 4:55 am
Hardware configuration: Two systems.
System 1: ASRock X79 Extreme 6 mobo, Intel i7 3960X CPU, 2x Asus Radeon 6970 GPU's Crossfire, 32GB DDR3 1333 RAM, Intel 320 120GB SSD.

System 2: Asus Sabertooth X79 mobo, Intel i7 3930K, 2x EVGA GTX 680 GPU's SLI, 32GB DDR3 2133 RAM, Intel 520 180GB SSD, 2x Samsung 840 500GB SSD RAID 0.
Location: New Windsor, NY USA

Re: 10138/10139/10140 EUE

Post by Yasgur »

The default is SMP 12. It's a 6 core cpu (Intel 3960X) and two gpu's in SLI (Radeon 6970's). I'll update my profile.
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 10138/10139/10140 EUE

Post by PantherX »

With default settings, it might have been SMP:11 but the FahCore would have remapped it to SMP:10 which is problematic.

However, the assignment settings have been tweaked so SMP:8 (excluding SMP:7) is the maximum now.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Yasgur
Posts: 27
Joined: Sat Feb 23, 2008 4:55 am
Hardware configuration: Two systems.
System 1: ASRock X79 Extreme 6 mobo, Intel i7 3960X CPU, 2x Asus Radeon 6970 GPU's Crossfire, 32GB DDR3 1333 RAM, Intel 320 120GB SSD.

System 2: Asus Sabertooth X79 mobo, Intel i7 3930K, 2x EVGA GTX 680 GPU's SLI, 32GB DDR3 2133 RAM, Intel 520 180GB SSD, 2x Samsung 840 500GB SSD RAID 0.
Location: New Windsor, NY USA

Re: 10138/10139/10140 EUE

Post by Yasgur »

Aye, you're right, SMP 11. I was looking at my nVidia folder running a 6 core 3930k on v7.2.9 and it shows SMP 12. Sorry about that.
Humanoid1
Posts: 17
Joined: Mon Jun 04, 2012 1:33 pm
Hardware configuration: CPU: Xeon X5650 @ 4GHz
Gfx: Asus 7950 Direct CUII TOP @ 1.05GHz
RAM: Tri Channel 12GB Kingston LoVo 1,600MHz 8-9-8-21 1T
MB: Gigabyte X58A-OC
OS: Win 7 Pro 64bit

Re: 10138/10139/10140 EUE

Post by Humanoid1 »

Thanks for the update PantherX

was good to get this sorted out so fast!
DexterThorphan
Posts: 1
Joined: Fri Oct 22, 2010 1:38 pm

Re: 10138/10139/10140 EUE

Post by DexterThorphan »

I did in fact experience problems with 10140 and core a3 just this past week as well. I was assigned it on a core2 duo. WU was Project 10140, Run 51, Clone 6, Gen 11 downloaded April 21 2013.

Code: Select all

17:25:17:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:10140 run:51 clone:6 gen:11 core:0xa3 unit:0x0000000f0a3b1e6f5149edd521856ba3
17:25:18:WU01:FS00:Downloading project 10140 description
17:25:18:WU01:FS00:Connecting to fah-web.stanford.edu:80
17:25:18:WU01:FS00:Project 10140 description downloaded successfully
Some time later this core starts as the last WU finishes and begins uploading.

Code: Select all

18:25:22:WU01:FS00:Starting
18:25:22:WU01:FS00:Running FahCore: "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
18:25:23:WU01:FS00:Started FahCore on PID 3768
18:25:24:WU01:FS00:Core PID:3220
18:25:24:WU01:FS00:FahCore 0xa3 started
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:*------------------------------*
18:25:24:WU01:FS00:0xa3:Folding@Home Gromacs SMP Core
18:25:24:WU01:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:Preparing to commence simulation
18:25:24:WU01:FS00:0xa3:- Looking at optimizations...
18:25:24:WU01:FS00:0xa3:- Created dyn
18:25:24:WU01:FS00:0xa3:- Files status OK
18:25:24:WU01:FS00:0xa3:- Expanded 969985 -> 2021624 (decompressed 208.4 percent)
18:25:24:WU01:FS00:0xa3:Called DecompressByteArray: compressed_data_size=969985 data_size=2021624, decompressed_data_size=2021624 diff=0
18:25:24:WU01:FS00:0xa3:- Digital signature verified
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:Project: 10140 (Run 51, Clone 6, Gen 11)
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:Assembly optimizations on if available.
18:25:24:WU01:FS00:0xa3:Entering M.D.
So far so good, I guess? But then...

Code: Select all

18:25:30:WU01:FS00:0xa3:Mapping NT from 2 to 2 
18:25:31:WU01:FS00:0xa3:Completed 0 out of 2000000 steps  (0%)
19:29:53:WU01:FS00:0xa3:Completed 20000 out of 2000000 steps  (1%)
(Many progress iterations roughly every 60-63 min)
******************************** Date: 21/04/13 ********************************
******************************** Date: 22/04/13 ********************************
******************************** Date: 23/04/13 ********************************
******************************** Date: 24/04/13 ********************************
******************************** Date: 25/04/13 ********************************
00:54:35:WU01:FS00:0xa3:Completed 1500000 out of 2000000 steps  (75%)
01:58:22:WU01:FS00:0xa3:Completed 1520000 out of 2000000 steps  (76%)
03:02:23:WU01:FS00:0xa3:Completed 1540000 out of 2000000 steps  (77%)
04:06:52:WU01:FS00:0xa3:Completed 1560000 out of 2000000 steps  (78%)
05:12:10:WU01:FS00:0xa3:Completed 1580000 out of 2000000 steps  (79%)
06:18:06:WU01:FS00:0xa3:Completed 1600000 out of 2000000 steps  (80%)
******************************** Date: 25/04/13 ********************************
07:24:03:WU01:FS00:0xa3:Completed 1620000 out of 2000000 steps  (81%)
08:29:53:WU01:FS00:0xa3:Completed 1640000 out of 2000000 steps  (82%)
09:35:11:WU01:FS00:0xa3:Completed 1660000 out of 2000000 steps  (83%)
10:40:07:WU01:FS00:0xa3:Completed 1680000 out of 2000000 steps  (84%)
11:45:31:WU01:FS00:0xa3:Completed 1700000 out of 2000000 steps  (85%)
...when suddenly...

Code: Select all

12:19:50:ERROR:Exception: Accessing './work/01/wuinfo_01.dat': Not enough quota is available to process this command.
.....
...
..
.
...every 1 to 5 seconds, for around 36 hours, generating 10+ megs log file. Then disaster.

Code: Select all

22:20:56:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
22:20:56:WARNING:WU01:FS00:FahCore returned: UNKNOWN_ENUM (-1073740777 = 0xc0000417)
22:21:02:WU01:FS00:Starting
22:21:02:WU01:FS00:Running FahCore:  "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
22:21:19:WU01:FS00:Started FahCore on PID 2172
22:21:27:WU01:FS00:Core PID:3028
22:21:27:WU01:FS00:FahCore 0xa3 started
22:21:28:ERROR:WU01:FS00:
22:21:28:ERROR:WU01:FS00:-------------------------------------------------------
22:21:28:ERROR:WU01:FS00:Program Folding@home, VERSION 4.5.4
22:21:28:ERROR:WU01:FS00:Source code file: gromacs-4.5.4\src\gmxlib\gmxfio.c, line: 519
22:21:28:ERROR:WU01:FS00:
22:21:28:ERROR:WU01:FS00:Can not open file:
22:21:28:ERROR:WU01:FS00:./work/01/wudata_01.tpr
22:21:28:ERROR:WU01:FS00:For more information and tips for troubleshooting, please check the GROMACS
22:21:28:ERROR:WU01:FS00:website at http://www.gromacs.org/Documentation/Errors
22:21:28:ERROR:WU01:FS00:-------------------------------------------------------
22:21:28:ERROR:WU01:FS00:
22:21:28:ERROR:WU01:FS00:Thanx for Using GROMACS - Have a Nice Day
22:21:28:ERROR:Exception: Accessing './work/01/wuinfo_01.dat': Not enough quota is available to process this command.
22:21:28:Server connection id=2 ended
22:21:28:Server connection id=20 ended
22:21:28:Server connection id=21 ended
22:21:28:Server connection id=22 ended
22:21:31:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
22:21:32:WARNING:WU01:FS00:FahCore returned: UNKNOWN_ENUM (-1073741502 = 0xc0000142)
22:22:02:WU01:FS00:Starting
22:22:02:WU01:FS00:Running FahCore:  "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
22:22:02:WU01:FS00:Started FahCore on PID 2896
22:22:03:WU01:FS00:Core PID:2900
22:22:03:WU01:FS00:FahCore 0xa3 started
22:22:05:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
22:22:05:WARNING:WU01:FS00:FahCore returned: UNKNOWN_ENUM (-1073741502 = 0xc0000142)
22:23:40:WU01:FS00:Starting
22:23:40:WU01:FS00:Running FahCore:  "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
22:23:40:WU01:FS00:Started FahCore on PID 220
This all happened totally unexpectedly and out of the blue, and only upon checking the machine did I see that FAH appeared to be crashing with windows exceptions. Additionally strange things were going on with CoreTemp which was also running, what actually tipped me off was an overtemp alarm that tripped, for some reason one core sensor had pegged to 100C (tjMax for the processor). I attempted a couple restarts and was seeing the same problems and the temp reading instantly spike from 45-50 to "100C(?)" soon after starting the core, which soon crashed. Finally leaving that box shut down for a couple days until I could get to addressing it I found the problem "fixed itself", that is the WU apparently dumped and I am now running an 8089, with no problems from a4 or CoreTemp.

In summation WTF? Sorry for the long post but I hope the above can be of use in diagnosing a possible problem lurking in a3.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 10138/10139/10140 EUE

Post by bruce »

DexterThorphan wrote:In summation WTF? Sorry for the long post but I hope the above can be of use in diagnosing a possible problem lurking in a3.
I'd say the probability that it's a problem lurking in a3 is terribly close to zero. A fast rise to excessive temperatures cannot be caused by software, but rather by some kind of failure in the cooling system. Maybe the fan stalled. Maybe the HS came loose. Maybe the VRs allowed the voltage to reach improper levels. There probably are other things that might have happened, but the cooling system has to be designed to keep the chip from overheating, even if the software manages to get the hardware to 99.99% utilization.
Post Reply