Folding strangles system

Moderators: Site Moderators, PandeGroup

Folding strangles system

Postby thebluebumblebee » Wed Mar 20, 2013 8:46 am

It's late and I'm tired, so this may ramble.

It has taken me about a year to post about this problem! It's taken this long because I wanted to be sure of what I was seeing and to be able to give a problem report that's better than; "Dude, Folding borks my system!"

On a system with 2 Nvidia Fermi cards, Folding can cause the system to become unresponsive. I'm willing to bet that many of you have had the same problem, assumed that your system had died, and simply rebooted. Easy fix, but it does not address the problem. If a person is patient enough, the system will come back, but it will take 5-15 minutes for EACH mouse click. I thought that it was the system that I had 2 of the Fermi cards on, but I now have a second one that has done it as well, and a teammate told me of the same happening on his system (all Fermi cards).

Tonight it happened on my i7-2600K with 2 GTX 560 Ti's. Below is the log from when it happened. Sorry that it's so huge, but I wanted to show that the system was running fine and then "stumbled". It took about 20 minutes to get the GUI to respond. I had Chrome open with two tabs (nothing fancy). I closed that, and the system became more responsive, but was still very slow. I launched EVGA Precision and it showed that both cards' work loads were bouncing around at about 30%. I've seen this GPU utilization pattern before; I even thought that I posted about it here before, but I can't find the post now. Once Folding was paused, the system normalized.

As you can also see, this went on for ~2 hours. Notice how it messes up the clock - I've even seen the system clock appear to get stuck. I had previously come to the conclusion that if 2 slots finished at roughly the same time that this could happen, but that is not the only time it happens.

This is a serious problem that needs to be fixed. Of course, if this is a FAHCore_15 problem, then it's about to go away anyway.

I stated in another post that I want to "set and forget." I monitor my clients from one location but FAHControl does not detect a problem when this happens. HFM.net does, but not with beta.

Code: Select all
04:23:56:WU00:FS02:0xa4:Completed 780000 out of 1500000 steps  (52%)
04:24:24:WU04:FS01:0x15:Completed  44500000 out of 50000000 steps (89%).
04:25:35:WU03:FS00:0x15:Completed  45500000 out of 50000000 steps (91%).
04:26:47:WU04:FS01:0x15:Completed  45000000 out of 50000000 steps (90%).
04:27:58:WU03:FS00:0x15:Completed  46000000 out of 50000000 steps (92%).
04:29:10:WU04:FS01:0x15:Completed  45500000 out of 50000000 steps (91%).
04:30:21:WU03:FS00:0x15:Completed  46500000 out of 50000000 steps (93%).
04:31:33:WU04:FS01:0x15:Completed  46000000 out of 50000000 steps (92%).
04:32:44:WU03:FS00:0x15:Completed  47000000 out of 50000000 steps (94%).
04:33:57:WU04:FS01:0x15:Completed  46500000 out of 50000000 steps (93%).
04:34:38:WU00:FS02:0xa4:Completed 795000 out of 1500000 steps  (53%)
04:35:08:WU03:FS00:0x15:Completed  47500000 out of 50000000 steps (95%).
04:36:20:WU04:FS01:0x15:Completed  47000000 out of 50000000 steps (94%).
04:37:31:WU03:FS00:0x15:Completed  48000000 out of 50000000 steps (96%).
04:38:43:WU04:FS01:0x15:Completed  47500000 out of 50000000 steps (95%).
04:39:54:WU03:FS00:0x15:Completed  48500000 out of 50000000 steps (97%).
04:41:06:WU04:FS01:0x15:Completed  48000000 out of 50000000 steps (96%).
04:42:17:WU03:FS00:0x15:Completed  49000000 out of 50000000 steps (98%).
04:42:18:WU01:FS00:Connecting to assign-GPU.stanford.edu:80
04:42:18:WU01:FS00:News: Welcome to Folding@Home
04:42:18:WU01:FS00:Assigned to work server 171.67.108.36
04:42:18:WU01:FS00:Requesting new work unit for slot 00: RUNNING gpu:0:GF114 [GeForce GTX 560 Ti] from 171.67.108.36
04:42:18:WU01:FS00:Connecting to 171.67.108.36:8080
04:42:18:WU01:FS00:Downloading 56.89KiB
04:42:18:WU01:FS00:Download complete
04:42:19:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8070 run:280 clone:2 gen:81 core:0x15 unit:0x0000005a6652edb45122e08adf03847b
04:43:29:WU04:FS01:0x15:Completed  48500000 out of 50000000 steps (97%).
04:44:40:WU03:FS00:0x15:Completed  49500000 out of 50000000 steps (99%).
04:45:04:WU00:FS02:0xa4:Completed 810000 out of 1500000 steps  (54%)
04:45:53:WU04:FS01:0x15:Completed  49000000 out of 50000000 steps (98%).
04:45:54:WU02:FS01:Connecting to assign-GPU.stanford.edu:80
04:45:54:WU02:FS01:News: Welcome to Folding@Home
04:45:54:WU02:FS01:Assigned to work server 171.67.108.36
04:45:54:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:1:GF114 [GeForce GTX 560 Ti] from 171.67.108.36
04:45:54:WU02:FS01:Connecting to 171.67.108.36:8080
04:45:54:WU02:FS01:Downloading 66.24KiB
04:45:55:WU02:FS01:Download complete
04:45:55:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:8071 run:82 clone:9 gen:0 core:0x15 unit:0x000000016652edb45122e1279645c4b4
04:47:04:WU03:FS00:0x15:Completed  50000000 out of 50000000 steps (100%).
04:47:04:WU03:FS00:0x15:Finished fah_main status=0
04:47:04:WU03:FS00:0x15:Successful run
04:47:04:WU03:FS00:0x15:DynamicWrapper: Finished Work Unit: sleep=10000
04:47:14:WU03:FS00:0x15:Reserved 324844 bytes for xtc file; Cosm status=0
04:47:14:WU03:FS00:0x15:Allocated 324844 bytes for xtc file
04:47:14:WU03:FS00:0x15:- Reading up to 324844 from "03/wudata_01.xtc": Read 324844
04:47:14:WU03:FS00:0x15:Read 324844 bytes from xtc file; available packet space=786105620
04:47:14:WU03:FS00:0x15:xtc file hash check passed.
04:47:14:WU03:FS00:0x15:Reserved 20256 20256 786105620 bytes for arc file=<03/wudata_01.trr> Cosm status=0
04:47:14:WU03:FS00:0x15:Allocated 20256 bytes for arc file
04:47:14:WU03:FS00:0x15:- Reading up to 20256 from "03/wudata_01.trr": Read 20256
04:47:14:WU03:FS00:0x15:Read 20256 bytes from arc file; available packet space=786085364
04:47:14:WU03:FS00:0x15:trr file hash check passed.
04:47:14:WU03:FS00:0x15:Allocated 544 bytes for edr file
04:47:14:WU03:FS00:0x15:Read bedfile
04:47:14:WU03:FS00:0x15:edr file hash check passed.
04:47:14:WU03:FS00:0x15:Allocated 36806 bytes for logfile
04:47:14:WU03:FS00:0x15:Read logfile
04:47:14:WU03:FS00:0x15:GuardedRun: success in DynamicWrapper
04:47:14:WU03:FS00:0x15:GuardedRun: done
04:47:14:WU03:FS00:0x15:Run: GuardedRun completed.
04:47:16:WU03:FS00:0x15:+ Opened results file
04:47:16:WU03:FS00:0x15:- Writing 382962 bytes of core data to disk...
04:47:16:WU03:FS00:0x15:Done: 382450 -> 351561 (compressed to 91.9 percent)
04:47:16:WU03:FS00:0x15:  ... Done.
04:47:16:WU03:FS00:0x15:DeleteFrameFiles: successfully deleted file=03/wudata_01.ckp
04:47:16:WU03:FS00:0x15:Shutting down core
04:47:16:WU03:FS00:0x15:
04:47:16:WU03:FS00:0x15:Folding@home Core Shutdown: FINISHED_UNIT
04:47:16:WU03:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
04:47:16:WU03:FS00:Sending unit results: id:03 state:SEND error:NO_ERROR project:8071 run:148 clone:5 gen:32 core:0x15 unit:0x000000236652edb45122e1908215290a
04:47:16:WU03:FS00:Uploading 343.82KiB to 171.67.108.36
04:47:16:WU01:FS00:Starting
04:47:16:WU03:FS00:Connecting to 171.67.108.36:8080
04:47:16:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/ProgramData/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_15.fah/FahCore_15.exe -dir 01 -suffix 01 -version 703 -lifeline 3480 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
04:47:16:WU01:FS00:Started FahCore on PID 2076
04:47:16:WU01:FS00:Core PID:1144
04:47:16:WU01:FS00:FahCore 0x15 started
04:47:17:WU01:FS00:0x15:
04:47:17:WU01:FS00:0x15:*------------------------------*
04:47:17:WU01:FS00:0x15:Folding@Home GPU Core
04:47:17:WU01:FS00:0x15:Version                2.25 (Wed May 9 17:03:01 EDT 2012)
04:47:17:WU01:FS00:0x15:Build host             AmoebaRemote
04:47:17:WU01:FS00:0x15:Board Type             NVIDIA/CUDA
04:47:17:WU01:FS00:0x15:Core                   15
04:47:17:WU01:FS00:0x15:
04:47:17:WU01:FS00:0x15:Window's signal control handler registered.
04:47:17:WU01:FS00:0x15:Preparing to commence simulation
04:47:17:WU01:FS00:0x15:- Looking at optimizations...
04:47:17:WU01:FS00:0x15:DeleteFrameFiles: successfully deleted file=01/wudata_01.ckp
04:47:17:WU01:FS00:0x15:- Created dyn
04:47:17:WU01:FS00:0x15:- Files status OK
04:47:17:WU01:FS00:0x15:sizeof(CORE_PACKET_HDR) = 512 file=<>
04:47:17:WU01:FS00:0x15:- Expanded 57741 -> 257358 (decompressed 445.7 percent)
04:47:17:WU01:FS00:0x15:Called DecompressByteArray: compressed_data_size=57741 data_size=257358, decompressed_data_size=257358 diff=0
04:47:17:WU01:FS00:0x15:- Digital signature verified
04:47:17:WU01:FS00:0x15:
04:47:17:WU01:FS00:0x15:Project: 8070 (Run 280, Clone 2, Gen 81)
04:47:17:WU01:FS00:0x15:
04:47:17:WU01:FS00:0x15:Assembly optimizations on if available.
04:47:17:WU01:FS00:0x15:Entering M.D.
04:47:18:WU03:FS00:Upload complete
04:47:19:WU03:FS00:Server responded WORK_ACK (400)
04:47:19:WU03:FS00:Final credit estimate, 3874.00 points
04:47:19:WU03:FS00:Cleaning up
04:47:19:WU01:FS00:0x15:Tpr hash 01/wudata_01.tpr:  523088428 2395562252 2328542420 506125749 2677203249
04:47:19:WU01:FS00:0x15:GPU device id=0
04:47:19:WU01:FS00:0x15:Working on Gallium Rubidium Oxygen Manganese Argon Carbon Silicon t= 280.00000
04:47:19:WU01:FS00:0x15:Client config unavailable.
04:47:19:WU01:FS00:0x15:Starting GUI Server
04:54:58:WU01:FS00:0x15:Setting checkpoint frequency: 500000
04:54:58:WU01:FS00:0x15:Completed         3 out of 50000000 steps (0%).
04:56:40:WU00:FS02:0xa4:Completed 825000 out of 1500000 steps  (55%)
05:08:35:WU00:FS02:0xa4:Completed 840000 out of 1500000 steps  (56%)
05:10:39:WARNING:WU00:FS02:Detected clock skew (2 mins 11 secs), adjusting time estimates
05:10:39:WARNING:WU04:FS01:Detected clock skew (2 mins 11 secs), adjusting time estimates
05:10:39:WARNING:WU01:FS00:Detected clock skew (2 mins 11 secs), adjusting time estimates
05:17:58:WARNING:WU00:FS02:Detected clock skew (4 mins 22 secs), adjusting time estimates
05:17:58:WARNING:WU04:FS01:Detected clock skew (4 mins 22 secs), adjusting time estimates
05:17:58:WARNING:WU01:FS00:Detected clock skew (4 mins 22 secs), adjusting time estimates
05:20:34:WU00:FS02:0xa4:Completed 855000 out of 1500000 steps  (57%)
05:32:26:WU00:FS02:0xa4:Completed 870000 out of 1500000 steps  (58%)
05:44:19:WU00:FS02:0xa4:Completed 885000 out of 1500000 steps  (59%)
05:45:02:WARNING:WU00:FS02:Detected clock skew (2 mins 12 secs), adjusting time estimates
05:45:02:WARNING:WU04:FS01:Detected clock skew (2 mins 12 secs), adjusting time estimates
05:45:02:WARNING:WU01:FS00:Detected clock skew (2 mins 12 secs), adjusting time estimates
05:52:38:WARNING:WU00:FS02:Detected clock skew (4 mins 21 secs), adjusting time estimates
05:52:38:WARNING:WU04:FS01:Detected clock skew (4 mins 21 secs), adjusting time estimates
05:52:38:WARNING:WU01:FS00:Detected clock skew (4 mins 21 secs), adjusting time estimates
05:56:10:WU00:FS02:0xa4:Completed 900000 out of 1500000 steps  (60%)
06:08:02:WU00:FS02:0xa4:Completed 915000 out of 1500000 steps  (61%)
06:12:25:WARNING:WU00:FS02:Detected clock skew (2 mins 12 secs), adjusting time estimates
06:12:25:WARNING:WU04:FS01:Detected clock skew (2 mins 12 secs), adjusting time estimates
06:12:25:WARNING:WU01:FS00:Detected clock skew (2 mins 12 secs), adjusting time estimates
06:14:35:WARNING:WU00:FS02:Detected clock skew (2 mins 10 secs), adjusting time estimates
06:14:35:WARNING:WU04:FS01:Detected clock skew (2 mins 10 secs), adjusting time estimates
06:14:35:WARNING:WU01:FS00:Detected clock skew (2 mins 10 secs), adjusting time estimates
06:19:49:WU00:FS02:0xa4:Completed 930000 out of 1500000 steps  (62%)
06:21:34:WARNING:WU00:FS02:Detected clock skew (2 mins 11 secs), adjusting time estimates
06:21:34:WARNING:WU04:FS01:Detected clock skew (2 mins 11 secs), adjusting time estimates
06:21:34:WARNING:WU01:FS00:Detected clock skew (2 mins 11 secs), adjusting time estimates
06:23:46:WARNING:WU00:FS02:Detected clock skew (2 mins 12 secs), adjusting time estimates
06:23:46:WARNING:WU04:FS01:Detected clock skew (2 mins 12 secs), adjusting time estimates
06:23:46:WARNING:WU01:FS00:Detected clock skew (2 mins 12 secs), adjusting time estimates
06:27:57:WARNING:WU00:FS02:Detected clock skew (1 mins 55 secs), adjusting time estimates
06:27:57:WARNING:WU04:FS01:Detected clock skew (1 mins 55 secs), adjusting time estimates
06:27:57:WARNING:WU01:FS00:Detected clock skew (1 mins 55 secs), adjusting time estimates
06:30:29:WARNING:WU00:FS02:Detected clock skew (2 mins 11 secs), adjusting time estimates
06:30:29:WARNING:WU04:FS01:Detected clock skew (2 mins 11 secs), adjusting time estimates
06:30:29:WARNING:WU01:FS00:Detected clock skew (2 mins 11 secs), adjusting time estimates
06:33:12:WU00:FS02:0xa4:Completed 945000 out of 1500000 steps  (63%)
06:37:32:WARNING:WU00:FS02:Detected clock skew (2 mins 02 secs), adjusting time estimates
06:37:32:WARNING:WU04:FS01:Detected clock skew (2 mins 02 secs), adjusting time estimates
06:37:32:WARNING:WU01:FS00:Detected clock skew (2 mins 02 secs), adjusting time estimates
06:45:11:WU00:FS02:0xa4:Completed 960000 out of 1500000 steps  (64%)
06:50:46:FS00:Paused
06:50:46:FS00:Shutting core down
06:50:53:WU01:FS00:0x15:Client no longer detected. Shutting down core
06:50:53:WU01:FS00:0x15:
06:50:53:WU01:FS00:0x15:Folding@home Core Shutdown: CLIENT_DIED
06:50:54:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
06:50:57:FS01:Paused
06:50:57:FS01:Shutting core down
06:50:59:WU04:FS01:0x15:Client no longer detected. Shutting down core
06:50:59:WU04:FS01:0x15:
06:50:59:WU04:FS01:0x15:Folding@home Core Shutdown: CLIENT_DIED
06:50:59:WU04:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
06:52:05:FS02:Paused
06:52:05:FS02:Shutting core down
06:52:09:WU00:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
thebluebumblebee
 
Posts: 15
Joined: Sat Feb 28, 2009 6:17 pm

Re: Folding strangles system

Postby P5-133XL » Wed Mar 20, 2013 9:17 am

Try upgrading to v7.3.6 of the folding client. It has different defaults (Power level medium), specifically designed not to interfere with your usage. It will start up with only GPU folding when the machine is idle and when using your CPU it will not use all your cores. Do not use power level full or you will end up in the same situation where folding uses everything, all the time.
Image
P5-133XL
Site Moderator
 
Posts: 4006
Joined: Sun Dec 02, 2007 4:36 am
Location: Salem. OR USA

Re: Folding strangles system

Postby bruce » Wed Mar 20, 2013 6:35 pm

Nobody has clearly identified the cause of this issue, so without some additional problem-determination steps, Development will not be able to know what to fix.

I'm pretty confident that it's a GPU problem, perhaps a new problem, perhaps a serious manifestation of the so-called "screen-lag" problem. That means it might be fixed someday by a new version of the drivers. Unfortunately FAH doesn't really have any influence on what or when driver-developers offer new versions.

To isolate the problem, there are some things you can try.

Every application (including Windows, itself) starts with the assumption that the GPU is much faster than the CPU when responding to a mouse click or when displaying something on the screen. They also assume that the GPU is idle most of the time -- and assumption that is terribly false when you're folding. Windows, browsers, programs which display videos, etc. all have settings which allow you to disable GPU optimization though each one must be configured separately. If setting Windows to use the CPU rather than the GPU fixes the problem, at least we'll know where to look for the root cause of the problem.

Another potential problem is that you have over-committed the memory on the GPU board. I certainly would find a utility that reports memory utilization and see what it tells you. The fact that you were able to clear the problem after some time seems to suggest that this might be the problem. If your GPUs have SLI enabled, I would disable that, since it limits the amount of VRAM that can be use by applications in each GPU and it does nothing for FAH.
bruce
Site Admin
 
Posts: 16882
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.


Return to V7.3.6 Public Release Windows/Linux/MacOS X

Who is online

Users browsing this forum: No registered users and 1 guest