GTX 780 & Core 17 problems

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Re: GTX 780 & Core 17 problems

Postby DocJonz » Thu Jul 25, 2013 10:16 pm

Update:
I'm now running the two GTX 780's on different Win7 machines with driver 320.49.
On one, the machine seems to throw out a bad WU every 36hrs (has done 3 now) - I managed to be in the right place at the right time for the last one, and a Windows message read "display driver nVidia Windows kernel mode driver, version 320.49 stopped responding and has successfully recovered" - though Windows continued running, it seemed to corrupt the WU and carry on. Warnings/Errors below - first WU error is different to the next two;

Code: Select all
*********************** Log Started 2013-07-21T08:33:04Z ***********************
08:33:36:WARNING:Exception: 9:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
08:33:38:WARNING:Exception: 10:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
******************************* Date: 2013-07-21 *******************************
******************************* Date: 2013-07-21 *******************************
******************************* Date: 2013-07-22 *******************************
******************************* Date: 2013-07-22 *******************************
******************************* Date: 2013-07-22 *******************************
20:22:48:WU01:FS00:0x17:ERROR:exception: Error invoking kernel finishSpreadCharge: clEnqueueNDRangeKernel (-5)
20:22:48:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
******************************* Date: 2013-07-22 *******************************
******************************* Date: 2013-07-23 *******************************
******************************* Date: 2013-07-23 *******************************
******************************* Date: 2013-07-23 *******************************
******************************* Date: 2013-07-23 *******************************
******************************* Date: 2013-07-24 *******************************
08:13:26:WU01:FS00:0x17:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)
08:13:27:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
******************************* Date: 2013-07-24 *******************************
******************************* Date: 2013-07-24 *******************************
******************************* Date: 2013-07-24 *******************************
******************************* Date: 2013-07-25 *******************************
******************************* Date: 2013-07-25 *******************************
******************************* Date: 2013-07-25 *******************************
******************************* Date: 2013-07-25 *******************************
******************************* Date: 2013-07-25 *******************************
20:50:58:WU02:FS00:0x17:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)
20:50:59:WARNING:WU02:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)


The other machine has dropped out once so far, with the same first WU error as the other machine. Warnings/Errors below;

Code: Select all
*********************** Log Started 2013-07-23T06:29:01Z ***********************
06:30:18:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
06:30:20:WARNING:Exception: 9:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
******************************* Date: 2013-07-23 *******************************
******************************* Date: 2013-07-23 *******************************
******************************* Date: 2013-07-24 *******************************
******************************* Date: 2013-07-24 *******************************
******************************* Date: 2013-07-24 *******************************
18:19:10:WU01:FS00:0x17:ERROR:exception: Error invoking kernel finishSpreadCharge: clEnqueueNDRangeKernel (-5)
18:19:11:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
******************************* Date: 2013-07-24 *******************************
******************************* Date: 2013-07-25 *******************************
******************************* Date: 2013-07-25 *******************************
******************************* Date: 2013-07-25 *******************************
******************************* Date: 2013-07-25 *******************************
User avatar
DocJonz
 
Posts: 211
Joined: Thu Dec 06, 2007 7:31 pm
Location: United Kingdom

Re: GTX 780 & Core 17 problems

Postby ChristianVirtual » Thu Jul 25, 2013 10:55 pm

I changed the PSU to a Seasonic 1050 with a huge single 12V rail; still have an issue with the 780 after about two days working; with Ubuntu 13.04 and xorg-edgers 319 driver.. The system become very laggy; GUI is not responding at all; shell via ssh is also slow.

The TPF gets quite long; more than 5 min; still crunching; but very slow.

Image

In the log file there is nothing special:

Code: Select all
16:33:34:WU00:FS00:Started FahCore on PID 8590
16:33:34:WU00:FS00:Core PID:8594
16:33:34:WU00:FS00:FahCore 0x17 started
16:33:34:WU00:FS00:0x17:*********************** Log Started 2013-07-25T16:33:34Z ***********************
16:33:34:WU00:FS00:0x17:Project: 7810 (Run 0, Clone 454, Gen 8)
16:33:34:WU00:FS00:0x17:Unit: 0x000000090a3b1e8651d34ac90a99b0f7
16:33:34:WU00:FS00:0x17:CPU: 0x00000000000000000000000000000000
16:33:34:WU00:FS00:0x17:Machine: 0
16:33:34:WU00:FS00:0x17:Reading tar file state.xml
16:33:34:WU00:FS00:0x17:Reading tar file system.xml
16:33:34:WU00:FS00:0x17:Reading tar file integrator.xml
16:33:34:WU00:FS00:0x17:Reading tar file core.xml
16:33:34:WU00:FS00:0x17:Digital signatures verified
16:33:41:WU02:FS00:Upload 84.97%
16:33:49:WU02:FS00:Upload complete
16:33:49:WU02:FS00:Server responded WORK_ACK (400)
16:33:49:WU02:FS00:Final credit estimate, 10404.00 points
16:33:49:WU02:FS00:Cleaning up
16:44:59:WU00:FS00:0x17:Completed 0 out of 2000000 steps (0%)
16:47:31:WU00:FS00:0x17:Completed 20000 out of 2000000 steps (1%)
16:49:39:WU00:FS00:0x17:Completed 40000 out of 2000000 steps (2%)
16:51:23:WU00:FS00:0x17:Completed 60000 out of 2000000 steps (3%)
16:53:30:WU00:FS00:0x17:Completed 80000 out of 2000000 steps (4%)

16:59:07:WU00:FS00:0x17:Completed 100000 out of 2000000 steps (5%)

17:00:51:WU00:FS00:0x17:Completed 120000 out of 2000000 steps (6%)
17:02:58:WU00:FS00:0x17:Completed 140000 out of 2000000 steps (7%)
17:04:42:WU00:FS00:0x17:Completed 160000 out of 2000000 steps (8%)
17:06:50:WU00:FS00:0x17:Completed 180000 out of 2000000 steps (9%)
17:08:27:WU00:FS00:0x17:Completed 200000 out of 2000000 steps (10%)
17:11:41:WU00:FS00:0x17:Completed 220000 out of 2000000 steps (11%)
17:13:18:WU00:FS00:0x17:Completed 240000 out of 2000000 steps (12%)
17:15:33:WU00:FS00:0x17:Completed 260000 out of 2000000 steps (13%)
17:17:40:WU00:FS00:0x17:Completed 280000 out of 2000000 steps (14%)
17:19:17:WU00:FS00:0x17:Completed 300000 out of 2000000 steps (15%)
17:21:31:WU00:FS00:0x17:Completed 320000 out of 2000000 steps (16%)
17:23:38:WU00:FS00:0x17:Completed 340000 out of 2000000 steps (17%)
17:25:52:WU00:FS00:0x17:Completed 360000 out of 2000000 steps (18%)

17:30:29:WU00:FS00:0x17:Completed 380000 out of 2000000 steps (19%)

17:32:07:WU00:FS00:0x17:Completed 400000 out of 2000000 steps (20%)
17:34:21:WU00:FS00:0x17:Completed 420000 out of 2000000 steps (21%)
17:35:58:WU00:FS00:0x17:Completed 440000 out of 2000000 steps (22%)
17:38:12:WU00:FS00:0x17:Completed 460000 out of 2000000 steps (23%)
17:39:49:WU00:FS00:0x17:Completed 480000 out of 2000000 steps (24%)
17:41:57:WU00:FS00:0x17:Completed 500000 out of 2000000 steps (25%)
17:44:10:WU00:FS00:0x17:Completed 520000 out of 2000000 steps (26%)
17:45:48:WU00:FS00:0x17:Completed 540000 out of 2000000 steps (27%)
17:48:02:WU00:FS00:0x17:Completed 560000 out of 2000000 steps (28%)
17:50:09:WU00:FS00:0x17:Completed 580000 out of 2000000 steps (29%)

17:55:16:WU00:FS00:0x17:Completed 600000 out of 2000000 steps (30%)

17:57:30:WU00:FS00:0x17:Completed 620000 out of 2000000 steps (31%)
17:59:37:WU00:FS00:0x17:Completed 640000 out of 2000000 steps (32%)
18:01:21:WU00:FS00:0x17:Completed 660000 out of 2000000 steps (33%)
18:03:28:WU00:FS00:0x17:Completed 680000 out of 2000000 steps (34%)
18:06:06:WU00:FS00:0x17:Completed 700000 out of 2000000 steps (35%)
18:07:49:WU00:FS00:0x17:Completed 720000 out of 2000000 steps (36%)
18:09:56:WU00:FS00:0x17:Completed 740000 out of 2000000 steps (37%)
18:11:40:WU00:FS00:0x17:Completed 760000 out of 2000000 steps (38%)
18:14:18:WU00:FS00:0x17:Completed 780000 out of 2000000 steps (39%)
18:16:55:WU00:FS00:0x17:Completed 800000 out of 2000000 steps (40%)
18:19:09:WU00:FS00:0x17:Completed 820000 out of 2000000 steps (41%)
18:20:46:WU00:FS00:0x17:Completed 840000 out of 2000000 steps (42%)
18:23:00:WU00:FS00:0x17:Completed 860000 out of 2000000 steps (43%)
18:25:07:WU00:FS00:0x17:Completed 880000 out of 2000000 steps (44%)
18:26:45:WU00:FS00:0x17:Completed 900000 out of 2000000 steps (45%)
18:28:58:WU00:FS00:0x17:Completed 920000 out of 2000000 steps (46%)
18:31:05:WU00:FS00:0x17:Completed 940000 out of 2000000 steps (47%)
18:33:49:WU00:FS00:0x17:Completed 960000 out of 2000000 steps (48%)
18:35:26:WU00:FS00:0x17:Completed 980000 out of 2000000 steps (49%)
18:37:34:WU00:FS00:0x17:Completed 1000000 out of 2000000 steps (50%)
18:39:48:WU00:FS00:0x17:Completed 1020000 out of 2000000 steps (51%)
18:41:55:WU00:FS00:0x17:Completed 1040000 out of 2000000 steps (52%)
18:43:39:WU00:FS00:0x17:Completed 1060000 out of 2000000 steps (53%)
18:45:46:WU00:FS00:0x17:Completed 1080000 out of 2000000 steps (54%)
18:47:54:WU00:FS00:0x17:Completed 1100000 out of 2000000 steps (55%)
18:50:07:WU00:FS00:0x17:Completed 1120000 out of 2000000 steps (56%)
18:52:15:WU00:FS00:0x17:Completed 1140000 out of 2000000 steps (57%)
18:53:59:WU00:FS00:0x17:Completed 1160000 out of 2000000 steps (58%)

18:58:36:WU00:FS00:0x17:Completed 1180000 out of 2000000 steps (59%)

19:00:13:WU00:FS00:0x17:Completed 1200000 out of 2000000 steps (60%)
19:02:27:WU00:FS00:0x17:Completed 1220000 out of 2000000 steps (61%)
19:04:34:WU00:FS00:0x17:Completed 1240000 out of 2000000 steps (62%)
19:06:18:WU00:FS00:0x17:Completed 1260000 out of 2000000 steps (63%)
19:08:26:WU00:FS00:0x17:Completed 1280000 out of 2000000 steps (64%)
19:10:33:WU00:FS00:0x17:Completed 1300000 out of 2000000 steps (65%)
19:12:47:WU00:FS00:0x17:Completed 1320000 out of 2000000 steps (66%)
19:15:55:WU00:FS00:0x17:Completed 1340000 out of 2000000 steps (67%)
19:17:38:WU00:FS00:0x17:Completed 1360000 out of 2000000 steps (68%)
19:19:46:WU00:FS00:0x17:Completed 1380000 out of 2000000 steps (69%)
19:21:53:WU00:FS00:0x17:Completed 1400000 out of 2000000 steps (70%)
19:24:07:WU00:FS00:0x17:Completed 1420000 out of 2000000 steps (71%)
19:24:36:FS01:Finishing
19:24:45:FS00:Finishing
19:25:44:WU00:FS00:0x17:Completed 1440000 out of 2000000 steps (72%)
19:27:58:WU00:FS00:0x17:Completed 1460000 out of 2000000 steps (73%)
19:30:35:WU00:FS00:0x17:Completed 1480000 out of 2000000 steps (74%)
19:32:13:WU00:FS00:0x17:Completed 1500000 out of 2000000 steps (75%)
19:34:27:WU00:FS00:0x17:Completed 1520000 out of 2000000 steps (76%)
19:36:34:WU00:FS00:0x17:Completed 1540000 out of 2000000 steps (77%)
19:38:47:WU00:FS00:0x17:Completed 1560000 out of 2000000 steps (78%)
19:40:25:WU00:FS00:0x17:Completed 1580000 out of 2000000 steps (79%)
19:43:02:WU00:FS00:0x17:Completed 1600000 out of 2000000 steps (80%)
19:45:46:WU00:FS00:0x17:Completed 1620000 out of 2000000 steps (81%)
19:47:53:WU00:FS00:0x17:Completed 1640000 out of 2000000 steps (82%)
19:50:07:WU00:FS00:0x17:Completed 1660000 out of 2000000 steps (83%)
19:53:14:WU00:FS00:0x17:Completed 1680000 out of 2000000 steps (84%)
19:54:52:WU00:FS00:0x17:Completed 1700000 out of 2000000 steps (85%)
19:57:05:WU00:FS00:0x17:Completed 1720000 out of 2000000 steps (86%)
19:58:43:WU00:FS00:0x17:Completed 1740000 out of 2000000 steps (87%)
20:00:56:WU00:FS00:0x17:Completed 1760000 out of 2000000 steps (88%)
20:03:04:WU00:FS00:0x17:Completed 1780000 out of 2000000 steps (89%)
20:04:41:WU00:FS00:0x17:Completed 1800000 out of 2000000 steps (90%)
20:07:25:WU00:FS00:0x17:Completed 1820000 out of 2000000 steps (91%)
20:09:33:WU00:FS00:0x17:Completed 1840000 out of 2000000 steps (92%)
20:11:16:WU00:FS00:0x17:Completed 1860000 out of 2000000 steps (93%)
20:13:24:WU00:FS00:0x17:Completed 1880000 out of 2000000 steps (94%)
20:15:31:WU00:FS00:0x17:Completed 1900000 out of 2000000 steps (95%)
20:18:45:WU00:FS00:0x17:Completed 1920000 out of 2000000 steps (96%)
20:20:52:WU00:FS00:0x17:Completed 1940000 out of 2000000 steps (97%)
20:22:35:WU00:FS00:0x17:Completed 1960000 out of 2000000 steps (98%)
20:24:43:WU00:FS00:0x17:Completed 1980000 out of 2000000 steps (99%)
20:26:50:WU00:FS00:0x17:Completed 2000000 out of 2000000 steps (100%)
20:26:57:WU00:FS00:0x17:Saving result file logfile_01.txt
20:26:57:WU00:FS00:0x17:Saving result file checkpointState.xml
20:26:58:WU00:FS00:0x17:Saving result file checkpt.crc
20:26:58:WU00:FS00:0x17:Saving result file log.txt
20:26:58:WU00:FS00:0x17:Saving result file positions.xtc
20:27:00:WU00:FS00:0x17:Folding@home Core Shutdown: FINISHED_UNIT
20:32:38:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
20:32:38:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:7810 run:0 clone:454 gen:8 core:0x17 unit:0x000000090a3b1e8651d34ac90a99b0f7
20:32:38:WU00:FS00:Uploading 5.78MiB to 171.64.65.98
20:32:38:WU00:FS00:Connecting to 171.64.65.98:8080
20:32:58:WU00:FS00:Upload 1.08%
20:33:06:WU00:FS00:Upload 87.63%
20:33:11:WU00:FS00:Upload complete
20:33:11:WU00:FS00:Server responded WORK_ACK (400)
20:33:11:WU00:FS00:Final credit estimate, 13169.00 points
20:33:12:WU00:FS00:Cleaning up
20:45:39:FS01:Paused


while in the system log file I get a number of message like this:

Code: Select all
Jul 26 00:12:18 linuxpowered kernel: [128842.721290] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 26 00:12:20 linuxpowered kernel: [128844.717402] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 26 00:12:22 linuxpowered kernel: [128846.713513] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context


Xorg.0.log contain the following
Code: Select all
[129664.787] (WW) NVIDIA(0): WAIT (2, 8, 0x8000, 0x00003818, 0x00005384)
[129671.787] (WW) NVIDIA(0): WAIT (1, 8, 0x8000, 0x00003818, 0x00005384)
[129994.890] (WW) NVIDIA(0): WAIT (2, 8, 0x8000, 0x0000ee7c, 0x0000f47c)
[130001.890] (WW) NVIDIA(0): WAIT (1, 8, 0x8000, 0x0000ee7c, 0x0000f47c)



And even subsequent impacting a CPU-folding core on the same system
Code: Select all
Jul 26 00:12:36 linuxpowered kernel: [128861.124503] BUG: soft lockup - CPU#1 stuck for 23s! [FahCore_a3:5078]
Jul 26 00:12:36 linuxpowered kernel: [128861.124506] Modules linked in: parport_pc(F) ppdev(F) rfcomm bnep bluetooth snd_hda_codec_hdmi coretemp snd_hda_codec_realtek kvm_intel kvm ghash_clmulni_intel(F) aesni_intel(F) aes_x86_64(F) xts(F) lrw(F) gf128mul(F) ablk_helper(F) cryptd(F) snd_hda_intel snd_hda_codec snd_hwdep(F) snd_pcm(F) gpio_ich snd_page_alloc(F) nvidia(POF) snd_seq_midi(F) snd_seq_midi_event(F) snd_rawmidi(F) snd_seq(F) snd_seq_device(F) snd_timer(F) drm snd(F) mac_hid lpc_ich psmouse(F) mei soundcore(F) lp(F) parport(F) microcode(F) serio_raw(F) hid_generic usbhid hid ahci(F) libahci(F) e1000e(F)
Jul 26 00:12:36 linuxpowered kernel: [128861.124541] CPU 1
Jul 26 00:12:36 linuxpowered kernel: [128861.124545] Pid: 5078, comm: FahCore_a3 Tainted: PF          O 3.8.0-19-generic #30-Ubuntu Supermicro C7Q67/C7Q67


Any idea ?

The 780 is from Gigabyte; no overclock. The only "deviation" from factory setting is with coolbits 4 the fan control and set to 75% to keep GPU around 62C

Any idea ? Any similar experience ?
ImageImage
Please contribute your logs to http://ppd.fahmm.net
User avatar
ChristianVirtual
 
Posts: 1596
Joined: Tue May 28, 2013 1:14 pm
Location: Tokyo

Re: GTX 780 & Core 17 problems

Postby bollix47 » Thu Jul 25, 2013 11:03 pm

I've had a similar experience with my gtx 780(no o/c ... temps/power fine). Using the 32x.xx drivers(including the latest beta) every 36-48 hours I'd notice a mild drop in PPD and, if I didn't reboot as soon as I noticed the drop, the computer would eventually freeze up and I had to do a hard reboot(never lost a WU). This behavior occurred in Windows 7 64-bit and Linux 13.04 64-bit. In Linux, nvidia-settings would not even load once the PPD started it's drop and if I tried to load it the mouse and keyboard would stop working and only a hard reboot solved the problem.

There are lots of threads on the internet about the gtx 7xx series of GPUs and the 32x.xx drivers. Some thought it was caused by a memory leak. There was a suggestion to use 314.xx, with a modified inf file so the cards are recognized, but I've been unsuccessful at getting that to work so far.

What I'm doing now to avoid the problem is to reboot the system daily. End of problems. No slow downs and no freezing. The consensus on the web appears to be that this is driver related and hopefully it will be fixed by Nvidia soon.
bollix47
 
Posts: 2871
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: GTX 780 & Core 17 problems

Postby ChristianVirtual » Thu Jul 25, 2013 11:31 pm

@bollix47: Yes, the deep freeze I saw the other day. I lost the time of a WU; the client restarted at 0% and crunched again. Bad luck.

Seems that a frequent reboot is right now the only way. Do you know by chance a way to set the fan via shell ? I tried with nvidia-settings and smi but can't really find the right parameter. As I would like to automize the reboot I need a way to reset the fan speed.
User avatar
ChristianVirtual
 
Posts: 1596
Joined: Tue May 28, 2013 1:14 pm
Location: Tokyo

Re: GTX 780 & Core 17 problems

Postby 7im » Thu Jul 25, 2013 11:31 pm

Fresh Win 7 install, fully patched, and GTX 760 with 320.xx drivers. Blue screens once a day. I set autologin, and it reboots and starts folding again.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Re: GTX 780 & Core 17 problems

Postby bollix47 » Thu Jul 25, 2013 11:41 pm

@ChristianVirtual

Look at this post about two thirds of the way down you'll see how I use a bash script to set the fan speed after a reboot (If you find your new fan setting does not survive a reboot ... ).
bollix47
 
Posts: 2871
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: GTX 780 & Core 17 problems

Postby bruce » Thu Jul 25, 2013 11:42 pm

Has anybody looked for a memory leak?
bruce
 
Posts: 20019
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: GTX 780 & Core 17 problems

Postby bollix47 » Thu Jul 25, 2013 11:47 pm

I've started to record memory usage (system, gpu, core_17) as well as temperatures and will check it every day around the same time (when possible) before the current WU finishes(90% to 99%), reboot after the WU finishes and check again at 2% to 4% of the next WU. Will report if I find anything of interest. The problem may not show up unless I let it run longer but then if it gets to the point where it freezes I won't be able to check the readings. :e?:
bollix47
 
Posts: 2871
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: GTX 780 & Core 17 problems

Postby PantherX » Fri Jul 26, 2013 1:09 am

Not sure if this would work but if you set-up GPU-Z to log to file and let it run, I think it just might be able to get the last data point before the crash. When you do a hard reset, you can look at the log file and see if there is any valuable data or not.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6765
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: GTX 780 & Core 17 problems

Postby bollix47 » Fri Jul 26, 2013 1:17 am

Thanks for that suggestion PantherX. I'm running on Linux at the moment but will switch over in a few days and give that a try.
bollix47
 
Posts: 2871
Joined: Sun Dec 02, 2007 6:04 am
Location: Canada

Re: GTX 780 & Core 17 problems

Postby Grendel » Fri Jul 26, 2013 5:24 am

bruce wrote:Has anybody looked for a memory leak?

Under Windows you can see it in the task manager (or better Process Explorer) -- the system process' working set memory usage keeps growing. Here is an example.
Image
User avatar
Grendel
 
Posts: 25
Joined: Mon Sep 22, 2008 8:16 pm
Location: OR, USA

Re: GTX 780 & Core 17 problems

Postby ChristianVirtual » Fri Jul 26, 2013 9:46 pm

After first 12 hours in rough monitoring available memory I can't (yet) see a significant memory leak.
The red area is the available memory while the blue line indicate the percentage done of WU at that time.

Image

Need to tweak the recording in a better detailed way.

That glitch and delay in the blue line around 2:00 am morning was the server outage.
User avatar
ChristianVirtual
 
Posts: 1596
Joined: Tue May 28, 2013 1:14 pm
Location: Tokyo

Re: GTX 780 & Core 17 problems

Postby Nicolas_orleans » Sat Jul 27, 2013 8:21 am

@Christian

From below post it appears (picture and log) you are running outdated 3.8.0-19 kernel. It may help to update to latest 3.8.0-26 and reinstall 319.32 from xorg-edgers repo to be sure all modules are built against this specific one.

This 3.8.0-26 w 319.32 is working for my GTX770, though it's not the same chip.

ChristianVirtual wrote:
while in the system log file I get a number of message like this:

Code: Select all
Jul 26 00:12:18 linuxpowered kernel: [128842.721290] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 26 00:12:20 linuxpowered kernel: [128844.717402] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jul 26 00:12:22 linuxpowered kernel: [128846.713513] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context


Xorg.0.log contain the following
Code: Select all
[129664.787] (WW) NVIDIA(0): WAIT (2, 8, 0x8000, 0x00003818, 0x00005384)
[129671.787] (WW) NVIDIA(0): WAIT (1, 8, 0x8000, 0x00003818, 0x00005384)
[129994.890] (WW) NVIDIA(0): WAIT (2, 8, 0x8000, 0x0000ee7c, 0x0000f47c)
[130001.890] (WW) NVIDIA(0): WAIT (1, 8, 0x8000, 0x0000ee7c, 0x0000f47c)



And even subsequent impacting a CPU-folding core on the same system
Code: Select all
Jul 26 00:12:36 linuxpowered kernel: [128861.124503] BUG: soft lockup - CPU#1 stuck for 23s! [FahCore_a3:5078]
Jul 26 00:12:36 linuxpowered kernel: [128861.124506] Modules linked in: parport_pc(F) ppdev(F) rfcomm bnep bluetooth snd_hda_codec_hdmi coretemp snd_hda_codec_realtek kvm_intel kvm ghash_clmulni_intel(F) aesni_intel(F) aes_x86_64(F) xts(F) lrw(F) gf128mul(F) ablk_helper(F) cryptd(F) snd_hda_intel snd_hda_codec snd_hwdep(F) snd_pcm(F) gpio_ich snd_page_alloc(F) nvidia(POF) snd_seq_midi(F) snd_seq_midi_event(F) snd_rawmidi(F) snd_seq(F) snd_seq_device(F) snd_timer(F) drm snd(F) mac_hid lpc_ich psmouse(F) mei soundcore(F) lp(F) parport(F) microcode(F) serio_raw(F) hid_generic usbhid hid ahci(F) libahci(F) e1000e(F)
Jul 26 00:12:36 linuxpowered kernel: [128861.124541] CPU 1
Jul 26 00:12:36 linuxpowered kernel: [128861.124545] Pid: 5078, comm: FahCore_a3 Tainted: PF          O 3.8.0-19-generic #30-Ubuntu Supermicro C7Q67/C7Q67
MSI Z77A-GD55 - i5-3550 - 16 Go RAM - GTX 980 Ti Hybrid @1461 MHz + GTX 770 @ 1124 MHz + GTX 750 Ti @ 1306 MHz - Ubuntu 16.10
Nicolas_orleans
 
Posts: 106
Joined: Wed Aug 08, 2012 4:08 am

Re: GTX 780 & Core 17 problems

Postby ChristianVirtual » Sat Jul 27, 2013 11:32 am

And it happen again ...

TOP showed also enough memory; interestingly one full core for Xorg.
Code: Select all
top - 18:55:21 up 1 day, 13:07,  2 users,  load average: 9.16, 9.18, 9.22
Tasks: 230 total,   4 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us, 12.1 sy, 74.3 ni, 12.9 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   8134468 total,  2081748 used,  6052720 free,   173956 buffers
KiB Swap:  8343548 total,        0 used,  8343548 free,   724956 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
11483 fahclien  39  19  444m 186m 3120 S 499.4  2.3   5307:55 FahCore_a3
 1272 root      20   0  231m  80m  38m R  96.0  1.0  71:24.82 Xorg
    1 root      20   0 26936 2728 1432 S   0.0  0.0   0:00.78 init
...
11479 fahclien  39  19 96040 1788 1548 S   0.0  0.0   0:11.36 FAHCoreWrapper
15523 fahclien  39  19 96040 1792 1548 S   0.0  0.0   0:03.23 FAHCoreWrapper
15527 fahclien  39  19 20.6g 358m  20m D   0.0  4.5 218:04.72 FahCore_17



Memory looked still ok (as per zabbix; different from TOP ??)
Image

But system dynamics changed
Image


TOP after restart
Code: Select all
top - 19:40:38 up 9 min,  3 users,  load average: 5.54, 1.92, 0.72
Tasks: 226 total,   2 running, 224 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  5.8 sy, 81.7 ni, 11.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8134468 total,  1475776 used,  6658692 free,    42372 buffers
KiB Swap:  8343548 total,        0 used,  8343548 free,   450728 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                             
 2287 fahclien  39  19  447m 182m 3000 S 596.9  2.3   8:41.14 FahCore_a3                                           
 2292 fahclien  39  19 20.4g 277m  19m R  99.9  3.5   1:33.40 FahCore_17                                           
 2321 cl        20   0  626m  35m  17m S   1.7  0.4   0:02.18 FAHControl                                           
 1683 cl        20   0 1387m 115m  57m S   1.0  1.5   0:03.45 compiz                                               
  277 root      20   0     0    0    0 S   0.3  0.0   0:00.08 kworker/5:1                                         
 1114 root      20   0  186m  60m  30m S   0.3  0.8   0:03.67 Xorg                                                 
 1166 zabbix    20   0 92636 1988 1292 S   0.3  0.0   0:00.06 zabbix_agentd                                       
 1728 nobody    20   0 34044 1536 1288 S   0.3  0.0   0:00.05 dnsmasq                                             
 2276 fahclien  20   0 20.8g 9524 6448 S   0.3  0.1   0:00.42 FAHClient                                           
 2309 cl        20   0  548m  55m  24m S   0.3  0.7   0:02.79 nvidia-settings                                     
 2453 cl        20   0 94668 2084 1084 S   0.3  0.0   0:00.02 sshd                                                 
 2548 cl        20   0 25948 1796 1144 R   0.3  0.0   0:00.04 top               


My forensic skills are limited :oops:


@Nicolas_orleans, might be a good idea to refresh the kernel; I fear it will not help ... same kernel drove before a GTX 660 TI with else identical hardware.
User avatar
ChristianVirtual
 
Posts: 1596
Joined: Tue May 28, 2013 1:14 pm
Location: Tokyo

Re: GTX 780 & Core 17 problems

Postby ChristianVirtual » Sat Jul 27, 2013 1:04 pm

For comparison the CPU load after restart ...

Image

another TOP (somehow in the way I would expect)
Code: Select all
top - 21:41:48 up  2:10,  2 users,  load average: 7.06, 7.06, 7.05
Tasks: 223 total,   3 running, 220 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  8.2 sy, 79.4 ni, 11.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8134468 total,  1585612 used,  6548856 free,    42676 buffers
KiB Swap:  8343548 total,        0 used,  8343548 free,   475540 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND           
 2287 fahclien  39  19  447m 184m 3036 S 598.1  2.3 732:40.45 FahCore_a3       
 2292 fahclien  39  19 20.6g 358m  20m R  99.7  4.5 122:09.05 FahCore_17       
 1114 root      20   0  218m  69m  39m S   2.0  0.9   3:36.92 Xorg             
 1683 cl        20   0 1411m 117m  57m S   1.3  1.5   1:41.41 compiz           
 2321 cl        20   0  626m  35m  17m S   1.0  0.4   2:03.13 FAHControl       
 2309 cl        20   0  548m  55m  24m S   0.7  0.7   0:41.92 nvidia-settings   
   80 root      20   0     0    0    0 S   0.3  0.0   0:06.72 kworker/1:1       
  277 root      20   0     0    0    0 S   0.3  0.0   0:05.77 kworker/5:1     
User avatar
ChristianVirtual
 
Posts: 1596
Joined: Tue May 28, 2013 1:14 pm
Location: Tokyo

PreviousNext

Return to Problems with NVidia drivers

Who is online

Users browsing this forum: No registered users and 0 guests

cron