Every once in a while...

Moderators: slegrand, Site Moderators, PandeGroup

Every once in a while...

Postby draeh » Wed Nov 10, 2010 1:38 pm

Every once in a while one of the 2 cards in my rig will stop working.

Here's the weird part...

When I restart the first card, the second will stop. If I restart the second card then the first stops again. The cores are still running, but no work takes place. Using nvclock shows that the card is idle (temps are all idle). I'm not sure what fixes this. I usually end up rebooting the box and starting fresh.


That being said, on ubuntu server, is there an automated update? If so, how can I stop it? IIRC, ubuntu has critical updates set to be automatically installed by default, but I'm no expert on the subject.

Time for some digging.
Image
draeh
 
Posts: 16
Joined: Fri May 02, 2008 12:51 pm

Re: Every once in a while...

Postby draeh » Wed Nov 10, 2010 1:44 pm

Well, I found the unattended-upgrades log file and nothing was installed last night when this latest occurrence happened.
draeh
 
Posts: 16
Joined: Fri May 02, 2008 12:51 pm

Re: Every once in a while...

Postby bruce » Wed Nov 10, 2010 5:33 pm

draeh wrote:Every once in a while one of the 2 cards in my rig will stop working.
. . .
Time for some digging.

Just a suggestion (or guess, really. no factual basis) What drivers are you using and are there later ones?
bruce
Site Admin
 
Posts: 20912
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Every once in a while...

Postby draeh » Wed Nov 10, 2010 5:55 pm

I'm using CUDA v3.0 with driver 256.35 as described in the headless guide in this forum. There are both newer versions of CUDA and the driver, but I hesitate to change them for fear of killing work units.
draeh
 
Posts: 16
Joined: Fri May 02, 2008 12:51 pm

Re: Every once in a while...

Postby rwh202 » Mon Nov 15, 2010 9:14 pm

Hi, first post here, but I seem to be having the same issue.

This has happened twice recently on a pair of GT430s - in both instances the first card was folding a 6800 unit and stopped as soon as the second card picked up another 6800. Power draw and temps on the first card dropped to idle but core15 was still running at 0% cpu.
After the first instance I switched off -advmethods and didn't see the problem again until this evening when both cards again tried to do 6800 WUs. It may be coincidence that these were all 6800 WUs, but I fear it may become a repeating occurrence now that they have come out of -advmethods.

Both cards are at stock speeds in a well ventilated (antec skeleton!) case and are set to "prefer maximum performance" so shouldn't have felt the need to downclock or anything.

I'm using the 260.19.12 drivers and cuda 3.0. As far as I know, the later cuda won't work with the wrappers, but I will give the latest drivers a whirl and see if this can magically improve things.

Rob
rwh202
 
Posts: 242
Joined: Mon Nov 15, 2010 8:51 pm
Location: South Coast, UK

Re: Every once in a while...

Postby draeh » Tue Nov 16, 2010 6:22 pm

In my case, I think that overheating may have been the culprit... On that note, I am having a problem with nvclock. When I read the temperature sensor on card #2 (a 9800 GTX) the value is negative. Anyone else have a similar issue with nvclock?
draeh
 
Posts: 16
Joined: Fri May 02, 2008 12:51 pm

Re: Every once in a while...

Postby draeh » Tue Nov 16, 2010 8:22 pm

Did some googling and found out that nvclock does not work correctly for some 9800 series cards (apparently mine) and thus doesn't support fan control. nvidia-settings would work if it weren't headless. The fan is defintely not running were it should. Its spinning rather slow and the GPU temp as reported by nvidia-smi is 89C (yikes) whereas my 8800GTX with the fan at 100% is at 74C. Both are currently folding BTW.

If I can't find a headless way to get that 9800 fan to speed up, I'll have to power the fan manually.
draeh
 
Posts: 16
Joined: Fri May 02, 2008 12:51 pm

Re: Every once in a while...

Postby draeh » Mon Nov 22, 2010 2:59 pm

The plot sickens...

I think I had two problems. One was an overheat, the second is an error message.

Code: Select all
Launch directory: Z:\home\me\fahgpu2-2
Executable: Z:\home\me\fahgpu2-2\Folding@home-Win32-GPU.exe
Arguments: -forcegpu nvidia_g80 -gpu 1

[13:04:18] - Ask before connecting: No
[13:04:18] - User name: draeh (Team 78445)
[13:04:18] - User ID: 7A8D75740383D6FA
[13:04:18] - Machine ID: 3
[13:04:18]

A potential conflict was detected:

Process 8 is currently running and may also be a client with Mach. ID 3.
The program will now exit. Upon restart, this check will not be done --
You may wish to check that no client is currently running in
Z:\home\me\fahgpu2-2 before restarting.


Which isn't true. The other process is using machine ID 2 with -gpu 0. Upon restarting this client, the error message does not appear, but on commencement of the work, the other gpu goes idle. The core is still running, but no more work.

On a lark, I have tried the latest nvidia driver 260.19.21 which does still work for work units, but does not cure this issue. My work around is to get the second client running (ID 3/gpu 1) and then start the first (ID 2/gpu 0). Once both are running again, they will continue to work for about a week before it happens again.
draeh
 
Posts: 16
Joined: Fri May 02, 2008 12:51 pm

Re: Every once in a while...

Postby bruce » Mon Nov 22, 2010 6:08 pm

That wouldn't be a driver problem. You've got two clients running with the same set of work files or you've got a problem where the client dies without cleaning up the active locks. Are you SURE that each client is starting in a directory with it's own copies of client.cfg, work, etc?
bruce
Site Admin
 
Posts: 20912
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Every once in a while...

Postby draeh » Mon Nov 22, 2010 6:13 pm

bruce wrote:Are you SURE that each client is starting in a directory with it's own copies of client.cfg, work, etc?



For sure. One is located in directory fahgpu2-1 and the other is in fahgpu2-2. They will run continuously for about a week in between incidents.
draeh
 
Posts: 16
Joined: Fri May 02, 2008 12:51 pm

Re: Every once in a while...

Postby Sidicas » Mon Feb 21, 2011 4:47 pm

Sorry I'm late to the party, but are you using Auto-mator by any chance?
Have you tried using fixed sleepwaits?
Also, what CPU do you have?
Sidicas
 
Posts: 233
Joined: Sun Feb 17, 2008 4:46 pm


Return to unOfficial Linux GPU (WINE wrapper) (3rd party support)

Who is online

Users browsing this forum: No registered users and 1 guest

cron