Server reports problem with unit.

Moderators: slegrand, Site Moderators, PandeGroup

Re: Server reports problem with unit.

Postby ArVee » Wed Aug 17, 2011 10:10 pm

No, 7im, when that server isn't involved, the WU's are accepted just fine. That's why I was so blatant about saying it's a server problem. You'll see this in the next post, it's breaks out pretty clearly. Not only that, the 7 gpu's that have been taken out for the past 12 hours are about to come back magically, at least that's my prediction because they've switched off that one to 108.21 . We'll see before not too much longer. I'd be thrilled but the larger issue will remain.

Edit changed 9 gpu's to 7.
Last edited by ArVee on Wed Aug 17, 2011 10:27 pm, edited 1 time in total.
ArVee
 
Posts: 209
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby ArVee » Wed Aug 17, 2011 10:25 pm

523F5D2C 3 108.21
523F5D2C 4 108.21
06C8788E 3 108.21
06C8788E 4 108.21
6E3FB515 2 108.21
24F6552E 2 108.21
6BC11CDC 3 108.21

10F6BC8B 2 65.64
1C7C80DF 2 65.64
1C4A8DE7 3 108.54


Top pile are 8800 and 9800 class, the ones erroring out for the past 12 hours, bottom pile is gtx465's and a gtx550ti (doing just fine). I'm betting this whole problem appears to mysteriously go away again shortly because as I was pulling this info I noticed they'd finally switched off 108.21 to 65.106, thank goodness. This doesn't mean the problem's solved however. Either that or I'm misreading this whole thing.

I haven't bothered with the cpu processing id's. Let me know if you need these as well though!
ArVee
 
Posts: 209
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby MtM » Thu Aug 18, 2011 10:43 am

I got two suggestion you could check as well;

Run memtest86, if your ram is bad it could corrupt the work unit's when being written from ram to hdd.

Check SMART status of your hdd's, extended butterfly test might take a long time but will ensure your drives aren't having problems.

None are directly folding related, both could be a wild goose chase but as noted the server rejects wu's which don't match their checksum.

You could also try stopping a client and restarting it, forcing it to resume from the last checkpoint ( this way it's easier to see if there is a problem with resuming from checkpoint, which is another stadium in which the client compares checksums ).
MtM
 
Posts: 3054
Joined: Fri Jun 27, 2008 2:20 pm
Location: The Netherlands

Re: Server reports problem with unit.

Postby ArVee » Fri Aug 19, 2011 12:01 am

The first few suggestions, while good in general don't realistically apply in this case because the failures are so broadly based across so many machines. All of them erroring out across the same short timeframe isn't really plausible. The start/restart to force checkpoints is a great idea though, and one that never occurred to me! I will try it if this ever crops up again, which it hasn't for the better part of a day now thankfully, on to different WU's being served from anything BUT that server, hopefully forever. Without fail it was involved in every error and without fail all other instances behaved.. I'm just hoping it's over. Thanks for the thoughts :-)
ArVee
 
Posts: 209
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby bruce » Sat Aug 20, 2011 5:30 pm

Start/restart does not force checkpoints. It reads the most recent checkpoint and confirms that it's valid but that's not a particularly important thing to do.
bruce
 
Posts: 22623
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Server reports problem with unit.

Postby MtM » Sat Aug 20, 2011 8:49 pm

I didn't mean forcing the creation of checkpoints which is why I said it forces resuming from the last checkpoint. If his system is corrupting results not during simulation ( which I hope would generate an EUE ) but during disk writes or reads it does seem like a good idea to check resuming from a checkpoint?

Tbh, the memtest86 suggestion was sooner misplaced, memory errors should result in EUE's as well.
MtM
 
Posts: 3054
Joined: Fri Jun 27, 2008 2:20 pm
Location: The Netherlands

Re: Server reports problem with unit.

Postby Mactin » Sun Sep 04, 2011 4:40 pm

I've been out of town for a few weeks (backlog of 2400+ messages), but I had the same problem, around the same time with the same server.
Client Martin_i7-875k_GTS250
p10504-r370-c4-g1, completed 2011-08-16 11:58:50, Server reports problem with unit
p10504-r107-c0-g567, completed 2011-08-16 14:23:26, Server reports problem with unit
p10502-r119-c0-g589, completed 2011-08-16 16:48:00, Server reports problem with unit
p10502-r357-c4-g1, completed 2011-08-16 19:12:45, Server reports problem with unit
the next WU (p5790-r2-c939-g12), completed without problems or intervention from me
At this time of the year (summer) no OC on this machine (especialy if I'm not there)
Image
Mactin
 
Posts: 327
Joined: Sun Dec 02, 2007 1:08 pm
Location: Côte-des-Neiges, Montréal, Québec

Re: Server reports problem with unit.

Postby bruce » Sun Sep 04, 2011 6:55 pm

Project: 10502 run 357 clone 4 gen 1 and Project: 10504 run 370 clone 4 gen 1:
I have no way of knowing if you were the only person who had trouble with these WUs, but they were both reassigned several times and completed successfully each time. There is no record of the WU being completed by Mactin.

Project: 10502 run 119 clone 0 gen 589 and Project: 10504 run 107 clone 0 gen 567:
These were both reassigned and completed many times but not by Mactin. A few people did get zero credit.

I have no basis to guess why some people are reporting problems and others are completing the same WUs successfully.
bruce
 
Posts: 22623
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Server reports problem with unit.

Postby new08 » Sat Feb 11, 2012 6:40 pm

ArVee-did you ever get to the bottom of this problem?
I've had a spate of one cores output finishing and not uploading with a non specific arror. It's stopped doing it this last unit today. Nothing is untoward on the PC but the units with problems [over 10 days] never got credited, Not worried about the points but keeping the PC on for days for results to crash out is dispiriting,as you know.
The only thing I changed before the last one corrected was to de-synch the PCIe bus from the FSB in BIOS at boot. This is apparently a better way to run non standard configs on MB and |Memory.
The only other non connected [afaik] issue was a spate of the latest drivers throwing up a NV4 .dll error which I caught in Crash reports. That has not happened with this unit checksum though and the drivers are still the same and running fine again.
There's a few unknowns here and very little to go on. The only other thing I had different was the inability for the unit to keep the check point for a restart and it was going back to zero. This meant I couldm't stop client to trouble shoot or it would never ever finish at all- and they were taking 2 days a unit! The GPU and other core are fine, btw.
viewtopic.php?f=19&t=20712
Image
User avatar
new08
 
Posts: 310
Joined: Fri Jan 04, 2008 11:02 pm
Location: England

Re: Server reports problem with unit.

Postby ArVee » Sat Feb 11, 2012 8:38 pm

No, I didn't ever get to the bottom of the issue, new08, nor did I hear anything more on it, I was just happy enough that I got switched off that server and that the issue never arose again. There were never any permanent changes at this end, I tried a couple of the suggested ones and switched off them when they didn't help. My junk's all running fine here since, the odd EUE but c'est la vie. Hope your thing is temporary, it's frustrating.
ArVee
 
Posts: 209
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby new08 » Fri Feb 24, 2012 3:57 pm

Yeah, settled down now. No repeats of the problem mentioned and some big units going through OK.
Nothing changed on the hardware/ config side so if there's a few hints people can use in the discussions that's not all wasted.
In many cases , it's not realistic to expect answers as there are so many possible configs and glitches that 'getting by' is good enough.
The idea that invalid results occur doing max outputs doesn't stack with me -as I think important issues/queries/faults can be re-run under controlled conditions.
User avatar
new08
 
Posts: 310
Joined: Fri Jan 04, 2008 11:02 pm
Location: England

Previous

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 1 guest

cron