Server reports problem with unit.

Moderators: slegrand, Site Moderators, PandeGroup

Server reports problem with unit.

Postby ArVee » Mon Aug 15, 2011 1:33 am

I've received the above error upon each attempted upload for the past three hours, across 10 gpu's. They then carry on and get a new WU and hold the completed WU in the queue. What's up and will they u/l when whatever it is is fixed?
ArVee
 
Posts: 215
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby bruce » Mon Aug 15, 2011 2:32 am

I'd like to help you but I don't have enough information. Are these all the same project? Are these all the same Server? Are the GPUs all the same model?
bruce
 
Posts: 21348
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Server reports problem with unit.

Postby Jesse_V » Mon Aug 15, 2011 2:45 am

Can you post the log that contains this? Use code tags so the post isn't miles long. Thanks.
Pen tester at Cigital/Synopsys
User avatar
Jesse_V
 
Posts: 2894
Joined: Mon Jul 18, 2011 4:44 am
Location: USA

Re: Server reports problem with unit.

Postby ArVee » Mon Aug 15, 2011 3:09 am

Here's one of the log portions, and I'll gather the project, server and model info, and post it in a few minutes.

Code: Select all
[01:47:35] Completed 98%
[01:49:08] Completed 99%
[01:50:40] Completed 100%
[01:50:40] Successful run
[01:50:40] DynamicWrapper: Finished Work Unit: sleep=10000
[01:50:50] Reserved 108932 bytes for xtc file; Cosm status=0
[01:50:50] Allocated 108932 bytes for xtc file
[01:50:50] - Reading up to 108932 from "work/wudata_01.xtc": Read 108932
[01:50:50] Read 108932 bytes from xtc file; available packet space=786321532
[01:50:50] xtc file hash check passed.
[01:50:50] Reserved 21912 21912 786321532 bytes for arc file=<work/wudata_01.trr> Cosm status=0
[01:50:50] Allocated 21912 bytes for arc file
[01:50:50] - Reading up to 21912 from "work/wudata_01.trr": Read 21912
[01:50:50] Read 21912 bytes from arc file; available packet space=786299620
[01:50:50] trr file hash check passed.
[01:50:50] Allocated 560 bytes for edr file
[01:50:50] Read bedfile
[01:50:50] edr file hash check passed.
[01:50:50] Logfile not read.
[01:50:50] GuardedRun: success in DynamicWrapper
[01:50:50] GuardedRun: done
[01:50:50] Run: GuardedRun completed.
[01:50:55] + Opened results file
[01:50:55] - Writing 131916 bytes of core data to disk...
[01:50:55] Done: 131404 -> 130315 (compressed to 99.1 percent)
[01:50:55]   ... Done.
[01:50:55] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[01:50:55] Shutting down core
[01:50:55]
[01:50:55] Folding@home Core Shutdown: FINISHED_UNIT
[01:50:59] CoreStatus = 64 (100)
[01:50:59] Sending work to server
[01:50:59] Project: 10504 (Run 62, Clone 4, Gen 1)


[01:50:59] + Attempting to send results [August 15 01:50:59 UTC]
[01:50:59] Gpu type=2 species=11.
[01:51:01] - Server reports problem with unit.
[01:51:01] - Preparing to get new work unit...
[01:51:01] Cleaning up work directory
[01:51:01] + Attempting to get work packet
[01:51:01] Passkey found
[01:51:01] Gpu type=2 species=11.
[01:51:01] - Connecting to assignment server
[01:51:01] - Successful: assigned to (171.67.108.21).
[01:51:01] + News From Folding@Home: Welcome to Folding@Home
[01:51:02] Loaded queue successfully.
[01:51:02] Gpu type=2 species=11.
[01:51:02] + Closed connections
[01:51:02]
[01:51:02] + Processing work unit
[01:51:02] Core required: FahCore_11.exe
[01:51:02] Core found.
[01:51:02] Working on queue slot 02 [August 15 01:51:02 UTC]
[01:51:02] + Working ...
[01:51:03]
[01:51:03] *------------------------------*
[01:51:03] Folding@Home GPU Core
[01:51:03] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[01:51:03]
[01:51:03] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[01:51:03] Build host: amoeba
[01:51:03] Board Type: Nvidia
[01:51:03] Core      :
[01:51:03] Preparing to commence simulation
[01:51:03] - Looking at optimizations...
[01:51:03] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[01:51:03] - Created dyn
[01:51:03] - Files status OK
[01:51:03] - Expanded 62832 -> 336799 (decompressed 536.0 percent)
[01:51:03] Called DecompressByteArray: compressed_data_size=62832 data_size=336799, decompressed_data_size=336799 diff=0
[01:51:03] - Digital signature verified
[01:51:03]
[01:51:03] Project: 10503 (Run 366, Clone 4, Gen 1)
[01:51:03]
[01:51:03] Assembly optimizations on if available.
[01:51:03] Entering M.D.
[01:51:09] Tpr hash work/wudata_02.tpr:  4208844038 19672731 3704029619 2577439479 693735541
[01:51:09]
[01:51:09] Calling fah_main args: 14 usage=100
[01:51:09]
[01:51:09] Working on Protein
[01:51:10] Client config found, loading data.
[01:51:10] Starting GUI Server
[01:52:43] Completed 1%
[01:54:16] Completed 2%
[01:55:49] Completed 3%
[01:57:22] Completed 4%
[01:58:55] Completed 5%
[02:00:28] Completed 6%
ArVee
 
Posts: 215
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby ArVee » Mon Aug 15, 2011 3:47 am

All the rejections are coming from multiple instances of 10501,10502,10503 and 10504, all from 8800gt and 9800gt gpu's spread across multiple pc's. The server sending the error message isn't shown for an upload, but if it can be assumed the wu's are reurned to the same server they came from, then it's 171.67.108.21 . Notably my GTX 550ti has shown none of this and is uploading without any error reports. My points over the past few hours are sharply down, understandable with the lion's share of my gpu's working and completing but with nothing to show for it.
ArVee
 
Posts: 215
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby 7im » Mon Aug 15, 2011 3:53 am

Are the GPUs overclocked or running stock speeds?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 15237
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Server reports problem with unit.

Postby ArVee » Mon Aug 15, 2011 4:29 am

They're overclocked, to the same degree they've been for months now. Most haven't even been shut off for months. This has just started out of nowhere, across 8 machines, just this evening. And lo and behold I just took a fast look before I went to submit this and they're being accepted again. The backlog is still in the queues, so if they don't go overnight 'll do -sendall's manually one by one I guess. I believe that had to be a server glitch, but ok, as long as it's ok now that's the main thing.
ArVee
 
Posts: 215
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby 7im » Mon Aug 15, 2011 4:43 am

I'm glad they are folding again.

But remember, past performance is no guarantee of future performance, especially in OC stability. As WUs get larger in size, they will demand more resources from the GPU, potentially heating them up more than before. Maybe even enough to become unstable. Just something to keep an eye on, and time goes on.
User avatar
7im
 
Posts: 15237
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Server reports problem with unit.

Postby bruce » Mon Aug 15, 2011 6:34 am

The log you posted shows Project: 10504 (Run 62, Clone 4, Gen 1) finishing and attempting to upload at [August 15 01:50:59 UTC] but failing. When was that WU downloaded?

The WU, itself, is not corrupt but the results you calculated appear to be.
bruce
 
Posts: 21348
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Server reports problem with unit.

Postby ArVee » Mon Aug 15, 2011 9:44 am

It was downloaded 2 1/2 hours earlier, everything but the error message and the rejection is normal on all the machines. It's odd that all pc's would start doing it last evening and then all stop doing it independently at roughly the same time. It indicates to me a server problem, but I'm just glad it's resolved itself. None of the WU's have sent themselves so I'll do them manually much later today when I have time using the -send all switch manually.
ArVee
 
Posts: 215
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby ArVee » Wed Aug 17, 2011 2:44 pm

This has just started up again overnight EDT and I really wish someone would fix it. I get the error message exactly as worded in the subject title and things then carry on as normal, except no WU has been accepted. This is not one-off stuff, it's all at once across multiple 8800 and 9800 class gpu's and it's confined or appears to be confined to Server 171.67.108.21, at least that's the one if WU's are returned to the same server they came from.

This is not a result of some errant overclock; multiple machines in similar classes (my gtx465's and 550 are NOT affected) all uploading to the same server and all doing similar work don't just start doing bad work all at once and all at the same time. Would someone please look into this, although I've seen what appears to be a similar report at the same time this started the first time but it was under Ver 7 so I couldn't be sure, I'm surprised there hasn't been an outcry. I suppose the single line in the log file is easily missed.

Remember this went away as soon as it started the first time. It's very obviously at the server end, so something's going on in there.
ArVee
 
Posts: 215
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby bruce » Wed Aug 17, 2011 6:45 pm

The server is rejecting the WUs because the results are corrupted. It's certainly not obvious that it's a server issue. There's a long topic on problems that NV is having getting dependable drivers. Did you (or Microsoft) upgrade the drivers on multiple machines?

Also, please confirm that you have removed all overclocking and you've verified that your power supply isn't experiencing a problem with voltage droop.
bruce
 
Posts: 21348
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Server reports problem with unit.

Postby ArVee » Wed Aug 17, 2011 7:32 pm

Bruce, the reason I stated it was obviously a server issue is because it appears that way to me. I should be more careful to remember I might have more info than everybody else because they're not sitting here. It's just frustrating. It starts all at once and affects only the noted classes, but ALL of them, suddenly (doesn't show until the u/l is rejected), and does NOT affect 465's and 550's. These 9 gpu's are all in the same two rooms and there have been no driver updates at all precisely because I'm aware of the problem with NVidia drivers. The overclocks are still applied because I have a lot of trouble accepting that this would all start happening at once across all those gpu's but only those and it just happens to be the same server. I'm just reluctant to miss any more performance because I was removing historically perfectly sound overclocks. Having said that, I'll do exactly that if that's what is thought I should do, just say the word, I'm getting precisely zero out of all these things right now.

When you say power supply power droop, do you literally mean the outside electrical power source? I suppose that could happen without being noticed, but all appliances lights etc. are performing normally, and that includes the three higher-end gpu's. The ADSL lines are clean as well now that I think of that.
ArVee
 
Posts: 215
Joined: Sun Dec 02, 2007 9:25 am

Re: Server reports problem with unit.

Postby 7im » Wed Aug 17, 2011 8:09 pm

Wait, if it's a server problem, why aren't ALL of your GPUs affected? That server is common to all your GPUs, is it not?

Again, past performance is no guarantee of future stability. New work units get biggger and more complex over time. These will push your systems harder, and if already at the edge of stability, would push them over the edge.

A rejected WU is a corrupted WU. Unstable overclocking is one of the causes. Eliminate the potential cause, and if the error happens again, you have proof that OC is not the problem. Until you remove that potential cause, it is still the most likely cause.
User avatar
7im
 
Posts: 15237
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Server reports problem with unit.

Postby bruce » Wed Aug 17, 2011 9:14 pm

You have many clients running. Please post the MachineID and the right half of UserID for each of these clients.
bruce
 
Posts: 21348
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Next

Return to NVIDIA specific issues

Who is online

Users browsing this forum: No registered users and 1 guest

cron