Page 12 of 28

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:03 pm
by VijayPande
ok, thanks. Joe and I have been working to track this down.

Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:04 pm
by VijayPande
PS The 108.26 CS has been up (the FAIL was an issue in the connection between serverstat and that CS, which we've now fixed).

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:10 pm
by chriskwarren
I can throw my hat into the group with single GPU rigs that could not upload.

How can someone go into their logs and provide proof of completed WUs? I can find 23 instances of "Server has already received unit." in the main Fahlog.txt files of my three WUs. I would hate to have donated my electricity for nothing.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:12 pm
by chriskwarren
VijayPande wrote:ok, thanks. Joe and I have been working to track this down.

Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?
I have a GTS 250 folding in a box all by itself that had the "Server has already received unit." error. There is a bigadv running in that box, and just the one GPU.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:28 pm
by bollix47
VijayPande wrote:- the problems are only seen with configs with multiple GPUs in a single box
As already stated in this thread by numerous contributors, this is not true.

viewtopic.php?p=131405#p131405

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:40 pm
by ONE-OF-THREE
bollix47 wrote:
VijayPande wrote:- the problems are only seen with configs with multiple GPUs in a single box
As already stated in this thread by numerous contributors, this is not true.

viewtopic.php?p=131405#p131405
Yes, as previously mentioned just one GPU, and I can also confirm that this problem was (at least in my case) was "only seen in one server: 171.67.108.21".

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:50 pm
by VijayPande
Thanks for the confirmation. We have been working on this trying to debug what's up. Right now, we're assuming (based on donor feedback) that this is only relevant for 171.67.108.21, but for single and multi GPU configs.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:53 pm
by Flathead74
VijayPande wrote:Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?
I have 119 WUs completed on both, systems with a single GPU, and also systems with multiple GPUs.

- the problems are only seen in one server: 171.67.108.21

Server has already received unit

The Work folders still contain the wuresults_xx.dat files, _00 through _09,
along with the corresponding wudata_xx.log files, and logfile_xx.txt files.
I also have the queue.dat files and FAHlog.txt files.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 10:56 pm
by jaxawier
Here is one current after a restart, using 2 Gpu clients on that pc.

Code: Select all

[22:28:42] Loaded queue successfully.
[22:28:42] 
[22:28:42] - Autosending finished units... [February 15 22:28:42 UTC]
[22:28:42] Trying to send all finished work units
[22:28:42] + Processing work unit
[22:28:42] Project: 5781 (Run 18, Clone 673, Gen 3)
[22:28:42] Core required: FahCore_11.exe
[22:28:42] - Read packet limit of 540015616... Set to 524286976.


[22:28:42] + Attempting to send results [February 15 22:28:42 UTC]
[22:28:42] Core found.[22:28:42] - Reading file work/wuresults_08.dat from core

[22:28:42]   (Read 168841 bytes from disk)
[22:28:42] Connecting to http://171.67.108.21:8080/
[22:28:42] Working on queue slot 09 [February 15 22:28:42 UTC]
[22:28:42] + Working ...
[22:28:42] - Calling '.\FahCore_11.exe -dir work/ -suffix 09 -priority 100 -checkpoint 27 -forceasm -verbose -lifeline 1072 -version 623'

[22:28:42] 
[22:28:42] *------------------------------*
[22:28:42] Folding@Home GPU Core
[22:28:42] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[22:28:42] 
[22:28:42] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[22:28:42] Build host: amoeba
[22:28:42] Board Type: Nvidia
[22:28:42] Core      : 
[22:28:42] Preparing to commence simulation
[22:28:42] - Assembly optimizations manually forced on.
[22:28:42] - Not checking prior termination.
[22:28:42] - Expanded 64996 -> 343707 (decompressed 528.8 percent)
[22:28:42] Called DecompressByteArray: compressed_data_size=64996 data_size=343707, decompressed_data_size=343707 diff=0
[22:28:42] - Digital signature verified
[22:28:42] 
[22:28:42] Project: 5783 (Run 10, Clone 37, Gen 28)
[22:28:42] 
[22:28:42] Assembly optimizations on if available.
[22:28:42] Entering M.D.
[22:28:48] Will resume from checkpoint file
[22:28:48] Tpr hash work/wudata_09.tpr:  3670348623 303977017 2684851848 1530148849 3932990047
[22:28:48] 
[22:28:48] Calling fah_main args: 14 usage=95
[22:28:48] 
[22:28:48] Working on GROwing Monsters And Cloning Shrimps
[22:28:49] Client config found, loading data.
[22:28:49] Starting GUI Server
[22:28:49] Resuming from checkpoint
[22:28:49] fcCheckPointResume: retreived and current tpr file hash:
[22:28:49]    0   3670348623   3670348623
[22:28:49]    1    303977017    303977017
[22:28:49]    2   2684851848   2684851848
[22:28:49]    3   1530148849   1530148849
[22:28:49]    4   3932990047   3932990047
[22:28:49] fcCheckPointResume: file hashes same.
[22:28:49] fcCheckPointResume: state restored.
[22:28:49] Verified work/wudata_09.log
[22:28:49] Verified work/wudata_09.edr
[22:28:49] Verified work/wudata_09.xtc
[22:28:49] Completed 8%
[22:28:50] - Couldn't send HTTP request to server
[22:28:50] + Could not connect to Work Server (results)
[22:28:50]     (171.67.108.21:8080)
[22:28:50] + Retrying using alternative port
[22:28:50] Connecting to http://171.67.108.21:80/
[22:29:15] - Couldn't send HTTP request to server
[22:29:15] + Could not connect to Work Server (results)
[22:29:15]     (171.67.108.21:80)
[22:29:15] - Error: Could not transmit unit 08 (completed February 15) to work server.
[22:29:15] - 4 failed uploads of this unit.
[22:29:15] - Read packet limit of 540015616... Set to 524286976.


[22:29:15] + Attempting to send results [February 15 22:29:15 UTC]
[22:29:15] - Reading file work/wuresults_08.dat from core
[22:29:15]   (Read 168841 bytes from disk)
[22:29:15] Connecting to http://171.67.108.26:8080/
[22:30:12] Completed 9%
[22:31:35] Completed 10%
[22:32:58] Completed 11%

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 11:25 pm
by weedacres
Flathead74 wrote:
VijayPande wrote:Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?
I have 119 WUs completed on both, systems with a single GPU, and also systems with multiple GPUs.

- the problems are only seen in one server: 171.67.108.21

Server has already received unit

The Work folders still contain the wuresults_xx.dat files, _00 through _09,
along with the corresponding wudata_xx.log files, and logfile_xx.txt files.
I also have the queue.dat files and FAHlog.txt files.
I also saw the same thing, only the one server on both single and multi GPU boxes.
Currently I have approx 150 completed work units that can not be sent, but for a different reason.

Code: Select all

- Warning: Asked to send unfinished unit to server

See:http://foldingforum.org/viewtopic.php?f=50&t=13464

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 11:38 pm
by Pette Broad
Here we go again, another unit lost...Thought this had been fixed.

Code: Select all

[22:52:56] CoreStatus = 64 (100)
[22:52:56] Sending work to server
[22:52:56] Project: 10502 (Run 59, Clone 0, Gen 0)


[22:52:56] + Attempting to send results [February 15 22:52:56 UTC]
[23:26:30] - Unknown packet returned from server, expected ACK for results
[23:26:30] - Error: Could not transmit unit 09 (completed February 15) to work server.
[23:26:30]   Keeping unit 09 in queue.
[23:26:30] Project: 10502 (Run 59, Clone 0, Gen 0)


[23:26:30] + Attempting to send results [February 15 23:26:30 UTC]
[23:26:43] - Server has already received unit.
[23:26:43] - Preparing to get new work unit...
[23:26:43] + Attempting to get work packet
[23:26:43] - Connecting to assignment server
[23:26:44] - Successful: assigned to (171.64.65.20).
[23:26:44] + News From Folding@Home: Welcome to Folding@Home
[23:26:45] Loaded queue successfully.
[23:26:46] + Closed connections
[23:26:46] 
[23:26:46] + Processing work unit
[23:26:46] Core required: FahCore_14.exe
[23:26:46] Core found.
[23:26:46] Working on queue slot 00 [February 15 23:26:46 UTC]
[23:26:46] + Working ...
[23:26:47] 
[23:26:47] *------------------------------*
[23:26:47] Folding@Home GPU Core - Beta
[23:26:47] Version 1.26 (Wed Oct 14 13:09:26 PDT 2009)
Pete

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Mon Feb 15, 2010 11:58 pm
by VijayPande
From what we see here, tHis is definitely not fixed yet, although we have some ideas. We are still working on it. We hope to have a code update in a few hours, although I'm trying to not rush this since that's a way to introduce new bugs and other problems.

Re: What do we do with all of the unsent workunits?

Posted: Tue Feb 16, 2010 1:43 am
by weedacres
Well, I guess I'll just wait to see if they get the server problems fixed.
I expect that I'll be flushing my completed work.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Tue Feb 16, 2010 1:44 am
by checka
Vijay, looking back through my posts and log files (Friday 2/12 in server has no record of this unit), I noticed at least two incidents with 171.64.65.71 as well as171.67.108.21. Today the return of 10102 resulted in a freeze where my computer uploaded the result but froze not being able to go to the assignment server. Just got a notice that "could not transmit unit 03(completed February 16 01:32:58 UT) to server 171.64.65.71.8080. At 01:35:41 got error message "Server does not have record of this unit. Will Try again later. Could not transmit unit 03 to Collection Server; keeping in queue. " All I can say is good luck and good hunting.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Tue Feb 16, 2010 2:14 am
by patonb
Yha, I had posted that it took 20min for me to upload a unit, then got the message it already had the unit. ToTow said there was no record of my unit to my name either.