GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Moderators: Site Moderators, FAHC Science Team

VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by VijayPande »

ok, thanks. Joe and I have been working to track this down.

Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by VijayPande »

PS The 108.26 CS has been up (the FAIL was an issue in the connection between serverstat and that CS, which we've now fixed).
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
chriskwarren
Posts: 28
Joined: Sun Nov 30, 2008 2:13 am

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by chriskwarren »

I can throw my hat into the group with single GPU rigs that could not upload.

How can someone go into their logs and provide proof of completed WUs? I can find 23 instances of "Server has already received unit." in the main Fahlog.txt files of my three WUs. I would hate to have donated my electricity for nothing.
Image
chriskwarren
Posts: 28
Joined: Sun Nov 30, 2008 2:13 am

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by chriskwarren »

VijayPande wrote:ok, thanks. Joe and I have been working to track this down.

Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?
I have a GTS 250 folding in a box all by itself that had the "Server has already received unit." error. There is a bigadv running in that box, and just the one GPU.
Image
bollix47
Posts: 2941
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by bollix47 »

VijayPande wrote:- the problems are only seen with configs with multiple GPUs in a single box
As already stated in this thread by numerous contributors, this is not true.

viewtopic.php?p=131405#p131405
Image
ONE-OF-THREE
Posts: 23
Joined: Thu Sep 04, 2008 4:42 pm
Hardware configuration: Playstation 3, 80gig (bundled with Metal Gear Solid 4 Guns of the Patriots) :)

Playstation 3, 40gig

Gateway FX6800-09H
Intel Core i7 920
2.67GHz
12 GB DDR3 RAM
nVidia GTX260
1 TB 7200 RPM
8MB L2
750 Watts
Windows Vista Home Premium 64-Bit
Location: Canada

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by ONE-OF-THREE »

bollix47 wrote:
VijayPande wrote:- the problems are only seen with configs with multiple GPUs in a single box
As already stated in this thread by numerous contributors, this is not true.

viewtopic.php?p=131405#p131405
Yes, as previously mentioned just one GPU, and I can also confirm that this problem was (at least in my case) was "only seen in one server: 171.67.108.21".
Image
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by VijayPande »

Thanks for the confirmation. We have been working on this trying to debug what's up. Right now, we're assuming (based on donor feedback) that this is only relevant for 171.67.108.21, but for single and multi GPU configs.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Flathead74
Posts: 266
Joined: Sun Dec 02, 2007 6:08 pm
Location: Central New York
Contact:

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by Flathead74 »

VijayPande wrote:Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?
I have 119 WUs completed on both, systems with a single GPU, and also systems with multiple GPUs.

- the problems are only seen in one server: 171.67.108.21

Server has already received unit

The Work folders still contain the wuresults_xx.dat files, _00 through _09,
along with the corresponding wudata_xx.log files, and logfile_xx.txt files.
I also have the queue.dat files and FAHlog.txt files.
jaxawier
Posts: 8
Joined: Fri Dec 18, 2009 7:42 pm

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by jaxawier »

Here is one current after a restart, using 2 Gpu clients on that pc.

Code: Select all

[22:28:42] Loaded queue successfully.
[22:28:42] 
[22:28:42] - Autosending finished units... [February 15 22:28:42 UTC]
[22:28:42] Trying to send all finished work units
[22:28:42] + Processing work unit
[22:28:42] Project: 5781 (Run 18, Clone 673, Gen 3)
[22:28:42] Core required: FahCore_11.exe
[22:28:42] - Read packet limit of 540015616... Set to 524286976.


[22:28:42] + Attempting to send results [February 15 22:28:42 UTC]
[22:28:42] Core found.[22:28:42] - Reading file work/wuresults_08.dat from core

[22:28:42]   (Read 168841 bytes from disk)
[22:28:42] Connecting to http://171.67.108.21:8080/
[22:28:42] Working on queue slot 09 [February 15 22:28:42 UTC]
[22:28:42] + Working ...
[22:28:42] - Calling '.\FahCore_11.exe -dir work/ -suffix 09 -priority 100 -checkpoint 27 -forceasm -verbose -lifeline 1072 -version 623'

[22:28:42] 
[22:28:42] *------------------------------*
[22:28:42] Folding@Home GPU Core
[22:28:42] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[22:28:42] 
[22:28:42] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[22:28:42] Build host: amoeba
[22:28:42] Board Type: Nvidia
[22:28:42] Core      : 
[22:28:42] Preparing to commence simulation
[22:28:42] - Assembly optimizations manually forced on.
[22:28:42] - Not checking prior termination.
[22:28:42] - Expanded 64996 -> 343707 (decompressed 528.8 percent)
[22:28:42] Called DecompressByteArray: compressed_data_size=64996 data_size=343707, decompressed_data_size=343707 diff=0
[22:28:42] - Digital signature verified
[22:28:42] 
[22:28:42] Project: 5783 (Run 10, Clone 37, Gen 28)
[22:28:42] 
[22:28:42] Assembly optimizations on if available.
[22:28:42] Entering M.D.
[22:28:48] Will resume from checkpoint file
[22:28:48] Tpr hash work/wudata_09.tpr:  3670348623 303977017 2684851848 1530148849 3932990047
[22:28:48] 
[22:28:48] Calling fah_main args: 14 usage=95
[22:28:48] 
[22:28:48] Working on GROwing Monsters And Cloning Shrimps
[22:28:49] Client config found, loading data.
[22:28:49] Starting GUI Server
[22:28:49] Resuming from checkpoint
[22:28:49] fcCheckPointResume: retreived and current tpr file hash:
[22:28:49]    0   3670348623   3670348623
[22:28:49]    1    303977017    303977017
[22:28:49]    2   2684851848   2684851848
[22:28:49]    3   1530148849   1530148849
[22:28:49]    4   3932990047   3932990047
[22:28:49] fcCheckPointResume: file hashes same.
[22:28:49] fcCheckPointResume: state restored.
[22:28:49] Verified work/wudata_09.log
[22:28:49] Verified work/wudata_09.edr
[22:28:49] Verified work/wudata_09.xtc
[22:28:49] Completed 8%
[22:28:50] - Couldn't send HTTP request to server
[22:28:50] + Could not connect to Work Server (results)
[22:28:50]     (171.67.108.21:8080)
[22:28:50] + Retrying using alternative port
[22:28:50] Connecting to http://171.67.108.21:80/
[22:29:15] - Couldn't send HTTP request to server
[22:29:15] + Could not connect to Work Server (results)
[22:29:15]     (171.67.108.21:80)
[22:29:15] - Error: Could not transmit unit 08 (completed February 15) to work server.
[22:29:15] - 4 failed uploads of this unit.
[22:29:15] - Read packet limit of 540015616... Set to 524286976.


[22:29:15] + Attempting to send results [February 15 22:29:15 UTC]
[22:29:15] - Reading file work/wuresults_08.dat from core
[22:29:15]   (Read 168841 bytes from disk)
[22:29:15] Connecting to http://171.67.108.26:8080/
[22:30:12] Completed 9%
[22:31:35] Completed 10%
[22:32:58] Completed 11%
Image
weedacres
Posts: 138
Joined: Mon Dec 24, 2007 11:18 pm
Hardware configuration: UserNames: weedacres_gpu ...
Location: Eastern Washington

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by weedacres »

Flathead74 wrote:
VijayPande wrote:Could people confirm the following:
- the problems are only seen in one server: 171.67.108.21
- the problems are only seen with configs with multiple GPUs in a single box

If either of the above aren't true for you, could you please post here?
I have 119 WUs completed on both, systems with a single GPU, and also systems with multiple GPUs.

- the problems are only seen in one server: 171.67.108.21

Server has already received unit

The Work folders still contain the wuresults_xx.dat files, _00 through _09,
along with the corresponding wudata_xx.log files, and logfile_xx.txt files.
I also have the queue.dat files and FAHlog.txt files.
I also saw the same thing, only the one server on both single and multi GPU boxes.
Currently I have approx 150 completed work units that can not be sent, but for a different reason.

Code: Select all

- Warning: Asked to send unfinished unit to server

See:http://foldingforum.org/viewtopic.php?f=50&t=13464
Image
Pette Broad
Posts: 128
Joined: Mon Dec 03, 2007 9:38 pm
Hardware configuration: CPU folding on only one machine a laptop

GPU Hardware..
3 x 460
1 X 260
4 X 250

+ 1 X 9800GT (3 days a week)
Location: Chester U.K

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by Pette Broad »

Here we go again, another unit lost...Thought this had been fixed.

Code: Select all

[22:52:56] CoreStatus = 64 (100)
[22:52:56] Sending work to server
[22:52:56] Project: 10502 (Run 59, Clone 0, Gen 0)


[22:52:56] + Attempting to send results [February 15 22:52:56 UTC]
[23:26:30] - Unknown packet returned from server, expected ACK for results
[23:26:30] - Error: Could not transmit unit 09 (completed February 15) to work server.
[23:26:30]   Keeping unit 09 in queue.
[23:26:30] Project: 10502 (Run 59, Clone 0, Gen 0)


[23:26:30] + Attempting to send results [February 15 23:26:30 UTC]
[23:26:43] - Server has already received unit.
[23:26:43] - Preparing to get new work unit...
[23:26:43] + Attempting to get work packet
[23:26:43] - Connecting to assignment server
[23:26:44] - Successful: assigned to (171.64.65.20).
[23:26:44] + News From Folding@Home: Welcome to Folding@Home
[23:26:45] Loaded queue successfully.
[23:26:46] + Closed connections
[23:26:46] 
[23:26:46] + Processing work unit
[23:26:46] Core required: FahCore_14.exe
[23:26:46] Core found.
[23:26:46] Working on queue slot 00 [February 15 23:26:46 UTC]
[23:26:46] + Working ...
[23:26:47] 
[23:26:47] *------------------------------*
[23:26:47] Folding@Home GPU Core - Beta
[23:26:47] Version 1.26 (Wed Oct 14 13:09:26 PDT 2009)
Pete
Image
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by VijayPande »

From what we see here, tHis is definitely not fixed yet, although we have some ideas. We are still working on it. We hope to have a code update in a few hours, although I'm trying to not rush this since that's a way to introduce new bugs and other problems.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
weedacres
Posts: 138
Joined: Mon Dec 24, 2007 11:18 pm
Hardware configuration: UserNames: weedacres_gpu ...
Location: Eastern Washington

Re: What do we do with all of the unsent workunits?

Post by weedacres »

Well, I guess I'll just wait to see if they get the server problems fixed.
I expect that I'll be flushing my completed work.
Image
checka
Posts: 10
Joined: Mon Feb 18, 2008 3:23 am

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by checka »

Vijay, looking back through my posts and log files (Friday 2/12 in server has no record of this unit), I noticed at least two incidents with 171.64.65.71 as well as171.67.108.21. Today the return of 10102 resulted in a freeze where my computer uploaded the result but froze not being able to go to the assignment server. Just got a notice that "could not transmit unit 03(completed February 16 01:32:58 UT) to server 171.64.65.71.8080. At 01:35:41 got error message "Server does not have record of this unit. Will Try again later. Could not transmit unit 03 to Collection Server; keeping in queue. " All I can say is good luck and good hunting.
patonb
Posts: 348
Joined: Thu Oct 23, 2008 2:42 am
Hardware configuration: WooHoo= SR-2 -- L5639 @ ?? -- Evga 560ti FPB -- 12Gig Corsair XMS3 -- Corsair 1050hx -- Blackhawk Ultra

Foldie = @3.2Ghz -- Noctua NH-U12 -- BFG GTX 260-216 -- 6Gig OCZ Gold -- x58a-ud3r -- 6Gig OCZ Gold -- hx520

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Post by patonb »

Yha, I had posted that it took 20min for me to upload a unit, then got the message it already had the unit. ToTow said there was no record of my unit to my name either.
WooHoo = L5639 @ 3.3Ghz Evga SR-2 6x2gb Corsair XMS3 CM 212+ Corsair 1050hx Blackhawk Ultra EVGA 560ti

Foldie = i7 950@ 4.0Ghz x58a-ud3r 216-216 @ 850/2000 3x2gb OCZ Gold NH-u12 Heatsink Corsair hx520 Antec 900
Post Reply