Problem with stats. [Lots of 0 point WUs]

Moderators: Site Moderators, FAHC Science Team

Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Problem with stats. [Lots of 0 point WUs]

Post by Tobit »

Well, looks like we have another server issue already this weekend. The stats server appears to be spitting out garbage.

Over the last four updates, multiple teams have noted abnormally high numbers of completed WUs. This is not limited to one team. For example:

EVGA usually completes 1000 to 1200 work units per update. The last four updates show numbers of: 2,871 - 8,406 - 11,745 - 11,184

[H]ardOCP usually completes 400 to 600 units per update. The last four updates show numbers of: 2,198 - 4,604 - 5,246 - 5,117

Both teams have already completed more WUs today than they usually do in one day. However, the points are NORMAL and not affected by this abnormality. A quick look at other teams show this anomaly as well.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Problem with stats server

Post by P5-133XL »

When stuff like this has happened before it was from re-credits. I have no idea about this specific hiccup though.
Image
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: Problem with stats server

Post by Tobit »

P5-133XL wrote:When stuff like this has happened before it was from re-credits. I have no idea about this specific hiccup though.
Yeah, I was thinking that but the points appear normal.. there is no apparent spike in points per update. Very odd.
Joe_H
Site Admin
Posts: 7870
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Problem with stats server

Post by Joe_H »

I am not seeing this in all teams, are they getting a large number of small WU's instead of fewer large ones recently? Perhaps just a hiccup, or more information may come out later.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: Problem with stats server

Post by Tobit »

Joe_H wrote:I am not seeing this in all teams, are they getting a large number of small WU's instead of fewer large ones recently?
There would have to be a lot of small units suddenly assigned to see the spikes shown. It is more noticeable on larger teams but I've seen a few smaller teams with this anomaly as well. There are some SMP units that complete very quickly but, in the 5 years I've been with my team, I don't think I've ever seen a spike like this.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Problem with stats server

Post by bruce »

Thanks for the report. The Pande Group is aware of the information and will be looking into it.
Tobit wrote:
Joe_H wrote:I am not seeing this in all teams, are they getting a large number of small WU's instead of fewer large ones recently?
There would have to be a lot of small units suddenly assigned to see the spikes shown. It is more noticeable on larger teams but I've seen a few smaller teams with this anomaly as well. There are some SMP units that complete very quickly but, in the 5 years I've been with my team, I don't think I've ever seen a spike like this.
If you're one of the people who personally experienced this spike, what type of projects were involved (i.e.- what are your client's preference settings?)
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: Problem with stats server

Post by Tobit »

bruce wrote:If you're one of the people who personally experienced this spike, what type of projects were involved (i.e.- what are your client's preference settings?)
One of our users, Schro, says he's been getting assigned quite a few p8001 SMP units that fold really fast on his hardware. He's completed 21 units over the past four updates.
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: Problem with stats server

Post by Tobit »

Ahh, I have found a better example..

User kranium- I am trying to look at his logs.

235 WUs for 0 points
239 for 938 points
239 for 7889 points
176 for 26,926 points
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Problem with stats server

Post by P5-133XL »

That is definitely not happening to me, but mine are all GPU WU's: p8018's. What Kranium is seeing is quite significant.
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Problem with stats server

Post by bruce »

Tobit wrote:Ahh, I have found a better example..
The reason I said if YOU experienced the problem, please report. Please search your logs and explain whatever you can learn about the 235 WUs for 0 points. By itself, that information doesn't add anything that wasn't already known -- except perhaps that 'kranium' might have unstable hardware.
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: Problem with stats server

Post by Tobit »

Sorry Bruce. Just trying to help users on my team who can't/won't come here. I am trying to get my hands on Kranium's logs, he just posted another 236 units for 0 points. Trying to wake him up to also turn off his box.
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: Problem with stats server

Post by Tobit »

Looks like the problem is related to the issues with 128.143.199.96. Here is the log file from a fellow user, KMac, whose stats are all messed up as well:

Code: Select all

Arguments: -oneunit -forceasm -smp -bigadv -verbosity 9 

[02:58:57] - Ask before connecting: No
[02:58:57] - User name: KMac (Team 33)
[02:58:57] - User ID: 4XXXXXXXXXXX
[02:58:57] - Machine ID: 2
[02:58:57] 
[02:58:57] Work directory not found. Creating...
[02:58:57] Could not open work queue, generating new queue...
[02:58:57] - Preparing to get new work unit...
[02:58:57] - Autosending finished units... [October 6 02:58:57 UTC]
[02:58:57] Cleaning up work directory
[02:58:57] Trying to send all finished work units
[02:58:57] + No unsent completed units remaining.
[02:58:57] - Autosend completed
[02:58:57] + Attempting to get work packet
[02:58:57] Passkey found
[02:58:57] - Will indicate memory of 8183 MB
[02:58:57] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 12, Stepping: 2
[02:58:57] - Connecting to assignment server
[02:58:57] Connecting to http://assign.stanford.edu:8080/
[02:58:57] Posted data.
[02:58:57] Initial: 8F80; - Successful: assigned to (128.143.199.96).
[02:58:57] + News From Folding@Home: Welcome to Folding@Home
[02:58:58] Loaded queue successfully.
[02:58:58] Sent data
[02:58:58] Connecting to http://128.143.199.96:8080/
[02:58:59] Posted data.
[02:58:59] Initial: 0000; - Receiving payload (expected size: 1763751)
[02:59:03] - Downloaded at ~430 kB/s
[02:59:03] - Averaged speed for that direction ~430 kB/s
[02:59:03] + Received work.
[02:59:03] + Closed connections
[02:59:03] 
[02:59:03] + Processing work unit
[02:59:03] Core required: FahCore_a3.exe
[02:59:03] Core found.
[02:59:03] Working on queue slot 01 [October 6 02:59:03 UTC]
[02:59:03] + Working ...
[02:59:03] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 12 -nocpulock -checkpoint 3 -forceasm -verbose -lifeline 5016 -version 634'

[02:59:03] 
[02:59:03] *------------------------------*
[02:59:03] Folding@Home Gromacs SMP Core
[02:59:03] Version 2.27 (Dec. 15, 2010)
[02:59:03] 
[02:59:03] Preparing to commence simulation
[02:59:03] - Assembly optimizations manually forced on.
[02:59:03] - Not checking prior termination.
[02:59:04] - Expanded 1763239 -> 1944296 (decompressed 110.2 percent)
[02:59:04] Called DecompressByteArray: compressed_data_size=1763239 data_size=1944296, decompressed_data_size=1944296 diff=0
[02:59:04] - Digital signature verified
[02:59:04] 
[02:59:04] Project: 6947 (Run 0, Clone 143, Gen 189)
[02:59:04] 
[02:59:04] Assembly optimizations on if available.
[02:59:04] Entering M.D.
[02:59:10] Mapping NT from 12 to 12 
[02:59:10] Completed 0 out of 500000 steps  (0%)
[03:01:24] Completed 5000 out of 500000 steps  (1%)
--snip--
[06:42:53] Completed 500000 out of 500000 steps  (100%)
[06:42:54] DynamicWrapper: Finished Work Unit: sleep=10000
[06:43:04] 
[06:43:04] Finished Work Unit:
[06:43:04] - Reading up to 3694992 from "work/wudata_01.trr": Read 3694992
[06:43:04] trr file hash check passed.
[06:43:04] edr file hash check passed.
[06:43:04] logfile size: 61055
[06:43:04] Leaving Run
[06:43:07] - Writing 3790983 bytes of core data to disk...
[06:43:07] Done: 3790471 -> 3516790 (compressed to 92.7 percent)
[06:43:07]   ... Done.
[06:43:07] - Shutting down core
[06:43:07] 
[06:43:07] Folding@home Core Shutdown: FINISHED_UNIT
[06:43:11] CoreStatus = 64 (100)
[06:43:11] Unit 1 finished with 96 percent of time to deadline remaining.
[06:43:11] Updated performance fraction: 0.961088
[06:43:11] Sending work to server
[06:43:11] Project: 6947 (Run 0, Clone 143, Gen 189)


[06:43:11] + Attempting to send results [October 6 06:43:11 UTC]
[06:43:11] - Reading file work/wuresults_01.dat from core
[06:43:11]   (Read 3517302 bytes from disk)
[06:43:11] Connecting to http://128.143.199.96:8080/
[06:43:12] - Couldn't send HTTP request to server
[06:43:12] + Could not connect to Work Server (results)
[06:43:12]     (128.143.199.96:8080)
[06:43:12] + Retrying using alternative port
[06:43:12] Connecting to http://128.143.199.96:80/
[06:43:13] - Couldn't send HTTP request to server
[06:43:13] + Could not connect to Work Server (results)
[06:43:13]     (128.143.199.96:80)
[06:43:13] - Error: Could not transmit unit 01 (completed October 6) to work server.
[06:43:13] - 1 failed uploads of this unit.
[06:43:13]   Keeping unit 01 in queue.

[19:11:38] + Attempting to send results [October 6 19:11:38 UTC]
[19:11:38] - Reading file work/wuresults_01.dat from core
[19:11:38]   (Read 3517302 bytes from disk)
[19:11:38] Connecting to http://130.237.165.141:8080/
[19:12:07] Posted data.
[19:32:07] Initial: 0000; + Could not connect to Work Server (results)
[19:52:07]     (130.237.165.141:8080)
[19:52:07] + Retrying using alternative port
[19:52:07] Connecting to http://130.237.165.141:80/
[19:52:45] Posted data.
[19:52:45] Initial: 0000; - Server does not have record of this unit. Will try again later.
[19:52:45]   Could not transmit unit 01 to Collection server; keeping in queue.
[19:52:45] + Sent 0 of 1 completed units to the server
[19:53:15] Trying to send all finished work units
[19:53:15] Project: 6947 (Run 0, Clone 143, Gen 189)

[19:53:15] + Attempting to send results [October 6 19:53:15 UTC]
[19:53:15] - Reading file work/wuresults_01.dat from core
[19:53:15]   (Read 3517302 bytes from disk)
[19:53:15] Connecting to http://128.143.199.96:8080/
[19:53:16] - Couldn't send HTTP request to server
[19:53:16] + Could not connect to Work Server (results)
[19:53:16]     (128.143.199.96:8080)
[19:53:16] + Retrying using alternative port
[19:53:16] Connecting to http://128.143.199.96:80/
[19:53:18] - Couldn't send HTTP request to server
[19:53:18] + Could not connect to Work Server (results)
[19:53:18]     (128.143.199.96:80)
[19:53:18] - Error: Could not transmit unit 01 (completed October 6) to work server.
[19:53:18] - 485 failed uploads of this unit.
orion
Posts: 135
Joined: Sun Dec 02, 2007 12:45 pm
Hardware configuration: 4p/4 MC ES @ 3.0GHz/32GB
4p/4x6128 @ 2.47GHz/32GB
2p/2 IL ES @ 2.7GHz/16GB
1p/8150/8GB
1p/1090T/4GB
Location: neither here nor there

Re: Problem with stats server

Post by orion »

I haven't had a spike in WU's but I did get one WU for no points on my 3am update.

The only WU that I would have finished for that time period is p6958 (0,239,102) It has tried to upload 9 times but hasn't yet.

Here's the last upload attempt from about 1 hour ago as of this post.

Code: Select all

[19:57:05] + Attempting to send results [October 6 19:57:05 UTC]
[19:57:05] - Reading file work/wuresults_05.dat from core
[19:57:05]   (Read 3522087 bytes from disk)
[19:57:05] Connecting to http://128.143.199.96:8080/
[19:57:05] - Couldn't send HTTP request to server
[19:57:05] + Could not connect to Work Server (results)
[19:57:05]     (128.143.199.96:8080)
[19:57:05] + Retrying using alternative port
[19:57:05] Connecting to http://128.143.199.96:80/
[19:57:05] - Couldn't send HTTP request to server
[19:57:05] + Could not connect to Work Server (results)
[19:57:05]     (128.143.199.96:80)
[19:57:05] - Error: Could not transmit unit 05 (completed October 6) to work server.
[19:57:05] - 9 failed uploads of this unit.


[19:57:05] + Attempting to send results [October 6 19:57:05 UTC]
[19:57:05] - Reading file work/wuresults_05.dat from core
[19:57:05]   (Read 3522087 bytes from disk)
[19:57:05] Connecting to http://130.237.165.141:8080/
[19:58:33] Posted data.
[19:58:33] Initial: 0000; - Uploaded at ~39 kB/s
[19:58:33] - Averaged speed for that direction ~38 kB/s
[19:58:33] - Server does not have record of this unit. Will try again later.
[19:58:33]   Could not transmit unit 05 to Collection server; keeping in queue.
iustus quia...
Tobit
Posts: 342
Joined: Thu Apr 17, 2008 2:35 pm
Location: Manchester, NH USA

Re: Problem with stats server

Post by Tobit »

In the log file I posted, I suspect what is happening is:

Client tries to upload to 128.143.199.96
Server has been in and out of reject state for awhile
Client makes a long enough connection that the WS thinks it receives a WU and records 0 credit or partial credit
Client ultimately timesout transmitting results
Keeps in queue to try again
Rinse - Repeat
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Problem with stats server

Post by bruce »

The problem with 128.143.199.96 in REJECT and the follow-on problem of the Collection Server not having a record of the WU is being discussed here. Those units are NOT being credited, so I'm confident that it's an entirely different problem. Getting 0 points and counting one more WU means that data IS getting to the stats server, even if it's zero points.

Project: 6947 (Run 0, Clone 143, Gen 189) has not been credited, which is another reason to believe they're separate problems.

I've changed the title. It looks like the stats SERVER is properly crediting every stats record that's being generated by the Work Servers.
Post Reply