Several console clients no longer getting work

Moderators: Site Moderators, FAHC Science Team

Post Reply
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Several console clients no longer getting work

Post by Torin3 »

This includes 5.04beta, 6.01beta4, and SMP 5.91beta6.

I get an ok from http://assign.stanford.edu:8080/
I get a connection refused from http://assign2.stanford.edu/

We are using an ISA server as a firewall for our network, and I double checked and it isn't blocking anything in 171.xxx.xxx.xxx. These boxes were turning in and getting new work units before, so I'm not sure what happened. They are all on a shared t1 line. If I switch a box over to our backup dsl line, it does download a new work unit. However, it isn't practical for me switch them over every time they are ready for a new work unit. assign2.stanford.edu shows up when I ping it as 171.64.65.121

Any suggestions?


Code: Select all

--- Opening Log file [February 26 17:09:14] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 5.04beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah1
Executable: C:\Program Files\fah1\FAH504-Console.exe
Arguments: -verbosity 9 -forceasm -advmethods 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[17:09:29] - Ask before connecting: No
[17:09:30] - User name: Torin3 (Team 32)
[17:09:31] - User ID: 37E65BE7264C6495
[17:09:31] - Machine ID: 1
[17:09:31] 
[17:09:32] Loaded queue successfully.
[17:09:32] + Benchmarking ...
[17:09:35] The benchmark result is 3580
[17:09:37] 
[17:09:37] - Autosending finished units...
[17:09:37] + Processing work unit
[17:09:37] Trying to send all finished work units
[17:09:37] Core required: FahCore_81.exe
[17:09:37] + No unsent completed units remaining.
[17:09:37] Core found.
[17:09:38] - Autosend completed
[17:09:39] Working on Unit 05 [February 26 17:09:39]
[17:09:39] + Working ...
[17:09:39] - Calling 'FahCore_81.exe -dir work/ -suffix 05 -checkpoint 30 -forceasm -verbose -lifeline 3916 -version 504'

[17:09:41] 
[17:09:41] *------------------------------*
[17:09:42] Folding@Home Gromacs Simulated Tempering Core
[17:09:42] Version 1.10 (Oct 4, 2007)
[17:09:42] 
[17:09:42] Preparing to commence simulation
[17:09:43] - Assembly optimizations manually forced on.
[17:09:43] - Not checking prior termination.
[17:09:43] - Expanded 364133 -> 1799134 (decompressed 494.0 percent)
[17:09:47] 
[17:09:47] Project: 3620 (Run 101, Clone 2, Gen 10)
[17:09:47] 
[17:09:48] Assembly optimizations on if available.
[17:09:48] Entering M.D.
[17:10:08] (Starting from checkpoint)
[17:10:09] Protein: p3620_Seq26_Amber03_Extended
[17:10:09] 
[17:10:11] Writing local files
[17:11:05] Completed 1255857 out of 1500000 steps  (84)
[17:11:05] Extra SSE boost OK.
[17:24:49] Writing local files
[17:24:49] Completed 1260000 out of 1500000 steps  (84)
[17:55:48] Timered checkpoint triggered.
[18:09:09] Writing local files
[18:09:09] Completed 1275000 out of 1500000 steps  (85)
[18:40:08] Timered checkpoint triggered.
[18:54:09] Writing local files
[18:54:09] Completed 1290000 out of 1500000 steps  (86)
[19:25:09] Timered checkpoint triggered.
[19:38:28] Writing local files
[19:38:28] Completed 1305000 out of 1500000 steps  (87)
[20:09:28] Timered checkpoint triggered.
[20:22:02] Writing local files
[20:22:02] Completed 1320000 out of 1500000 steps  (88)
[20:53:01] Timered checkpoint triggered.
[21:06:08] Writing local files
[21:06:09] Completed 1335000 out of 1500000 steps  (89)
[21:37:08] Timered checkpoint triggered.
[21:50:10] Writing local files
[21:50:10] Completed 1350000 out of 1500000 steps  (90)
[22:21:10] Timered checkpoint triggered.
[22:34:31] Writing local files
[22:34:31] Completed 1365000 out of 1500000 steps  (91)
[23:05:32] Timered checkpoint triggered.
[23:09:39] - Autosending finished units...
[23:09:39] Trying to send all finished work units
[23:09:39] + No unsent completed units remaining.
[23:09:39] - Autosend completed
[23:18:36] Writing local files
[23:18:36] Completed 1380000 out of 1500000 steps  (92)
[23:49:36] Timered checkpoint triggered.
[00:02:37] Writing local files
[00:02:38] Completed 1395000 out of 1500000 steps  (93)
[00:33:38] Timered checkpoint triggered.
[00:47:03] Writing local files
[00:47:03] Completed 1410000 out of 1500000 steps  (94)
[01:18:02] Timered checkpoint triggered.
[01:31:12] Writing local files
[01:31:12] Completed 1425000 out of 1500000 steps  (95)
[02:02:13] Timered checkpoint triggered.
[02:15:03] Writing local files
[02:15:03] Completed 1440000 out of 1500000 steps  (96)
[02:46:03] Timered checkpoint triggered.
[02:59:25] Writing local files
[02:59:25] Completed 1455000 out of 1500000 steps  (97)
[03:30:26] Timered checkpoint triggered.
[03:43:49] Writing local files
[03:43:49] Completed 1470000 out of 1500000 steps  (98)
[04:14:49] Timered checkpoint triggered.
[04:27:45] Writing local files
[04:27:45] Completed 1485000 out of 1500000 steps  (99)
[04:58:46] Timered checkpoint triggered.
[05:09:39] - Autosending finished units...
[05:09:39] Trying to send all finished work units
[05:09:39] + No unsent completed units remaining.
[05:09:39] - Autosend completed
[05:12:05] Writing local files
[05:12:05] Completed 1500000 out of 1500000 steps  (100)
[05:12:05] Writing final coordinates.
[05:12:06] Past main M.D. loop
[05:13:06] 
[05:13:06] Finished Work Unit:
[05:13:06] - Reading up to 297504 from "work/wudata_05.arc": Read 297504
[05:13:06] - Reading up to 478844 from "work/wudata_05.xtc": Read 478844
[05:13:06] goefile size: 0
[05:13:06] logfile size: 59666
[05:13:06] Leaving Run
[05:13:08] - Writing 968672 bytes of core data to disk...
[05:13:09] Done: 968160 -> 781937 (compressed to 80.7 percent)
[05:13:09]   ... Done.
[05:13:09] - Shutting down core
[05:13:09] 
[05:13:09] Folding@home Core Shutdown: FINISHED_UNIT
[05:13:12] CoreStatus = 64 (100)
[05:13:12] Unit 5 finished with 97 percent of time to deadline remaining.
[05:13:12] Updated performance fraction: 0.943991
[05:13:12] Sending work to server


[05:13:12] + Attempting to send results
[05:13:12] - Reading file work/wuresults_05.dat from core
[05:13:12]   (Read 782449 bytes from disk)
[05:13:12] Connecting to http://171.64.122.82:80/
[05:13:24] Posted data.
[05:13:24] Initial: 0000; - Uploaded at ~63 kB/s
[05:13:24] - Averaged speed for that direction ~67 kB/s
[05:13:24] + Results successfully sent
[05:13:24] Thank you for your contribution to Folding@Home.
[05:13:24] + Number of Units Completed: 5

[05:13:28] Trying to send all finished work units
[05:13:28] + No unsent completed units remaining.
[05:13:28] - Preparing to get new work unit...
[05:13:28] + Attempting to get work packet
[05:13:28] - Will indicate memory of 2013 MB
[05:13:28] - Connecting to assignment server
[05:13:28] Connecting to http://assign.stanford.edu:8080/
[05:13:28] - Couldn't send HTTP request to server
[05:13:28] + Could not connect to Assignment Server
[05:13:28] Connecting to http://assign2.stanford.edu:80/
[05:13:30] - Couldn't send HTTP request to server
[05:13:30] + Could not connect to Assignment Server 2
[05:13:30] + Couldn't get work instructions.
[05:13:30] - Error: Attempt #1  to get work failed, and no other work to do.
             Waiting before retry.
[05:13:41] + Attempting to get work packet
[05:13:41] - Will indicate memory of 2013 MB
[05:13:41] - Connecting to assignment server
[05:13:41] Connecting to http://assign.stanford.edu:8080/
[05:13:41] - Couldn't send HTTP request to server
[05:13:41] + Could not connect to Assignment Server
[05:13:41] Connecting to http://assign2.stanford.edu:80/
[05:13:42] - Couldn't send HTTP request to server
[05:13:42] + Could not connect to Assignment Server 2
[05:13:42] + Couldn't get work instructions.
[05:13:42] - Error: Attempt #2  to get work failed, and no other work to do.
             Waiting before retry.

Code: Select all

--- Opening Log file [February 25 22:09:20] 


# Windows Console Edition #####################################################
###############################################################################

                       Folding@Home Client Version 6.01beta4

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah
Service: C:\Program Files\fah\fah6-win-x86-console.exe
Arguments: -svcstart 

Launched as a service.
Entered C:\Program Files\fah to do work.

[22:09:20] - Ask before connecting: No
[22:09:20] - User name: Torin3 (Team 32)
[22:09:20] - User ID: 32A551543679AFAF
[22:09:20] - Machine ID: 1
[22:09:20] 
[22:09:20] Loaded queue successfully.
[22:09:20] 
[22:09:20] + Processing work unit
[22:09:20] Core required: FahCore_81.exe
[22:09:20] Core found.
[22:09:20] Working on Unit 01 [February 25 22:09:20]
[22:09:20] + Working ...
[22:09:20] 
[22:09:20] *------------------------------*
[22:09:20] Folding@Home Gromacs Simulated Tempering Core
[22:09:20] Version 1.10 (Oct 4, 2007)
[22:09:20] 
[22:09:20] Preparing to commence simulation
[22:09:20] - Looking at optimizations...
[22:09:20] - Files status OK
[22:09:21] - Expanded 365631 -> 1800420 (decompressed 492.4 percent)
[22:09:21] 
[22:09:21] Project: 3640 (Run 14, Clone 9, Gen 8)
[22:09:21] 
[22:09:21] Assembly optimizations on if available.
[22:09:21] Entering M.D.
[22:09:41] (Starting from checkpoint)
[22:09:41] Protein: p3640_Seq14_Amber03_Extended
[22:09:41] 
[22:09:41] Writing local files
[22:10:27] Completed 1365000 out of 1500000 steps  (91%)
[22:10:27] Extra SSE boost OK.
[22:49:54] Writing local files
[22:49:54] Completed 1380000 out of 1500000 steps  (92%)
[23:28:08] Writing local files
[23:28:08] Completed 1395000 out of 1500000 steps  (93%)
[00:06:25] Writing local files
[00:06:25] Completed 1410000 out of 1500000 steps  (94%)
[00:44:33] Writing local files
[00:44:33] Completed 1425000 out of 1500000 steps  (95%)
[01:22:41] Writing local files
[01:22:41] Completed 1440000 out of 1500000 steps  (96%)
[02:00:55] Writing local files
[02:00:55] Completed 1455000 out of 1500000 steps  (97%)
[02:39:03] Writing local files
[02:39:03] Completed 1470000 out of 1500000 steps  (98%)
[03:17:13] Writing local files
[03:17:13] Completed 1485000 out of 1500000 steps  (99%)
[03:55:26] Writing local files
[03:55:26] Completed 1500000 out of 1500000 steps  (100%)
[03:55:26] Writing final coordinates.
[03:55:26] Past main M.D. loop
[03:56:26] 
[03:56:26] Finished Work Unit:
[03:56:26] - Reading up to 297528 from "work/wudata_01.arc": Read 297528
[03:56:26] - Reading up to 507832 from "work/wudata_01.xtc": Read 507832
[03:56:26] goefile size: 0
[03:56:26] logfile size: 110214
[03:56:27] Leaving Run
[03:56:27] - Writing 1061231 bytes of core data to disk...
[03:56:28] Done: 1060719 -> 810249 (compressed to 76.3 percent)
[03:56:28]   ... Done.
[03:56:28] - Shutting down core
[03:56:28] 
[03:56:28] Folding@home Core Shutdown: FINISHED_UNIT
[03:56:31] CoreStatus = 64 (100)
[03:56:31] Sending work to server
[03:56:31] - Read packet limit of 540015616... Set to 524286976.


[03:56:31] + Attempting to send results
[03:56:33] - Couldn't send HTTP request to server
[03:56:33] + Could not connect to Work Server (results)
[03:56:33]     (171.64.122.82:80)
[03:56:33] - Error: Could not transmit unit 01 (completed February 26) to work server.
[03:56:33]   Keeping unit 01 in queue.
[03:56:33] - Read packet limit of 540015616... Set to 524286976.


[03:56:33] + Attempting to send results
[03:56:35] - Couldn't send HTTP request to server
[03:56:35] + Could not connect to Work Server (results)
[03:56:35]     (171.64.122.82:80)
[03:56:35] - Error: Could not transmit unit 01 (completed February 26) to work server.
[03:56:35] - Read packet limit of 540015616... Set to 524286976.


[03:56:35] + Attempting to send results
[03:56:36] - Couldn't send HTTP request to server
[03:56:36] + Could not connect to Work Server (results)
[03:56:36]     (171.64.122.76:80)
[03:56:36]   Could not transmit unit 01 to Collection server; keeping in queue.
[03:56:36] - Preparing to get new work unit...
[03:56:36] + Attempting to get work packet
[03:56:36] - Connecting to assignment server
[03:56:44] - Couldn't send HTTP request to server
[03:56:44] + Could not connect to Assignment Server
[03:56:51] - Couldn't send HTTP request to server
[03:56:51] + Could not connect to Assignment Server 2
[03:56:51] + Couldn't get work instructions.
[03:56:51] - Attempt #1  to get work failed, and no other work to do.
             Waiting before retry.
[03:57:03] + Attempting to get work packet
[03:57:03] - Connecting to assignment server
[03:57:04] - Couldn't send HTTP request to server
[03:57:04] + Could not connect to Assignment Server
[03:57:06] - Couldn't send HTTP request to server
[03:57:06] + Could not connect to Assignment Server 2
[03:57:06] + Couldn't get work instructions.
[03:57:06] - Attempt #2  to get work failed, and no other work to do.
             Waiting before retry.

Code: Select all


--- Opening Log file [February 26 11:10:54] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah
Executable: C:\Program Files\fah\fah.exe
Arguments: -verbosity 9 -forceasm -advmethods 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[11:10:54] - Ask before connecting: No
[11:10:54] - User name: Torin3 (Team 32)
[11:10:54] - User ID: 4FC70C1768D014E8
[11:10:55] - Machine ID: 1
[11:10:55] 
[11:10:55] Loaded queue successfully.
[11:10:55] 
[11:10:55] - Autosending finished units...
[11:10:55] + Processing work unit
[11:10:55] Trying to send all finished work units
[11:10:55] Core required: FahCore_a1.exe
[11:10:56] + No unsent completed units remaining.
[11:10:56] Core found.
[11:10:56] - Autosend completed
[11:10:56] Working on Unit 07 [February 26 11:10:56]
[11:10:56] + Working ...
[11:10:56] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 07 -checkpoint 30 -forceasm -verbose -lifeline 2184 -version 591'

[11:10:57] 
[11:10:57] *------------------------------*
[11:10:57] Folding@Home Gromacs SMP Core
[11:10:57] Version 1.74 (March 10, 2007)
[11:10:57] 
[11:10:58] Preparing to commence simulation
[11:10:58] - Ensuring status. Please wait.
[11:11:14] - Assembly optimizations manually forced on.
[11:11:14] - Not checking prior termination.
[11:11:20] - Expanded 2965840 -> 15212615 (decompressed 512.9 percent)
[11:11:21] 
[11:11:21] Project: 2653 (Run 15, Clone 188, Gen 63)
[11:11:21] 
[11:11:24] Assembly optimizations on if available.
[11:11:24] Entering M.D.
[11:11:30] Calling FAH init
[11:11:31] in POPC
[11:11:31] Writing local files
[11:11:31]  checkpoint)
[11:11:31] Read checkpoint
[11:11:31] Protein: Protein in POPC
[11:11:31] ra SSE boost OK.
[11:11:31] es
[11:11:31] Completed 465000 out of 500000 steps  (93 percent)
[11:11:32] Extra SSE boost OK.
[11:27:25] Writing local files
[11:27:25] Completed 470000 out of 500000 steps  (94 percent)
[11:41:57] Writing local files
[11:41:57] Completed 475000 out of 500000 steps  (95 percent)
[11:57:34] Writing local files
[11:57:34] Completed 480000 out of 500000 steps  (96 percent)
[12:13:26] Writing local files
[12:13:26] Completed 485000 out of 500000 steps  (97 percent)
[12:33:50] Writing local files
[12:33:50] Completed 490000 out of 500000 steps  (98 percent)
[12:49:17] Writing local files
[12:49:17] Completed 495000 out of 500000 steps  (99 percent)
[13:04:39] Writing local files
[13:04:40] Completed 500000 out of 500000 steps  (100 percent)
[13:04:40] Writing final coordinates.
[13:04:41] Past main M.D. loop
[13:04:41] Will end MPI now
[13:05:41] 
[13:05:41] Finished Work Unit:
[13:05:41] - Reading up to 3724272 from "work/wudata_07.arc": Read 3724272
[13:05:42] - Reading up to 1780324 from "work/wudata_07.xtc": Read 1780324
[13:05:43] goefile size: 0
[13:05:43] logfile size: 0
[13:05:43] Warning: Core could not open logfile.
[13:05:43] Leaving Run
[13:05:47] - Writing 5508996 bytes of core data to disk...
[13:05:48]   ... Done.
[13:05:48] - Failed to delete work/wudata_07.sas
[13:05:48] - Failed to delete work/wudata_07.goe
[13:05:48] Warning:  check for stray files
[13:05:48] - Shutting down core
[13:07:48] 
[13:07:48] Folding@home Core Shutdown: FINISHED_UNIT
[13:07:48] 
[13:07:48] Folding@home Core Shutdown: FINISHED_UNIT
[13:07:51] CoreStatus = 64 (100)
[13:07:51] Unit 7 finished with 71 percent of time to deadline remaining.
[13:07:51] Updated performance fraction: 0.716120
[13:07:51] Sending work to server


[13:07:51] + Attempting to send results
[13:07:51] - Reading file work/wuresults_07.dat from core
[13:07:51]   (Read 5508996 bytes from disk)
[13:07:51] Connecting to http://171.64.65.64:80/
[13:09:11] Posted data.
[13:09:12] Initial: 0000; - Uploaded at ~65 kB/s
[13:09:13] - Averaged speed for that direction ~66 kB/s
[13:09:13] + Results successfully sent
[13:09:13] Thank you for your contribution to Folding@Home.
[13:09:13] + Number of Units Completed: 7

[13:11:59] - Warning: Could not delete all work unit files (7): Core returned invalid code
[13:11:59] Trying to send all finished work units
[13:11:59] + No unsent completed units remaining.
[13:11:59] - Preparing to get new work unit...
[13:11:59] + Attempting to get work packet
[13:11:59] - Will indicate memory of 2045 MB
[13:11:59] - Connecting to assignment server
[13:11:59] Connecting to http://assign.stanford.edu:8080/
[13:11:59] - Couldn't send HTTP request to server
[13:11:59] + Could not connect to Assignment Server
[13:11:59] Connecting to http://assign2.stanford.edu:80/
[13:12:00] - Couldn't send HTTP request to server
[13:12:00] + Could not connect to Assignment Server 2
[13:12:00] + Couldn't get work instructions.
[13:12:00] - Error: Attempt #1  to get work failed, and no other work to do.
             Waiting before retry.
[13:12:13] + Attempting to get work packet
[13:12:13] - Will indicate memory of 2045 MB
[13:12:13] - Connecting to assignment server
[13:12:13] Connecting to http://assign.stanford.edu:8080/
[13:12:13] - Couldn't send HTTP request to server
[13:12:13] + Could not connect to Assignment Server
[13:12:13] Connecting to http://assign2.stanford.edu:80/
[13:12:14] - Couldn't send HTTP request to server
[13:12:14] + Could not connect to Assignment Server 2
[13:12:14] + Couldn't get work instructions.
[13:12:14] - Error: Attempt #2  to get work failed, and no other work to do.
             Waiting before retry.
[13:12:26] + Attempting to get work packet
[13:12:26] - Will indicate memory of 2045 MB
[13:12:26] - Connecting to assignment server
[13:12:26] Connecting to http://assign.stanford.edu:8080/
[13:12:26] - Couldn't send HTTP request to server
[13:12:26] + Could not connect to Assignment Server
[13:12:26] Connecting to http://assign2.stanford.edu:80/
[13:12:27] - Couldn't send HTTP request to server
[13:12:27] + Could not connect to Assignment Server 2
[13:12:27] + Couldn't get work instructions.
[13:12:27] - Error: Attempt #3  to get work failed, and no other work to do.
             Waiting before retry.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Several console clients no longer getting work

Post by bruce »

Torin3 wrote:This includes 5.04beta, 6.01beta4, and SMP 5.91beta6.

I get an ok from http://assign.stanford.edu:8080/
I get a connection refused from http://assign2.stanford.edu/

We are using an ISA server as a firewall for our network, and I double checked and it isn't blocking anything in 171.xxx.xxx.xxx. These boxes were turning in and getting new work units before, so I'm not sure what happened. They are all on a shared t1 line. If I switch a box over to our backup dsl line, it does download a new work unit. However, it isn't practical for me switch them over every time they are ready for a new work unit. assign2.stanford.edu shows up when I ping it as 171.64.65.121

Any suggestions?
The connection refused from assign2 is being fixed. (Yes, the IP address has changed, but I also get connection-refused from http://171.64.65.121/) I don't think it's related to your primary problem but it's an issue for the "standard" methods to debug a connection so it's important to forum members who are trying to help with problems like yours.

You didn't specify which line was used for the test of http://assign.stanford.edu:8080/ As far as I know, the FAH client tries to connect to the primary server in the same way that a browser would. If you can connect to it from the browser and not from the client, there must be a difference in how FAH connects to your primary proxy. This might include local security software, but I'd think that would be the same for either ISP connection.

Please post FAHlog.txt for a successful connection through the backup DSL line.
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Re: Several console clients no longer getting work

Post by Torin3 »

bruce wrote: The connection refused from assign2 is being fixed. (Yes, the IP address has changed, but I also get connection-refused from http://171.64.65.121/) I don't think it's related to your primary problem but it's an issue for the "standard" methods to debug a connection so it's important to forum members who are trying to help with problems like yours.

You didn't specify which line was used for the test of http://assign.stanford.edu:8080/
It worked from the T1 and the DSL lines.
As far as I know, the FAH client tries to connect to the primary server in the same way that a browser would. If you can connect to it from the browser and not from the client, there must be a difference in how FAH connects to your primary proxy. This might include local security software, but I'd think that would be the same for either ISP connection.

Please post FAHlog.txt for a successful connection through the backup DSL line.
This covers from T1 to DSL and back to T1 after I got the next work unit.

Code: Select all

[11:17:18] + Attempting to get work packet
[11:17:18] - Will indicate memory of 2045 MB
[11:17:18] - Connecting to assignment server
[11:17:18] Connecting to http://assign.stanford.edu:8080/
[11:17:30] - Could not v3HTTPOpen
[11:17:30] + Could not connect to Assignment Server
[11:17:30] Connecting to http://assign2.stanford.edu:80/
[11:17:30] - Couldn't send HTTP request to server
[11:17:30] + Could not connect to Assignment Server 2
[11:17:30] + Couldn't get work instructions.
[11:17:30] - Error: Attempt #8  to get work failed, and no other work to do.
             Waiting before retry.
[11:17:55] Killing all core threads

Folding@Home Client Shutdown at user request.
[11:17:55] ***** Got a SIGTERM signal (2)
[11:17:55] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [February 27 11:18:29] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\program files\fah
Executable: C:\Program Files\fah\fah.exe
Arguments: -verbosity 9 -forceasm -advmethods 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[11:18:29] - Ask before connecting: No
[11:18:29] - User name: Torin3 (Team 32)
[11:18:29] - User ID: 4FC70C1768D014E8
[11:18:29] - Machine ID: 1
[11:18:29] 
[11:18:30] Loaded queue successfully.
[11:18:30] - Preparing to get new work unit...
[11:18:30] - Autosending finished units...
[11:18:30] + Attempting to get work packet
[11:18:30] Trying to send all finished work units
[11:18:30] - Will indicate memory of 2045 MB
[11:18:30] + No unsent completed units remaining.
[11:18:30] - Autosend completed
[11:18:30] - Connecting to assignment server
[11:18:30] Connecting to http://assign.stanford.edu:8080/
[11:18:30] Posted data.
[11:18:30] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[11:18:30] + News From Folding@Home: Welcome to Folding@Home
[11:18:30] Loaded queue successfully.
[11:18:30] Connecting to http://171.64.65.64:8080/
[11:18:33] Posted data.
[11:18:33] Initial: 0000; - Receiving payload (expected size: 2964839)
[11:18:49] - Downloaded at ~180 kB/s
[11:18:49] - Averaged speed for that direction ~98 kB/s
[11:18:49] + Received work.
[11:18:50] + Closed connections
[11:18:50] 
[11:18:50] + Processing work unit
[11:18:50] Core required: FahCore_a1.exe
[11:18:50] Core found.
[11:18:50] Working on Unit 08 [February 27 11:18:50]
[11:18:50] + Working ...
[11:18:50] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 30 -forceasm -verbose -lifeline 2972 -version 591'

[11:18:50] 
[11:18:50] *------------------------------*
[11:18:50] Folding@Home Gromacs SMP Core
[11:18:50] Version 1.74 (March 10, 2007)
[11:18:50] 
[11:18:50] Preparing to commence simulation
[11:18:50] - Ensuring status. Please wait.
[11:19:07] - Assembly optimizations manually forced on.
[11:19:07] - Not checking prior termination.
[11:19:13] - Expanded 2964327 -> 15205923 (decompressed 512.9 percent)
[11:19:13] - Starting from initial work packet
[11:19:13] 
[11:19:13] Project: 2653 (Run 20, Clone 153, Gen 65)
[11:19:13] 
[11:19:14] Assembly optimizations on if available.
[11:19:14] Entering M.D.
[11:19:20] Rejecting checkpoint
[11:19:21] ProtWriting local files
[11:19:21] Extra SSE boost OK.
[11:19:21] 
[11:19:22] Extra SSE boost OK.
[11:19:23] Writing local files
[11:19:23] Completed 0 out of 500000 steps  (0 percent)
[11:24:51] Killing all core threads
[11:24:51] Killing SMP core threads
[11:24:51] Killing 2 cores
[11:24:51] Killing core 0
[11:24:51] Killing core 1

Folding@Home Client Shutdown at user request.
[11:24:51] ***** Got a SIGTERM signal (2)
[11:24:51] Killing all core threads
[11:24:51] Killing SMP core threads
[11:24:51] Killing 2 cores
[11:24:51] Killing core 0
[11:24:51] Killing core 1

Folding@Home Client Shutdown.


--- Opening Log file [February 27 11:39:00] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah
Executable: C:\Program Files\fah\fah.exe
Arguments: -verbosity 9 -forceasm -advmethods 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[11:39:03] - Ask before connecting: No
[11:39:03] - User name: Torin3 (Team 32)
[11:39:03] - User ID: 4FC70C1768D014E8
[11:39:04] - Machine ID: 1
[11:39:04] 
[11:39:04] Loaded queue successfully.
[11:39:04] 
[11:39:04] - Autosending finished units...
[11:39:04] + Processing work unit
[11:39:04] Trying to send all finished work units
[11:39:05] Core required: FahCore_a1.exe
[11:39:05] + No unsent completed units remaining.
[11:39:05] Core found.
[11:39:05] - Autosend completed
[11:39:05] Working on Unit 08 [February 27 11:39:05]
[11:39:05] + Working ...
[11:39:05] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 30 -forceasm -verbose -lifeline 1296 -version 591'

[11:39:06] 
[11:39:06] *------------------------------*
[11:39:06] Folding@Home Gromacs SMP Core
[11:39:07] Version 1.74 (March 10, 2007)
[11:39:07] 
[11:39:07] Preparing to commence simulation
[11:39:07] - Ensuring status. Please wait.
[11:39:23] - Assembly optimizations manually forced on.
[11:39:23] - Not checking prior termination.
[11:39:28] - Expanded 2964327 -> 15205923 (decompressed 512.9 percent)
[11:39:30] 
[11:39:30] Project: 2653 (Run 20, Clone 153, Gen 65)
[11:39:30] 
[11:39:30] Assembly optimizations on if available.
[11:39:30] Entering M.D.
[11:39:37] Calling FAH init
[11:39:38] in POPC
[11:39:38] Writing local files
[11:39:38]  checkpoint)
[11:39:38] Read checkpoint
[11:39:38] Protein: Protein in POPC
[11:39:38] Writing local files
[11:39:39] Extra SSE boost OK.
[11:39:39] Writing local files
[11:39:39] Completed 0 out of 500000 steps  (0 percent)
[11:56:40] Writing local files
[11:56:40] Completed 5000 out of 500000 steps  (1 percent)
[12:12:58] Writing local files
[12:12:58] Completed 10000 out of 500000 steps  (2 percent)
[12:27:45] Writing local files
[12:27:45] Completed 15000 out of 500000 steps  (3 percent)
[12:41:01] ***** Windows shutdown
[12:41:01] ***** Got a SIGTERM signal (2)
[12:41:01] Killing all core threads
[12:41:01] Killing SMP core threads
[12:41:01] Killing 2 cores
[12:41:01] Killing core 0
[12:41:01] Killing core 1

Folding@Home Client Shutdown.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Several console clients no longer getting work

Post by bruce »

OK, we can now see that the FAH client can connect to http://assign.stanford.edu:8080/ on the DSL but not on the T1. Is the same true when a browser tries to connect?

While this was all happening, (171.64.65.64), which had been off-line for several hours came back on-line. I don't think this is related, but until we solve the real issue, we can't rule it out entirely.
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Re: Several console clients no longer getting work

Post by Torin3 »

bruce wrote:OK, we can now see that the FAH client can connect to http://assign.stanford.edu:8080/ on the DSL but not on the T1. Is the same true when a browser tries to connect?

While this was all happening, (171.64.65.64), which had been off-line for several hours came back on-line. I don't think this is related, but until we solve the real issue, we can't rule it out entirely.
Browser connects on the T1. Also, I went to double check on my other local machine that wasn't getting work and I see it got a work unit about an hour ago. I'll check the other stalled machines and see if they are getting work now too.
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Re: Several console clients no longer getting work

Post by Torin3 »

SMP got work.

5.04Beta got work.

6.01beta4 clients still haven't gotten work yet.

Here is the log from the 5.04Beta client as it got work:

Code: Select all

[14:41:48] + Attempting to get work packet
[14:41:48] - Will indicate memory of 2013 MB
[14:41:48] - Connecting to assignment server
[14:41:48] Connecting to http://assign.stanford.edu:8080/
[14:41:48] - Couldn't send HTTP request to server
[14:41:48] + Could not connect to Assignment Server
[14:41:48] Connecting to http://assign2.stanford.edu:80/
[14:41:49] - Couldn't send HTTP request to server
[14:41:49] + Could not connect to Assignment Server 2
[14:41:49] + Couldn't get work instructions.
[14:41:49] - Error: Attempt #21  to get work failed, and no other work to do.
             Waiting before retry.
[15:29:58] + Attempting to get work packet
[15:29:58] - Will indicate memory of 2013 MB
[15:29:58] - Connecting to assignment server
[15:29:58] Connecting to http://assign.stanford.edu:8080/
[15:29:58] - Couldn't send HTTP request to server
[15:29:58] + Could not connect to Assignment Server
[15:29:58] Connecting to http://assign2.stanford.edu:80/
[15:29:59] Posted data.
[15:29:59] Initial: 41AB; - Successful: assigned to (171.65.103.160).
[15:29:59] + News From Folding@Home: Welcome to Folding@Home
[15:29:59] Loaded queue successfully.
[15:29:59] Connecting to http://171.65.103.160:80/
[15:30:00] Posted data.
[15:30:00] Initial: 0000; - Receiving payload (expected size: 474540)
[15:30:06] - Downloaded at ~77 kB/s
[15:30:06] - Averaged speed for that direction ~75 kB/s
[15:30:06] + Received work.
[15:30:06] Trying to send all finished work units
[15:30:06] + No unsent completed units remaining.
[15:30:06] + Closed connections
[15:30:06] 
[15:30:06] + Processing work unit
[15:30:06] Core required: FahCore_78.exe
[15:30:06] Core found.
[15:30:06] Working on Unit 06 [February 27 15:30:06]
[15:30:06] + Working ...
[15:30:06] - Calling 'FahCore_78.exe -dir work/ -suffix 06 -checkpoint 30 -forceasm -verbose -lifeline 3916 -version 504'

[15:30:06] 
[15:30:06] *------------------------------*
[15:30:06] Folding@Home Gromacs Core
[15:30:06] Version 1.90 (March 8, 2006)
[15:30:06] 
[15:30:06] Preparing to commence simulation
[15:30:06] - Assembly optimizations manually forced on.
[15:30:06] - Not checking prior termination.
[15:30:07] - Expanded 474028 -> 2471321 (decompressed 521.3 percent)
[15:30:07] - Starting from initial work packet
[15:30:07] 
[15:30:07] Project: 2151 (Run 0, Clone 100, Gen 4)
[15:30:07] 
[15:30:07] Assembly optimizations on if available.
[15:30:07] Entering M.D.
[15:30:13] Protein: p2151_p2147_lambda_m2_expl_99p_373K
[15:30:13] 
[15:30:13] Writing local files
[15:30:14] Extra SSE boost OK.
[15:30:14] Writing local files
[15:30:14] Completed 0 out of 1000000 steps  (0)
[15:57:47] Writing local files
[15:57:47] Completed 10000 out of 1000000 steps  (1)
[16:24:41] Writing local files
[16:24:41] Completed 20000 out of 1000000 steps  (2)
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Several console clients no longer getting work

Post by Ivoshiee »

All assignment servers seem to be up (I get "OK" from each).
In case of failed attempts and successful attempts try to establish if your computers indeed are being connected to the same servers.
Try to trace route these servers during these times (tracert).
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Re: Several console clients no longer getting work

Post by Torin3 »

Well, my two computers working on SMP show different IPs for assign2.stanford.edu. The one that got the work unit when it was hooked up to the dsl line and is currently on the T1 shows 171.64.65.121. The one that just recently got the work unit over the T1 line is showing an IP of 171.65.103.95. The 6.01beta4 computer shows an IP of 171.64.65.121. I tried putting the 171.65.103.95 in the hosts file to see if that would let it get assigned a work unit. It did not. Verified the IP by running a tracert. Removed the line from the hosts file whne I was done.
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Several console clients no longer getting work

Post by 7im »

Pande Group did upgrade the Assign2 server recently, and the IP changed. Maybe a case of a DNS table update not fully perpetuated yet?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Several console clients no longer getting work

Post by Ivoshiee »

That is what I am seeing:

Code: Select all

[ivo@sarmax ~]$ ping assign2.stanford.edu
PING vspg6-vz7.stanford.edu (171.64.65.121) 56(84) bytes of data.
64 bytes from vspg6-vz7.Stanford.EDU (171.64.65.121): icmp_seq=1 ttl=42 time=227 ms

--- vspg6-vz7.stanford.edu ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 227.712/227.712/227.712/0.000 ms
[ivo@sarmax ~]$ ping assign.stanford.edu
PING VSPMF26.stanford.edu (171.65.103.93) 56(84) bytes of data.
64 bytes from VSPMF26.Stanford.EDU (171.65.103.93): icmp_seq=1 ttl=42 time=231 ms

--- VSPMF26.stanford.edu ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 999ms
rtt min/avg/max/mdev = 231.997/231.997/231.997/0.000 ms
[ivo@sarmax ~]$ 
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Several console clients no longer getting work

Post by bruce »

7im wrote:Pande Group did upgrade the Assign2 server recently, and the IP changed. Maybe a case of a DNS table update not fully perpetuated yet?
That's really likely -- perhaps a case of a DNS cache that had not yet been cleared out yet. There's a command which I don't remember to suggest which can clear the local cache manually, but it eventually takes care of itself. In Windows, it's ipconfig /flushdns.

The other thing that is going on at the same time: the server at the new IP for assign2.stanford.edu was down briefly.

A third thing was shown here:

Code: Select all

[15:29:58] + Could not connect to Assignment Server
[15:29:58] Connecting to http://assign2.stanford.edu:80/
I'm not sure why you couldn't get through to the primary assignment server but since assign2 was working by this time, you did get through.

No matter how much redundancy there is in the system, it's possible (even though unlikely) that enough different things all go wrong at the same time that the connection will fail.

I think we all learned something from this.
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Re: Several console clients no longer getting work

Post by Torin3 »

Ok, still getting weirdness. I've got clients that since I posted the original message have finshed work, sent work and recieved work. I've got some that have been idling ever since this started. I've got some that never had a problem. Now I've got an SMP client (5.91Beta6) that is still getting work, but can't send in the results.

When it tries to connect to 171.64.122.76:8080, and 171.64.65.64:8080, it can't upload. When I load those IP into IE explorer, it come back OK.

Also, I noticed that the two clients that have been haven't been able to recieve or send work since the 25 are both 6.01Beta4, and I had another client (6.01Beta2) that has sent and recieved work since the 25. So I installed the Beta2 client on a fresh machine, but it was never able to get work (so much for that theory). I also did the ipconfig /flushdns command on one of the stalled clients. No luck there..still stalled. Anyway, here is the tracert for the SMP client that can recieve work, but can't return work units. Plus the log below that.

Code: Select all


[code]--- Opening Log file [February 28 18:17:23] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program Files\fah
Executable: fah
Arguments: -send 08 

[18:17:23] - Ask before connecting: No
[18:17:23] - User name: Torin3 (Team 32)
[18:17:23] - User ID: 4FC70C1768D014E8
[18:17:23] - Machine ID: 1
[18:17:23] 
[18:17:23] Loaded queue successfully.
[18:17:23] Attempting to return result(s) to server...


[18:17:23] + Attempting to send results
[18:17:23] - Couldn't send HTTP request to server
[18:17:23] + Could not connect to Work Server (results)
[18:17:23]     (171.64.65.64:8080)
[18:17:23] - Error: Could not transmit unit 08 (completed February 28) to work server.


[18:17:23] + Attempting to send results
[18:17:23] - Couldn't send HTTP request to server
[18:17:23] + Could not connect to Work Server (results)
[18:17:23]     (171.64.122.76:8080)
[18:17:23]   Could not transmit unit 08 to Collection server; keeping in queue.
[18:17:23] - Failed to send unit 08 to server

Folding@Home Client Shutdown.
C:\Program Files\fah>tracert 171.64.65.64

Tracing route to vspg2v.stanford.edu [171.64.65.64]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms BMC_FW2 [150.0.0.11]
2 <1 ms <1 ms <1 ms w065.z064000083.was-dc.dsl.cnc.net [64.0.83.65]

3 12 ms 12 ms 12 ms w029.z208037113.nyc-ny.dsl.cnc.net [208.37.113.2
9]
4 10 ms 10 ms 11 ms ge5-0-0.mar1.philadelphia-pa.us.xo.net [207.88.8
7.49]
5 256 ms 236 ms 71 ms p3-0-0d0.rar1.washington-dc.us.xo.net [65.106.3.
233]
6 87 ms 87 ms 87 ms p1-0-0.rar1.sanjose-ca.us.xo.net [65.106.0.38]
7 103 ms 108 ms 87 ms 65.106.5.234.ptr.us.xo.net [65.106.5.234]
8 88 ms 88 ms 88 ms paix-px1--xo-ge.cenic.net [198.32.251.41]
9 90 ms 90 ms 90 ms dc-stan--svl-dc1-ge.cenic.net [137.164.23.38]
10 89 ms 89 ms 88 ms bbrb-rtr.stanford.edu [171.64.1.136]
11 * * * Request timed out.
12 1204 ms 1322 ms 1253 ms vspg2v.stanford.edu [171.64.65.64]

Trace complete.

C:\Program Files\fah>[/code]
7im
Posts: 10189
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Several console clients no longer getting work

Post by 7im »

For reference 171.64.122.76 is the backup collection server for some SMP works units, but isn't accepting WUs at the moment.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Re: Several console clients no longer getting work

Post by Torin3 »

Well, I'm still having a bit of this problem going on. I've got several computers on this network that have never had a problem. I've got some that stopped sending and receiving work unit, then started again and are running fine now. And I've got some that stopped and never started again. When I hooked them up to the DSL line directly, they were able to send and receive, but once they were back on our T1 connected network, they finshed up their work and couldn't send again. There seems to be no correlation between stopped and working computers, the same clients are on computers in both conditions. Some are XP, some are 2000. All have permission to connect up to the internet. There are some I can hook up every few days in the morning before their users get here to send and recieve work, but not all of them. I'm only allowed to farm them as long as it is transparent to the end users and nobody complains (and it doesn't take up too much of my time).

:(
Torin3
Posts: 39
Joined: Mon Feb 11, 2008 11:40 am

Re: Several console clients no longer getting work

Post by Torin3 »

I think this can probably be considered resolved. I finally noticed that all the stalled units were 6.01 clients running as a service. So I uninstalled and removed the registry settings and set it back up with the 5.04 client, running as a service. It got new work on the T1 that way. I'll confirm that it uploads and get more work and the switch all the other 6.01 clients running as a service to the 5.04 client.

Thanks!
Post Reply