Problem with 171.67.108.25 - CS5

Moderators: Site Moderators, FAHC Science Team

Post Reply
Ripper36
Posts: 60
Joined: Sun Sep 18, 2011 8:55 am

Problem with 171.67.108.25 - CS5

Post by Ripper36 »

I have a work unit attempting to send and is getting an 'actively refused connection' message.

Server log shows status standby and 'connect' as not accept.

I guess I will just have to wait, but is there anything more I should do?

JR :?
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Problem with 171.67.108.25 - CS5

Post by PantherX »

Welcome to the F@H Forum Ripper36,

Please read this post on details of Collection Server (viewtopic.php?f=18&t=17794#p161539). If you can post the FAHlog, we can look into the Work Server that should be receiving the WU.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Ripper36
Posts: 60
Joined: Sun Sep 18, 2011 8:55 am

Re: Problem with 171.67.108.25 - CS5

Post by Ripper36 »

171.64.65.54:8080 is the apparent work server in the log

Log follows:

Code: Select all

09:48:54:Started core on PID 2644
09:48:54:FahCore 0xa4 started
09:48:54:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
09:48:54:Unit 01: Uploading 34.70KiB to 171.64.65.54
09:48:54:Connecting to 171.64.65.54:8080
09:48:54:Unit 00:
09:48:54:Unit 00:*------------------------------*
09:48:54:Unit 00:Folding@Home Gromacs GB Core
09:48:54:Unit 00:Version 2.27 (Dec. 15, 2010)
09:48:54:Unit 00:
09:48:54:Unit 00:Preparing to commence simulation
09:48:54:Unit 00:- Ensuring status. Please wait.
09:48:55:WARNING: Exception: Failed to send results to work server: Upload failed
09:48:55:Trying to send results to collection server
09:48:55:Unit 01: Uploading 34.70KiB to 171.67.108.25
09:48:55:Connecting to 171.67.108.25:8080
09:48:57:WARNING: WorkServer connection failed on port 8080 trying 80
09:48:57:Connecting to 171.67.108.25:80
09:48:57:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1
09:48:59:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
09:48:59:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
09:48:59:Unit 01: Uploading 34.70KiB to 171.64.65.54
09:48:59:Connecting to 171.64.65.54:8080
09:49:00:WARNING: Exception: Failed to send results to work server: Upload failed
09:49:00:Trying to send results to collection server
09:49:00:Unit 01: Uploading 34.70KiB to 171.67.108.25
09:49:00:Connecting to 171.67.108.25:8080
09:49:03:Unit 00:- Looking at optimizations...
09:49:03:Unit 00:- Working with standard loops on this execution.
09:49:03:Unit 00:- Previous termination of core was improper.
09:49:03:Unit 00:- Going to use standard loops.
09:49:03:Unit 00:- Files status OK
09:49:03:Unit 00:- Expanded 467371 -> 1008860 (decompressed 215.8 percent)
09:49:03:Unit 00:Called DecompressByteArray: compressed_data_size=467371 data_size=1008860, decompressed_data_size=1008860 diff=0
09:49:04:Unit 00:- Digital signature verified
09:49:04:Unit 00:
09:49:04:Unit 00:Project: 7903 (Run 84, Clone 1, Gen 0)
09:49:04:Unit 00:
09:49:04:Unit 00:Entering M.D.
09:49:07:WARNING: WorkServer connection failed on port 8080 trying 80
09:49:07:Connecting to 171.67.108.25:80
09:49:10:Unit 00:Using Gromacs checkpoints
09:49:10:Unit 00:Mapping NT from 2 to 2 
09:49:10:Unit 00:Resuming from checkpoint
09:49:10:Unit 00:Verified 00/wudata_01.log
09:49:10:Unit 00:Verified 00/wudata_01.trr
09:49:10:Unit 00:Verified 00/wudata_01.edr
09:49:10:Unit 00:Completed 161270 out of 2500000 steps  (6%)
09:49:15:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
09:49:59:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
09:49:59:Unit 01: Uploading 34.70KiB to 171.64.65.54
09:49:59:Connecting to 171.64.65.54:8080
09:50:09:WARNING: Exception: Failed to send results to work server: Upload failed
09:50:09:Trying to send results to collection server
09:50:10:Unit 01: Uploading 34.70KiB to 171.67.108.25
09:50:10:Connecting to 171.67.108.25:8080
09:50:12:WARNING: WorkServer connection failed on port 8080 trying 80
09:50:12:Connecting to 171.67.108.25:80
09:50:14:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
09:51:37:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
09:51:37:Unit 01: Uploading 34.70KiB to 171.64.65.54
09:51:37:Connecting to 171.64.65.54:8080
09:51:41:WARNING: Exception: Failed to send results to work server: Upload failed
09:51:41:Trying to send results to collection server
09:51:41:Unit 01: Uploading 34.70KiB to 171.67.108.25
09:51:41:Connecting to 171.67.108.25:8080
09:51:43:WARNING: WorkServer connection failed on port 8080 trying 80
09:51:43:Connecting to 171.67.108.25:80
09:51:51:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
09:54:14:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
09:54:14:Unit 01: Uploading 34.70KiB to 171.64.65.54
09:54:14:Connecting to 171.64.65.54:8080
09:54:15:WARNING: Exception: Failed to send results to work server: Upload failed
09:54:15:Trying to send results to collection server
09:54:15:Unit 01: Uploading 34.70KiB to 171.67.108.25
09:54:15:Connecting to 171.67.108.25:8080
09:54:17:WARNING: WorkServer connection failed on port 8080 trying 80
09:54:17:Connecting to 171.67.108.25:80
09:54:20:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
09:58:28:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
09:58:28:Unit 01: Uploading 34.70KiB to 171.64.65.54
09:58:28:Connecting to 171.64.65.54:8080
09:58:29:WARNING: Exception: Failed to send results to work server: Upload failed
09:58:29:Trying to send results to collection server
09:58:29:Unit 01: Uploading 34.70KiB to 171.67.108.25
09:58:29:Connecting to 171.67.108.25:8080
09:58:48:WARNING: WorkServer connection failed on port 8080 trying 80
09:58:48:Connecting to 171.67.108.25:80
09:58:51:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
10:05:19:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
10:05:20:Unit 01: Uploading 34.70KiB to 171.64.65.54
10:05:20:Connecting to 171.64.65.54:8080
10:05:20:WARNING: Exception: Failed to send results to work server: Upload failed
10:05:20:Trying to send results to collection server
10:05:20:Unit 01: Uploading 34.70KiB to 171.67.108.25
10:05:20:Connecting to 171.67.108.25:8080
10:05:28:WARNING: WorkServer connection failed on port 8080 trying 80
10:05:28:Connecting to 171.67.108.25:80
10:05:30:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
10:16:25:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
10:16:25:Unit 01: Uploading 34.70KiB to 171.64.65.54
10:16:25:Connecting to 171.64.65.54:8080
10:16:26:WARNING: Exception: Failed to send results to work server: Upload failed
10:16:26:Trying to send results to collection server
10:16:26:Unit 01: Uploading 34.70KiB to 171.67.108.25
10:16:26:Connecting to 171.67.108.25:8080
10:16:28:WARNING: WorkServer connection failed on port 8080 trying 80
10:16:28:Connecting to 171.67.108.25:80
10:16:43:ERROR: Exception: Failed to connect to 171.67.108.25:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
10:27:47:Unit 00:Completed 175000 out of 2500000 steps  (7%)
10:34:22:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
10:34:22:Unit 01: Uploading 34.70KiB to 171.64.65.54
10:34:22:Connecting to 171.64.65.54:8080
10:34:23:WARNING: Exception: Failed to send results to work server: Upload failed
10:34:23:Trying to send results to collection server
10:34:23:Unit 01: Uploading 34.70KiB to 171.67.108.25
10:34:23:Connecting to 171.67.108.25:8080
10:34:26:WARNING: WorkServer connection failed on port 8080 trying 80
10:34:26:Connecting to 171.67.108.25:80
10:34:29:ERROR: Exception: Failed to connect to 171.67.108.25:80: No connection could be made because the target machine actively refused it.
Mod Edit: Added Code Tags - PantherX
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Problem with 171.67.108.25 - CS5

Post by bruce »

Please post the portion of the log immediately before the first attempt to upload project:6054 run:1 clone:55 gen:378 to 171.64.65.54. Did the WU have an error or did it complete successfully?
Ripper36
Posts: 60
Joined: Sun Sep 18, 2011 8:55 am

Re: Problem with 171.67.108.25 - CS5

Post by Ripper36 »

Found it ! Unstable machine error, dumped work unit, and then a network problem interrupted communications with the server! Crap all around!

Code: Select all

10:49:20:Unit 01:Project: 6054 (Run 1, Clone 55, Gen 378)
10:49:20:Unit 01:
10:49:20:Unit 01:Entering M.D.
10:49:27:Unit 01:Using Gromacs checkpoints
10:49:27:Unit 01:Mapping NT from 2 to 2 
10:49:31:Unit 01:Resuming from checkpoint
10:49:31:Unit 01:Verified 01/wudata_01.log
10:49:31:Unit 01:Verified 01/wudata_01.trr
10:49:31:Unit 01:Verified 01/wudata_01.edr
10:49:33:Unit 01:Completed 110914 out of 500000 steps  (22%)
11:20:08:Unit 01:Completed 115000 out of 500000 steps  (23%)
11:57:22:Unit 01:Completed 120000 out of 500000 steps  (24%)
12:34:40:Unit 01:Completed 125000 out of 500000 steps  (25%)
13:11:55:Unit 01:Completed 130000 out of 500000 steps  (26%)
13:49:08:Unit 01:Completed 135000 out of 500000 steps  (27%)
14:26:25:Unit 01:Completed 140000 out of 500000 steps  (28%)
15:03:42:Unit 01:Completed 145000 out of 500000 steps  (29%)
15:41:00:Unit 01:Completed 150000 out of 500000 steps  (30%)
22:06:41:FahCore, running Unit 01, returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029)
22:06:41:Starting Unit 01
22:06:41:Running core: "C:/Documents and Settings/John/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a3.fah/FahCore_a3.exe" -dir 01 -suffix 01 -lifeline 1864 -version 701 -checkpoint 15 -np 2
22:06:41:Started core on PID 1480
22:06:41:FahCore 0xa3 started
22:06:41:Unit 01:
22:06:41:Unit 01:*------------------------------*
22:06:41:Unit 01:Folding@Home Gromacs SMP Core
22:06:41:Unit 01:Version 2.27 (Dec. 15, 2010)
22:06:41:Unit 01:
22:06:41:Unit 01:Preparing to commence simulation
22:06:41:Unit 01:- Ensuring status. Please wait.
22:06:51:Unit 01:- Looking at optimizations...
22:06:51:Unit 01:- Working with standard loops on this execution.
22:06:51:Unit 01:Examination of work files indicates 8 consecutive improper terminations of core.
22:06:51:Unit 01:- Expanded 1763231 -> 2251181 (decompressed 127.6 percent)
22:06:51:Unit 01:Called DecompressByteArray: compressed_data_size=1763231 data_size=2251181, decompressed_data_size=2251181 diff=0
22:06:51:Unit 01:- Digital signature verified
22:06:51:Unit 01:
22:06:51:Unit 01:Project: 6054 (Run 1, Clone 55, Gen 378)
22:06:51:Unit 01:
22:06:51:Unit 01:Entering M.D.
22:06:57:Unit 01:Using Gromacs checkpoints
22:06:57:Unit 01:Mapping NT from 2 to 2 
22:06:58:Unit 01:Resuming from checkpoint
22:06:58:Unit 01:Verified 01/wudata_01.log
22:06:58:Unit 01:Verified 01/wudata_01.trr
22:06:58:Unit 01:Verified 01/wudata_01.edr
22:06:59:Unit 01:Completed 151154 out of 500000 steps  (30%)
22:54:26:Unit 01:mdrun returned 255
22:54:26:Unit 01:Going to send back what have done -- stepsTotalG=500000
22:54:26:Unit 01:Work fraction=0.3080 steps=500000.
22:54:30:Unit 01:logfile size=34990 infoLength=34990 edr=0 trr=25
22:54:30:Unit 01:logfile size: 34990 info=34990 bed=0 hdr=25
22:54:30:Unit 01:- Writing 35528 bytes of core data to disk...
22:54:32:Unit 01:  ... Done.
22:54:33:Unit 01:
22:54:33:Unit 01:Folding@home Core Shutdown: UNSTABLE_MACHINE
22:54:33:FahCore, running Unit 01, returned: UNSTABLE_MACHINE (122 = 0x7a)
22:54:33:Starting Unit 01
22:54:33:Running core: "C:/Documents and Settings/John/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a3.fah/FahCore_a3.exe" -dir 01 -suffix 01 -lifeline 1864 -version 701 -checkpoint 15 -np 2
22:54:33:Started core on PID 2092
22:54:33:FahCore 0xa3 started
22:54:33:FahCore, running Unit 01, returned: MISSING_WORK_FILES (116 = 0x74)
22:54:33:WARNING: Unit 01 Fatal error, dumping
22:54:34:Sending unit results: id:01 state:SEND project:6054 run:1 clone:55 gen:378 core:0xa3 unit:0x083c58514e7de082017a0037000117a6
22:54:34:Unit 01: Uploading 34.70KiB to 171.64.65.54
22:54:34:Connecting to 171.64.65.54:8080
22:54:34:WARNING: WorkServer connection failed on port 8080 trying 80
22:54:34:Connecting to 171.64.65.54:80
22:54:34:Connecting to assign3.stanford.edu:8080
22:54:34:WARNING: Failed to get assignment from 'assign3.stanford.edu:8080': Could not get IP address for assign3.stanford.edu: No such host is known.
Mod Edit: Added Code Tags - PantherX
Image
jcoffland
Site Admin
Posts: 1019
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: Problem with 171.67.108.25 - CS5

Post by jcoffland »

171.64.65.54 is an old WS. The v7 client shouldn't send dump reports to these WS because they don't understand it. The v7 client tries to detect the WS version but get's confused in some situations because old WS can send essentially random data for the WS version. I'm working on a solution to this problem right now. This part is not a big problem though. The v7 client keeps trying to send the DUMP report but quits once the WU expires.

The other problem is that the core crashed. The v7 client correctly tried to restart the WU where it left off. The core then reported UNSTABLE_MACHINE which may also have been a correct response but it did not produce a results packet. With out a results packet the client cannot request partial credit so it dumps the WU, this leads to the problem described above.

Here is what we need to do to resolve these problems:

1) Upgrade the WS.
2) Apply the WS version detection fixes I'm working on to the v7 client.
3) Solve the core crash. Overclocked? Bad memory?
4) Get the core to create WU result file so we can request partial credit. It should do this automatically and I'm not sure why it doesn't in this case.

#2 will be improved very soon which will alleviate the need for #1. Regarding #3, if you are getting UNSTABLE_MACHINE and crashes that may be an indication of problems with your hardware. I'm not sure about #4 or when it will be fixed this is up to the core developers.

So basically there are at least 4 separate problems here which on their own might not be huge problems but are aggravating when the all come together.
Cauldron Development LLC
http://cauldrondevelopment.com/
Ripper36
Posts: 60
Joined: Sun Sep 18, 2011 8:55 am

Re: Problem with 171.67.108.25 - CS5

Post by Ripper36 »

Thanks for this explanation. This is an oldest machine in the farm. I think there is a memory fault although the quick tests don't show it. I will 'un-fold' this machine at the end of the current workunit and run memtest - I have some spare DDR2 RAM lying around so will see if that fixes it at my end.
Image
Post Reply