Project: 2677 - ERROR 0x0 and lost WU

Moderators: Site Moderators, FAHC Science Team

Post Reply
ethomaz
Posts: 7
Joined: Tue Aug 19, 2008 5:15 pm

Project: 2677 - ERROR 0x0 and lost WU

Post by ethomaz »

Hi,

Me and some friends of team Fórum PCs are getting the error below with WU 2677. Others WUs (2665, 2669, 2675, 2653, etc) this does not occur. Only this last week the error occurred seven times.

Code: Select all

[04:52:49] 
[04:52:49] Project: 2677 (Run 37, Clone 58, Gen 1) 
[04:52:49] 
[04:52:49] Assembly optimizations on if available. 
[04:52:49] Entering M.D. 
... 
[18:04:05] Completed 245000 out of 250000 steps (98%) 
[18:18:52] Completed 247500 out of 250000 steps (99%) 
[18:33:40] Completed 250000 out of 250000 steps (100%) 
[18:33:42] DynamicWrapper: Finished Work Unit: sleep=1000 
[18:33:43] 
[18:33:43] Finished Work Unit: 
[18:33:43] - Reading up to 21184704 from "work/wudata_09.trr": Read 21184704 
[18:33:43] trr file hash check passed. 
[18:33:43] - Reading up to 27668064 from "work/wudata_09.xtc": Read 27668064 
[18:33:43] xtc file hash check passed. 
[18:33:43] edr file hash check passed. 
[18:33:43] logfile size: 176850 
[18:33:43] Leaving Run 
[18:33:44] Done with run, master node 
[18:33:44] - Writing 49212962 bytes of core data to disk... 
[18:33:50] CoreStatus = 0 (0) 
[18:33:50] Client-core communications error: ERROR 0x0 
[18:33:50] Deleting current work unit & continuing... 
[18:34:04] - Preparing to get new work unit... 
[18:34:04] + Attempting to get work packet 
[18:34:04] - Connecting to assignment server 
[18:34:05] - Successful: assigned to (171.64.65.56). 
[18:34:05] + News From Folding@Home: Welcome to Folding@Home 
[18:34:05] Loaded queue successfully. 
[18:34:31] + Closed connections 
[18:34:36] 
[18:34:36] + Processing work unit 
[18:34:36] At least 4 processors must be requested.Core required: FahCore_a2.exe 
[18:34:36] Core found. 
[18:34:36] Working on Unit 00 [March 29 18:34:36] 
[18:34:36] + Working ... 
[18:34:36] 
[18:34:36] *------------------------------* 
[18:34:36] Folding@Home Gromacs SMP Core 
[18:34:36] Version 2.04 (Thu Jan 29 16:43:57 PST 2009) 
[18:34:36] 
[18:34:36] Preparing to commence simulation 
[18:34:36] - Ensuring status. Please wait. 
[18:34:46] - Assembly optimizations manually forced on. 
[18:34:46] - Not checking prior termination. 
[18:34:48] - Expanded 4839121 -> 24014637 (decompressed 496.2 percent) 
[18:34:49] Called DecompressByteArray: compressed_data_size=4839121 data_size=24014637, decompressed_data_size=24014637 diff=0 
[18:34:49] - Digital signature verified 
[18:34:49] 
[18:34:49] Project: 2677 (Run 2, Clone 52, Gen 2) 
[18:34:49] 
[18:34:50] Assembly optimizations on if available. 
[18:34:50] Entering M.D. 
[18:34:58] Multi-core optimizations on 
[18:49:44] Completed 2500 out of 250000 steps (1%)
What's happening??? And how I fix it because we are getting only WU 2667.

I search in fórum and found various hardware issues of ERROR 0x0, but I don't work with overclock and the other projects (WUs) are fine.

We using SMP Linux in VM.

PS. Sorry my bad english, very bad. I'm brasilian.

Thanks.
toTOW
Site Moderator
Posts: 6293
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by toTOW »

I don't know what happens on your machine, but someone else was able to complete it and to return results.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
ethomaz
Posts: 7
Joined: Tue Aug 19, 2008 5:15 pm

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by ethomaz »

toTOW wrote:I don't know what happens on your machine, but someone else was able to complete it and to return results.
It's not only my machine, the other four machines have same error. I recommended the users disable the AdvMethods for not get 2677 more.
The last 10 WUs 2677, seven occurred the error.

And I have other 2677 in 70%, after finish I post the results.
susato
Site Moderator
Posts: 511
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by susato »

7 out of 10 is a lot of errors. But all project 2677 WU's are not the same.

If you are willing to list the project run clone and gen data for the troublesome WU's, we will investigate each one.
All we need is this line from the FAHlog.txt of the problem WU's:

Project: 2677 (Run 37, Clone 58, Gen 1)

Best wishes for success with your 70%-complete p2677 work unit.
ethomaz
Posts: 7
Joined: Tue Aug 19, 2008 5:15 pm

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by ethomaz »

Error again...

Code: Select all

[22:34:58] Loaded queue successfully.
[22:34:58] - Autosending finished units... [March 30 22:34:58 UTC]
[22:34:58] Trying to send all finished work units
[22:34:58] + No unsent completed units remaining.
[22:34:58] - Autosend completed
[22:34:58] 
[22:34:58] + Processing work unit
[22:34:58] At least 4 processors must be requested.Core required: FahCore_a2.exe
[22:34:58] Core found.
[22:34:58] Working on queue slot 09 [March 30 22:34:58 UTC]
[22:34:58] + Working ...
[22:34:58] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 09 -checkpoint 15 -forceasm -verbose -lifeline 866 -version 624'

[22:34:58] 
[22:34:58] *------------------------------*
[22:34:58] Folding@Home Gromacs SMP Core
[22:34:58] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[22:34:58] 
[22:34:58] Preparing to commence simulation
[22:34:58] - Ensuring status. Please wait.
[22:35:07] - Assembly optimizations manually forced on.
[22:35:07] - Not checking prior termination.
[22:35:09] - Expanded 4826942 -> 24048413 (decompressed 498.2 percent)
[22:35:10] Called DecompressByteArray: compressed_data_size=4826942 data_size=24048413, decompressed_data_size=24048413 diff=0
[22:35:10] - Digital signature verified
[22:35:10] 
[22:35:10] Project: 2677 (Run 18, Clone 58, Gen 1)
[22:35:10] 
[22:35:10] Assembly optimizations on if available.
[22:35:10] Entering M.D.
[22:35:16] Will resume from checkpoint file
[22:35:21] Resuming from checkpoint
[22:35:21] Verified work/wudata_09.log
[22:35:21] Verified work/wudata_09.trr
[22:35:22] Verified work/wudata_09.xtc
[22:35:22] Verified work/wudata_09.edr
[22:35:22] Completed 245010 out of 250000 steps  (98%)
[22:51:16] Completed 247500 out of 250000 steps  (99%)
[23:07:18] Completed 250000 out of 250000 steps  (100%)
[23:07:20] DynamicWrapper: Finished Work Unit: sleep=1000
[23:07:21] 
[23:07:21] Finished Work Unit:
[23:07:21] - Reading up to 21198096 from "work/wudata_09.trr": Read 21198096
[23:07:22] trr file hash check passed.
[23:07:22] - Reading up to 27685532 from "work/wudata_09.xtc": Read 27685532
[23:07:22] xtc file hash check passed.
[23:07:22] edr file hash check passed.
[23:07:22] logfile size: 177589
[23:07:22] Leaving Run
[23:07:26] Done with run, master node
[23:07:26] - Writing 49245281 bytes of core data to disk...
[23:07:33] CoreStatus = 0 (0)
[23:07:33] Sending work to server
[23:07:33] Project: 2677 (Run 18, Clone 58, Gen 1)


[23:07:33] + Attempting to send results [March 30 23:07:33 UTC]
[23:07:33] - Reading file work/wuresults_09.dat from core
[23:07:34]   (Read 35950592 bytes from disk)
[23:07:34] Connecting to http://171.64.65.56:8080/
[23:19:39] Posted data.
[23:19:39] Initial: 0000; - Uploaded at ~48 kB/s
[23:19:39] - Averaged speed for that direction ~92 kB/s
[23:19:39] - Server reports digital signature does not match.
[23:19:39]   (May be due to corruption during network transmission or a corrupted file.)
[23:19:39] - Error: Could not transmit unit 09 (completed March 30) to work server.
[23:19:39] - 1 failed uploads of this unit.
[23:19:39]   Keeping unit 09 in queue.
[23:19:39] Trying to send all finished work units
[23:19:39] Project: 2677 (Run 18, Clone 58, Gen 1)

[23:19:39] + Attempting to send results [March 30 23:19:39 UTC]
[23:19:39] - Reading file work/wuresults_09.dat from core
[23:19:40]   (Read 35950592 bytes from disk)
[23:19:40] Connecting to http://171.64.65.56:8080/
[23:31:46] Posted data.
[23:31:47] Initial: 0000; - Uploaded at ~48 kB/s
[23:31:47] - Averaged speed for that direction ~83 kB/s
[23:31:47] - Server reports digital signature does not match.
[23:31:47]   (May be due to corruption during network transmission or a corrupted file.)
[23:31:47] - Error: Could not transmit unit 09 (completed March 30) to work server.
[23:31:47] - 2 failed uploads of this unit.
List of projects with error:

1x Project: 2677 (Run 37, Clone 58, Gen 1)
2x Project: 2677 (Run 18, Clone 58, Gen 1)
1x Project: 2677 (Run 2, Clone 52, Gen 2)

Only 4 WUs 2677 was sending OK, but we lost the log.
ethomaz
Posts: 7
Joined: Tue Aug 19, 2008 5:15 pm

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by ethomaz »

Hi,

I guess that resolved the issue with WU!!! Yeah!!! The WU 2677 use more RAM than others WU and this make errors. My VM was running with 640MB of RAM, and increasing to 1024MB, the finish and send os WU is fine.

Now, I make test for find the best amount of RAM for this WU.
Last edited by ethomaz on Tue Mar 31, 2009 2:49 am, edited 1 time in total.
susato
Site Moderator
Posts: 511
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by susato »

Well done ethomaz! I agree, you found the problem. If you need to restrict the amount of RAM allocated to your VM, you might need to reconfigure Folding to seek only "small" or "normal" work units from the servers. The "big" WU really do occupy a lot of RAM.

Thanks for letting us know the problem is solved.
ethomaz
Posts: 7
Joined: Tue Aug 19, 2008 5:15 pm

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by ethomaz »

Susato, I understand... but I read that currently all SMP clients are considered "big" work units and this parameter is not valid. It's true???

In time, for other test I make a backup for WU in 99%. For finish and send with 640MB VM RAM error again, but increasing only 32MB of VM RAM (672MB total) the error not occour. For security, I will leave 768MB VM RAM.

Thanks.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by bruce »

Added VM too small to the list of causes for error 0x0 in the WIKI.

Question: When the VM is set to 640 MB, does that fact get reported in FAHlog when -verbosity 9 is set?
ethomaz
Posts: 7
Joined: Tue Aug 19, 2008 5:15 pm

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by ethomaz »

bruce wrote:Question: When the VM is set to 640 MB, does that fact get reported in FAHlog when -verbosity 9 is set?
One time I received ERROR 0x0 and the others times I received corrupted file in send WU.

Code: Select all

[18:33:50] CoreStatus = 0 (0) 
[18:33:50] Client-core communications error: ERROR 0x0 
[18:33:50] Deleting current work unit & continuing...

Code: Select all

[23:19:39] - Server reports digital signature does not match.
[23:19:39]   (May be due to corruption during network transmission or a corrupted file.)
[23:19:39] - Error: Could not transmit unit 09 (completed March 30) to work server.
And the console of VM showed a lot of error messages in both cases.

The funny thing is that the errors only appeared when the WU was 100%, if I backup 99% (executed the fist 99 steps with 640MB VM RAM) and execute only last step with more RAM the WU was finish and send successful. But if I execute the last step with 640MB the error appeared.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by bruce »

This is that part I was asking about:

[19:44:16] - Preparing to get new work unit...
[19:44:16] + Attempting to get work packet
[19:44:16] - Will indicate memory of 640 MB
[19:44:16] - Connecting to assignment server
[19:44:16] Connecting to http://assign.stanford.edu:8080/
[19:44:17] Posted data.
[19:44:17] - Successful: assigned to (xxx.xxx.xxx.xxx)
ethomaz
Posts: 7
Joined: Tue Aug 19, 2008 5:15 pm

Re: Project: 2677 - ERROR 0x0 and lost WU

Post by ethomaz »

Now I understand :P.

VM with 640MB RAM = "[12:08:21] - Will indicate memory of 620 MB" - Error
VM with 672MB RAM = "[12:20:56] - Will indicate memory of 650 MB" - OK

The config is automatic for memory in FAH. For me process WU 2677 the VM with 672MB RAM is perfect for now.

Thanks.
Post Reply