Can't send ok WU (?)

Moderators: Site Moderators, FAHC Science Team

Post Reply
él Mero
Posts: 49
Joined: Sun Dec 02, 2007 1:14 pm

Can't send ok WU (?)

Post by él Mero »

Hi,

I had a WU not doing anything for about 30 minutes, stuck on "Folding@home Core Shutdown: FINISHED_UNIT".
So I closed it and tried to send it. That wouldn't work. I ran qfix, which said the WU was OK. I also checked with the queue-flag to see what was up and it reported that the WU was ready. I tried to send it again, no luck. Is there anything else I might try?

Code: Select all

Launch directory: /root/folding1
Executable: ./fah6
Arguments: -send 06 

[06:02:54] - Ask before connecting: No
[06:02:54] - User name: el_Mero (Team 37451)
[06:02:54] - User ID: 5968848105C326C3
[06:02:54] - Machine ID: 1
[06:02:54] 
[06:02:54] Loaded queue successfully.
[06:02:54] Attempting to return result(s) to server...
[06:02:54] - Failed to send unit 06 to server

Folding@Home Client Shutdown.


--- Opening Log file [February 1 06:03:00] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.00beta1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/folding1
Executable: ./fah6
Arguments: -queueinfo 

[06:03:00] - Ask before connecting: No
[06:03:00] - User name: el_Mero (Team 37451)
[06:03:00] - User ID: 5968848105C326C3
[06:03:00] - Machine ID: 1
[06:03:00] 
[06:03:00] Loaded queue successfully.
[06:03:00] Printing Queue Information
CURRENT QUEUE: 
00  EMPTY    
01  EMPTY    
02  EMPTY    
03  EMPTY    
04  EMPTY    
05  EMPTY    
06 *READY     a1 171.64.65.63:8080  January 31 09:23 | February 3 02:26
[    P3062R5C26G27 ]
07  EMPTY    
08  EMPTY    
09  EMPTY    

Folding@Home Client Shutdown.


--- Opening Log file [February 1 06:06:22] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.00beta1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /root/folding1
Executable: ./fah6
Arguments: -send all 

[06:06:22] - Ask before connecting: No
[06:06:22] - User name: el_Mero (Team 37451)
[06:06:22] - User ID: 5968848105C326C3
[06:06:22] - Machine ID: 1
[06:06:22] 
[06:06:22] Loaded queue successfully.
[06:06:22] Attempting to return result(s) to server...

Folding@Home Client Shutdown.
MstrBlstr
Posts: 578
Joined: Thu Nov 29, 2007 7:03 pm
Location: Texas

Re: Can't send ok WU (?)

Post by MstrBlstr »

Let the client run. See what happens (without the -send all).
-=MB=-
él Mero
Posts: 49
Joined: Sun Dec 02, 2007 1:14 pm

Re: Can't send ok WU (?)

Post by él Mero »

This happens:

Code: Select all

Launch directory: /root/folding1
Executable: ./fah6
Arguments: -local -smp -verbosity 9 -forceasm -advmethods 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[09:56:04] - Ask before connecting: No
[09:56:04] - User name: el_Mero (Team 37451)
[09:56:04] - User ID: 5968848105C326C3
[09:56:04] - Machine ID: 1
[09:56:04] 
[09:56:05] Loaded queue successfully.
[09:56:05] - Autosending finished units...
[09:56:05] Trying to send all finished work units
[09:56:05] + No unsent completed units remaining.
[09:56:05] - Autosend completed
[09:56:05] 
[09:56:05] + Processing work unit
[09:56:05] Core required: FahCore_a1.exe
[09:56:05] Core found.
[09:56:05] Working on Unit 06 [February 1 09:56:05]
[09:56:05] + Working ...
[09:56:05] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 06 -priority 96 -checkpoint 3 -forceasm -verbose -lifeline 6316 -version 600'

[09:56:05] 
[09:56:05] *------------------------------*
[09:56:05] Folding@Home Gromacs SMP Core
[09:56:05] Version 1.74 (November 27, 2006)
[09:56:05] 
[09:56:05] Preparing to commence simulation
[09:56:05] - Ensuring status. Please wait.
[09:56:22] - Assembly optimizations manually forced on.
[09:56:22] - Not checking prior termination.
[09:56:22] Finalizing output
[09:56:22] - Starting from initial work packet
[09:56:22] 
[09:56:22] Project: 0 (Run 0, Clone 0, Gen 0)
[09:56:22] 
[09:56:22] Error: Could not write local file.  Exiting.
[09:56:22] - Shutting down core
[09:56:47] ***** Got an Activate signal (2)
[09:56:47] Killing all core threads

Folding@Home Client Shutdown.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Can't send ok WU (?)

Post by bruce »

él Mero wrote:[09:56:22] Error: Could not write local file. Exiting.
This means that your OS is preventing FAH from updating it's files. You probably need to fix the permissions on /root/folding1 so that the user that's running FAH has full access to all the files -- including the /work directory.
él Mero
Posts: 49
Joined: Sun Dec 02, 2007 1:14 pm

Re: Can't send ok WU (?)

Post by él Mero »

bruce wrote:
él Mero wrote:[09:56:22] Error: Could not write local file. Exiting.
This means that your OS is preventing FAH from updating it's files. You probably need to fix the permissions on /root/folding1 so that the user that's running FAH has full access to all the files -- including the /work directory.
but I´m running the client as root and as far as I know root should have full permissions?

anyway, when I "ls -l" the contents of the folding folder, it for fah6 displays:

Code: Select all

-rwx------ 1 8372 37 252100 2007-09-27 16:09 fah6

instead of like for example FahCore_a1.exe:

Code: Select all

-rwxr-x--- 1 root root 3625104 2008-01-28 04:34 FahCore_a1.exe
and why, when I start it normally, does it display "Project: 0 (Run 0, Clone 0, Gen 0)"? is it because the results of WU 6 is already finished?
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Can't send ok WU (?)

Post by bruce »

I'm not sure why you're running as root, but that's your business.

Please do a ls -l /root/folding1.

I suspect that a chmod +rwx -R /root/folding1 will fix it.
él Mero
Posts: 49
Joined: Sun Dec 02, 2007 1:14 pm

Re: Can't send ok WU (?)

Post by él Mero »

running as root is me not being to good at creating users on a linux server, I know about the security issues, but for now when I´m only running folding on that machine I don´t really care

regarding setting the permissions to max didn't help me much either, still the same errors as when I started this thread

[all files in the folder is "-rwxr-xr-x" now]
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Can't send ok WU (?)

Post by bruce »

él Mero wrote:running as root is me not being to good at creating users on a linux server, I know about the security issues, but for now when I´m only running folding on that machine I don´t really care

regarding setting the permissions to max didn't help me much either, still the same errors as when I started this thread

[all files in the folder is "-rwxr-xr-x" now]
Post a new log beginning after the chnmod.

What are the permissions on "work"?
What files are in it?
él Mero
Posts: 49
Joined: Sun Dec 02, 2007 1:14 pm

Re: Can't send ok WU (?)

Post by él Mero »

aa, man! now I got e neat little message running "./fah6 - send all":

Code: Select all

Folding@Home: This beta expired on February 2, 2008
just my luck :roll:

so how do I go about now? impossible situation?


Edit: Solved it by pasting in the new fah6 and mpiexec. Here's all:
Image

Work folder:
Image

Still doesn't work though ^^
anandhanju
Posts: 526
Joined: Mon Dec 03, 2007 4:33 am
Location: Australia

Re: Can't send ok WU (?)

Post by anandhanju »

Project: 0 (Run 0, Clone 0, Gen 0)? That sounds weird. Haven't run FAH on *nix myself so I'm not certain if this is expected. Can someone ascertain if this is the problem? Maybe clearing out the queue.dat and work dir might help, although I'd rather you wait for someone more knowledge than me to comment on this.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Can't send ok WU (?)

Post by bruce »

anandhanju wrote:Project: 0 (Run 0, Clone 0, Gen 0)? That sounds weird. Haven't run FAH on *nix myself so I'm not certain if this is expected. Can someone ascertain if this is the problem? Maybe clearing out the queue.dat and work dir might help, although I'd rather you wait for someone more knowledge than me to comment on this.
I suspect it's the same problem. The client cannot update the file so the values do not get stored.

Since you don't have anything going, stop the client, remove queue.dat and the /work directory, and restart the client. It should recreate both files and we'll have to see if it can change them.
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Can't send ok WU (?)

Post by Ivoshiee »

bruce wrote:
él Mero wrote:[09:56:22] Error: Could not write local file. Exiting.
This means that your OS is preventing FAH from updating it's files. You probably need to fix the permissions on /root/folding1 so that the user that's running FAH has full access to all the files -- including the /work directory.
That is not a file permission issue - the queue.dat or some other FAH file got corrupt by the termination of the FAH SMP client.
What core was it running? I get this zombie "Project: 0 (Run 0, Clone 0, Gen 0)" at every user controlled termination of FAH Core A2, so I tend not to stop the FAH SMP client myself (Note: Rebooting the box does not give this error).

Only "fix" I've ever found has been deleting the queue.dat and /work directory - dumping the WU. I do not use the qfix and from your post it turned out it didn't help you at all. I do not know if the qfix will let you regenerate the queue.dat file from scratch. If it will then delete the queue.dat and rebuild it with qfix. If the SMP client still fails then either the error is not within queue.dat, but some /work file. At that point we have no tools to verify content of any of those files, so dump the WU.
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Can't send ok WU (?)

Post by Ivoshiee »

After some experimenting with the FAH SMP Linux client and A2 core - you can not restart the FAH SMP client for about 2 minutes after you have terminated it (ex. CTRL+C) because despite the fah6 itself has been terminated, the FahCore_a2.exes and mpiexec will continue to run for about that time.
If you restart the FAH SMP client before that time the FAH WU files are locked by running FAH Cores and it will get you that "Project 0..." error.
él Mero
Posts: 49
Joined: Sun Dec 02, 2007 1:14 pm

Re: Can't send ok WU (?)

Post by él Mero »

Ivoshiee wrote:What core was it running?
FahCore_a1
Ivoshiee wrote:I get this zombie "Project: 0 (Run 0, Clone 0, Gen 0)" at every user controlled termination of FAH Core A2, so I tend not to stop the FAH SMP client myself (Note: Rebooting the box does not give this error).
ok, will try the reboot-fix next time/if the client hangs
Ivoshiee wrote:I do not know if the qfix will let you regenerate the queue.dat file from scratch. If it will then delete the queue.dat and rebuild it with qfix.
no, it will not, error message: "Can't open <queue.dat> file"
Ivoshiee wrote:Only "fix" I've ever found has been deleting the queue.dat and /work directory - dumping the WU.
that started another WU, so the client still works though
Ivoshiee wrote:If the SMP client still fails then either the error is not within queue.dat, but some /work file. At that point we have no tools to verify content of any of those files, so dump the WU.
dumped, now working on a new WU, better that the pc does some work instead of just idling, seems like I couldn't send the results anyway, want me to email it? ^^
Ivoshiee wrote:After some experimenting with the FAH SMP Linux client and A2 core - you can not restart the FAH SMP client for about 2 minutes after you have terminated it (ex. CTRL+C) because despite the fah6 itself has been terminated, the FahCore_a2.exes and mpiexec will continue to run for about that time. If you restart the FAH SMP client before that time the FAH WU files are locked by running FAH Cores and it will get you that "Project 0..." error.
I probably did this, will only happen once though ^^
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Can't send ok WU (?)

Post by Ivoshiee »

Note:
The FahCore_a2.exe "terminating" will not take about 2 minutes, but in some tests it took even about 6 minutes! I suspect it can any random number of minutes. Maybe "killall -9 FahCore_a2.exe mpiexec" will do the trick a bit sooner...
Post Reply