What to do when your SMP client hangs at 100% completion

Moderators: Site Moderators, PandeGroup

What to do when your SMP client hangs at 100% completion

Postby uncle_fungus » Fri Apr 04, 2008 5:07 pm

This is a post that was made on the old forum, but still applies.
(Edited to adapt from Linux to OSX context)

Firstly make sure that this is actually the problem you have. See the log below:
Code: Select all
[22:36:42] Completed 4800000 out of 5000000 steps  (96 percent)
[22:45:59] Writing local files
[22:45:59] Completed 4850000 out of 5000000 steps  (97 percent)
[22:55:14] Writing local files
[22:55:14] Completed 4900000 out of 5000000 steps  (98 percent)
[23:04:30] Writing local files
[23:04:30] Completed 4950000 out of 5000000 steps  (99 percent)
[23:13:45] Writing local files
[23:13:45] Completed 5000000 out of 5000000 steps  (100 percent)
[23:13:45] Writing final coordinates.
[23:13:45] Past main M.D. loop
[23:13:45] Will end MPI now
[23:14:45]
[23:14:45] Finished Work Unit:
[23:14:45] - Reading up to 232536 from "work/wudata_08.arc": Read 232536
[23:14:45] - Reading up to 6860708 from "work/wudata_08.xtc": Read 6860708
[23:14:45] goefile size: 0
[23:14:45] logfile size: 129438
[23:14:45] Leaving Run
[23:14:47] - Writing 7362098 bytes of core data to disk...
[23:14:47]   ... Done.
[23:14:50] - Shutting down core
[23:14:50]
[23:14:50] Folding@home Core Shutdown: FINISHED_UNIT

<<Nothing appears to be happening after this point apart from the automatic upload attempts>>

[02:24:50] - Autosending finished units...
[02:24:50] Trying to send all finished work units
[02:24:50] + No unsent completed units remaining.
[02:24:50] - Autosend completed
[08:24:50] - Autosending finished units...
[08:24:50] Trying to send all finished work units
[08:24:50] + No unsent completed units remaining.
[08:24:50] - Autosend completed

<<Requires the client to be manually killed>>

[14:21:12] ***** Got an Activate signal (2)
[14:21:12] Killing all core threads

Folding@Home Client Shutdown.


Stop your client.

v. important: Do NOT try to restart your client this will eventually trash the queue and cause your client to think it's processing Project: 0 (Run 0, Clone 0, Gen 0). If this happens you'll have to use qgen to regenerate a new queue.

Download qfix from here: http://linuxminded.xs4all.nl/?target=so ... -tools.plc and place it in your SMP client folder. Be sure to get the Mac OSX binary for Intel Macs or the universal binary

Open a Terminal shell and set the SMP client folder as the active directory by typing
Code: Select all
 cd
with a space after "cd", then drag the icon for your SMP client folder into the Terminal window, then hit return.

Give yourself permission to execute the qfix binary (either through your desktop environment or via this command):
Code: Select all
chmod +x qfix
followed by return

Run qfix from a terminal like this:

Code: Select all
./qfix


Note its output, which will probably look a little like this:
Code: Select all
entry 9, status 0, address 171.64.65.56:8080
entry 0, status 0, address 171.64.65.56:8080
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 0, address 171.64.65.56:8080
entry 3, status 0, address 171.64.65.56:8080
entry 4, status 0, address 171.64.65.56:8080
entry 5, status 0, address 171.64.65.56:8080
entry 6, status 0, address 171.64.65.56:8080
entry 7, status 0, address 171.64.65.56:8080
entry 8, status 1, address 171.64.65.56:8080
  Found results <work/wuresults_08.dat>: proj 3027, run 1, clone 84, gen 18
   -- queue entry: proj 3027, run 1, clone 84, gen 18
   -- queue entry isn't empty
File is OK


The entry with results waiting may be different in your queue, but it this case it's entry 8 that has hung. Notice at this point that qfix doesn't think that there is a problem...unfortunately there is. This is the "diagnostic" run of qfix.

Now you need to run the smp client from the terminal using the -delete flag and the queue entry number you've just found out (replace 08 with your queue entry):

Code: Select all
./fah6 -local -delete 08


In Linux you needn't put in the -local flag but be SURE to do it for OSX.
This operation will take about 4 minutes and may produce an error saying it could not remove all items from the queue, ignore it.
What it does will delete the erroneous entry from the queue - but it will leave the result files alone.

Code: Select all
[16:06:50] Loaded queue successfully.
[16:06:50] Deleting work unit #8 from work queue...
[0]0:Return code = 18
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[16:11:11] - Failed to delete the requested work unit

Folding@Home Client Shutdown.


Now that the broken queue entry has been deleted it's time to run qfix again. This time qfix will state that it has fixed a broken entry and requeued the result for upload.
Code: Select all
entry 9, status 0, address 171.64.65.56:8080
entry 0, status 0, address 171.64.65.56:8080
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 0, address 171.64.65.56:8080
entry 3, status 0, address 171.64.65.56:8080
entry 4, status 0, address 171.64.65.56:8080
entry 5, status 0, address 171.64.65.56:8080
entry 6, status 0, address 171.64.65.56:8080
entry 7, status 0, address 171.64.65.56:8080
entry 8, status 1, address 171.64.65.56:8080
  Found results <work/wuresults_08.dat>:  proj 3027, run 1, clone 84, gen 18
   -- queue entry: proj 3027, run 1, clone 84, gen 18
   -- requeued for upload
File needed repair. Errors fixed: 1.


It may not look exactly like the above, but it will be close - the "1" at the problem entry may be replaced by a "0".
Last, restart your client from the Terminal window, and the fixed queue will allow the results to get sent.
Code: Select all
./fah6 -local -send all


The client will send the results, then shut down.
Now you can restart Folding with your usual flags from the Terminal window if you usually use the console client. If not, type exit at the prompt in the Terminal window and close the window. Then restart Folding the System Preferences pane if you use the Universal (PrefPane) client, or from the InCrease window if you use InCrease to control and monitor F@H.

Fold on!

Edit: Added giving qfix execute permissions - 2007/02/26
[i]Edited by susato to add "cd to smp client folder" and use of -local flag - 2009/01/31[i]
User avatar
uncle_fungus
Site Admin
 
Posts: 1682
Joined: Fri Nov 30, 2007 9:37 am
Location: Oxfordshire, UK

Return to Intel Mac V6 client

Who is online

Users browsing this forum: No registered users and 1 guest

cron