What to do when your SMP client hangs at 100% completion

Moderators: Site Moderators, PandeGroup

What to do when your SMP client hangs at 100% completion

Postby uncle_fungus » Fri Apr 04, 2008 5:07 pm

This is a post that was made on the old forum, but still applies

Firstly make sure that this is actually the problem you have. See the log below:
Code: Select all
[22:36:42] Completed 4800000 out of 5000000 steps  (96 percent)
[22:45:59] Writing local files
[22:45:59] Completed 4850000 out of 5000000 steps  (97 percent)
[22:55:14] Writing local files
[22:55:14] Completed 4900000 out of 5000000 steps  (98 percent)
[23:04:30] Writing local files
[23:04:30] Completed 4950000 out of 5000000 steps  (99 percent)
[23:13:45] Writing local files
[23:13:45] Completed 5000000 out of 5000000 steps  (100 percent)
[23:13:45] Writing final coordinates.
[23:13:45] Past main M.D. loop
[23:13:45] Will end MPI now
[23:14:45]
[23:14:45] Finished Work Unit:
[23:14:45] - Reading up to 232536 from "work/wudata_08.arc": Read 232536
[23:14:45] - Reading up to 6860708 from "work/wudata_08.xtc": Read 6860708
[23:14:45] goefile size: 0
[23:14:45] logfile size: 129438
[23:14:45] Leaving Run
[23:14:47] - Writing 7362098 bytes of core data to disk...
[23:14:47]   ... Done.
[23:14:50] - Shutting down core
[23:14:50]
[23:14:50] Folding@home Core Shutdown: FINISHED_UNIT

<<Nothing appears to be happening after this point apart from the automatic upload attempts>>

[02:24:50] - Autosending finished units...
[02:24:50] Trying to send all finished work units
[02:24:50] + No unsent completed units remaining.
[02:24:50] - Autosend completed
[08:24:50] - Autosending finished units...
[08:24:50] Trying to send all finished work units
[08:24:50] + No unsent completed units remaining.
[08:24:50] - Autosend completed

<<Requires the client to be manually killed>>

[14:21:12] ***** Got an Activate signal (2)
[14:21:12] Killing all core threads

Folding@Home Client Shutdown.


Stop your client.

v. important: Do NOT try to restart your client this will eventually trash the queue and cause your client to think it's processing Project: 0 (Run 0, Clone 0, Gen 0). If this happens you'll have to use qgen to regenerate a new queue.

Download qfix from here: http://linuxminded.xs4all.nl/?target=so ... -tools.plc and place it in your SMP client folder.

Give yourself permission to execute the qfix binary (either through your desktop environment or via this command):
Code: Select all
chmod +x qfix


Run qfix from a terminal like this:

Code: Select all
./qfix


Note its output, which will probably look a little like this:
Code: Select all
entry 9, status 0, address 171.64.65.56:8080
entry 0, status 0, address 171.64.65.56:8080
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 0, address 171.64.65.56:8080
entry 3, status 0, address 171.64.65.56:8080
entry 4, status 0, address 171.64.65.56:8080
entry 5, status 0, address 171.64.65.56:8080
entry 6, status 0, address 171.64.65.56:8080
entry 7, status 0, address 171.64.65.56:8080
entry 8, status 1, address 171.64.65.56:8080
  Found results <work/wuresults_08.dat>: proj 3027, run 1, clone 84, gen 18
   -- queue entry: proj 3027, run 1, clone 84, gen 18
   -- queue entry isn't empty
File is OK


The entry with results waiting may be different in your queue, but it this case it's entry 8 that has hung. Notice at this point that qfix doesn't think that there is a problem...unfortunately there is.

Now you need to run the smp client from a terminal using the -delete flag and the queue entry number you've just found out (replace 08 with your queue entry):

Code: Select all
./fah6 -delete 08


This operation will take about 4 minutes and will produce an error saying it could not remove all items from the queue, ignore it.

Code: Select all
[16:06:50] Loaded queue successfully.
[16:06:50] Deleting work unit #8 from work queue...
[0]0:Return code = 18
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[16:11:11] - Failed to delete the requested work unit

Folding@Home Client Shutdown.


Now that the broken queue entry has been deleted it's time to run qfix again. This time qfix will state that it has fixed a broken entry and requeued the result for upload.
Code: Select all
entry 9, status 0, address 171.64.65.56:8080
entry 0, status 0, address 171.64.65.56:8080
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 0, address 171.64.65.56:8080
entry 3, status 0, address 171.64.65.56:8080
entry 4, status 0, address 171.64.65.56:8080
entry 5, status 0, address 171.64.65.56:8080
entry 6, status 0, address 171.64.65.56:8080
entry 7, status 0, address 171.64.65.56:8080
entry 8, status 1, address 171.64.65.56:8080
  Found results <work/wuresults_08.dat>:  proj 3027, run 1, clone 84, gen 18
   -- queue entry: proj 3027, run 1, clone 84, gen 18
   -- requeued for upload
File needed repair. Errors fixed: 1.


Lastly you can restart your client and the fixed queue will allow the results to get sent.

Edit: Added giving qfix execute permissions - 2007/02/26
User avatar
uncle_fungus
Site Admin
 
Posts: 1702
Joined: Fri Nov 30, 2007 9:37 am
Location: Oxfordshire, UK

Re: What to do when your SMP client hangs at 100% completion

Postby bruce » Sat Dec 01, 2012 6:28 am

Just an additional note from a recent event.
TheWolf wrote:Just wanted to let everyone know that ... [this] does still work.


In his case, he made a backup copy of FAH's files and wants to remind us to use "./fah6 -send all" when restarting folding. This lets the client close after the upload has finished & will not download another work-unit on the (re-)start of F@H.

He had already dumped the problematic WU & started clean before he discovered this method and a new WU had been downloaded and was at 25% by then. Restarting the old backup with just "./fah6" per the above instructions did successfully start the upload but it also downloaded a new WU, leaving him in a situation where two new WUs had been downloaded into two copies of the queue. To avoid this, you would need to ALWAYS use "./fah6 -send all" when restarting on a backup copy, just in case you are working in a situation his.
bruce
 
Posts: 21278
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.


Return to Linux CPU V6 Client

Who is online

Users browsing this forum: No registered users and 1 guest

cron