Firstly make sure that this is actually the problem you have. See the log below:
- Code: Select all
[22:36:42] Completed 4800000 out of 5000000 steps (96 percent)
[22:45:59] Writing local files
[22:45:59] Completed 4850000 out of 5000000 steps (97 percent)
[22:55:14] Writing local files
[22:55:14] Completed 4900000 out of 5000000 steps (98 percent)
[23:04:30] Writing local files
[23:04:30] Completed 4950000 out of 5000000 steps (99 percent)
[23:13:45] Writing local files
[23:13:45] Completed 5000000 out of 5000000 steps (100 percent)
[23:13:45] Writing final coordinates.
[23:13:45] Past main M.D. loop
[23:13:45] Will end MPI now
[23:14:45]
[23:14:45] Finished Work Unit:
[23:14:45] - Reading up to 232536 from "work/wudata_08.arc": Read 232536
[23:14:45] - Reading up to 6860708 from "work/wudata_08.xtc": Read 6860708
[23:14:45] goefile size: 0
[23:14:45] logfile size: 129438
[23:14:45] Leaving Run
[23:14:47] - Writing 7362098 bytes of core data to disk...
[23:14:47] ... Done.
[23:14:50] - Shutting down core
[23:14:50]
[23:14:50] Folding@home Core Shutdown: FINISHED_UNIT
<<Nothing appears to be happening after this point apart from the automatic upload attempts>>
[02:24:50] - Autosending finished units...
[02:24:50] Trying to send all finished work units
[02:24:50] + No unsent completed units remaining.
[02:24:50] - Autosend completed
[08:24:50] - Autosending finished units...
[08:24:50] Trying to send all finished work units
[08:24:50] + No unsent completed units remaining.
[08:24:50] - Autosend completed
<<Requires the client to be manually killed>>
[14:21:12] ***** Got an Activate signal (2)
[14:21:12] Killing all core threads
Folding@Home Client Shutdown.
Stop your client.
v. important: Do NOT try to restart your client this will eventually trash the queue and cause your client to think it's processing Project: 0 (Run 0, Clone 0, Gen 0). If this happens you'll have to use qgen to regenerate a new queue.
Download qfix from here: http://linuxminded.xs4all.nl/?target=so ... -tools.plc and place it in your SMP client folder.
Give yourself permission to execute the qfix binary (either through your desktop environment or via this command):
- Code: Select all
chmod +x qfix
Run qfix from a terminal like this:
- Code: Select all
./qfix
Note its output, which will probably look a little like this:
- Code: Select all
entry 9, status 0, address 171.64.65.56:8080
entry 0, status 0, address 171.64.65.56:8080
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 0, address 171.64.65.56:8080
entry 3, status 0, address 171.64.65.56:8080
entry 4, status 0, address 171.64.65.56:8080
entry 5, status 0, address 171.64.65.56:8080
entry 6, status 0, address 171.64.65.56:8080
entry 7, status 0, address 171.64.65.56:8080
entry 8, status 1, address 171.64.65.56:8080
Found results <work/wuresults_08.dat>: proj 3027, run 1, clone 84, gen 18
-- queue entry: proj 3027, run 1, clone 84, gen 18
-- queue entry isn't empty
File is OK
The entry with results waiting may be different in your queue, but it this case it's entry 8 that has hung. Notice at this point that qfix doesn't think that there is a problem...unfortunately there is.
Now you need to run the smp client from a terminal using the -delete flag and the queue entry number you've just found out (replace 08 with your queue entry):
- Code: Select all
./fah6 -delete 08
This operation will take about 4 minutes and will produce an error saying it could not remove all items from the queue, ignore it.
- Code: Select all
[16:06:50] Loaded queue successfully.
[16:06:50] Deleting work unit #8 from work queue...
[0]0:Return code = 18
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[16:11:11] - Failed to delete the requested work unit
Folding@Home Client Shutdown.
Now that the broken queue entry has been deleted it's time to run qfix again. This time qfix will state that it has fixed a broken entry and requeued the result for upload.
- Code: Select all
entry 9, status 0, address 171.64.65.56:8080
entry 0, status 0, address 171.64.65.56:8080
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 0, address 171.64.65.56:8080
entry 3, status 0, address 171.64.65.56:8080
entry 4, status 0, address 171.64.65.56:8080
entry 5, status 0, address 171.64.65.56:8080
entry 6, status 0, address 171.64.65.56:8080
entry 7, status 0, address 171.64.65.56:8080
entry 8, status 1, address 171.64.65.56:8080
Found results <work/wuresults_08.dat>: proj 3027, run 1, clone 84, gen 18
-- queue entry: proj 3027, run 1, clone 84, gen 18
-- requeued for upload
File needed repair. Errors fixed: 1.
Lastly you can restart your client and the fixed queue will allow the results to get sent.
Edit: Added giving qfix execute permissions - 2007/02/26

