Page 1 of 1

SMP unit just hangs

Posted: Tue Dec 04, 2007 6:22 pm
by dmearns
One of my Linux boxes is trying to process Project: 2605 (Run 7, Clone 506, Gen 8). I have tried twice and each time it hangs right at the start of processing.

Code: Select all

[16:10:10] + Processing work unit
[16:10:10] Core required: FahCore_a1.exe
[16:10:10] Core found.
[16:10:10] Working on Unit 07 [December 4 16:10:10]
[16:10:10] + Working ...
[16:10:10] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 07 -checkpoint 15 -forceasm -verbose -lifeline 21965 -version 600'

[16:10:10] 
[16:10:10] *------------------------------*
[16:10:10] Folding@Home Gromacs SMP Core
[16:10:10] Version 1.74 (November 27, 2006)
[16:10:10] 
[16:10:10] Preparing to commence simulation
[16:10:10] - Ensuring status. Please wait.
[16:10:27] - Assembly optimizations manually forced on.
[16:10:27] - Not checking prior termination.
[16:10:28] - Expanded 2419998 -> 12897049 (decompressed 532.9 percent)
[16:10:28] - Starting from initial work packet
[16:10:28] 
[16:10:28] Project: 2605 (Run 7, Clone 506, Gen 8)
[16:10:28] 
[16:10:28] Assembly optimizations on if available.
[16:10:28] Entering M.D.
[16:10:34] Finalizing output
I don't remember seeing that message about "Finalizing output" before, does anyone know what it means? This box has been folding perfectly for several weeks, so I really don't think this is a hardware issue. Should I just delete this work unit and move on?

Thanks.

- Dave

Posted: Tue Dec 04, 2007 7:10 pm
by toTOW
How long did you wait :?:

What happens if you restart the client :?:

If the WU didn't make any progress before, you can delete it, there are good chance that it's a bad WU :(

Posted: Tue Dec 04, 2007 8:12 pm
by dmearns
The first time it was a little over 5 hours. It tried to send unsent work units at four hours so that part of the program was still alive. There was only one FahCore_a1.exe process running which usually means trouble. The processes did not respond to normal kill commands, so I had to use kill -9. On a restart, it did exactly the same thing. That one has been running about 4 hours now, and has produced no output after the "Finalizing output" line.

I will kill it again, and delete the unit.

Thanks

- Dave

Posted: Tue Dec 04, 2007 10:32 pm
by toTOW
The WU seems to have a problem yes :(