Project: 2608 (Run 0, Clone 205, Gen 47)

Moderators: Site Moderators, FAHC Science Team

Project: 2608 (Run 0, Clone 205, Gen 47)

Postby Xilikon » Mon Dec 17, 2007 2:00 am

I had this WU fail at 70% consistently under Linux SMP :

Code: Select all
[16:22:07] Completed 490000 out of 500000 steps  (98 percent)
[16:35:51] Writing local files
[16:35:51] Completed 495000 out of 500000 steps  (99 percent)
[16:49:37] Writing local files
[16:49:37] Completed 500000 out of 500000 steps  (100 percent)
[16:49:37] Writing final coordinates.
[16:49:37] Past main M.D. loop
[16:49:37] Will end MPI now
[16:50:37]
[16:50:37] Finished Work Unit:
[16:50:37] - Reading up to 3721056 from "work/wudata_00.arc": Read 3721056
[16:50:37] - Reading up to 1778972 from "work/wudata_00.xtc": Read 1778972
[16:50:37] goefile size: 0
[16:50:37] logfile size: 32950
[16:50:37] Leaving Run
[16:50:41] - Writing 5537378 bytes of core data to disk...
[16:50:41]   ... Done.
[16:50:45] - Shutting down core
[16:50:45]
[16:50:45] Folding@home Core Shutdown: FINISHED_UNIT
[16:50:57] CoreStatus = 64 (100)
[16:50:57] Unit 0 finished with 31 percent of time to deadline remaining.
[16:50:57] Updated performance fraction: 0.636592
[16:50:57] Sending work to server


[16:50:57] + Attempting to send results
[16:50:57] - Reading file work/wuresults_00.dat from core
[16:50:57]   (Read 5537378 bytes from disk)
[16:50:57] Connecting to http://171.64.65.56:8080/
[16:52:08] Posted data.
[16:52:08] Initial: 0000; - Uploaded at ~75 kB/s
[16:52:09] - Averaged speed for that direction ~70 kB/s
[16:52:09] + Results successfully sent
[16:52:09] Thank you for your contribution to Folding@Home.
[16:52:09] + Number of Units Completed: 57

[16:56:20] - Warning: Could not delete all work unit files (0): Core returned invalid code
[16:56:20] Trying to send all finished work units
[16:56:20] + No unsent completed units remaining.
[16:56:20] - Preparing to get new work unit...
[16:56:20] + Attempting to get work packet
[16:56:20] - Will indicate memory of 751 MB
[16:56:20] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 8
[16:56:20] - Connecting to assignment server
[16:56:20] Connecting to http://assign.stanford.edu:8080/
[16:56:20] Posted data.
[16:56:20] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[16:56:20] + News From Folding@Home: Welcome to Folding@Home
[16:56:21] Loaded queue successfully.
[16:56:21] Connecting to http://171.64.65.56:8080/
[16:56:25] Posted data.
[16:56:25] Initial: 0000; - Receiving payload (expected size: 3141701)
[16:56:29] - Downloaded at ~767 kB/s
[16:56:29] - Averaged speed for that direction ~587 kB/s
[16:56:29] + Received work.
[16:56:29] Trying to send all finished work units
[16:56:29] + No unsent completed units remaining.
[16:56:29] + Closed connections
[16:56:29]
[16:56:29] + Processing work unit
[16:56:29] Core required: FahCore_a1.exe
[16:56:29] Core found.
[16:56:29] Working on Unit 01 [December 14 16:56:29]
[16:56:29] + Working ...
[16:56:29] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -priority 96 -checkpoint 15 -verbose -lifeline 4758 -version 600'

[16:56:29]
[16:56:29] *------------------------------*
[16:56:29] Folding@Home Gromacs SMP Core
[16:56:29] Version 1.74 (November 27, 2006)
[16:56:29]
[16:56:29] Preparing to commence simulation
[16:56:29] - Ensuring status. Please wait.
[16:56:30] - Starting from initial work packet
[16:56:30]
[16:56:30] Project: 2608 (Run 0, Clone 205, Gen 47)
[16:56:30]
[16:56:30] Assembly optimizations on if available.
[16:56:30] Entering M.D.
[16:56:47]  percent)
[16:56:48] - Starting from initial work packet
[16:56:48]
[16:56:48] Project: 2608 (Run 0, Clone 205, Gen 47)
[16:56:48]
[16:56:48] Entering M.D.
[16:56:55] files
[16:56:56] n: Protein
[16:56:56] Writing local files
[16:56:57] Extra SSE boost OK.
[17:11:59] t triggered.
[17:12:43] Writing local files
[17:12:43] Completed 5000 out of 500000 steps  (1 percent)
[17:27:43] Timered checkpoint triggered.
[17:27:52] Writing local files
[17:27:52] Completed 10000 out of 500000 steps  (2 percent)
...
[09:21:41] Writing local files
[09:21:41] Completed 340000 out of 500000 steps  (68 percent)
[09:36:12] Writing local files
[09:36:12] Completed 345000 out of 500000 steps  (69 percent)
[09:50:41] Writing local files
[09:50:41] Completed 350000 out of 500000 steps  (70 percent)
[10:01:53] CoreStatus = 0 (0)
[10:01:53] Client-core communications error: ERROR 0x0
[10:01:53] Deleting current work unit & continuing...
[10:06:28] - Warning: Could not delete all work unit files (1): Core returned invalid code
[10:06:28] Trying to send all finished work units
[10:06:28] + No unsent completed units remaining.
[10:06:28] - Preparing to get new work unit...
[10:06:28] + Attempting to get work packet
[10:06:28] - Will indicate memory of 751 MB
[10:06:28] - Connecting to assignment server
[10:06:28] Connecting to http://assign.stanford.edu:8080/
[10:06:29] Posted data.
[10:06:29] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[10:06:29] + News From Folding@Home: Welcome to Folding@Home
[10:06:29] Loaded queue successfully.
[10:06:29] Connecting to http://171.64.65.56:8080/
[10:06:33] Posted data.
[10:06:33] Initial: 0000; - Receiving payload (expected size: 3141701)
[10:06:38] - Downloaded at ~613 kB/s
[10:06:38] - Averaged speed for that direction ~593 kB/s
[10:06:38] + Received work.
[10:06:38] + Closed connections
[10:06:43]
[10:06:43] + Processing work unit
[10:06:43] Core required: FahCore_a1.exe
[10:06:43] Core found.
[10:06:43] Working on Unit 02 [December 15 10:06:43]
[10:06:43] + Working ...
[10:06:43] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 02 -priority 96 -checkpoint 15 -verbose -lifeline 4758 -version 600'

[10:06:43]
[10:06:43] *------------------------------*
[10:06:43] Folding@Home Gromacs SMP Core
[10:06:43] Version 1.74 (November 27, 2006)
[10:06:43]
[10:06:43] Preparing to commence simulation
[10:06:43] - Ensuring status. Please wait.
[10:07:00] - Looking at optimizations...
[10:07:00] - Working with standard loops on this execution.
[10:07:00] - Previous termination of core was improper.
[10:07:00] - Going to use standard loops.
[10:07:00] - Files status OK
[10:07:02] (decompressed 563.9 percent)
[10:07:02] 9 (decompressed 563.9 percent)
[10:07:03] - Starting from initial work packet
[10:07:03]
[10:07:03] Project: 2608 (Run 0, Clone 205, Gen 47)
[10:07:03]
[10:07:03] Entering M.D.
[10:07:11] Protein: Protein
[10:07:11] Writing local files
[10:07:14] Extra SSE boost OK.
[10:07:14] 0000 steps  (0 percent)
[10:22:15] Timered checkpoint triggered.
[10:22:53] Writing local files
[10:22:53] Completed 5000 out of 500000 steps  (1 percent)
[10:37:54] Timered checkpoint triggered.
[10:37:56] Writing local files
[10:37:56] Completed 10000 out of 500000 steps  (2 percent)
[10:52:42] Writing local files
[10:52:42] Completed 15000 out of 500000 steps  (3 percent)
[11:07:04] Writing local files
[11:07:04] Completed 20000 out of 500000 steps  (4 percent)
[11:21:41] Writing local files
[11:21:42] Completed 25000 out of 500000 steps  (5 percent)
...
[02:51:15] Writing local files
[02:51:15] Completed 345000 out of 500000 steps  (69 percent)
[03:05:42] Writing local files
[03:05:42] Completed 350000 out of 500000 steps  (70 percent)
[03:16:56] CoreStatus = 0 (0)
[03:16:56] Client-core communications error: ERROR 0x0
[03:16:56] Deleting current work unit & continuing...
[03:21:30] - Warning: Could not delete all work unit files (2): Core returned invalid code
[03:21:30] Trying to send all finished work units
[03:21:30] + No unsent completed units remaining.
[03:21:30] - Preparing to get new work unit...
[03:21:30] + Attempting to get work packet
[03:21:30] - Will indicate memory of 751 MB
[03:21:30] - Connecting to assignment server
[03:21:30] Connecting to http://assign.stanford.edu:8080/
[03:21:30] Posted data.
[03:21:30] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[03:21:30] + News From Folding@Home: Welcome to Folding@Home
[03:21:30] Loaded queue successfully.
[03:21:30] Connecting to http://171.64.65.56:8080/
[03:21:35] Posted data.
[03:21:35] Initial: 0000; - Receiving payload (expected size: 3141701)
[03:21:40] - Downloaded at ~613 kB/s
[03:21:40] - Averaged speed for that direction ~597 kB/s
[03:21:40] + Received work.
[03:21:40] + Closed connections
[03:21:45]
[03:21:45] + Processing work unit
[03:21:45] Core required: FahCore_a1.exe
[03:21:45] Core found.
[03:21:45] Working on Unit 03 [December 16 03:21:45]
[03:21:45] + Working ...
[03:21:45] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 03 -priority 96 -checkpoint 15 -verbose -lifeline 4758 -version 600'

[03:21:45]
[03:21:45] *------------------------------*
[03:21:45] Folding@Home Gromacs SMP Core
[03:21:45] Version 1.74 (November 27, 2006)
[03:21:45]
[03:21:45] Preparing to commence simulation
[03:21:45] - Ensuring status. Please wait.
[03:22:02] - Looking at optimizations...
[03:22:02] - Working with standard loops on this execution.
[03:22:02] - Previous termination of core was improper.
[03:22:02] - Going to use standard loops.
[03:22:02] - Files status OK
[03:22:04] (decompressed 563.9 percent)
[03:22:04] 9 (decompressed 563.9 percent)
[03:22:04] - Starting from initial work packet
[03:22:04]
[03:22:04] Project: 2608 (Run 0, Clone 205, Gen 47)
[03:22:04]
[03:22:05] Entering M.D.
[03:22:12] Protein: Protein
[03:22:12] Writing local files
[03:22:15] Extra SSE boost OK.
[03:22:15] 0000 steps  (0 percent)
[03:37:15] Timered checkpoint triggered.
[03:37:58] Writing local files
[03:37:59] Completed 5000 out of 500000 steps  (1 percent)
[03:52:58] Timered checkpoint triggered.
[03:53:18] Writing local files
[03:53:18] Completed 10000 out of 500000 steps  (2 percent)
[04:08:14] Writing local files
[04:08:14] Completed 15000 out of 500000 steps  (3 percent)
[04:22:59] Writing local files
[04:22:59] Completed 20000 out of 500000 steps  (4 percent)
[04:37:41] Writing local files
[04:37:41] Completed 25000 out of 500000 steps  (5 percent)
[04:52:24] Writing local files
[04:52:24] Completed 30000 out of 500000 steps  (6 percent)
[05:07:03] Writing local files
[05:07:04] Completed 35000 out of 500000 steps  (7 percent)
[05:21:43] Writing local files
...
[19:42:33] Writing local files
[19:42:34] Completed 335000 out of 500000 steps  (67 percent)
[19:57:10] Writing local files
[19:57:10] Completed 340000 out of 500000 steps  (68 percent)
[20:11:46] Writing local files
[20:11:46] Completed 345000 out of 500000 steps  (69 percent)
[20:26:24] Writing local files
[20:26:24] Completed 350000 out of 500000 steps  (70 percent)
[20:34:43]
[20:34:43] Folding@home Core Shutdown: INTERRUPTED
[20:34:47] CoreStatus = 66 (102)
[20:34:47] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[20:34:47] Killing all core threads

Folding@Home Client Shutdown.


I'm wondering if others had this error.
User avatar
Xilikon
 
Posts: 155
Joined: Sun Dec 02, 2007 2:34 pm

Postby 7im » Mon Dec 17, 2007 3:19 am

When a WU dies in the same place 3 times, it's safe to say the WU is the problem. No need to check for others completing the WU.
User avatar
7im
 
Posts: 10189
Joined: Thu Nov 29, 2007 5:30 pm
Location: Arizona

Postby Xilikon » Mon Dec 17, 2007 5:04 pm

Nevermind this, I kicked the Linux VM a bit and it finally completed :shock:

Something was stalling at this exact spot...
User avatar
Xilikon
 
Posts: 155
Joined: Sun Dec 02, 2007 2:34 pm


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron