Cancel WUs reporting "symtab get_symtab_handle not found" ?

The most demanding Projects are only available to a small percentage of very high-end servers.

Moderators: Site Moderators, PandeGroup

Cancel WUs reporting "symtab get_symtab_handle not found" ?

Postby wuffy68 » Thu Jun 05, 2014 4:25 am

Should I cancel the job, or let it continue running?


I saw the following in my logs before my job started folding

symtab get_symtab_handle 10790540 not found


Code: Select all
03:07:45:WU00:FS00:Started FahCore on PID 1393
03:07:47:WU00:FS00:Core PID:1411
03:07:47:WU00:FS00:FahCore 0xa5 started
03:07:50:WU01:FS00:Connecting to 171.67.108.200:8080
03:07:50:WU00:FS00:0xa5:
03:07:50:WU00:FS00:0xa5:*------------------------------*
03:07:50:WU00:FS00:0xa5:Folding@Home Gromacs SMP Core
03:07:50:WU00:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
03:07:50:WU00:FS00:0xa5:
03:07:50:WU00:FS00:0xa5:Preparing to commence simulation
03:07:50:WU00:FS00:0xa5:- Ensuring status. Please wait.
03:07:50:WU01:FS00:Assigned to work server 128.143.231.201
03:07:50:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:32 from 128.143.231.201
03:07:50:WU01:FS00:Connecting to 128.143.231.201:8080
03:07:57:WU01:FS00:Downloading 28.94MiB
03:07:59:WU00:FS00:0xa5:- Looking at optimizations...
03:07:59:WU00:FS00:0xa5:- Working with standard loops on this execution.
03:07:59:WU00:FS00:0xa5:- Previous termination of core was improper.
03:07:59:WU00:FS00:0xa5:- Files status OK
03:08:01:WU01:FS00:Download complete
03:08:01:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8102 run:0 clone:17 gen:650 core:0xa5 unit:0x000002a4088988e15011d34eac4b0774
03:08:06:WU00:FS00:0xa5:- Expanded 30339624 -> 33163648 (decompressed 109.3 percent)
03:08:06:WU00:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30339624 data_size=33163648, decompressed_data_size=33163648 diff=0
03:08:06:WU00:FS00:0xa5:- Digital signature verified
03:08:06:WU00:FS00:0xa5:
03:08:06:WU00:FS00:0xa5:Project: 8103 (Run 0, Clone 17, Gen 384)
03:08:06:WU00:FS00:0xa5:
03:08:08:ERROR:FS00:
03:08:08:ERROR:FS00:-------------------------------------------------------
03:08:08:ERROR:FS00:Program Folding@home, VERSION 4.5.4
03:08:08:ERROR:FS00:Source code file: gromacs-4.5.4/src/gmxlib/symtab.c, line: 107
03:08:08:ERROR:FS00:
03:08:08:ERROR:FS00:Fatal error:
03:08:08:ERROR:FS00:symtab get_symtab_handle 10790540 not found
03:08:08:ERROR:FS00:For more information and tips for troubleshooting, please check the GROMACS
03:08:08:ERROR:FS00:website at http://www.gromacs.org/Documentation/Errors
03:08:08:ERROR:FS00:-------------------------------------------------------
03:08:08:ERROR:FS00:
03:08:08:ERROR:FS00:Thanx for Using GROMACS - Have a Nice Day
03:08:15:WU00:FS00:0xa5:Entering M.D.
03:08:26:WU00:FS00:0xa5:Mapping NT from 32 to 32
03:08:32:WU00:FS00:0xa5:Completed 0 out of 250000 steps  (0%)
03:21:12:WU00:FS00:0xa5:Completed 2500 out of 250000 steps  (1%)
03:33:54:WU00:FS00:0xa5:Completed 5000 out of 250000 steps  (2%)
03:46:34:WU00:FS00:0xa5:Completed 7500 out of 250000 steps  (3%)
03:59:14:WU00:FS00:0xa5:Completed 10000 out of 250000 steps  (4%)
04:11:59:WU00:FS00:0xa5:Completed 12500 out of 250000 steps  (5%)


There isn't much on the GROMAC site about troubleshooting symtab problems (and no much recent info on the forum either)

http://www.gromacs.org/Developer_Zone/P ... ght=symtab

EDIT: Reordered some semantics in the description...
Last edited by wuffy68 on Thu Jun 05, 2014 5:02 am, edited 1 time in total.
1x nVidia 1070, 1x nVidia 1060 3g,
1x nVidia 970, 2x nVidia 960,
1x nVidia 555, 1x AMD R7, 2x AMD 295,
6x i5 CPU-only rigs
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby P5-133XL » Thu Jun 05, 2014 4:47 am

No idea, haven't see the error before
Image
P5-133XL
 
Posts: 4034
Joined: Sun Dec 02, 2007 4:36 am
Location: Salem. OR USA

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby PantherX » Thu Jun 05, 2014 1:35 pm

While I haven't encountered this error before, I would suggest that you restart the FAHClient. If it resumes the WU, let it finish. If it errors out, the Server will automatically be notified of the failure (assuming that it uploads the report).
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Chrome Folding App (Beta) Ӂ Troubleshooting "Bad WUs" Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6321
Joined: Wed Dec 23, 2009 9:33 am

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby wuffy68 » Thu Jun 05, 2014 2:22 pm

49% complete now ... fingers crossed :-) I restarted the client a couple of times, no other errors were reported.
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby wuffy68 » Fri Jun 06, 2014 5:08 am

Looks like the job wound up being a dupe?? ... got kicked out. Verified it takes nearly 10 minutes before WU is actually sent to Stanford. That explains my other problem with WUs not being sent since I rebooted the system before it had a chance to transmit results:

02:34:40:WU00:FS00:Upload complete
02:34:40:WU00:FS00:Server responded GOT_ALREADY (434)
02:34:40:WARNING:WU00:FS00:Server did not like results, dumping

More Log:
Code: Select all
02:23:46:WU00:FS00:0xa5:DynamicWrapper: Finished Work Unit: sleep=10000
02:23:56:WU00:FS00:0xa5:
02:23:56:WU00:FS00:0xa5:Finished Work Unit:
02:23:56:WU00:FS00:0xa5:- Reading up to 64407792 from "00/wudata_01.trr": Read 64407792
02:23:56:WU00:FS00:0xa5:trr file hash check passed.
02:23:56:WU00:FS00:0xa5:- Reading up to 31560756 from "00/wudata_01.xtc": Read 31560756
02:23:56:WU00:FS00:0xa5:xtc file hash check passed.
02:23:56:WU00:FS00:0xa5:edr file hash check passed.
02:23:56:WU00:FS00:0xa5:logfile size: 187927
02:23:56:WU00:FS00:0xa5:Leaving Run
02:23:57:WU00:FS00:0xa5:- Writing 96317351 bytes of core data to disk...
02:24:14:WU00:FS00:0xa5:Done: 96316839 -> 91578209 (compressed to 5.8 percent)
02:24:25:WU00:FS00:0xa5:  ... Done.
02:33:19:WU00:FS00:0xa5:- Shutting down core
02:33:19:WU00:FS00:0xa5:
02:33:19:WU00:FS00:0xa5:Folding@home Core Shutdown: FINISHED_UNIT
02:34:36:WU01:FS00:Starting
02:34:36:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient                                                         /cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -version 704 -lifeline 1404 -checkpoint 15 -np 32
02:34:36:WU01:FS00:Started FahCore on PID 2076
02:34:36:WU01:FS00:Core PID:2080
02:34:36:WU01:FS00:FahCore 0xa5 started
02:34:36:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
02:34:36:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:8103 run:0 clone:17 gen:384 core:0xa5 unit:0x00000241088988e1511d1e5f01bcd8                                                         40
02:34:36:WU00:FS00:Uploading 87.34MiB to 128.143.231.201
02:34:36:WU00:FS00:Connecting to 128.143.231.201:8080
02:34:37:WU01:FS00:0xa5:
02:34:37:WU01:FS00:0xa5:*------------------------------*
02:34:37:WU01:FS00:0xa5:Folding@Home Gromacs SMP Core
02:34:37:WU01:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
02:34:37:WU01:FS00:0xa5:
02:34:37:WU01:FS00:0xa5:Preparing to commence simulation
02:34:37:WU01:FS00:0xa5:- Looking at optimizations...
02:34:37:WU01:FS00:0xa5:- Created dyn
02:34:37:WU01:FS00:0xa5:- Files status OK
02:34:39:WU01:FS00:0xa5:- Expanded 30294185 -> 33130012 (decompressed 109.3 percent)
02:34:39:WU01:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30294185 data_size=33130012, decompressed_data_size=33130012 diff=0
02:34:39:WU01:FS00:0xa5:- Digital signature verified
02:34:39:WU01:FS00:0xa5:
02:34:39:WU01:FS00:0xa5:Project: 8105 (Run 0, Clone 19, Gen 416)
02:34:39:WU01:FS00:0xa5:
02:34:39:WU01:FS00:0xa5:Assembly optimizations on if available.
02:34:39:WU01:FS00:0xa5:Entering M.D.
02:34:40:WU00:FS00:Upload complete
02:34:40:WU00:FS00:Server responded GOT_ALREADY (434)
02:34:40:WARNING:WU00:FS00:Server did not like results, dumping
02:34:40:WU00:FS00:Cleaning up
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby PantherX » Fri Jun 06, 2014 5:14 am

According to the WU Database, you transmitted it and got points for it (you would obviously not get points for it again if it is re-uploaded):
Hi wuffy68 (team 224497),
Your WU (P8103 R0 C17 G384) was added to the stats database on 2014-06-04 09:14:31 for 234186 points of credit.
User avatar
PantherX
Site Moderator
 
Posts: 6321
Joined: Wed Dec 23, 2009 9:33 am

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby bruce » Fri Jun 06, 2014 5:19 am

I'll bet you did a backup/restore, thinking that you could resume from an earlier point without considering the fact that the FAH EULA explicitly prohibits manipulating it's files.
bruce
 
Posts: 22616
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby wuffy68 » Fri Jun 06, 2014 6:00 am

bruce wrote:I'll bet you did a backup/restore, thinking that you could resume from an earlier point without considering the fact that the FAH EULA explicitly prohibits manipulating it's files.


Hmmm - I wouldn't consider this manipulating the files ... its really like recovering (continuing) from the last valid checkpoint. I don't think the FAH server knows anything about the progress of the job until the processed WU is sent, or the WU expires...

To me this is similar to a physical server frying a CPU when a job is at 90% complete. After the user recovers the system from a system backup taken when the the job was only 50% complete, all should be well (As long as the job finishes within the scheduled window)...

That being said, I don't wan't to be violating the FAH rules - definitely not my intent. Do you have a specific article of the EULA that references allowed backup recovery methods?
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby 7im » Fri Jun 06, 2014 2:56 pm

No backup or recover methods needed on a stable system. The FAH checkpoint is all the client needs.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
User avatar
7im
 
Posts: 14648
Joined: Thu Nov 29, 2007 4:30 pm
Location: Arizona

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby PantherX » Fri Jun 06, 2014 9:21 pm

wuffy68 wrote:...That being said, I don't wan't to be violating the FAH rules - definitely not my intent. Do you have a specific article of the EULA that references allowed backup recovery methods?

You may want to refer to the EULA License (http://folding.stanford.edu/home/License/) and the Best Practices (https://folding.stanford.edu/home/faq/f ... practices/).

BTW, backing-up the F@H data while its running isn't a good idea since you haven't gotten any idea if it is currently writing a checkpoint or not (IIRC, a donor was able to guess if the checkpoint was written or not by using the file modified time and the checkpoint value). Thus, if you have to make a back-up, pausing the Slots and exiting the FAHClient would be the safest option. However, as 7im stated, this isn't usually needed under normal conditions.
User avatar
PantherX
Site Moderator
 
Posts: 6321
Joined: Wed Dec 23, 2009 9:33 am

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby wuffy68 » Sat Jun 07, 2014 6:00 am

PantherX wrote:Thus, if you have to make a back-up, pausing the Slots and exiting the FAHClient would be the safest option. However, as 7im stated, this isn't usually needed under normal conditions.


Yes - that's exactly what I'm doing to avoid corruption. I agree, taking a "Snapshot" on a live folding session could lead to general badness, so here is what my backups consist of:

1. I pause the folding job, wait ~120 seconds
2. stop the FAHClient.
3. Then I create an "Image" of the system and give it a name based on job ID, time and percent complete.
. a. Project8115Run0Clone18Gen426-13pct-0606141216
4. The imaging process actually does an automatic soft shutdown (really an extended reboot) of the instance before making a copy of it.
. a. If you do try this, never shutdown your Spot Instance manually (this will permanently terminate it, and you'll have to re-deploy from a backup instance - if you have one)
5. When the system comes back up, I restart the FAHClient manually and unpause the job
6. FAH safely hits the ground running.


I'm doing this 3-4 times per day (1%, 25%, 50%, and 75% complete) - each process taking roughly 5 minutes ... the goal is eventually to script this through AWS CLI, so it can be a hands off operation. If the instance fails, I can just manually recover from the latest image and finish the job.

Hopefully, this doesn't happen more than once every couple of days, and its certainly cost effective over Reserved Instances or buying a server capable of virtual 32 cores.

When I have time, I'll post a clear write-up of the process with screenshots.
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby bruce » Sat Jun 07, 2014 7:35 pm

I guess the fundamental questions are whether you've ever restored a backup. If so why?

Do you have more than one slot, and if so, what happened to the (other) WUs that were restored but didn't need to be?

(Taking a backup doesn't violate any recommendations or policy, but restoring one does and can lead to unexpected consequences.)
bruce
 
Posts: 22616
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Cancel WUs reporting "symtab get_symtab_handle not found

Postby wuffy68 » Sun Jun 08, 2014 7:05 am

bruce wrote:(Taking a backup doesn't violate any recommendations or policy, but restoring one does and can lead to unexpected consequences.)


So, I figured out what happened (having exposed potential bugs with AWS EC2 and FAH (or just known features, since I'm pretty green):

I only have one slot configured for single BigAdv jobs...

After a job completed, I shut system down as I normally do for a backup image. I didn't realize it takes over 10 minutes before the previous WU results are actually transmitted back to FAH. In shutting down the instance, I interrupted the transmission process somehow.

When the system came back after the reboot (not a restore), it started working on a new job, but never transmitted the original results.

Then things got really confusing ... I mounted an older image as a volume (/dev/sdb2) to my working Instance to find the exact PRCG number. On a subsequent backup, somehow AWS became confused and insisted on booting the second volume /dev/sdb2 instead of /dev/sdb1 ... so when I started folding, it started folding the job stored on /dev/sdb2 instead of what I expected it to fold on /dev/sdb1.

I finally gave up on it. I have since processed multiple BigAdv jobs using my current backup system of 6-8 hour incremental backups. I'm avoiding a situation where I have unsent jobs existing alongside jobs in progress, and avoiding making backups instances of two volumes with two root partitions.

I have yet to see forced termination (outbid on spot instance), where I can try out my recovery procedure - which will go as follows:

1. System is terminated by AWS while FAH job is at 93% complete.
2. I create a new spot instance based on a system image taken when the job was only 75% complete
3. let the job run to completion (since I would still have a couple days to complete, I could try again if outbid, or take another image closer to the completion point (like at 90%).

I'll update this thread once a forced termination happens.
wuffy68
 
Posts: 150
Joined: Wed Jun 04, 2014 11:06 pm
Location: Roxborough, Colorado USA


Return to SMP with bigadv

Who is online

Users browsing this forum: No registered users and 1 guest

cron