16434 (run:232 clone:3 gen:13) WORK_QUIT

Moderators: Site Moderators, FAHC Science Team

16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby legalalien » Fri May 15, 2020 5:00 am

Just received Server responded WORK_QUIT (404) / Server did not like results, dumping .. there goes several days of GPU work :( The GPU is EVGA Nvidia GT 710 2GB, no overclocking.

I suspect it may have something to do with CORE_OUTDATED received in the middle of uploading the results:

**
Code: Select all
***************************** Date: 2020-05-15 *******************************
00:52:48:WU02:FS01:0x22:Completed 2475000 out of 2500000 steps (99%)
00:52:49:WU01:FS01:Connecting to 65.254.110.245:80
00:52:49:WARNING:WU01:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
00:52:49:WU01:FS01:Connecting to 18.218.241.186:80
00:52:50:WU01:FS01:Assigned to work server 206.223.170.146
00:52:50:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GK208 [GeForce GT 710 LP] from 206.223.170.146
00:52:50:WU01:FS01:Connecting to 206.223.170.146:8080
00:52:59:WU01:FS01:Downloading 35.98MiB
00:53:05:WU01:FS01:Download 44.64%
00:53:11:WU01:FS01:Download 79.73%
00:53:13:WU01:FS01:Download complete
00:53:13:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:14251 run:114 clone:0 gen:11 core:0x22 unit:0x00000013cedfaa925eab0323e1367802
02:18:53:WU02:FS01:0x22:Completed 2500000 out of 2500000 steps (100%)
02:19:26:WU02:FS01:0x22:Saving result file ../logfile_01.txt
02:19:26:WU02:FS01:0x22:Saving result file checkpointState.xml
02:19:26:WU02:FS01:0x22:Saving result file checkpt.crc
02:19:26:WU02:FS01:0x22:Saving result file positions.xtc
02:19:27:WU02:FS01:0x22:Saving result file science.log
02:19:27:WU02:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
02:19:27:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
02:19:27:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:16434 run:232 clone:3 gen:13 core:0x22 unit:0x0000001503854c135e9cbacc8fa1f2ea
02:19:27:WU02:FS01:Uploading 104.38MiB to 3.133.76.19
02:19:27:WU02:FS01:Connecting to 3.133.76.19:8080
02:19:27:WU01:FS01:Starting
02:19:27:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 982 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 -tmax=80 -twait=900
02:19:27:WU01:FS01:Started FahCore on PID 18857
02:19:27:WU01:FS01:Core PID:18861
02:19:27:WU01:FS01:FahCore 0x22 started
02:19:28:WARNING:WU01:FS01:FahCore returned: CORE_OUTDATED (110 = 0x6e)
02:19:28:WU01:FS01:Downloading core from http://cores.foldingathome.org/v7/lin/64bit/Core_22.fah
02:19:28:WU01:FS01:Connecting to cores.foldingathome.org:80
02:19:29:WU01:FS01:FahCore 22: Downloading 3.59MiB
02:19:29:WU01:FS01:FahCore 22: Download complete
02:19:29:WU01:FS01:Valid core signature
02:19:29:WU01:FS01:Unpacked 9.32MiB to cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22
02:19:30:WU01:FS01:Starting
02:19:30:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 982 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 -tmax=80 -twait=900
02:19:30:WU01:FS01:Started FahCore on PID 18864
02:19:30:WU01:FS01:Core PID:18868
02:19:30:WU01:FS01:FahCore 0x22 started
02:19:30:WU01:FS01:0x22:*********************** Log Started 2020-05-15T02:19:30Z ***********************
02:19:30:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
02:19:30:WU01:FS01:0x22:       Type: 0x22
02:19:30:WU01:FS01:0x22:       Core: Core22
02:19:30:WU01:FS01:0x22:    Website: https://foldingathome.org/
02:19:30:WU01:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
02:19:30:WU01:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
02:19:30:WU01:FS01:0x22:             <rafal.wiewiora@choderalab.org>
02:19:30:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 18864 -checkpoint 15
02:19:30:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
02:19:30:WU01:FS01:0x22:             0 -gpu 0 -tmax=80 -twait=900
02:19:30:WU01:FS01:0x22:     Config: <none>
02:19:30:WU01:FS01:0x22:************************************ Build *************************************
02:19:30:WU01:FS01:0x22:    Version: 0.0.5
02:19:30:WU01:FS01:0x22:       Date: Apr 22 2020
02:19:30:WU01:FS01:0x22:       Time: 03:57:11
02:19:30:WU01:FS01:0x22: Repository: Git
02:19:30:WU01:FS01:0x22:   Revision: 2d69202c898bd9bb3e093f51cd32bf411c2a0388
02:19:30:WU01:FS01:0x22:     Branch: HEAD
02:19:30:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
02:19:30:WU01:FS01:0x22:    Options: -std=c++11 -O3 -funroll-loops
02:19:30:WU01:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
02:19:30:WU01:FS01:0x22:       Bits: 64
02:19:30:WU01:FS01:0x22:       Mode: Release
02:19:30:WU01:FS01:0x22:************************************ System ************************************
02:19:30:WU01:FS01:0x22:        CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
02:19:30:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 15 Model 67 Stepping 3
02:19:30:WU01:FS01:0x22:       CPUs: 2
02:19:30:WU01:FS01:0x22:     Memory: 7.77GiB
02:19:30:WU01:FS01:0x22:Free Memory: 3.36GiB
02:19:30:WU01:FS01:0x22:    Threads: POSIX_THREADS
02:19:30:WU01:FS01:0x22: OS Version: 5.4
02:19:30:WU01:FS01:0x22:Has Battery: false
02:19:30:WU01:FS01:0x22: On Battery: false
02:19:30:WU01:FS01:0x22: UTC Offset: -5
02:19:30:WU01:FS01:0x22:        PID: 18868
02:19:30:WU01:FS01:0x22:        CWD: /var/lib/fahclient/work
02:19:30:WU01:FS01:0x22:         OS: Linux 5.4.0-29-generic x86_64
02:19:30:WU01:FS01:0x22:    OS Arch: AMD64
02:19:30:WU01:FS01:0x22:********************************************************************************
02:19:30:WU01:FS01:0x22:Project: 14251 (Run 114, Clone 0, Gen 11)
02:19:30:WU01:FS01:0x22:Unit: 0x00000013cedfaa925eab0323e1367802
02:19:30:WU01:FS01:0x22:Reading tar file core.xml
02:19:30:WU01:FS01:0x22:Reading tar file integrator.xml
02:19:30:WU01:FS01:0x22:Reading tar file state.xml
02:19:32:WU01:FS01:0x22:Reading tar file system.xml
02:19:33:WU01:FS01:0x22:Digital signatures verified
02:19:33:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
02:19:33:WU01:FS01:0x22:Version 0.0.5
02:21:38:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
02:21:38:WU02:FS01:Connecting to 3.133.76.19:80
02:21:45:WU02:FS01:Upload 0.06%
02:22:35:WU01:FS01:0x22:Completed 0 out of 500000 steps (0%)
02:22:36:WU01:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
02:22:43:WU02:FS01:Upload 0.12%
02:22:44:WARNING:WU02:FS01:Exception: Failed to send results to work server: Transfer failed
02:22:44:WU02:FS01:Trying to send results to collection server
02:22:44:WU02:FS01:Uploading 104.38MiB to 3.21.157.11
02:22:44:WU02:FS01:Connecting to 3.21.157.11:8080
02:22:50:WU02:FS01:Upload 11.50%
02:22:56:WU02:FS01:Upload 22.33%
02:23:02:WU02:FS01:Upload 32.33%
02:23:08:WU02:FS01:Upload 42.21%
02:23:14:WU02:FS01:Upload 52.57%
02:23:20:WU02:FS01:Upload 63.41%
02:23:26:WU02:FS01:Upload 75.14%
02:23:32:WU02:FS01:Upload 86.40%
02:23:38:WU02:FS01:Upload 98.97%
02:23:39:WU02:FS01:Upload complete
02:23:39:WU02:FS01:Server responded WORK_QUIT (404)
02:23:39:WARNING:WU02:FS01:Server did not like results, dumping
02:23:39:WU02:FS01:Cleaning up


Thoughts?
legalalien
 
Posts: 3
Joined: Fri May 15, 2020 4:43 am

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby bruce » Fri May 15, 2020 5:29 am

No. The CORE_OUTDATED event had nothing to do with the server's rejection of the WU.

You're talking about project:16434 run:232 clone:3 gen:13 which was also identified as WU02:FS01. It was rejected by the Collection Server.

Perhaps it had expired You would have to look back in the previous logs to find when that WU was assigned and downloaded.

I happen to have personal experience with a GT710 which is a very slow GPU. It is very difficult to complete assignments before they expire.
bruce
 
Posts: 19701
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby legalalien » Fri May 15, 2020 5:33 am

The WU was definitely completed before the deadline; I've been watching it very carefully.

ETA: Yes, I realize that the core was updated when the next WU started; I was wondering if core restarting in the middle of the completed unit submission had some effect on the submission itself.
legalalien
 
Posts: 3
Joined: Fri May 15, 2020 4:43 am

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby bruce » Fri May 15, 2020 6:05 am

When a WU is not returned quickly, duplicates are issued. Assuming your result was error-free, it should still have been accepted, even if a duplicate was completed before you uploaded your result. I'll have to check to see if that happened.

According to the public records, Two people returned it (not counting you). One had an error and the other completed it successfully.

Your WU as rejected at 02:23:39 ... presumably that on 2020-05-15
2020-05-15T02:23:39
bruce
 
Posts: 19701
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby bruce » Fri May 15, 2020 6:16 am

The successful WU as issued at 2020-05-09T06:28:32 and returned at 2020-05-09T18:18:46 and was credited at 2020-05-09T18:25:42
Your WU as rejected at 2020-05-15T02:23:39
Unfortunately I can't tell when your WU was assigned to you. Can you find that date/time?
bruce
 
Posts: 19701
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby legalalien » Fri May 15, 2020 2:21 pm

bruce wrote:The successful WU as issued at 2020-05-09T06:28:32 and returned at 2020-05-09T18:18:46 and was credited at 2020-05-09T18:25:42
Your WU as rejected at 2020-05-15T02:23:39
Unfortunately I can't tell when your WU was assigned to you. Can you find that date/time?


Back on May 8, it seems:

Code: Select all
******************************* Date: 2020-05-08 *******************************
18:37:35:WU01:FS01:0x22:Completed 960000 out of 1000000 steps (96%)
19:27:11:WU01:FS01:0x22:Completed 970000 out of 1000000 steps (97%)
20:16:56:WU01:FS01:0x22:Completed 980000 out of 1000000 steps (98%)
20:18:44:WU00:FS00:0xa7:Completed 975000 out of 1250000 steps (78%)
21:05:50:WU01:FS01:0x22:Completed 990000 out of 1000000 steps (99%)
21:05:50:WU02:FS01:Connecting to 65.254.110.245:80
21:05:50:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
21:05:50:WU02:FS01:Connecting to 18.218.241.186:80
21:05:51:WU02:FS01:Assigned to work server 3.133.76.19
21:05:51:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GK208 [GeForce GT 710 LP] from 3.133.76.19
21:05:51:WU02:FS01:Connecting to 3.133.76.19:8080
21:07:08:WU02:FS01:Downloading 67.19MiB
21:07:14:WU02:FS01:Download 60.00%
21:07:17:WU02:FS01:Download complete
21:07:17:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:16434 run:232 clone:3 gen:13 core:0x22 unit:0x0000001503854c135e9cbacc8fa1f2ea


Like you said earlier, GT 710 is a slow card by today's standards, but I was watching the progress very closely and the unit finished about 20 hours ahead of the expiration deadline (taking into account that the times are in UTC). It usually doesn't cut it *that* close, but indeed struggles to complete units before the "timeout".

<off-topic>
According to the public records, Two people returned it (not counting you). One had an error and the other completed it successfully.


This is an old Linux box that I had resurrected to fold, with the idea of it sitting quietly in a corner and not being used for anything else other than looking for a possible cure. Based on the above, even if my client returns complete WUs, it is likely that someone else has already finished the same WU. I'll have to rethink whether it's a rational use of electricity, as much as I would like to help.

I'm sure this question has been asked before .. will go do some soul-googling with the morning coffee.
</off-topic>
legalalien
 
Posts: 3
Joined: Fri May 15, 2020 4:43 am

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby _r2w_ben » Fri May 15, 2020 5:58 pm

legalalien wrote:This is an old Linux box that I had resurrected to fold, with the idea of it sitting quietly in a corner and not being used for anything else other than looking for a possible cure. Based on the above, even if my client returns complete WUs, it is likely that someone else has already finished the same WU. I'll have to rethink whether it's a rational use of electricity, as much as I would like to help.


A lot of COVID19 projects run on the CPU so you can still contribute. Removing the GPU slot would let the CPU slot use both cores.
_r2w_ben
 
Posts: 277
Joined: Wed Apr 23, 2008 4:11 pm

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby bruce » Wed Jun 17, 2020 10:04 pm

I have a veriety of GPUs, including one GT710. (It's the only GPU that will fit in that slot.) FAH is doing some WU/GPU optimization studies and they are distributing some small proteins to slow GPUs. (They haven't said anything about the proposed deadlines, though.) I hope they do manage to send my GT710 WUs where it can do some good and avoid burdening it with big WU where it can't help.) I'll be very happy to have the GPU idle a lot of the time, just displaying my desktop the rest of the time.
bruce
 
Posts: 19701
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: 16434 (run:232 clone:3 gen:13) WORK_QUIT

Postby JohnChodera » Fri Jun 19, 2020 8:24 pm

Oh no! I'm so sorry this happened.

> I was wondering if core restarting in the middle of the completed unit submission had some effect on the submission itself.

I don't think this should impact things.

We're about to roll out a bunch of COVID Moonshot projects that work well for older GPUs and run in just a few hours. Hopefully this will help---we very much value these contributions!

~ John Chodera // MSKCC
User avatar
JohnChodera
Pande Group Member
 
Posts: 315
Joined: Fri Feb 22, 2013 10:59 pm


Return to Issues with a specific WU

Who is online

Users browsing this forum: No registered users and 2 guests

cron