Project:9209 run:0 clone:5 gen:88

Moderators: Site Moderators, FAHC Science Team

Post Reply
davidcoton
Posts: 1102
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Project:9209 run:0 clone:5 gen:88

Post by davidcoton »

A new type of error to me! -- see second // comment
Win Vista running an nV980 with drivers 359.00

Code: Select all

12:55:21:WU02:FS01:Connecting to 171.67.108.45:80
12:55:21:WU02:FS01:Assigned to work server 171.64.65.104
12:55:21:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GM204 [GeForce GTX 980] from 171.64.65.104
12:55:21:WU02:FS01:Connecting to 171.64.65.104:8080
12:55:22:WU02:FS01:Downloading 10.04MiB
12:55:28:WU02:FS01:Download 59.79%
12:55:31:WU02:FS01:Download complete
12:55:31:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:9209 run:0 clone:5 gen:88 core:0x21 unit:0x00000092664f2dd055edef39c3d83e43
12:55:42:WU02:FS01:Starting
12:55:42:WU02:FS01:Running FahCore: "C:\Program Files\FAHClient/FAHCoreWrapper.exe" C:/Users/David/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/x86/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 9144 -checkpoint 5 -gpu 0 -gpu-vendor nvidia
12:55:42:WU02:FS01:Started FahCore on PID 9516
12:55:43:WU02:FS01:Core PID:12972
12:55:43:WU02:FS01:FahCore 0x21 started
12:55:45:WU02:FS01:0x21:*********************** Log Started 2015-12-29T12:55:45Z ***********************
12:55:45:WU02:FS01:0x21:Project: 9209 (Run 0, Clone 5, Gen 88)
12:55:45:WU02:FS01:0x21:Unit: 0x00000092664f2dd055edef39c3d83e43
12:55:45:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
12:55:45:WU02:FS01:0x21:Machine: 1
12:55:45:WU02:FS01:0x21:Reading tar file core.xml
12:55:45:WU02:FS01:0x21:Reading tar file system.xml
12:55:48:WU02:FS01:0x21:Reading tar file integrator.xml
12:55:48:WU02:FS01:0x21:Reading tar file state.xml
12:55:50:WU02:FS01:0x21:Digital signatures verified
12:55:50:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
12:55:50:WU02:FS01:0x21:Version 0.0.14
12:58:29:WU02:FS01:0x21:Completed 0 out of 2500000 steps (0%)
12:58:29:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
13:05:46:WU02:FS01:0x21:Completed 25000 out of 2500000 steps (1%)
...
14:03:31:WU02:FS01:0x21:Completed 225000 out of 2500000 steps (9%)

// Possible PC memory crisis here (Testing viewer wrapper)

14:18:25:WARNING:WU02:FS01:FahCore returned an unknown error code which probably indicates that it crashed
14:18:25:WARNING:WU02:FS01:FahCore returned: UNKNOWN_ENUM (127 = 0x7f)
14:18:25:WU02:FS01:Starting
14:18:26:WU02:FS01:Running FahCore: "C:\Program Files\FAHClient/FAHCoreWrapper.exe" C:/Users/David/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/x86/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 9144 -checkpoint 5 -gpu 0 -gpu-vendor nvidia
14:18:27:WU02:FS01:Started FahCore on PID 13168
14:18:28:WU02:FS01:Core PID:13996
14:18:28:WU02:FS01:FahCore 0x21 started
14:18:31:WU02:FS01:0x21:*********************** Log Started 2015-12-29T14:18:31Z ***********************
14:18:31:WU02:FS01:0x21:Project: 9209 (Run 0, Clone 5, Gen 88)
14:18:31:WU02:FS01:0x21:Unit: 0x00000092664f2dd055edef39c3d83e43
14:18:31:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
14:18:31:WU02:FS01:0x21:Machine: 1
14:18:31:WU02:FS01:0x21:Digital signatures verified
14:18:31:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
14:18:31:WU02:FS01:0x21:Version 0.0.14
14:18:32:WU02:FS01:0x21:  Found a checkpoint file
14:20:56:WU02:FS01:0x21:Completed 200000 out of 2500000 steps (8%)
14:20:56:WU02:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
14:36:37:WU02:FS01:0x21:Completed 225000 out of 2500000 steps (9%)
...
16:32:27:WU02:FS01:0x21:Completed 650000 out of 2500000 steps (26%)
******************************* Date: 2015-12-29 *******************************
16:39:11:WU02:FS01:0x21:Completed 675000 out of 2500000 steps (27%)

// Not seen this before!

16:41:57:WU02:FS01:0x21:ERROR:SmartPointer: Can't dereference a NULL pointer!
16:42:21:WARNING:WU02:FS01:FahCore returned an unknown error code which probably indicates that it crashed
16:42:21:WARNING:WU02:FS01:FahCore returned: UNKNOWN_ENUM (127 = 0x7f)
16:42:22:WU02:FS01:Starting
16:42:22:WU02:FS01:Running FahCore: "C:\Program Files\FAHClient/FAHCoreWrapper.exe" C:/Users/David/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/x86/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 9144 -checkpoint 5 -gpu 0 -gpu-vendor nvidia
16:42:22:WU02:FS01:Started FahCore on PID 10844
16:42:22:WU02:FS01:Core PID:13052
16:42:22:WU02:FS01:FahCore 0x21 started
16:42:33:WU02:FS01:0x21:*********************** Log Started 2015-12-29T16:42:33Z ***********************
16:42:41:WU02:FS01:0x21:Project: 9209 (Run 0, Clone 5, Gen 88)
16:42:41:WU02:FS01:0x21:Unit: 0x00000092664f2dd055edef39c3d83e43
16:42:41:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
16:42:41:WU02:FS01:0x21:Machine: 1
16:42:41:WU02:FS01:0x21:Digital signatures verified
16:42:41:WU02:FS01:0x21:Folding@home GPU Core21 Folding@home Core
16:42:41:WU02:FS01:0x21:Version 0.0.14
16:43:03:WU02:FS01:0x21:  Found a checkpoint file
16:44:08:WU02:FS01:0x21:ERROR:exception: bad allocation
16:44:08:WU02:FS01:0x21:Saving result file logfile_01.txt
16:44:08:WU02:FS01:0x21:Saving result file log.txt
16:44:08:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
16:44:09:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:44:09:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9209 run:0 clone:5 gen:88 core:0x21 unit:0x00000092664f2dd055edef39c3d83e43
16:44:09:WU02:FS01:Uploading 3.29KiB to 171.64.65.104
16:44:09:WU02:FS01:Connecting to 171.64.65.104:8080
16:44:10:WU02:FS01:Upload complete
16:44:10:WU02:FS01:Server responded WORK_ACK (400)
16:44:10:WU02:FS01:Cleaning up
And the config:

Code: Select all

*********************** Log Started 2015-12-22T10:05:07Z ***********************
10:05:07:************************* Folding@home Client *************************
10:05:07:      Website: http://folding.stanford.edu/
10:05:07:    Copyright: (c) 2009-2014 Stanford University
10:05:07:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:05:07:         Args: --open-web-control
10:05:07:       Config: C:/Users/David/AppData/Roaming/FAHClient/config.xml
10:05:07:******************************** Build ********************************
10:05:07:      Version: 7.4.4
10:05:07:         Date: Mar 4 2014
10:05:07:         Time: 20:26:54
10:05:07:      SVN Rev: 4130
10:05:07:       Branch: fah/trunk/client
10:05:07:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
10:05:07:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
10:05:07:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
10:05:07:     Platform: win32 XP
10:05:07:         Bits: 32
10:05:07:         Mode: Release
10:05:07:******************************* System ********************************
10:05:07:          CPU: AMD Athlon(tm) II X4 640 Processor
10:05:07:       CPU ID: AuthenticAMD Family 16 Model 5 Stepping 3
10:05:07:         CPUs: 4
10:05:07:       Memory: 3.12GiB
10:05:07:  Free Memory: 1.12GiB
10:05:07:      Threads: WINDOWS_THREADS
10:05:07:   OS Version: 6.0
10:05:07:  Has Battery: false
10:05:07:   On Battery: false
10:05:07:   UTC Offset: 0
10:05:07:          PID: 9144
10:05:07:          CWD: C:/Users/David/AppData/Roaming/FAHClient
10:05:07:           OS: Windows Vista (TM) Home Premium Service Pack 2
10:05:07:      OS Arch: X86
10:05:07:         GPUs: 1
10:05:07:        GPU 0: NVIDIA:5 GM204 [GeForce GTX 980]
10:05:07:         CUDA: 5.2
10:05:07:  CUDA Driver: 7050
10:05:07:Win32 Service: false
10:05:07:***********************************************************************
10:05:07:<config>
10:05:07:  <!-- Folding Core -->
10:05:07:  <checkpoint v='5'/>
10:05:07:
10:05:07:  <!-- HTTP Server -->
10:05:07:  <allow v='127.0.0.1 192.168.1.0/24'/>
10:05:07:  <deny v='0.0.0.0/0'/>
10:05:07:  <http-addresses v='127.0.0.1:7396 david-ubuntu:7396'/>
10:05:07:
10:05:07:  <!-- Network -->
10:05:07:  <proxy v=':8080'/>
10:05:07:
10:05:07:  <!-- Remote Command Server -->
10:05:07:  <password v='*******'/>
10:05:07:
10:05:07:  <!-- Slot Control -->
10:05:07:  <power v='full'/>
10:05:07:
10:05:07:  <!-- User Information -->
10:05:07:  <passkey v='********************************'/>
10:05:07:  <user v='davidcoton'/>
10:05:07:
10:05:07:  <!-- Web Server -->
10:05:07:  <web-allow v='127.0.0.1 168.192.1.0/24'/>
10:05:07:
10:05:07:  <!-- Folding Slots -->
10:05:07:  <slot id='0' type='CPU'>
10:05:07:    <client-type v='advanced'/>
10:05:07:    <cpus v='3'/>
10:05:07:    <paused v='true'/>
10:05:07:  </slot>
10:05:07:  <slot id='1' type='GPU'>
10:05:07:    <client-type v='advanced'/>
10:05:07:    <max-packet-size v='big'/>
10:05:07:    <paused v='true'/>
10:05:07:  </slot>
10:05:07:</config>
10:05:07:Trying to access database...
10:05:07:Successfully acquired database lock
Image
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project:9209 run:0 clone:5 gen:88

Post by bruce »

I've seen both of those errors before. There's no official explanation except that it has something to do with unable to allocate memory.

What do you mean by "Testing viewer wrapper"? At this point, all I can assume is that the PC memory crisis was something that you caused unless you have defective hardware.
jimerickson
Posts: 533
Joined: Tue May 27, 2008 11:56 pm
Hardware configuration: Parts:
Asus H370 Mining Master motherboard (X2)
Patriot Viper DDR4 memory 16gb stick (X4)
Nvidia GeForce GTX 1080 gpu (X16)
Intel Core i7 8700 cpu (X2)
Silverstone 1000 watt psu (X4)
Veddha 8 gpu miner case (X2)
Thermaltake hsf (X2)
Ubit riser card (X16)
Location: ames, iowa

Re: Project:9209 run:0 clone:5 gen:88

Post by jimerickson »

i think he is referring to FAH_WrapperGPUTrajectory.py by ChristianVirtual now on github.
ChristianVirtual
Posts: 1596
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project:9209 run:0 clone:5 gen:88

Post by ChristianVirtual »

It could be that the Python wrapper leak memory. I thought I released internal lists during processing; might not be sufficient. At one stage we chased some memory leakage with a Core or project. At that time I raised some false alarms too due to a precessor of this script.

Anyway: I switched my client to adv and hope to get a similar WU assigned (actually I got one but had to go to work) and hope to some hints of leakage. Else I need to tweak the processing and reduce the memory footprint.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
davidcoton
Posts: 1102
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: Project:9209 run:0 clone:5 gen:88

Post by davidcoton »

I saw high memory use (>500M) when serving two Viewers simultaneously. I looked for continuously rising mem use. I did not definitively see that, but can't rule it out. Lower memory use would be better -- or a check of overall memory available because the non-essential program (wrapper+viewer) should not cause a problem for the mission-critical core.

The null pointer release issue could be a consequence of an earlier out of memory issue, but ideally the Core code should detect the condition and handle it (it becomes a Warning, not an Error). It is only an issue because the Core failed to recover. It may be because of Out of Memory, but that in itself should be detected and should raise a Warning. (Can be difficult if there is no memory left to use for generating the Warning!)
Image
toTOW
Site Moderator
Posts: 6312
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project:9209 run:0 clone:5 gen:88

Post by toTOW »

The WU might be bad ... there's another report of failure from another user in the DB ...
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
davidcoton
Posts: 1102
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: Project:9209 run:0 clone:5 gen:88

Post by davidcoton »

Or it might just be the false positive checking from an old version of Core21. I've had several that look like that on 9208/9.
Image
ChristianVirtual
Posts: 1596
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project:9209 run:0 clone:5 gen:88

Post by ChristianVirtual »

there was (and still is a bit) some memory leak with the XML Parser. Modified it and have a branch open "issue-with-memory-leak" (v2.3).

I could open a 9208 (0,9,81) with 33'000 atoms and bonds.
But memory footprint it better

Code: Select all

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                    
17740 fahclie+  39  19 29.435g 1.033g 104248 R 100.0 13.6 104:39.61 FahCore_21                                                 
24694 fahclie+  39  19 29.300g 368584  85044 R 100.0  4.6 304:48.61 FahCore_21                                                 
31348 cl        20   0   76208  29648   3248 R 100.0  0.4   2:39.48 python3.4 
Also keep in mind if you run the viewer on the same machine that also need quite some memory to actually render the protein. Preferable is to run the viewer from a different machine (will also reduce impact on the folding itself by not bothering the GPU for rendering).

I will check more next year ... family time ...
ImageImage
Please contribute your logs to http://ppd.fahmm.net
Post Reply