[SOLVED] FahCore returned: WU_STALLED (127 = 0x7f) | A100X

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Post Reply
benc
Posts: 8
Joined: Fri Jul 17, 2020 2:28 pm

[SOLVED] FahCore returned: WU_STALLED (127 = 0x7f) | A100X

Post by benc »

Having Trouble getting the GPUs on my DGX to work, though the CPU is folding fine, it should be possible based on this thread: viewtopic.php?f=80&t=36079

head -n 200 log.txt

Code: Select all

*********************** Log Started 2020-10-29T13:03:00Z ***********************
13:03:00:******************************* libFAH ********************************
13:03:00:         Date: Oct 20 2020
13:03:00:         Time: 20:36:41
13:03:00:     Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
13:03:00:       Branch: master
13:03:00:     Compiler: GNU 4.9.4
13:03:00:      Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
13:03:00:               -funroll-loops
13:03:00:     Platform: linux2 5.8.0-1-amd64
13:03:00:         Bits: 64
13:03:00:         Mode: Release
13:03:00:****************************** FAHClient ******************************
13:03:00:      Version: 7.6.21
13:03:00:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
13:03:00:    Copyright: 2020 foldingathome.org
13:03:00:     Homepage: https://foldingathome.org/
13:03:00:         Date: Oct 20 2020
13:03:00:         Time: 20:38:59
13:03:00:     Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
13:03:00:       Branch: master
13:03:00:     Compiler: GNU 4.9.4
13:03:00:      Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
13:03:00:               -funroll-loops
13:03:00:     Platform: linux2 5.8.0-1-amd64
13:03:00:         Bits: 64
13:03:00:         Mode: Release
13:03:00:         Args: --config=config.xml
13:03:00:       Config: /var/lib/home/scp/tmp/folding/usr/bin/config.xml
13:03:00:******************************** CBang ********************************
13:03:00:         Date: Oct 20 2020
13:03:00:         Time: 18:38:01
13:03:00:     Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
13:03:00:       Branch: master
13:03:00:     Compiler: GNU 4.9.4
13:03:00:      Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
13:03:00:               -funroll-loops -fPIC
13:03:00:     Platform: linux2 5.8.0-1-amd64
13:03:00:         Bits: 64
13:03:00:         Mode: Release
13:03:00:******************************* System ********************************
13:03:00:          CPU: AMD EPYC 7742 64-Core Processor
13:03:00:       CPU ID: AuthenticAMD Family 23 Model 49 Stepping 0
13:03:00:         CPUs: 256
13:03:00:       Memory: 1007.70GiB
13:03:00:  Free Memory: 928.82GiB
13:03:00:      Threads: POSIX_THREADS
13:03:00:   OS Version: 5.4
13:03:00:  Has Battery: false
13:03:00:   On Battery: false
13:03:00:   UTC Offset: 0
13:03:00:          PID: 121492
13:03:00:          CWD: /var/lib/home/scp/tmp/folding/usr/bin
13:03:00:           OS: Linux 5.4.0-48-generic x86_64
13:03:00:      OS Arch: AMD64
13:03:00:         GPUs: 8
13:03:00:        GPU 0: Bus:7 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 1: Bus:15 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 2: Bus:71 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 3: Bus:78 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 4: Bus:135 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 5: Bus:144 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 6: Bus:183 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:        GPU 7: Bus:189 Slot:0 Func:0 NVIDIA:8 GA100 [GRID A100X]
13:03:00:CUDA Device 0: Platform:0 Device:0 Bus:7 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 1: Platform:0 Device:1 Bus:15 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 2: Platform:0 Device:2 Bus:71 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 3: Platform:0 Device:3 Bus:78 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 4: Platform:0 Device:4 Bus:135 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 5: Platform:0 Device:5 Bus:144 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 6: Platform:0 Device:6 Bus:183 Slot:0 Compute:8.0 Driver:11.0
13:03:00:CUDA Device 7: Platform:0 Device:7 Bus:189 Slot:0 Compute:8.0 Driver:11.0
13:03:00:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
13:03:00:               libOpenCL.so: cannot open shared object file: No such file or
13:03:00:               directory
13:03:00:***********************************************************************
13:03:00:<config>
13:03:00:  <!-- Network -->
13:03:00:  <proxy v='seprivatezen.astrazeneca.net:9480'/>
13:03:00:  <proxy-enable v='true'/>
13:03:00:
13:03:00:  <!-- Slot Control -->
13:03:00:  <power v='full'/>
13:03:00:
13:03:00:  <!-- User Information -->
13:03:00:  <passkey v='*****'/>
13:03:00:  <user v='BenjaminHCCarr'/>
13:03:00:
13:03:00:  <!-- Folding Slots -->
13:03:00:  <slot id='0' type='CPU'/>
13:03:00:  <slot id='1' type='GPU'>
13:03:00:    <pci-bus v='7'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='2' type='GPU'>
13:03:00:    <pci-bus v='15'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='3' type='GPU'>
13:03:00:    <pci-bus v='71'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='4' type='GPU'>
13:03:00:    <pci-bus v='78'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='5' type='GPU'>
13:03:00:    <pci-bus v='135'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='6' type='GPU'>
13:03:00:    <pci-bus v='144'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='7' type='GPU'>
13:03:00:    <pci-bus v='183'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:  <slot id='8' type='GPU'>
13:03:00:    <pci-bus v='189'/>
13:03:00:    <pci-slot v='0'/>
13:03:00:  </slot>
13:03:00:</config>
13:03:00:Trying to access database...
13:03:00:Successfully acquired database lock
13:03:00:FS00:Initialized folding slot 00: cpu:248
13:03:00:FS01:Initialized folding slot 01: gpu:7:0 GA100 [GRID A100X]
13:03:00:FS02:Initialized folding slot 02: gpu:15:0 GA100 [GRID A100X]
13:03:00:FS03:Initialized folding slot 03: gpu:71:0 GA100 [GRID A100X]
13:03:00:FS04:Initialized folding slot 04: gpu:78:0 GA100 [GRID A100X]
13:03:00:FS05:Initialized folding slot 05: gpu:135:0 GA100 [GRID A100X]
13:03:00:FS06:Initialized folding slot 06: gpu:144:0 GA100 [GRID A100X]
13:03:00:FS07:Initialized folding slot 07: gpu:183:0 GA100 [GRID A100X]
13:03:00:FS08:Initialized folding slot 08: gpu:189:0 GA100 [GRID A100X]
13:03:00:WU01:FS01:Starting
13:03:00:WU01:FS01:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 0 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU01:FS01:Started FahCore on PID 121518
13:03:00:WU01:FS01:Core PID:121522
13:03:00:WU01:FS01:FahCore 0x22 started
13:03:00:WU02:FS02:Starting
13:03:00:WU02:FS02:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 1 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU02:FS02:Started FahCore on PID 121523
13:03:00:WU02:FS02:Core PID:121527
13:03:00:WU02:FS02:FahCore 0x22 started
13:03:00:WU05:FS04:Starting
13:03:00:WU05:FS04:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 05 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 3 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU05:FS04:Started FahCore on PID 121528
13:03:00:WU05:FS04:Core PID:121532
13:03:00:WU05:FS04:FahCore 0x22 started
13:03:00:WU07:FS05:Starting
13:03:00:WU07:FS05:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 07 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 4 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU07:FS05:Started FahCore on PID 121533
13:03:00:WU07:FS05:Core PID:121537
13:03:00:WU07:FS05:FahCore 0x22 started
13:03:00:WU10:FS07:Starting
13:03:00:WU10:FS07:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 10 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 6 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU10:FS07:Started FahCore on PID 121538
13:03:00:WU10:FS07:Core PID:121542
13:03:00:WU10:FS07:FahCore 0x22 started
13:03:00:WU13:FS08:Starting
13:03:00:WU13:FS08:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 13 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 7 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU13:FS08:Started FahCore on PID 121543
13:03:00:WU13:FS08:Core PID:121547
13:03:00:WU13:FS08:FahCore 0x22 started
13:03:00:WU00:FS00:Connecting to seprivatezen.astrazeneca.net:9480
13:03:00:WU03:FS03:Connecting to seprivatezen.astrazeneca.net:9480
13:03:00:WU04:FS06:Connecting to seprivatezen.astrazeneca.net:9480
13:03:00:WARNING:WU01:FS01:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU02:FS02:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU05:FS04:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU07:FS05:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU10:FS07:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WARNING:WU13:FS08:FahCore returned: WU_STALLED (127 = 0x7f)
13:03:00:WU01:FS01:Starting
13:03:00:WU01:FS01:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 0 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU01:FS01:Started FahCore on PID 121548
13:03:00:WU01:FS01:Core PID:121552
13:03:00:WU01:FS01:FahCore 0x22 started
13:03:00:WU02:FS02:Starting
13:03:00:WU02:FS02:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 1 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU02:FS02:Started FahCore on PID 121553
13:03:00:WU02:FS02:Core PID:121557
13:03:00:WU02:FS02:FahCore 0x22 started
13:03:00:WU05:FS04:Starting
13:03:00:WU05:FS04:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 05 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 3 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU05:FS04:Started FahCore on PID 121558
13:03:00:WU05:FS04:Core PID:121562
13:03:00:WU05:FS04:FahCore 0x22 started
13:03:00:WU07:FS05:Starting
13:03:00:WU07:FS05:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 07 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 4 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU07:FS05:Started FahCore on PID 121563
13:03:00:WU07:FS05:Core PID:121567
13:03:00:WU07:FS05:FahCore 0x22 started
13:03:00:WU10:FS07:Starting
13:03:00:WU10:FS07:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 10 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 6 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
13:03:00:WU10:FS07:Started FahCore on PID 121568
13:03:00:WU10:FS07:Core PID:121572
13:03:00:WU10:FS07:FahCore 0x22 started
13:03:00:WU13:FS08:Starting
13:03:00:WU13:FS08:Running FahCore: /var/lib/home/scp/tmp/folding/usr/bin/FAHCoreWrapper /var/lib/home/scp/tmp/folding/usr/bin/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 13 -suffix 01 -version 706 -lifeline 121492 -checkpoint 15 -cuda-device 7 -gpu-vendor nvidia -gpu -1 -gpu-usage 100
would like to get the GRID A100X's folding before we put this into production
Last edited by benc on Mon Nov 02, 2020 3:09 pm, edited 1 time in total.
benc
Posts: 8
Joined: Fri Jul 17, 2020 2:28 pm

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Post by benc »

Reading this Reddit thread: https://www.reddit.com/r/Folding/commen ... u_stalled/

Code: Select all

13:03:00:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
13:03:00:               libOpenCL.so: cannot open shared object file: No such file or
13:03:00:               directory
Will it fail with CUDA but no OpenCL?
benc
Posts: 8
Joined: Fri Jul 17, 2020 2:28 pm

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Post by benc »

So this is the driver I am running

And this is from the release notes: https://docs.nvidia.com/datacenter/tesl ... index.html

Code: Select all

API Support
This release supports the following APIs:
- NVIDIA® CUDA® 11.0 for NVIDIA® KeplerTM, MaxwellTM, PascalTM, VoltaTM, TuringTM and NVIDIA Ampere architecture GPUs
- OpenGL® 4.5
- Vulkan® 1.1
- DirectX 11
- DirectX 12 (Windows 10)
- Open Computing Language (OpenCLTM software) 1.2
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Post by PantherX »

Just wondering if you have the OpenCL Package installed (sudo apt-get install ocl-icd-opencl-dev)?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
benc
Posts: 8
Joined: Fri Jul 17, 2020 2:28 pm

Re: FahCore returned: WU_STALLED (127 = 0x7f) | GRID A100X /

Post by benc »

Thank you @PantherX that was the missing package!
PantherX
Site Moderator
Posts: 7020
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: [SOLVED] FahCore returned: WU_STALLED (127 = 0x7f) | A10

Post by PantherX »

Glad that it was a simple fix! Hopefully, your 8 GPUs can be fed without issues :)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
benc
Posts: 8
Joined: Fri Jul 17, 2020 2:28 pm

Re: [SOLVED] FahCore returned: WU_STALLED (127 = 0x7f) | A10

Post by benc »

We're feeding Two NVidia DGX, Two sets of:
- 2x AMD EPYC 7742 64-Core Processor (1TB Ram)
- 8x A100-SXM4-40GB
Post Reply