Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Moderators: Site Moderators, FAHC Science Team

Post Reply
mat2
Posts: 9
Joined: Tue Mar 17, 2015 7:23 pm

Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by mat2 »

Hello,

Today, FahCore_a7 has been dumping core many times in a row, in the same place. I have captured the coredumps and coredumpctl gave me the following backtraces (here after demangling with c++filt):

Code: Select all

           PID: 13620 (FahCore_a7)
           UID: 127 (fahclient)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: pon 2019-11-25 20:06:43 CET (21min ago)
  Command Line: /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 13616 -checkpoint 15 -np 4
    Executable: /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
 Control Group: /system.slice/FAHClient.service
          Unit: FAHClient.service
         Slice: system.slice
       Boot ID: 2ceb6a93d8914120993b5830a4c90da9
    Machine ID: 5d91c5b8a310b7071ad044fd00000010
      Hostname: mateusz-ubuntu
      Coredump: /var/lib/systemd/coredump/core.FahCore_a7.127.2ceb6a93d8914120993b5830a4c90da9.13620.1574708803000000000000.xz
       Message: Process 13620 (FahCore_a7) of user 127 dumped core.
                
                Stack trace of thread 13620:
                #0  0x0000000001207bcc std::_Rb_tree<boost::re_detail_106900::cpp_regex_traits_base<char>, std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > >, std::_Select1st<std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > > >, std::less<boost::re_detail_106900::cpp_regex_traits_base<char> >, std::allocator<std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > > > >::find(boost::re_detail_106900::cpp_regex_traits_base<char> const&) (FahCore_a7)
                #1  0x000000000120a9fd boost::object_cache<boost::re_detail_106900::cpp_regex_traits_base<char>, boost::re_detail_106900::cpp_regex_traits_implementation<char> >::do_get(boost::re_detail_106900::cpp_regex_traits_base<char> const&, unsigned long) (FahCore_a7)
                #2  0x0000000001219831 boost::basic_regex<char, boost::regex_traits<char, boost::cpp_regex_traits<char> > >::do_assign(char const*, char const*, unsigned int) (FahCore_a7)
                #3  0x00000000011383c5 cb::Regex::private_t::private_t(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (FahCore_a7)
                #4  0x0000000001135a9b cb::Regex::Regex(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cb::Regex::type_t) (FahCore_a7)
                #5  0x00000000011d287c cb::DirectoryWalker::DirectoryWalker(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, bool) (FahCore_a7)
                #6  0x000000000116dfe8 cb::SystemUtilities::rmtree(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (FahCore_a7)
                #7  0x000000000116e0f1 cb::SystemUtilities::rmdir(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) (FahCore_a7)
                #8  0x00000000010ea320 FAH::Core::run(int, char**) (FahCore_a7)
                #9  0x0000000000438b6e main (FahCore_a7)
                #10 0x00007f0b9d054830 __libc_start_main (libc.so.6)
                #11 0x000000000043d60a _start (FahCore_a7)
                
                Stack trace of thread 13622:
                #0  0x00007f0b9d91bc1d __nanosleep (libpthread.so.0)
                #1  0x000000000118fe6b cb::Timer::sleep(double) (FahCore_a7)
                #2  0x00000000010f17a7 FAH::Watchdog::run() (FahCore_a7)
                #3  0x0000000001176797 cb::Thread::starter() (FahCore_a7)
                #4  0x000000000117543a n/a (FahCore_a7)
                #5  0x00007f0b9d9126ba start_thread (libpthread.so.0)
                #6  0x00007f0b9d13b41d __clone (libc.so.6)
                
                Stack trace of thread 13623:
                #0  0x00007f0b9d918709 pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000000000115a964 cb::Condition::timedWait(double) (FahCore_a7)
                #2  0x00000000011637bd non-virtual thunk to cb::SignalManager::run() (FahCore_a7)
                #3  0x0000000001176797 cb::Thread::starter() (FahCore_a7)
                #4  0x000000000117543a n/a (FahCore_a7)
                #5  0x00007f0b9d9126ba start_thread (libpthread.so.0)
                #6  0x00007f0b9d13b41d __clone (libc.so.6)

           PID: 14551 (FahCore_a7)
           UID: 127 (fahclient)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: pon 2019-11-25 20:07:42 CET (20min ago)
  Command Line: /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 14547 -checkpoint 15 -np 4
    Executable: /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
 Control Group: /system.slice/FAHClient.service
          Unit: FAHClient.service
         Slice: system.slice
       Boot ID: 2ceb6a93d8914120993b5830a4c90da9
    Machine ID: 5d91c5b8a310b7071ad044fd00000010
      Hostname: mateusz-ubuntu
      Coredump: /var/lib/systemd/coredump/core.FahCore_a7.127.2ceb6a93d8914120993b5830a4c90da9.14551.1574708862000000000000.xz
       Message: Process 14551 (FahCore_a7) of user 127 dumped core.
                
                Stack trace of thread 14551:
                #0  0x0000000001207bcc std::_Rb_tree<boost::re_detail_106900::cpp_regex_traits_base<char>, std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > >, std::_Select1st<std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > > >, std::less<boost::re_detail_106900::cpp_regex_traits_base<char> >, std::allocator<std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > > > >::find(boost::re_detail_106900::cpp_regex_traits_base<char> const&) (FahCore_a7)
                #1  0x000000000120a9fd boost::object_cache<boost::re_detail_106900::cpp_regex_traits_base<char>, boost::re_detail_106900::cpp_regex_traits_implementation<char> >::do_get(boost::re_detail_106900::cpp_regex_traits_base<char> const&, unsigned long) (FahCore_a7)
                #2  0x0000000001219831 boost::basic_regex<char, boost::regex_traits<char, boost::cpp_regex_traits<char> > >::do_assign(char const*, char const*, unsigned int) (FahCore_a7)
                #3  0x00000000011383c5 cb::Regex::private_t::private_t(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (FahCore_a7)
                #4  0x0000000001135a9b cb::Regex::Regex(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cb::Regex::type_t) (FahCore_a7)
                #5  0x00000000011d287c cb::DirectoryWalker::DirectoryWalker(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, bool) (FahCore_a7)
                #6  0x000000000116dfe8 cb::SystemUtilities::rmtree(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (FahCore_a7)
                #7  0x000000000116e0f1 cb::SystemUtilities::rmdir(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) (FahCore_a7)
                #8  0x00000000010ea320 FAH::Core::run(int, char**) (FahCore_a7)
                #9  0x0000000000438b6e main (FahCore_a7)
                #10 0x00007f0a6534c830 __libc_start_main (libc.so.6)
                #11 0x000000000043d60a _start (FahCore_a7)
                
                Stack trace of thread 14553:
                #0  0x00007f0a65c13c1d __nanosleep (libpthread.so.0)
                #1  0x000000000118fe6b cb::Timer::sleep(double) (FahCore_a7)
                #2  0x00000000010f17a7 FAH::Watchdog::run() (FahCore_a7)
                #3  0x0000000001176797 cb::Thread::starter() (FahCore_a7)
                #4  0x000000000117543a n/a (FahCore_a7)
                #5  0x00007f0a65c0a6ba start_thread (libpthread.so.0)
                #6  0x00007f0a6543341d __clone (libc.so.6)
                
                Stack trace of thread 14554:
                #0  0x00007f0a65c10709 pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000000000115a964 cb::Condition::timedWait(double) (FahCore_a7)
                #2  0x00000000011637bd non-virtual thunk to cb::SignalManager::run() (FahCore_a7)
                #3  0x0000000001176797 cb::Thread::starter() (FahCore_a7)
                #4  0x000000000117543a n/a (FahCore_a7)
                #5  0x00007f0a65c0a6ba start_thread (libpthread.so.0)
                #6  0x00007f0a6543341d __clone (libc.so.6)

           PID: 15367 (FahCore_a7)
           UID: 127 (fahclient)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: pon 2019-11-25 20:08:42 CET (19min ago)
  Command Line: /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 15363 -checkpoint 15 -np 4
    Executable: /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
 Control Group: /system.slice/FAHClient.service
          Unit: FAHClient.service
         Slice: system.slice
       Boot ID: 2ceb6a93d8914120993b5830a4c90da9
    Machine ID: 5d91c5b8a310b7071ad044fd00000010
      Hostname: mateusz-ubuntu
      Coredump: /var/lib/systemd/coredump/core.FahCore_a7.127.2ceb6a93d8914120993b5830a4c90da9.15367.1574708922000000000000.xz
       Message: Process 15367 (FahCore_a7) of user 127 dumped core.
                
                Stack trace of thread 15367:
                #0  0x0000000001207bcc std::_Rb_tree<boost::re_detail_106900::cpp_regex_traits_base<char>, std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > >, std::_Select1st<std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > > >, std::less<boost::re_detail_106900::cpp_regex_traits_base<char> >, std::allocator<std::pair<boost::re_detail_106900::cpp_regex_traits_base<char> const, std::_List_iterator<std::pair<boost::shared_ptr<boost::re_detail_106900::cpp_regex_traits_implementation<char> const>, boost::re_detail_106900::cpp_regex_traits_base<char> const*> > > > >::find(boost::re_detail_106900::cpp_regex_traits_base<char> const&) (FahCore_a7)
                #1  0x000000000120a9fd boost::object_cache<boost::re_detail_106900::cpp_regex_traits_base<char>, boost::re_detail_106900::cpp_regex_traits_implementation<char> >::do_get(boost::re_detail_106900::cpp_regex_traits_base<char> const&, unsigned long) (FahCore_a7)
                #2  0x0000000001219831 boost::basic_regex<char, boost::regex_traits<char, boost::cpp_regex_traits<char> > >::do_assign(char const*, char const*, unsigned int) (FahCore_a7)
                #3  0x00000000011383c5 cb::Regex::private_t::private_t(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (FahCore_a7)
                #4  0x0000000001135a9b cb::Regex::Regex(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cb::Regex::type_t) (FahCore_a7)
                #5  0x00000000011d287c cb::DirectoryWalker::DirectoryWalker(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, bool) (FahCore_a7)
                #6  0x000000000116dfe8 cb::SystemUtilities::rmtree(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (FahCore_a7)
                #7  0x000000000116e0f1 cb::SystemUtilities::rmdir(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) (FahCore_a7)
                #8  0x00000000010ea320 FAH::Core::run(int, char**) (FahCore_a7)
                #9  0x0000000000438b6e main (FahCore_a7)
                #10 0x00007f66dd157830 __libc_start_main (libc.so.6)
                #11 0x000000000043d60a _start (FahCore_a7)
                
                Stack trace of thread 15369:
                #0  0x00007f66dda1ec1d __nanosleep (libpthread.so.0)
                #1  0x000000000118fe6b cb::Timer::sleep(double) (FahCore_a7)
                #2  0x00000000010f17a7 FAH::Watchdog::run() (FahCore_a7)
                #3  0x0000000001176797 cb::Thread::starter() (FahCore_a7)
                #4  0x000000000117543a n/a (FahCore_a7)
                #5  0x00007f66dda156ba start_thread (libpthread.so.0)
                #6  0x00007f66dd23e41d __clone (libc.so.6)
                
                Stack trace of thread 15370:
                #0  0x00007f66dda1b709 pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000000000115a964 cb::Condition::timedWait(double) (FahCore_a7)
                #2  0x00000000011637bd non-virtual thunk to cb::SignalManager::run() (FahCore_a7)
                #3  0x0000000001176797 cb::Thread::starter() (FahCore_a7)
                #4  0x000000000117543a n/a (FahCore_a7)
                #5  0x00007f66dda156ba start_thread (libpthread.so.0)
                #6  0x00007f66dd23e41d __clone (libc.so.6)
I will gladly send over a Private Message coredumps and contents of /var/lib/fahclient.

sha256sum of FahCore_a7:

Code: Select all

401cc7e0416c68706eb262cf97b54af4b605f6cbbe26ae5ee1362a869fb27d55  FahCore_a7
Previously, since November 23th, the Folding@Home client has thrown me several times a Guru Meditation error. In every case it happened shortly after resume from suspend (but not after every resume from suspend). This seems to be not strictly related, however.

Today, I have been running the Torture Test from the Mersenne Prime project and suspending the laptop several times in a row during it. No errors were reported.

Code: Select all

mateusz@mateusz-ubuntu:/var/lib/fahclient$ find
.
./cores
./cores/cores.foldingathome.org
./cores/cores.foldingathome.org/v7
./cores/cores.foldingathome.org/v7/lin
./cores/cores.foldingathome.org/v7/lin/64bit
./cores/cores.foldingathome.org/v7/lin/64bit/avx
./cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah
./cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
./configs
./configs/config-20191115-194549.xml
./configs/config-20191111-201938.xml
./configs/config-20191125-191014.xml
./configs/config-20191116-200130.xml
./configs/config-20191116-140728.xml
./configs/config-20191111-202039.xml
./configs/config-20191125-181247.xml
./configs/config-20191111-191232.xml
./configs/config-20191124-200005.xml
./configs/config-20191125-190711.xml
./configs/config-20191116-203701.xml
./configs/config-20191116-190333.xml
./configs/config-20191123-170346.xml
./configs/config-20191116-180744.xml
./configs/config-20191123-165942.xml
./configs/config-20191124-200105.xml
./logs
./logs/log-20191116-185220.txt
./logs/log-20191121-191723.txt
./logs/log-20191116-202044.txt
./logs/log-20191116-105825.txt
./logs/log-20191110-195030.txt
./logs/log-20191121-184719.txt
./logs/log-20191123-135438.txt
./log.txt
./work
./work/client.db-journal
./work/client.db
./work/00
./work/00/logfile_01-20191124-174736.txt
./work/00/logfile_01-20191125-175547.txt
./work/00/wudata_01.dat
./work/00/logfile_01-20191125-190842.txt
./work/00/viewerFrame34.json
./work/00/viewerFrame0.json
./work/00/viewerFrame54.json
./work/00/logfile_01.txt
./work/00/viewerTop.json
./work/00/viewerFrame33.json
./work/00/logfile_01-20191125-190742.txt
./work/00/viewerFrame40.json
./work/00/viewerFrame13.json
./work/00/viewerFrame26.json
./work/00/logfile_01-20191125-180447.txt
./work/00/viewerFrame38.json
./work/00/viewerFrame48.json
./work/00/viewerFrame57.json
./work/00/viewerFrame11.json
./work/00/viewerFrame39.json
./work/00/viewerFrame53.json
./work/00/viewerFrame19.json
./work/00/wuinfo_01.dat
./work/00/viewerFrame43.json
./work/00/viewerFrame14.json
./work/00/viewerFrame50.json
./work/00/viewerFrame35.json
./work/00/logfile_01-20191125-180848.txt
./work/00/viewerFrame2.json
./work/00/viewerFrame3.json
./work/00/viewerFrame7.json
./work/00/logfile_01-20191125-180747.txt
./work/00/viewerFrame12.json
./work/00/viewerFrame46.json
./work/00/viewerFrame42.json
./work/00/viewerFrame10.json
./work/00/viewerFrame21.json
./work/00/viewerFrame31.json
./work/00/viewerFrame8.json
./work/00/logfile_01-20191125-180347.txt
./work/00/logfile_01-20191125-175747.txt
./work/00/logfile_01-20191125-175647.txt
./work/00/logfile_01-20191125-175947.txt
./work/00/logfile_01-20191124-172622.txt
./work/00/logfile_01-20191125-180247.txt
./work/00/viewerFrame9.json
./work/00/viewerFrame45.json
./work/00/logfile_01-20191125-180147.txt
./work/00/viewerFrame29.json
./work/00/viewerFrame6.json
./work/00/viewerFrame44.json
./work/00/logfile_01-20191125-190642.txt
./work/00/logfile_01-20191125-175447.txt
./work/00/viewerFrame1.json
./work/00/viewerFrame4.json
./work/00/viewerFrame52.json
./work/00/viewerFrame47.json
./work/00/viewerFrame49.json
./work/00/viewerFrame20.json
./work/00/logfile_01-20191125-180647.txt
./work/00/viewerFrame55.json
./work/00/viewerFrame36.json
./work/00/logfile_01-20191125-180547.txt
./work/00/wudata_01.lock
./work/00/logfile_01-20191125-180948.txt
./work/00/viewerFrame5.json
./work/00/viewerFrame30.json
./work/00/viewerFrame25.json
./work/00/01
./work/00/01/frame61.tpr
./work/00/01/frame61.trr
./work/00/01/md.log
./work/00/01/core.xml
./work/00/01/state.cpt
./work/00/01/ener.edr
./work/00/01/frame61.xtc
./work/00/01/state_prev.cpt
./work/00/01/checkpt.crc
./work/00/01/science.log
./work/00/viewerFrame41.json
./work/00/viewerFrame16.json
./work/00/viewerFrame32.json
./work/00/logfile_01-20191125-175847.txt
./work/00/viewerFrame28.json
./work/00/viewerFrame58.json
./work/00/logfile_01-20191125-180047.txt
./work/00/viewerFrame24.json
./work/00/wuresults_01.dat
./work/00/viewerFrame27.json
./work/00/viewerFrame17.json
./work/00/viewerFrame37.json
./work/00/logfile_01-20191125-181048.txt
./work/00/viewerFrame23.json
./work/00/logfile_01-20191124-211124.txt
./work/00/logfile_01-20191124-172344.txt
./work/00/viewerFrame51.json
./work/00/viewerFrame56.json
./work/00/viewerFrame18.json
./work/00/viewerFrame22.json
./work/00/logfile_01-20191125-181148.txt
./work/00/viewerFrame15.json
mateusz@mateusz-ubuntu:/var/lib/fahclient$
Log contents:

Code: Select all

23:18:37:WU00:FS00:0xa7:Completed 70000 out of 125000 steps (56%)
23:24:49:WU00:FS00:0xa7:Completed 71250 out of 125000 steps (57%)
23:31:02:WU00:FS00:0xa7:Completed 72500 out of 125000 steps (58%)
******************************* Date: 2019-11-25 *******************************
[93m17:54:29:WARNING:WU00:FS00:Detected clock skew (18 hours 18 mins), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates[0m
17:54:29:Started thread 38 on PID 2703
[91m17:54:30:ERROR:Receive error: 4: Interrupted system call[0m
[93m17:54:45:WARNING:WU00:FS00:FahCore returned: WU_STALLED (127 = 0x7f)[0m
17:54:47:WU00:FS00:Starting
17:54:48:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 2703 -checkpoint 15 -np 4
17:54:48:WU00:FS00:Started FahCore on PID 22454
17:54:48:Started thread 39 on PID 2703
17:54:50:WU00:FS00:Core PID:22467
17:54:50:WU00:FS00:FahCore 0xa7 started
17:54:52:Started thread 40 on PID 2703
17:54:53:WU00:FS00:0xa7:*********************** Log Started 2019-11-25T17:54:53Z ***********************
17:54:53:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
17:54:53:WU00:FS00:0xa7:       Type: 0xa7
17:54:53:WU00:FS00:0xa7:       Core: Gromacs
17:54:53:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 22454 -checkpoint 15 -np
17:54:53:WU00:FS00:0xa7:             4
17:54:53:WU00:FS00:0xa7:************************************ CBang *************************************
17:54:53:WU00:FS00:0xa7:       Date: Nov 5 2019
17:54:53:WU00:FS00:0xa7:       Time: 06:06:57
17:54:53:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
17:54:53:WU00:FS00:0xa7:     Branch: master
17:54:53:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
17:54:53:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
17:54:53:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:54:53:WU00:FS00:0xa7:       Bits: 64
17:54:53:WU00:FS00:0xa7:       Mode: Release
17:54:53:WU00:FS00:0xa7:************************************ System ************************************
17:54:53:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
17:54:53:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 142 Stepping 9
17:54:53:WU00:FS00:0xa7:       CPUs: 4
17:54:53:WU00:FS00:0xa7:     Memory: 3.73GiB
17:54:53:WU00:FS00:0xa7:Free Memory: 361.60MiB
17:54:53:WU00:FS00:0xa7:    Threads: POSIX_THREADS
17:54:53:WU00:FS00:0xa7: OS Version: 5.3
17:54:53:WU00:FS00:0xa7:Has Battery: true
17:54:53:WU00:FS00:0xa7: On Battery: false
17:54:53:WU00:FS00:0xa7: UTC Offset: 1
17:54:53:WU00:FS00:0xa7:        PID: 22467
17:54:53:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
17:54:53:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
17:54:53:WU00:FS00:0xa7:    Version: 0.0.18
17:54:53:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:54:53:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
17:54:53:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
17:54:53:WU00:FS00:0xa7:       Date: Nov 5 2019
17:54:53:WU00:FS00:0xa7:       Time: 06:13:26
17:54:53:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
17:54:53:WU00:FS00:0xa7:     Branch: master
17:54:53:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
17:54:53:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
17:54:53:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:54:53:WU00:FS00:0xa7:       Bits: 64
17:54:53:WU00:FS00:0xa7:       Mode: Release
17:54:53:WU00:FS00:0xa7:************************************ Build *************************************
17:54:53:WU00:FS00:0xa7:       SIMD: avx_256
17:54:53:WU00:FS00:0xa7:********************************************************************************
17:54:53:WU00:FS00:0xa7:Project: 13830 (Run 913, Clone 2, Gen 61)
17:54:53:WU00:FS00:0xa7:Unit: 0x0000004a80fccb095d693c31457bd777
17:54:53:WU00:FS00:0xa7:Digital signatures verified
17:54:54:WU00:FS00:0xa7:Calling: mdrun -s frame61.tpr -o frame61.trr -x frame61.xtc -cpi state.cpt -cpt 15 -nt 4
17:54:59:WU00:FS00:0xa7:ERROR:Guru Meditation #890b5b0346ada36a.3bdc183ca1956d5e (4665948.4665948) '00/01/state.cpt'
17:54:59:WU00:FS00:0xa7:WARNING:Unexpected exit() call
17:54:59:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
17:54:59:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
17:54:59:WU00:FS00:0xa7:Saving result file frame61.trr
17:55:02:WU00:FS00:0xa7:Saving result file frame61.xtc
17:55:03:WU00:FS00:0xa7:Saving result file md.log
17:55:03:WU00:FS00:0xa7:Saving result file science.log
17:55:16:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
17:55:47:WU00:FS00:Starting
17:55:47:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 2703 -checkpoint 15 -np 4
17:55:47:WU00:FS00:Started FahCore on PID 23523
17:55:47:Started thread 41 on PID 2703
17:55:47:WU00:FS00:Core PID:23527
17:55:47:WU00:FS00:FahCore 0xa7 started
17:55:47:WU00:FS00:0xa7:*********************** Log Started 2019-11-25T17:55:47Z ***********************
17:55:48:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
17:55:48:WU00:FS00:0xa7:       Type: 0xa7
17:55:48:WU00:FS00:0xa7:       Core: Gromacs
17:55:48:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 23523 -checkpoint 15 -np
17:55:48:WU00:FS00:0xa7:             4
17:55:48:WU00:FS00:0xa7:************************************ CBang *************************************
17:55:48:WU00:FS00:0xa7:       Date: Nov 5 2019
17:55:48:WU00:FS00:0xa7:       Time: 06:06:57
17:55:48:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
17:55:48:WU00:FS00:0xa7:     Branch: master
17:55:48:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
17:55:48:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
17:55:48:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:55:48:WU00:FS00:0xa7:       Bits: 64
17:55:48:WU00:FS00:0xa7:       Mode: Release
17:55:48:WU00:FS00:0xa7:************************************ System ************************************
17:55:48:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
17:55:48:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 142 Stepping 9
17:55:48:WU00:FS00:0xa7:       CPUs: 4
17:55:48:WU00:FS00:0xa7:     Memory: 3.73GiB
17:55:48:WU00:FS00:0xa7:Free Memory: 211.59MiB
17:55:48:WU00:FS00:0xa7:    Threads: POSIX_THREADS
17:55:48:WU00:FS00:0xa7: OS Version: 5.3
17:55:48:WU00:FS00:0xa7:Has Battery: true
17:55:48:WU00:FS00:0xa7: On Battery: false
17:55:48:WU00:FS00:0xa7: UTC Offset: 1
17:55:48:WU00:FS00:0xa7:        PID: 23527
17:55:48:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
17:55:48:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
17:55:48:WU00:FS00:0xa7:    Version: 0.0.18
17:55:48:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:55:48:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
17:55:48:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
17:55:48:WU00:FS00:0xa7:       Date: Nov 5 2019
17:55:48:WU00:FS00:0xa7:       Time: 06:13:26
17:55:48:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
17:55:48:WU00:FS00:0xa7:     Branch: master
17:55:48:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
17:55:48:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
17:55:48:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:55:48:WU00:FS00:0xa7:       Bits: 64
17:55:48:WU00:FS00:0xa7:       Mode: Release
17:55:48:WU00:FS00:0xa7:************************************ Build *************************************
17:55:48:WU00:FS00:0xa7:       SIMD: avx_256
17:55:48:WU00:FS00:0xa7:********************************************************************************
17:55:48:WU00:FS00:0xa7:Project: 13830 (Run 913, Clone 2, Gen 61)
17:55:48:WU00:FS00:0xa7:Unit: 0x0000004a80fccb095d693c31457bd777
17:55:48:WU00:FS00:0xa7:Digital signatures verified
17:55:48:WU00:FS00:0xa7:Calling: mdrun -s frame61.tpr -o frame61.trr -x frame61.xtc -cpi state.cpt -cpt 15 -nt 4
17:55:52:WU00:FS00:0xa7:ERROR:Guru Meditation #890b5b0346ada36a.3bdc183ca1956d5e (4665948.4665948) '00/01/state.cpt'
17:55:52:WU00:FS00:0xa7:WARNING:Unexpected exit() call
17:55:52:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
17:55:52:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
17:55:52:WU00:FS00:0xa7:Saving result file frame61.trr
17:55:52:WU00:FS00:0xa7:Saving result file frame61.xtc
17:55:52:WU00:FS00:0xa7:Saving result file md.log
17:55:52:WU00:FS00:0xa7:Saving result file science.log
17:55:58:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
17:56:02:Started thread 42 on PID 2703
17:56:47:WU00:FS00:Starting
17:56:47:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 2703 -checkpoint 15 -np 4
17:56:47:WU00:FS00:Started FahCore on PID 24323
17:56:47:Started thread 43 on PID 2703
17:56:49:WU00:FS00:Core PID:24443
jcoffland
Site Admin
Posts: 1019
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by jcoffland »

So here's what I think happened in your last log?

1) The system was suspended while the 0xa7 core was running.
2) The system reawoke.
3) The 0xa7 core's watchdog timer triggered: WARNING:WU00:FS00:FahCore returned: WU_STALLED (127 = 0x7f)
4) The core shutdown incorrectly.
5) On WU restart the incorrect shutdown triggered: ERROR:Guru Meditation #890b5b0346ada36a.3bdc183ca1956d5e (4665948.4665948) '00/01/state.cpt'
6) The Guru Mediation at 17:55:52 triggered: WARNING:Unexpected exit from science code
7) The exit handler caused the core to return INTERRUPTED instead of BAD_FRAME_CHECKSUM.
8) Since the core returned INTERRUPTED, which it is supposed to return when stopped normally, the client happily keeps trying.

I think the core dump in cb::Regex::Regex was caused by shanagins in the exit handler.

I've simplified the exit handler and altered the watchdog so it detects system suspend and does not trigger. Note, the watchdog checks for WU progress and shuts down the core if it's not progressing.

I will not be able to release these improvements until January because I will be away in New Zealand for all of December. I will, however, put the new core up for beta testing.
Cauldron Development LLC
http://cauldrondevelopment.com/
jcoffland
Site Admin
Posts: 1019
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by jcoffland »

Testing new 0xa7 core 0.0.19 on slack channel #11900
Cauldron Development LLC
http://cauldrondevelopment.com/
mat2
Posts: 9
Joined: Tue Mar 17, 2015 7:23 pm

Re: Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by mat2 »

Previously, since November 23th, the Folding@Home client has thrown me several times a Guru Meditation error. In every case it happened shortly after resume from suspend (but not after every resume from suspend). This seems to be not strictly related, however.
I was probably hitting this bug in the Linux kernel:

https://www.phoronix.com/scan.php?page= ... er-Corrupt
https://bugzilla.kernel.org/show_bug.cgi?id=205663
When a signal is delivered that must fault in user stack pages in order to XSAVE the signal context, AVX YMM registers may be corrupted on return from the signal handler.

To reproduce, build the attached program with "gcc -pthread test.c" and run it on a 5.2 or later kernel compiled with GCC 9 (if the kernel is compiled with GCC 8, it does *not* reproduce).

Actual result: On v5.2 and later kernels, it will fail quickly with a corrupted value in an AVX register.
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by bruce »

@mat2
Your coredump problem has been diagnosed (above) for v0.0.18. The beta version 0.0.19 also has problems so don't try it until January.
jcoffland wrote:I will not be able to release these improvements until January because I will be away in New Zealand for all of December. I will, however, put the new core up for beta testing.
To avoid the problem, disable the battery saving setting ("sleep") in your OS. If you want to save power, go to FAHControl and PAUSE folding on the CPU. You can then leave it running, manually set "sleep", or shut down.
mat2
Posts: 9
Joined: Tue Mar 17, 2015 7:23 pm

Re: Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by mat2 »

Thank for Your answers. I would like to say that I think that the jcoffland's diagnosis may not be correct.

I do not know how crazy the exit handler is, but I am sceptical as to whether it caused these segmentation faults. There were over ten segmentation faults and every one of them happened at the same address, in the same function. I think that if the problem was caused by bugs in the exit handler, the addresses would not be same every time. I would rather suspect the code handling removal of files in the /var/lib/fahclient/work directory. There were many files in this directory as a result of previous problems.

My laptop has only 4GiB of RAM and I usually have many things open so there is much memory pressure. In particular, after resuming from suspend the system is busy accessing the hard drive and in particular the swap partition. At this time, I was running Linux 5.3 compiled with GCC 9, which contained a bug that sometimes caused corruption of the AVX registers. After resume from suspend, the FAHClient was getting many page faults, which triggered this bug. The contents of the AVX registers were sometimes corrupted, which caused the Guru Meditation errors. This caused many files to accumulate in the /var/lib/fahclient/work directory.

I am ready to test fahcore 0.0.19 inside a similar environment (same contents of /var/lib/fahclient, similar system date, etc.) and to check whether it still fails inside cb::SystemUtilities::rmdir().
bruce
Posts: 20910
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by bruce »

Perhaps the problem is reaching a mutual understanding of "suspended state." Most of my experience is with Windows, do this may or may not apply to Linux.

When Windows enters a suspended state, it stores a full copy of main RAM together with enough information about the programs that are running to be able to resume them once main RAM has been restored. No such information is stored or restored from the GPU's memory. Assuming that the main RAM code segment FAHCore_21 is restored, it will resume sending work packets to the GPU and receiving whatever the GPU sends back. This ASSUMES that the the GPUs memory is still in an identical state to what was present when FAHCore_21 last communicated with it. That's not necessarily true.

If GPU processing is paused, FAHCore_21 will resume processing based on the last checkpoint and there are no assumptions needed about whatever the GU was doing. If, on the other hand, the GPU was busy processing, the data that was in the GPU will be missing or at least changed. A GPU is not designed to sleep/be_restored unless data is restored by the CPU. You'll notice that after a resume, the desktop performs a screen refresh. FAH doesn't do that unless you manually PAUSE/FOLD.

FAHCore_a7 resides entirely in main RAM, so it should be restored to the same condition it was before it entered the sleep state.

Segmentation errors can also be related to overclocking or actual memory faults.
foldy
Posts: 2061
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Repeated coredumps of FahCore_a7 under cb::Regex::Regex

Post by foldy »

GPU FAHCore_21 can detect clock skew caused by standby mode and resume GPU from last FAH checkpoint.

But this mat's problem seems related to CPU FahCore_a7 only
Post Reply