Linux scripts you may find useful

Moderators: Site Moderators, PandeGroup

Linux scripts you may find useful

Postby SteveWillis » Wed Dec 13, 2017 2:07 pm

I have written a couple of Linux scripts that others may find useful. I have them running on three Linux Mint PCs but I don't see any reason they wouldn't run on other distributions. Run at your own risk but I find them very useful and have them running automatically from start applications. Personally I have free dropbox installed on all three boxes so I only have to maintain them in one place.

This one gives a snapshot of errors in log.txt and also shows errors from previous logs.

Code: Select all
#!/bin/ksh
while true
do

greppit(){
cat $1 |sed -r "s/[[:cntrl:]]\[[0-9]{1,3}m//g"|egrep -v "positions|NO_ERROR|max-slot-errors|max-unit-errors"|egrep --color=always "Bad State detected|BAD_WORK_UNIT|DUMPED|FAILED|Date|over|STALLED|BAD_FRAME_CHECKSUM|wuresults|boost|ENUM|error|WORK_QUIT|dumping|FAULTY|WARNING|'WorkServer connection failed'" 
}

loggrep(){
echo "Current log $1"
ls -tr |tail $1|head -1 |read xxx; grep "Log Started" $xxx|head -1;greppit  $xxx
echo ;echo
}

reset

cd /var/lib/fahclient/logs

loggrep -8
loggrep -7
loggrep -6
loggrep -5
loggrep -4
loggrep -3
loggrep -2
loggrep -1



cd /var/lib/fahclient
grep "Log Started" log.txt|head -1
greppit log.txt
echo;echo "###########################################################################"
tail -1 ~/stalled.log
uptime
date
sleep 180

done







This script looks for conditions in log.txt that in my opinion require a reboot. It has to be run as root. Put an entry in sudo visudo and then run with sudo -n if you want to not have to enter the root password every time and especially if you want to run from start applications. There are a couple of components that you may have to install if not already installed on your box. It gives a 15 minute warning of a pending reboot so you can save your work or kill the script if you don't want to reboot right then.

Code: Select all
#!/bin/ksh

RESTART(){
notify-send -t 1000000 "$1 - rebooting"  # send message to desktop 15 minutes before rebooting. Gives time to save work or to execute "ps -elf |grep stalled"  then "sudo kill (pid)"  to abort reboot.
sleep 900
echo "$(date)  $1"  >> ~/stalled.log
sudo /sbin/shutdown -r now
}

set -x
cd /var/lib/fahclient
FS00count=$(grep FS00 log.txt |wc -l)
FS01count=$(grep FS01 log.txt |wc -l)
FS02count=$(grep FS02 log.txt |wc -l)
FS03count=$(grep FS03 log.txt |wc -l)
FS04count=$(grep FS04 log.txt |wc -l)
downloadcount=0
sleep 600


while true
do


# loop if internet is down
while true
do
nc -z 8.8.8.8 53  >/dev/null 2>&1  #looks for connection to google
online=$?
if [ $online -eq 0 ]; then  # if nc returns 0 then the internet is up
break
else
notify-send -t 59000 "Internet is down"
fi
sleep 60
done

#################################

#  folding slots hung.  Handles 1 to 5 slots.
if [[ $(grep FS00 log.txt|wc -l) -ne 0 && $(grep FS00 log.txt |wc -l) == $FS00count ]]
then
RESTART "FS00 stuck"
fi

if [[ $(grep FS01 log.txt|wc -l) -ne 0 && $(grep FS01 log.txt |wc -l) == $FS01count ]]
then
RESTART "FS01 stuck"
fi

if [[ $(grep FS02 log.txt|wc -l) -ne 0 && $(grep FS02 log.txt |wc -l) == $FS02count ]]
then
RESTART "FS02 stuck"
fi

if [[ $(grep FS03 log.txt|wc -l) -ne 0 && $(grep FS03 log.txt |wc -l) == $FS03count ]]
then
RESTART "FS03 stuck"
fi

if [[ $(grep FS04 log.txt|wc -l) -ne 0 && $(grep FS04 log.txt |wc -l) == $FS04count ]]
then
RESTART "FS04 stuck"
fi


FS00count=$(grep FS00 log.txt |wc -l)
FS01count=$(grep FS01 log.txt |wc -l)
FS02count=$(grep FS02 log.txt |wc -l)
FS03count=$(grep FS03 log.txt |wc -l)
FS04count=$(grep FS04 log.txt |wc -l)

#############################

#reboot if stuck download
DL=$(egrep -i "Downloading|Download complete" log.txt|wc -l|xargs -i expr {} % 2)  # if the result is 1 then I know one download isn't complete
if [ $DL == 1 ]
then
downloadcount=$(($downloadcount+1))
else
downloadcount=0
fi
if [ $downloadcount == 3 ]  # trigger the reboot only after 3 iterations
then
RESTART "Download stuck"
fi

#GPU stalled
egrep -q "STALLED" log.txt
if [ $? = 0 ]
then
RESTART "STALLED"
fi

failed=$(grep  "WorkServer connection failed" log.txt |wc -l)
if [ $failed -ge 10 ]
then
RESTART "WORKSERVER"
fi


# pause/fold fs if no work units
egrep -i "Download|No WUs available" log.txt|tail -1|egrep "No WUs available|refused"
results=$?
#echo "$(date)    results = $results"
if [ $results = 0 ]
then
INDEX=$(egrep -i "No WUs available" log.txt|tail -1|cut -d F -f2 |cut -b 3)
echo "PAUSED *******  $(date) INDEX = $INDEX" >> ~/stalled.log
echo -e "pause $INDEX\nquit" | nc localhost 36330 &> /dev/null
sleep 10
echo -e "unpause $INDEX\nquit" | nc localhost 36330 &> /dev/null
fi


sleep 600  #loop every 10 minutes
done




I hope others find these useful. I'd appreciate a message from anyone who thinks they have an improvement.
Image
folding as DarthMouse_ALL_1GD5nCZbh7gNo1SESPLT24xEd2Jsu4rTP9
Currently folding on 12 1080 and 1080TI GPUs on Linux Mint
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby rwh202 » Thu Dec 14, 2017 12:18 pm

Thanks for these and good timing - my parents are bringing down a couple of computers this weekend that have been running headless in their shed for the past 6 months for a once over.
Adding these scripts should make monitoring easier and reduce the downtime I've seen.

Any chance you could please provide a numpty guide for:
SteveWillis wrote:Put an entry in sudo visudo and then run with sudo -n if you want to not have to enter the root password every time and especially if you want to run from start applications.

I could probably figure something out, but if you know the commands for mint off the top of your head...

Thanks
rwh202
 
Posts: 328
Joined: Mon Nov 15, 2010 8:51 pm
Location: South Coast, UK

Re: Linux scripts you may find useful

Postby SteveWillis » Thu Dec 14, 2017 12:23 pm

It just occurred to me that the second script assumes only fast GPU folding and will get a completed message in less than 10 minutes for each GPU. For slow GPUs you might have to increase the sleep at the end of the script to loop slower.
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby SteveWillis » Thu Dec 14, 2017 12:35 pm

you don't want to directly edit the sudoers file as I've been told you can hose up your system so
in a terminal enter
sudo visudo
enter your passoword

At the end of the file enter something like this. The example is for user steve, script named stalled, in path /home/steve/Dropbox/bin
steve ALL = NOPASSWD: /home/steve/Dropbox/bin/stalled

note the instructions at the bottom of the screen
to save, enter
<control> x
y
then remove .tmp from the file name


It also occurs to me that I don't CPU fold at all so no guarantees on how the second script will work with CPU folding. The first script would be more universal.
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby bruce » Thu Dec 14, 2017 10:35 pm

They look great!

Now who would like to port (at least) the first one to Windows?
I'll bet you'd have lots of happy customers.
bruce
 
Posts: 21543
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Linux scripts you may find useful

Postby chris21010 » Thu Dec 14, 2017 11:09 pm

why would you be looking for completed message after 10 minutes instead of looking for % progress being made every 2-5 minutes?
chris21010
 
Posts: 10
Joined: Fri Jun 14, 2013 8:33 pm

Re: Linux scripts you may find useful

Postby SteveWillis » Fri Dec 15, 2017 12:34 am

chris21010 wrote:why would you be looking for completed message after 10 minutes instead of looking for % progress being made every 2-5 minutes?


Because it was easy to script and because I wanted to minimize excessive reboots.
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby SteveWillis » Sun Feb 11, 2018 7:01 pm

Updates to my two Linux scripts. Please read the comments in the scripts and consider them in Beta state.


Update to errors.sh, the script that just pulls the errors, warnings, and notifications from log.txt
The main change to this version is to prompt to press enter to refresh instead of continuous looping. Notes in the comments on how to change back if you prefer the other way.
Also added some additional conditions to display.

Code: Select all
#!/bin/bash
# filename errors.sh
# author Stephen Willis AKA SteveWillis AKA DarthMouse
# This script has only been tested on Linux Mint. Use at your own risk. No warranty expressed or implied under any circumstances. :-)
# Untill I get some feedback from others using it I'll consider the script to be in Beta
# This script does NOT have to be executed as root
# Suggestions for new features are welcome but may be ignored
# Please report any bugs you encounter in a PM in foldingforum.org to SteveWillis or to swillis_1@yahoo.com

#Display 1 log file
displayLog(){ 
basename $1
head -1 $1
cat $1 |sed -r "s/[[:cntrl:]]\[[0-9]{1,3}m//g"|egrep -v "Killing|Jan 6 2017|positions|NO_ERROR|max-slot-errors|max-unit-errors"|egrep --color=always "Date|Exception|Failed to read stream|Bad State detected|BAD_WORK_UNIT|DUMPED|FAILED|over|STALLED|BAD_FRAME_CHECKSUM|wuresults|boost|ENUM|error|WORK_QUIT|dumping|FAULTY|WARNING|'WorkServer connection failed'" 
echo ;echo
}

#Find old log file
whichLog(){ 
echo "Current log $1"
xxx=$(ls -tr |tail $1|head -1)
displayLog  $xxx
}


lastReboots(){
if [ -e ~/reboots.log ] # file will only exist if my reboot.sh script is running and has executed a reboot
then
    echo;echo "######################### Last reboots.log entries #############################"
    tail -4 ~/reboots.log
    echo
fi
}


# End of Functions


cd /var/lib/fahclient/logs


while true   #loop till killed
do

reset    #clears the window


# display most recent old log files
#whichLog -10
#whichLog -9
#whichLog -8
#whichLog -7
#whichLog -6
whichLog -5
whichLog -4
whichLog -3
whichLog -2
whichLog -1

displayLog /var/lib/fahclient/log.txt   #display current log.txt file
 
lastReboots             # will not display anything unless my reboot script is running and has executed a reboot
echo "################################################################################"
echo "Current date and time:  $(date)       $(date -u)"

echo
echo -n "Press Enter To Refresh:";read dummy     # comment this line and uncomment sleep if you want script to loop continiously.  Will result in more disk access.
#sleep 180

done






Update to reboot.sh, script that reboots the computer if conditions require it.
Lots of changes and improvements
* Moved everything to functions. Only function calls in the main loop
* Now essentially no limit to the number of folding slots it supports. supports up to FS99
* Now supports CPU as well as GPU folding
* Now utilizes a ram disk to minimize disk reads. See comments in the script to set that up.
* Configuration section for the things you have to customize
* Timing now based on the actual time errors occurred. You can customize how long to wait before triggering a restart (thanks chris21010)
Unfortunately in order for the timing to work you have to set log-date true in client control/configure/expert



Code: Select all
#!/bin/ksh
# author Stephen Willis AKA SteveWillis AKA DarthMouse
# This script has only been tested on Linux Mint. Use at your own risk. No warranty expressed or implied under any circumstances. :-)
# Until I get feedback from other users I'll consider the script in BETA
# Should work for folding slots up through FS99
# This script must be executed as root, that is, sudo reboot.sh
# I script in korn shell since that's what we used at my job.  You may have to install ksh
# sudo apt-get install ksh
# You have to set log-date true in client control/configure/expert
# Be sure and edit the CONFIGURATION section below.
# I execute this script from /etc/rc.local so that it starts before I actually log back in.  From there by default it's executed as root.  Just add: yourpath/reboot.sh & to the end of the file.
# you may have to install beep and beep will not work on your system if your MB doesn't have a speaker. In which case don't bother to install.
# apt-get install beep
# You must create the ramdisk (reduces disk activity). It gets mounted when the script runs. You need to create the mount point.
#    sudo mkdir /mnt/ramdisk
# touch $HOME/exit_flag  to exit this script without reboot. 
# touch $HOME/bounce_flag  to exit and restart this script.  Probably not useful unless, like me, you tweak the script a lot.
# touch $HOME/mute_flag to mute the beeps if your MB supports them. I set and unset this at night (create and delete the file) from crontab.
#
# If you add a FS while the script is running you will need to restart it
# Suggestions for new features are welcome but may be ignored
# Please report any bugs you encounter in a PM in foldingforum.org to SteveWillis or to swillis_1@yahoo.com




#***********CONFIGURATION SECTION***********************
HOMEPATH=/home/steve                              # *** hard code your home path here if you run the script (as root by default) as I do from /etc/rc.local or if not change to $HOME
#HOMEPATH=$HOME                                   # Executing from /etc/rc.local runs before you even log in. In which case you don't want the log files going to /root
                                                  # Or you can change to $HOME if you don't
username=steve                                    # your user name, necessary for notify-send to work since script is run as root
notifyHostname=linux01                            # hostname you want to send notifications on. I run three rigs but only want notifications on my main pc

REBOOTPATH=/home/steve/Dropbox/bin/reboot.sh      # The location of this script. Only necessary if you use bounce_flag

badConnectionInterval=6                           #This is the interval in minutes to determine that a download is hung or WorkServer connection failed
hungFSIntervalGPU=6                               #This is the interval in minutes to determine that a GPU FS is hung. I suggest the maximum observed interval plus 5
hungFSIntervalCPU=20                              #This is the interval in minutes to determine that a CPU FS is hung. I suggest the maximum observed interval plus 10
rebootWarning=10                                  #This is the number of minutes warning displayed to the desktop before a reboot
logFileGenerations=50                             #Generations of log file to keep
set -x                                            #display all steps to $logfile for debug. If commented only logs errors and reduces disk activity

enable_flags=1                                    # flag files  1 = enabled, anything else = disabled (which slightly reduces disk access)
#************END CONFIGURATION**************************



RESTART(){
echo "$errstr   $(date)     $(date -u)"
echo "$errstr   $(date)     $(date -u)"  >> $HOMEPATH/reboots.log
if [[ sendNotifications == 1 ]]
then
    for min in {1..$rebootWarning}  # once per minute for $rebootWarning minutes beep 5 times and notify
    do
        minutes=$(($rebootWarning-$min))
        for sec in {1..60}
        do
            seconds=$((60-$sec))
            sudo -u $username DISPLAY=:0.0  notify-send -t 995 "$1 - rebooting in $minutes minutes  $seconds seconds"  # Send message to desktop 10 minutes before rebooting.
            sleep 1
        done                                                            # Gives time to save work or to "touch $HOME/exit_flag" to kill script and abort reboot.
        if [[ $enable_flags == 1 && ! -e $HOMEPATH/mute_flag ]]
        then
           beep -l 500  -r 5
        fi

        if [[ $enable_flags == 1 && -e $HOMEPATH/exit_flag ]]    # if $HOMEPATH/exit_flag exists in your home directory delete the file and exit script.
        then
            rm $HOMEPATH/exit_flag
            if [[ ! -e $HOMEPATH/mute_flag ]]
            then
                beep -l 2000   #longer beep to signal reboot was canceled and script exit
            fi
            exit 0  # exit the script
        fi
        if [[ $enable_flags == 1 && -e $HOMEPATH/boot_flag ]]    # if $HOMEPATH/boot_flag exists in your home directory delete the file and reboot
        then
            rm $HOMEPATH/boot_flag
            if [ ! -e $HOMEPATH/mute_flag ]
            then
                beep -l 2000   
            fi
            break  # exit the loop and immediately reboot
        fi
    done
else
    sleep 600
fi
sync
sudo /sbin/shutdown -r now
sync
exit 0
}

#reboot if error_threshold BAD_WORK_UNIT errors in the last 100 (arbitrary) lines of the log(I got 40 bad work unit errors in fifteen minutes.  Have no idea why. Reboot stopped them)
badWorkUnit(){
error_threshold=$1
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[ii]}
    if [[ $( grep $FS  $logFileToRead| tail -100|grep -c BAD_WORK_UNIT) -ge $error_threshold ]]
    then
        /etc/init.d/FAHClient  stop
        errstr="BAD_WORK_UNIT $(grep $FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
        RESTART
    fi
done
}



# loop if internet is down
isInternetDown(){
wasdown=0
while true  #repeat until return
do
    nc -z 8.8.8.8 53  >/dev/null 2>&1  #looks for connection to google
    online=$?
    if [ $online -eq 0 ] # if nc returns 0 then the internet is up now.
    then               
   if [ $wasdown == 1 ]
        then
            sleep 300   #I don't want to exit the function unless the internet has been back up at least 5 minutes
            isInternetDown
        fi
        return
    else
        wasdown=1
        if [[ sendNotifications == 1 ]]
        then
            sudo -u $username DISPLAY=:0.0 DISPLAY=:0.0 notify-send -t 59000 "Internet is down"
        fi
    fi
    sleep 60
done
}


#no work units
no_WU(){
for ii in "${!FSindexArray[@]}"
do
    FSindex=${FSindexArray[ii]}
    if [[ $(egrep "Download|No WUs available"  $logFileToRead |grep FS$FSindex |tail -1|grep -c "No WUs available") == 1 ]]
    then
        echo " FS$FSindex  no_WU, pause/unpause    $(date)    $(date -u)  "  >> $HOMEPATH/reboots.log
        if [[ sendNotifications == 1 ]]
        then
            sudo -u $username DISPLAY=:0.0 notify-send -t 10000 "no_WU - FS$FSindex"
        fi
        FAHClient --send-pause $FSindex
        sleep 10
        FAHClient --send-unpause $FSindex
    fi
done
}

isClientRunning(){
for i in {1..120}           # loop until FAHClient is running is true or times out and exits script
do
   /etc/init.d/FAHClient status|grep "fahclient is running"
   if [ $? == 0 ]    #success
    then
        return
    fi
    sleep 1
done
echo "FAHClient is NOT RUNNING. exit"
if [[ sendNotifications == 1 ]]
then
    sudo -u $username DISPLAY=:0.0  notify-send -t 1000000 "FAHClient is NOT RUNNING. exit"   
    if [[ $enable_flags == 1 && ! -e $HOMEPATH/mute_flag ]]
    then
       beep -l 1500  -r 5
    fi
fi
sync
exit 0
}
 


#  folding slot hung.  Reboot if the FS doesn't increment during $hungFSIntervalGPU or $hungFSIntervalCPU
FS_Hung(){
currTime=$(date -u +%s)
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[ii]}
    FSType=${FSindexArrayType[ii]}
    if [[ $(grep $FS $logFileToRead|egrep "Paused|Completed" |tail -1 |grep -c "Paused") == 1 ]]  #Skip execution if FS is Paused
    then
        continue
    fi
    lastTime=$(xxx=$(grep $FS $logFileToRead|tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
    timeDiff=$(($currTime-$lastTime))

    if [[ $FSType == "GPU" && $timeDiff -ge $hungFSIntervalGPU ]] || [[ $FSType == "CPU" && $timeDiff -ge $hungFSIntervalCPU ]]
    then
        errstr="$FS hung $(grep $FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
        RESTART
    fi
done
}



#GPU stalled
stalled(){
if [ $(grep -c "STALLED" $logFileToRead) -ge 1 ]
then
    FS=$(grep STALLED $logFileToRead|tail -1 |cut -d "S" -f 2|cut -d : -f 1 )
    errstr="STALLED $(grep FS$FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
    RESTART
fi
}


lookForInitialBadWorkUnits(){
if [ -e $HOMEPATH/bounce_flag ]  #warm start so skip this section
then
    rm $HOMEPATH/bounce_flag
else
    for i in {1..120}                           #Probably only affects me but every once in a while the client starts up throwing bad work units and I want to trigger reboot asap
    do
        grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
        badWorkUnit 1 
        breakcount=0
        for j in "${!FSindexArray[@]}"
        do
            if [[ $(grep FS${FSindexArray[j]} $logFileToRead|tail -1 |grep -c Completed) == 1 ]]
            then
                ((breakcount+=1))
            fi
            if [ $breakcount == ${#FSindexArray[@]} ]  #array length
            then
                break 2                         # break out of both loops if all FS are folding
            fi
        done
        sleep 1
    done 
fi
}

lookForExitFlag(){
#exit
if [[ $enable_flags == 1 && -e $HOMEPATH/exit_flag ]]    # if $HOMEPATH/exit_flag exists in your home directory delete the file and exit script.
then
    rm $HOMEPATH/exit_flag
    sudo -u $username DISPLAY=:0.0 notify-send -t 20000 "exit reboot.sh"
    if [ ! -e $HOMEPATH/mute_flag ]
    then
        beep -l 2000   #longer beep to signal reboot was canceled and script exit
    fi
    sync
    sleep 1
    exit 0  # exit the script
fi
}

lookForBounceFlag(){
#bounce
if [[ $enable_flags == 1 && -e $HOMEPATH/bounce_flag ]]    # if $HOMEPATH/bounce_flag exists RESTART reboot.sh, and exit the script
then
    sudo -u $username DISPLAY=:0.0 notify-send -t 20000 "bounce reboot.sh"
    if [ ! -e $HOMEPATH/mute_flag ]
    then
        beep -l 2000   
    fi
    chmod 744 $REBOOTPATH    #I change permissions because for security the script's owner is root and when I edit it the permission gets changed in Dropbox on my other rigs.
    $REBOOTPATH  & # Start new execution of reboot.sh
    sync
    sleep 1
    exit 0
fi
}

workserverConnectionFailed(){
    if [ $(egrep "Upload complete|WorkServer connection failed" $logFileToRead|tail -1 |grep -c "WorkServer connection failed") == 1 ]
    then
        currTime=$(date -u +%s)
        lastTime=$(xxx=$(grep "WorkServer connection failed" $logFileToRead|tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
        timeDiff=$(($currTime-$lastTime))
        if [[ $timeDiff -ge $badConnectionInterval ]]
        then
            errstr="WorkServer connection failed"
            RESTART
        fi
    fi
}
 

   #reboot if stuck download
downloadStuck(){
currTime=$(date -u +%s)
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[ii]}
    DL=$(grep $FS $logFileToRead| grep -v "Downloading core" | egrep -c "Downloading|Download complete" |xargs -i |grep -c Downloading)  # if the result is 1 then I know the download isn't complete
    if [ $DL == 1 ]
    then
        lastTime=$(xxx=$(grep $FS $logFileToRead|grep Downloading| tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
        timeDiff=$(($currTime-$lastTime))
        if [[ $timeDiff -ge $badConnectionInterval ]]
        then
            erstr="Download stuck  $FS"
            RESTART
        fi
    fi
done
}


populateFSArrays(){
grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
c=0
grep "slot id" $logFileToRead|cut -d '<' -f 2 |sort -u|while read line
do
    echo $line|cut -d "'" -f2|read FS
    typeset -Z2 FSindexArray[c]     # define as 2 digit with leading zero
    FSindexArray[c]=$FS
    echo $line |cut -d "'" -f4 | read type
    FSindexArrayType[c]=$type
    ((c+=1))
done   #set up the folding slot arrays
for i in "${!FSindexArray[@]}"; do  echo "FSindexArray[i] is ${FSindexArray[i]}"  ;done
for i in "${!FSindexArrayType[@]}"; do  echo "FSindexArrayType[i] is ${FSindexArrayType[i]}"  ;done
}

#End of functions







#main

((hungFSIntervalGPU*=60))
((hungFSIntervalCPU*=60))
((badConnectionInterval*=60))
if [[ $(hostname|grep -c $notifyHostname) == 1 ]]; then sendNotifications=1; fi

logfile="$HOMEPATH/reboot.log.$(date  -u +"%Y%m%dT%H%M")"                           #define the logfile path and name, including date string to make it unique
exec >$logfile  2>&1                                                            #send output to $logfile.
ls -1tr $HOMEPATH/reboot.log.*| head -n -$logFileGenerations | xargs -i rm {}   #delete old versions of reboot.log*

modprobe pcspkr                                                                 # Enable speaker if your MB has one
mount -t tmpfs -o size=100m tmpfs /mnt/ramdisk                                  # I set the size to 100m which is about 10 times the largest file I have in logs but I have plenty of RAM.
logFileToRead=/mnt/ramdisk/log.txt
isClientRunning

sleep 1
beep -l 1000   

populateFSArrays

lookForInitialBadWorkUnits


while true  #repeat forever
do
    date -u                     # useful in debugging. Otherwise comment it out to prevent disk writes
    isInternetDown              # If the internet goes down loop till it's been back up at least 5 minutes
    isClientRunning             #
#   stalled                    # Uncomment if stalled FS reduces PPD on that slot and you want to start triggering a reboot
    workserverConnectionFailed 
    FS_Hung
    badWorkUnit 4
    downloadStuck
    no_WU                       # pauses and unpauses the FS to try and force quicker retry.  Once per minute.
    lookForExitFlag
    lookForBounceFlag
 
    sleep 60
    grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
done




My next project will be to try and port the errors.sh script to windows.
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby bruce » Sun Feb 11, 2018 10:33 pm

My Linux system went belly-up and (AT LEAST) needs a rebuild so I can't try your scripts right now.

In the "log" tab of FAHClient, there's an option to view Warnings & Errors. Is errors.sh fundamentally based on the same concepts?
bruce
 
Posts: 21543
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Linux scripts you may find useful

Postby SteveWillis » Sun Feb 11, 2018 10:52 pm

Hi Bruce, Sorry about your system.
I guess that they are basically similar with some differences.
The client output is pretty unusable due to all the "Size of positions xxxx does not match topology" type entries. I thought I saw a post that that was being fixed but I believe I have the latest version and they are still there.
My version reports at least some items that aren't shown in the client log
My version shows results from the current log and also from a few previous logs which I find useful in seeing recurrent issues and trends. Especially useful looking at the most recent old log after a system restart.
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby SteveWillis » Thu Feb 15, 2018 7:27 pm

bug fixed in reboot.sh. It only showed up if the internet went down briefly.

Code: Select all
#!/bin/ksh
# author Stephen Willis AKA SteveWillis AKA DarthMouse
# This script has only been tested on Linux Mint. Use at your own risk. No warranty expressed or implied under any circumstances. :-)
# Until I get feedback from other users I'll consider the script in BETA
# Should work for folding slots up through FS99
# This script must be executed as root, that is, sudo reboot.sh
# I script in korn shell since that's what we used at my job.  You may have to install ksh
# sudo apt-get install ksh
# You have to set log-date true in client control/configure/expert
# Be sure and edit the CONFIGURATION section below.
# I execute this script from /etc/rc.local so that it starts before I actually log back in.  From there by default it's executed as root.  Just add: yourpath/reboot.sh & to the end of the file.
# you may have to install beep and beep will not work on your system if your MB doesn't have a speaker. In which case don't bother to install.
# apt-get install beep
# You must create the ramdisk (reduces disk activity). It gets mounted when the script runs. You need to create the mount point.
#    sudo mkdir /mnt/ramdisk
# touch $HOME/exit_flag  to exit this script without reboot. 
# touch $HOME/bounce_flag  to exit and restart this script.  Probably not useful unless, like me, you tweak the script a lot.
# touch $HOME/mute_flag to mute the beeps if your MB supports them. I set and unset this at night (create and delete the file) from crontab.
#
# If you add a FS while the script is running you will need to restart it
# Suggestions for new features are welcome but may be ignored
# Please report any bugs you encounter in a PM in foldingforum.org to SteveWillis or to swillis_1@yahoo.com




#***********CONFIGURATION SECTION***********************
HOMEPATH=/home/steve                              # *** hard code your home path here if you run the script (as root by default) as I do from /etc/rc.local or if not change to $HOME
#HOMEPATH=$HOME                                   # Executing from /etc/rc.local runs before you even log in. In which case you don't want the log files going to /root
                                                  # Or you can change to $HOME if you don't
username=steve                                    # your user name, necessary for notify-send to work since script is run as root

REBOOTPATH=/home/steve/Dropbox/bin/reboot.sh      # The location of this script. Only necessary if you use bounce_flag

badConnectionInterval=6                           #This is the interval in minutes to determine that a download is hung or WorkServer connection failed
hungFSIntervalGPU=6                               #This is the interval in minutes to determine that a GPU FS is hung. I suggest the maximum observed interval plus 5
hungFSIntervalCPU=20                              #This is the interval in minutes to determine that a CPU FS is hung. I suggest the maximum observed interval plus 10
rebootWarning=10                                  #This is the number of minutes warning displayed to the desktop before a reboot
logFileGenerations=50                             #Generations of log file to keep
set -x                                            #display all steps to $logfile for debug. If commented only logs errors and reduces disk activity

enable_flags=1                                    # flag files  1 = enabled, anything else = disabled (which slightly reduces disk access)
#************END CONFIGURATION**************************



RESTART(){
echo "$errstr   $(date)     $(date -u)"
echo "$errstr   $(date)     $(date -u)"  >> $HOMEPATH/reboots.log
   for min in {1..$rebootWarning}  # once per minute for $rebootWarning minutes beep 5 times and notify
    do
        minutes=$(($rebootWarning-$min))
        for sec in {1..60}
        do
            seconds=$((60-$sec))
            sudo -u $username DISPLAY=:0.0  notify-send -t 995 "$1 - rebooting in $minutes minutes  $seconds seconds"  # Send message to desktop 10 minutes before rebooting.
            sleep 1
        done                                                            # Gives time to save work or to "touch $HOME/exit_flag" to kill script and abort reboot.
        if [[ $enable_flags == 1 && ! -e $HOMEPATH/mute_flag ]]
        then
           beep -l 500  -r 5
        fi

        if [[ $enable_flags == 1 && -e $HOMEPATH/exit_flag ]]    # if $HOMEPATH/exit_flag exists in your home directory delete the file and exit script.
        then
            rm $HOMEPATH/exit_flag
            if [[ ! -e $HOMEPATH/mute_flag ]]
            then
                beep -l 2000   #longer beep to signal reboot was canceled and script exit
            fi
            exit 0  # exit the script
        fi
        if [[ $enable_flags == 1 && -e $HOMEPATH/boot_flag ]]    # if $HOMEPATH/boot_flag exists in your home directory delete the file and reboot
        then
            rm $HOMEPATH/boot_flag
            if [ ! -e $HOMEPATH/mute_flag ]
            then
                beep -l 2000   
            fi
            break  # exit the loop and immediately reboot
        fi
    done
sync
sudo /sbin/shutdown -r now
sync
exit 0
}

#reboot if error_threshold BAD_WORK_UNIT errors in the last 100 (arbitrary) lines of the log(I got 40 bad work unit errors in fifteen minutes.  Have no idea why. Reboot stopped them)
badWorkUnit(){
error_threshold=$1
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[$ii]}
    if [[ $( grep $FS  $logFileToRead| tail -100|grep -c BAD_WORK_UNIT) -ge $error_threshold ]]
    then
        /etc/init.d/FAHClient  stop
        errstr="BAD_WORK_UNIT $(grep $FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
        RESTART
    fi
done
}



# loop if internet is down
isInternetDown(){
wasdown=0
while true  #repeat until return
do
    nc -z 8.8.8.8 53  >/dev/null 2>&1  #looks for connection to google
    online=$?
    if [ $online -eq 0 ] # if nc returns 0 then the internet is up now.
    then               
   if [ $wasdown == 1 ]
        then
            sleep 300   #I don't want to exit the function unless the internet has been back up at least 5 minutes
            wasdown=0
            isInternetDown
        fi
        return
    else
        wasdown=1
        sudo -u $username DISPLAY=:0.0 DISPLAY=:0.0 notify-send -t 59000 "Internet is down"
    fi
    sleep 60
done
}


#no work units
no_WU(){
for ii in "${!FSindexArray[@]}"
do
    FSindex=${FSindexArray[$ii]}
    if [[ $(egrep "Download|No WUs available"  $logFileToRead |grep FS$FSindex |tail -1|grep -c "No WUs available") == 1 ]]
    then
        echo " FS$FSindex  no_WU, pause/unpause    $(date)    $(date -u)  "  >> $HOMEPATH/reboots.log
        sudo -u $username DISPLAY=:0.0 notify-send -t 10000 "no_WU - FS$FSindex"
        FAHClient --send-pause $FSindex
        sleep 10
        FAHClient --send-unpause $FSindex
    fi
done
}

isClientRunning(){
for i in {1..120}           # loop until FAHClient is running is true or times out and exits script
do
   /etc/init.d/FAHClient status|grep "fahclient is running"
   if [ $? == 0 ]    #success
    then
        return
    fi
    sleep 1
done
echo "FAHClient is NOT RUNNING. exit"
sudo -u $username DISPLAY=:0.0  notify-send -t 1000000 "FAHClient is NOT RUNNING. exit"   
if [[ $enable_flags == 1 && ! -e $HOMEPATH/mute_flag ]]
then
   beep -l 1500  -r 5
fi
sync
exit 0
}
 


#  folding slot hung.  Reboot if the FS doesn't increment during $hungFSIntervalGPU or $hungFSIntervalCPU
FS_Hung(){
date -u
currTime=$(date -u +%s)
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[$ii]}
    FSType=${FSindexArrayType[$ii]}
    if [[ $(grep $FS $logFileToRead|egrep "Paused|Completed" |tail -1 |grep -c "Paused") == 1 ]]  #Skip execution if FS is Paused
    then
        continue
    fi
    lastTime=$(xxx=$(grep $FS $logFileToRead|grep Completed |tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
    timeDiff=$(($currTime-$lastTime))

    if [[ $FSType == "GPU" && $timeDiff -ge $hungFSIntervalGPU ]] || [[ $FSType == "CPU" && $timeDiff -ge $hungFSIntervalCPU ]]
    then
        errstr="$FS hung $(grep $FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
        RESTART
    fi
done
}



#GPU stalled
stalled(){
if [ $(grep -c "STALLED" $logFileToRead) -ge 1 ]
then
    FS=$(grep STALLED $logFileToRead|tail -1 |cut -d "S" -f 2|cut -d : -f 1 )
    errstr="STALLED $(grep FS$FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
    RESTART
fi
}


lookForInitialBadWorkUnits(){
if [ -e $HOMEPATH/bounce_flag ]  #warm start so skip this section
then
    rm $HOMEPATH/bounce_flag
else
    for i in {1..120}                           #Probably only affects me but every once in a while the client starts up throwing bad work units and I want to trigger reboot asap
    do
        grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
        badWorkUnit 1 
        breakcount=0
        for j in "${!FSindexArray[@]}"
        do
            if [[ $(grep FS${FSindexArray[$j]} $logFileToRead|tail -1 |grep -c Completed) == 1 ]]
            then
                ((breakcount+=1))
            fi
            if [ $breakcount == ${#FSindexArray[@]} ]  #array length
            then
                break 2                         # break out of both loops if all FS are folding
            fi
        done
        sleep 1
    done 
fi
}

lookForExitFlag(){
#exit
if [[ $enable_flags == 1 && -e $HOMEPATH/exit_flag ]]    # if $HOMEPATH/exit_flag exists in your home directory delete the file and exit script.
then
    rm $HOMEPATH/exit_flag
    sudo -u $username DISPLAY=:0.0 notify-send -t 20000 "exit reboot.sh"
    if [ ! -e $HOMEPATH/mute_flag ]
    then
        beep -l 2000   #longer beep to signal reboot was canceled and script exit
    fi
    sync
    sleep 1
    exit 0  # exit the script
fi
}

lookForBounceFlag(){
#bounce
if [[ $enable_flags == 1 && -e $HOMEPATH/bounce_flag ]]    # if $HOMEPATH/bounce_flag exists RESTART reboot.sh, and exit the script
then
    sudo -u $username DISPLAY=:0.0 notify-send -t 20000 "bounce reboot.sh"
    if [ ! -e $HOMEPATH/mute_flag ]
    then
        beep -l 2000   
    fi
    chmod 744 $REBOOTPATH    #I change permissions because for security the script's owner is root and when I edit it the permission gets changed in Dropbox on my other rigs.
    $REBOOTPATH  & # Start new execution of reboot.sh
    sync
    sleep 1
    exit 0
fi
}

workserverConnectionFailed(){
    if [ $(egrep "Upload complete|WorkServer connection failed" $logFileToRead|tail -1 |grep -c "WorkServer connection failed") == 1 ]
    then
        currTime=$(date -u +%s)
        lastTime=$(xxx=$(grep "WorkServer connection failed" $logFileToRead|tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
        timeDiff=$(($currTime-$lastTime))
        if [[ $timeDiff -ge $badConnectionInterval ]]
        then
            errstr="WorkServer connection failed"
            RESTART
        fi
    fi
}
 

   #reboot if stuck download
downloadStuck(){
currTime=$(date -u +%s)
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[$ii]}
    DL=$(grep $FS $logFileToRead| grep -v "Downloading core" | egrep -c "Downloading|Download complete" |xargs -i |grep -c Downloading)  # if the result is 1 then I know the download isn't complete
    if [ $DL == 1 ]
    then
        lastTime=$(xxx=$(grep $FS $logFileToRead|grep Downloading| tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
        timeDiff=$(($currTime-$lastTime))
        if [[ $timeDiff -ge $badConnectionInterval ]]
        then
            erstr="Download stuck  $FS"
            RESTART
        fi
    fi
done
}


populateFSArrays(){
grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
c=0
grep "slot id" $logFileToRead|cut -d '<' -f 2 |sort -u|while read line
do
    echo $line|cut -d "'" -f2|read FS
    typeset -Z2 FSindexArray[c]     # define as 2 digit with leading zero
    FSindexArray[c]=$FS
    echo $line |cut -d "'" -f4 | read type
    FSindexArrayType[c]=$type
    ((c+=1))
done   #set up the folding slot arrays
for i in "${!FSindexArray[@]}"; do  echo "FSindexArray[i] is ${FSindexArray[i]}"  ;done
for i in "${!FSindexArrayType[@]}"; do  echo "FSindexArrayType[i] is ${FSindexArrayType[i]}"  ;done
}

#End of functions







#main

((hungFSIntervalGPU*=60))
((hungFSIntervalCPU*=60))
((badConnectionInterval*=60))

logfile="$HOMEPATH/reboot.log.$(date  -u +"%Y%m%dT%H%M")"                           #define the logfile path and name, including date string to make it unique
exec >$logfile  2>&1                                                            #send output to $logfile.
ls -1tr $HOMEPATH/reboot.log.*| head -n -$logFileGenerations | xargs -i rm {}   #delete old versions of reboot.log*

modprobe pcspkr                                                                 # Enable speaker if your MB has one
mount -t tmpfs -o size=100m tmpfs /mnt/ramdisk                                  # I set the size to 100m which is about 10 times the largest file I have in logs but I have plenty of RAM.
logFileToRead=/mnt/ramdisk/log.txt
isClientRunning

sleep 1
beep -l 1000   

populateFSArrays

lookForInitialBadWorkUnits


while true  #repeat forever
do
    date -u                     # useful in debugging. Otherwise comment it out to prevent disk writes
    isInternetDown              # If the internet goes down loop till it's been back up at least 5 minutes
    isClientRunning             #
    grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
#   stalled                    # Uncomment if stalled FS reduces PPD on that slot and you want to start triggering a reboot
    workserverConnectionFailed 
    FS_Hung
    badWorkUnit 4
    downloadStuck
    no_WU                       # pauses and unpauses the FS to try and force quicker retry.  Once per minute.
    lookForExitFlag
    lookForBounceFlag
 
    sleep 60

done

SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby SteveWillis » Mon Feb 26, 2018 3:05 am

reboot.sh update
Code: Select all
#!/bin/ksh
# author Stephen Willis AKA SteveWillis AKA DarthMouse
# This script has only been tested on Linux Mint. Use at your own risk. No warranty expressed or implied under any circumstances. :-)
# updates published at https://foldingforum.org/viewtopic.php?f=96&t=30504
# Until I get feedback from other users I'll consider the script in BETA
# Should work for folding slots up through FS99
# This script must be executed as root, that is, sudo reboot.sh
# You have to set log-date true in client control/configure/expert for the time out functions to work
# Be sure and edit the CONFIGURATION section below.
# I execute this script from /etc/rc.local so that it starts before I actually log back in.  From there by default it's executed as root.  Just add: yourpath/reboot.sh & to the end of the file.
# you may have to install beep and beep will not work on your system if your MB doesn't have a speaker. In which case don't bother to install.
# apt-get install beep
# touch $HOME/exit_flag  to exit this script without reboot. 
# touch $HOME/bounce_flag  to exit and restart this script.  Probably not useful unless, like me, you tweak the script a lot.
# touch $HOME/mute_flag to mute the beeps if your MB supports them. I set and unset this at night (create and delete the file) from crontab.
#
# If you add a FS while the script is running you will need to restart it
# Suggestions for new features are welcome but may be ignored
# Please report any bugs you encounter in a PM in foldingforum.org to SteveWillis or to swillis_1@yahoo.com

#20180225    automate creation of /mnt/ramdisk, bounce FAH if GPU offline



#***********CONFIGURATION SECTION***********************
set -x                                            #display all steps to $logfile for debug. If commented only logs errors and reduces disk activity
HOMEPATH=/home/steve                              # *** hard code your home path here if you run the script (as root by default) as I do from /etc/rc.local or if not change to $HOME
#HOMEPATH=$HOME                                   # Executing from /etc/rc.local runs before you even log in. In which case you don't want the log files going to /root
                                                  # Or you can change to $HOME if you don't
username=steve                                    # your user name, necessary for notify-send to work since script is run as root

REBOOTPATH=/home/steve/Dropbox/bin/reboot.sh      # The location of this script. Only necessary if you use bounce_flag

badConnectionInterval=8                           #This is the interval in minutes to determine that a download is hung or WorkServer connection failed
hungFSIntervalGPU=8                               #This is the interval in minutes to determine that a GPU FS is hung. I suggest the maximum observed interval plus 5
hungFSIntervalCPU=30                              #This is the interval in minutes to determine that a CPU FS is hung. I suggest the maximum observed interval plus 10
rebootWarning=10                                  #This is the number of minutes warning displayed to the desktop before a reboot
logFileGenerations=30                             #Generations of log file to keep

enable_flags=1                                    # flag files  1 = enabled, anything else = disabled (which slightly reduces disk access)
#************END CONFIGURATION**************************



bounceFAH(){
echo "$errstr   $(date)     $(date -u)"
echo "$errstr   $(date)     $(date -u)"  >> $HOMEPATH/reboots.log
 sudo -u $username DISPLAY=:0.0  notify-send -t 999999 "FAHClient restarted" 
 /etc/init.d/FAHClient stop
 sleep 10
 pkill -e FAHCore
 sleep 10
 /etc/init.d/FAHClient start

}


RESTART(){
echo "$errstr   $(date)     $(date -u)"
echo "$errstr   $(date)     $(date -u)"  >> $HOMEPATH/reboots.log
   for min in {1..$rebootWarning}  # once per minute for $rebootWarning minutes beep 5 times and notify
    do
        if [[ $enable_flags == 1 && ! -e $HOMEPATH/mute_flag ]]
        then
           beep -l 500  -r 5  &
        fi
       minutes=$((rebootWarning-min))
        for sec in {1..60}
        do
            seconds=$((60-$sec))
            sudo -u $username DISPLAY=:0.0  notify-send -t 995 "$1 - rebooting in $minutes minutes  $seconds seconds"  # Send message to desktop 10 minutes before rebooting.
            sleep 1
        done                                                            # Gives time to save work or to "touch $HOME/exit_flag" to kill script and abort reboot.

        if [[ $enable_flags == 1 && -e $HOMEPATH/exit_flag ]]    # if $HOMEPATH/exit_flag exists in your home directory delete the file and exit script.
        then
            rm $HOMEPATH/exit_flag
            if [[ ! -e $HOMEPATH/mute_flag ]]
            then
                beep -l 2000   #longer beep to signal reboot was canceled and script exit
            fi
            exit 0  # exit the script
        fi
        if [[ $enable_flags == 1 && -e $HOMEPATH/boot_flag ]]    # if $HOMEPATH/boot_flag exists in your home directory delete the file and reboot
        then
            rm $HOMEPATH/boot_flag
            if [ ! -e $HOMEPATH/mute_flag ]
            then
                beep -l 2000   
            fi
            break  # exit the loop and immediately reboot
        fi
    done
sync
sudo /sbin/shutdown -r now
sync
exit 0
}


#reboot if error_threshold BAD_WORK_UNIT errors in the last 100 (arbitrary) lines of the log(I got 40 bad work unit errors in fifteen minutes.  Have no idea why. Reboot stopped them)
badWorkUnit(){
error_threshold=$1
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[$ii]}
    if [[ $( grep $FS  $logFileToRead| tail -100|grep -c BAD_WORK_UNIT) -ge $error_threshold ]]
    then
        /etc/init.d/FAHClient  stop
        errstr="BAD_WORK_UNIT $(grep $FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
        RESTART
    fi
done
}



# loop if internet is down
isInternetDown(){
wasdown=0
while true  #repeat until return
do
    nc -z 8.8.8.8 53  >/dev/null 2>&1  #looks for connection to google
    online=$?
    if [ $online -eq 0 ] # if nc returns 0 then the internet is up now.
    then               
   if [ $wasdown == 1 ]
        then
            sleep 300   #I don't want to exit the function unless the internet has been back up at least 5 minutes
            wasdown=0
            isInternetDown
        fi
        return
    else
        wasdown=1
        sudo -u $username DISPLAY=:0.0 DISPLAY=:0.0 notify-send -t 59000 "Internet is down"
    fi
    sleep 60
done
}


#no work units
no_WU(){
for ii in "${!FSindexArray[@]}"
do
    FSindex=${FSindexArray[$ii]}
    if [[ $(egrep "Download|No WUs available"  $logFileToRead |grep FS$FSindex |tail -1|grep -c "No WUs available") == 1 ]]
    then
        echo " FS$FSindex  no_WU, pause/unpause    $(date)    $(date -u)  "  >> $HOMEPATH/reboots.log
        sudo -u $username DISPLAY=:0.0 notify-send -t 10000 "no_WU - FS$FSindex"
        FAHClient --send-pause $FSindex
        sleep 10
        FAHClient --send-unpause $FSindex
    fi
done
}

isClientRunning(){
for i in {1..120}           # loop until FAHClient is running is true or times out and exits script
do
   /etc/init.d/FAHClient status|grep "fahclient is running"
   if [ $? == 0 ]    #success
    then
        return
    fi
    sleep 1
done
echo "FAHClient is NOT RUNNING. exit"
sudo -u $username DISPLAY=:0.0  notify-send -t 1000000 "FAHClient is NOT RUNNING. exit"   
if [[ $enable_flags == 1 && ! -e $HOMEPATH/mute_flag ]]
then
   beep -l 1500  -r 5
fi
sync
exit 0
}
 


#  folding slot hung.  Reboot if the FS doesn't increment during $hungFSIntervalGPU or $hungFSIntervalCPU
FS_Hung(){
#date -u
currTime=$(date -u +%s)
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[$ii]}
    FSType=${FSindexArrayType[$ii]}
    if [[ $(grep $FS $logFileToRead|egrep "Paused|Completed" |tail -1 |grep -c "Paused") == 1 ]]  #Skip execution if FS is Paused
    then
        continue
    fi
    lastTime=$(xxx=$(grep $FS $logFileToRead|grep Completed |tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
    timeDiff=$(($currTime-$lastTime))

    if [[ $FSType == "GPU" && $timeDiff -ge $hungFSIntervalGPU ]] || [[ $FSType == "CPU" && $timeDiff -ge $hungFSIntervalCPU ]]
    then
        errstr="$FS hung $(grep $FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
        RESTART
    fi
done
}



#GPU stalled
stalled(){
if [ $(grep -c "STALLED" $logFileToRead) -ge 1 ]
then
    FS=$(grep STALLED $logFileToRead|tail -1 |cut -d "S" -f 2|cut -d : -f 1 )
    errstr="STALLED $(grep FS$FS $logFileToRead |grep Project|tail -1 | while read xxx;do echo "FS$(echo $xxx|cut -d 'S' -f 2)";done)"
    RESTART
fi
}


lookForInitialBadWorkUnits(){
if [ -e $HOMEPATH/bounce_flag ]  #warm start so skip this section
then
    rm $HOMEPATH/bounce_flag
else
    for i in {1..120}                           #Probably only affects me but every once in a while the client starts up throwing bad work units and I want to trigger reboot asap
    do
        grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
        badWorkUnit 1 
        breakcount=0
        for j in "${!FSindexArray[@]}"
        do
            if [[ $(grep FS${FSindexArray[$j]} $logFileToRead|tail -1 |grep -c Completed) == 1 ]]
            then
                ((breakcount+=1))
            fi
            if [ $breakcount == ${#FSindexArray[@]} ]  #array length
            then
                break 2                         # break out of both loops if all FS are folding
            fi
        done
        sleep 1
    done 
fi
}

lookForExitFlag(){
#exit
if [[ $enable_flags == 1 && -e $HOMEPATH/exit_flag ]]    # if $HOMEPATH/exit_flag exists in your home directory delete the file and exit script.
then
    rm $HOMEPATH/exit_flag
    sudo -u $username DISPLAY=:0.0 notify-send -t 20000 "exit reboot.sh"
    if [ ! -e $HOMEPATH/mute_flag ]
    then
        beep -l 2000   #longer beep to signal reboot was canceled and script exit
    fi
    sync
    sleep 1
    exit 0  # exit the script
fi
}

lookForBounceFlag(){
#bounce
if [[ $enable_flags == 1 && -e $HOMEPATH/bounce_flag ]]    # if $HOMEPATH/bounce_flag exists RESTART reboot.sh, and exit the script
then
    sudo -u $username DISPLAY=:0.0 notify-send -t 20000 "bounce reboot.sh"
    if [ ! -e $HOMEPATH/mute_flag ]
    then
        beep -l 2000   
    fi
    chmod 744 $REBOOTPATH    #I change permissions because for security the script's owner is root and when I edit it the permission gets changed in Dropbox on my other rigs.
    $REBOOTPATH  & # Start new execution of reboot.sh
    sync
    sleep 1
    exit 0
fi
}

workserverConnectionFailed(){
    if [ $(egrep "Upload complete|WorkServer connection failed" $logFileToRead|tail -1 |grep -c "WorkServer connection failed") == 1 ]
    then
        currTime=$(date -u +%s)
        lastTime=$(xxx=$(grep "WorkServer connection failed" $logFileToRead|tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
        timeDiff=$(($currTime-$lastTime))
        if [[ $timeDiff -ge $badConnectionInterval ]]
        then
            errstr="WorkServer connection failed"
            RESTART
        fi
    fi
}
 

   #reboot if stuck download
downloadStuck(){
currTime=$(date -u +%s)
for ii in "${!FSindexArray[@]}"
do
    FS=FS${FSindexArray[$ii]}
    DL=$(grep $FS $logFileToRead| grep -v "Downloading core" | egrep -c "Downloading|Download complete" |xargs -i |grep -c Downloading)  # if the result is 1 then I know the download isn't complete
    if [ $DL == 1 ]
    then
        lastTime=$(xxx=$(grep $FS $logFileToRead|grep Downloading| tail -1|cut -d ":" -f1-4) ; echo "$(echo $xxx|cut -b 1-10) $(echo $xxx|cut -b 12-19)"|xargs -i date -u -d'{}'  +%s)
        timeDiff=$(($currTime-$lastTime))
        if [[ $timeDiff -ge $badConnectionInterval ]]
        then
            erstr="Download stuck  $FS"
            RESTART
        fi
    fi
done
}


populateFSArrays(){
grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
c=0
grep "slot id" $logFileToRead|cut -d '<' -f 2 |sort -u|while read line
do
    echo $line|cut -d "'" -f2|read FS
    typeset -Z2 FSindexArray[c]     # define as 2 digit with leading zero
    FSindexArray[c]=$FS
    echo $line |cut -d "'" -f4 | read type
    FSindexArrayType[c]=$type
    ((c+=1))
done   #set up the folding slot arrays
for i in "${!FSindexArray[@]}"; do  echo "FSindexArray[i] is ${FSindexArray[i]}"  ;done
for i in "${!FSindexArrayType[@]}"; do  echo "FSindexArrayType[i] is ${FSindexArrayType[i]}"  ;done
}


checkGPUs(){
if [[ $(nvidia-smi -L |wc -l) != $(ps -elf |grep FAHCore |grep gpu|wc -l) ]]
then
    ((gpucount++))
else
    gpucount=0
fi
if [[ $gpucount -ge 5 ]]
then
    errstr2="";while read xxx;do errstr2="$errstr2   ${xxx} ";done < <(nvidia-smi -L|cut -d ":" -f1)
    errstr="GPU offline $errstr2"
    bounceFAH
    gpucount=0
fi
}

#End of functions







#main
logfile="$HOMEPATH/reboot.log.$(date  -u +"%Y%m%dT%H%M")"                           #define the logfile path and name, including date string to make it unique
exec >$logfile  2>&1                                                            #send output to $logfile.

((hungFSIntervalGPU*=60))
((hungFSIntervalCPU*=60))
((badConnectionInterval*=60))
gpucount=0
ls -1tr $HOMEPATH/reboot.log.*| head -n -$logFileGenerations | xargs -i rm {}   #delete old versions of reboot.log*

modprobe pcspkr   
if [[ ! -e /mnt/ramdisk ]]
then
    mkdir /mnt/ramdisk
fi                                                               # Enable speaker if your MB has one
mount -t tmpfs -o size=100m tmpfs /mnt/ramdisk                                  # I set the size to 100m which is about 10 times the largest file I have in logs but I have plenty of RAM.
logFileToRead=/mnt/ramdisk/log.txt
isClientRunning

sleep 1
beep -l 1000   

populateFSArrays

lookForInitialBadWorkUnits


while true  #repeat forever
do
    date -u                     # useful in debugging. Otherwise comment it out to prevent disk writes
    isInternetDown              # If the internet goes down loop till it's been back up at least 5 minutes
    isClientRunning             #
    grep -v topology /var/lib/fahclient/log.txt > $logFileToRead
#   stalled                    # Uncomment if stalled FS reduces PPD on that slot and you want to start triggering a reboot
    workserverConnectionFailed 
    FS_Hung
    badWorkUnit 4
    downloadStuck
    no_WU                       # pauses and unpauses the FS to try and force quicker retry.  Once per minute.
    lookForExitFlag
    lookForBounceFlag
    checkGPUs
    sleep 60

done

SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby SteveWillis » Fri Mar 16, 2018 7:05 pm

I'm no longer going to post new versions of my scripts here but instead here is the link you can download from. They are works in progress and still change periodically as I add new features, fix bugs, and find conditions that weren't allowed for. I personally find them very useful.

https://drive.google.com/drive/folders/ ... sp=sharing
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am

Re: Linux scripts you may find useful

Postby SteveWillis » Tue Jun 12, 2018 8:50 pm

I just thought I'd publish a list of the main functions currently implemented by the reboot.sh Linux script.
It can be downloaded from: https://drive.google.com/drive/folders/ ... sp=sharing
It requires some configuration so READ THE COMMENTS in the script.
Best if started from /etc/rc.local so after a reboot it automatically starts even before you log in.
If you fold on Linux and have folding slots that go down this script can really help keep them folding. I now have at least a couple of other people using it but would love to get additional feedback from others.


* Restart FAH client if the last log.txt line for a FS doesn't change within a configurable number of minutes to get the slot going again. Ignores folding slots paused or finished.
* Executes a system reboot if nvidia-smi -L results include "Unable to determine the device handle" for a gpu.
* Restart FAH client if for 5 consecutive minutes there isn't a FAHCore process for each gpu.
* Reboot if a gpu's load percent stays stuck on 0% for 10 minutes.
* If No WUs available repeatedly pause and unpause the slot to force faster retry to get a WU.
* Execute a system reboot if client restart needed within a configurable number of minutes since the last one with the assumption that the restart didn't fix the problem.
* Execute a system reboot if "CORE_RESTART" found in log.txt.
* Configurable to pause a folding slot if it's producing bad work units.
* Configurable to pop up notifications such as warning of system reboot so you can save any work.
* Configurable to reboot if internet goes down. This was a special request.
* Limit system reboots to not less than a configurable time since the last one.
* First step is to copy log.txt (once per minute) to a ram disk to minimize hard disk reads.
SteveWillis
 
Posts: 382
Joined: Fri Apr 15, 2016 12:42 am


Return to V7.4.4 Public Release Windows/Linux/MacOS X (deprecated)

Who is online

Users browsing this forum: No registered users and 2 guests

cron