Page 1 of 1

problem with exit-when-done

Posted: Wed Apr 15, 2020 7:35 am
by gw666
Hi everyone,

I'm trying to do some backfilling on a GPU farm, e.g. starting some GPU load if available and exiting if no work units are available. I am using FAHClient on Ubuntu 18.04. config.xml looks like this:

Code: Select all

<config>
  <!-- Folding Slots -->
  <slot id='0' type='GPU'/>
</config>
The full command line looks like this:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --exit-when-done=true

The program hasn't received any work units and is sitting idle for hours, blocking the GPU on the farm. I would have expected that the --exit-when-done option would make FAHClient actually exit if no WUs are assigned.

From the log:

Code: Select all

20:52:32:WU00:FS00:Connecting to 18.218.241.186:80
20:52:33:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:52:33:ERROR:WU00:FS00:Exception: Could not get an assignment
20:59:23:WU00:FS00:Connecting to 65.254.110.245:8080
20:59:24:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:59:24:WU00:FS00:Connecting to 18.218.241.186:80
20:59:24:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:59:24:ERROR:WU00:FS00:Exception: Could not get an assignment
21:10:29:WU00:FS00:Connecting to 65.254.110.245:8080
21:10:29:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:10:29:WU00:FS00:Connecting to 18.218.241.186:80
21:10:30:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:10:30:ERROR:WU00:FS00:Exception: Could not get an assignment
21:28:25:WU00:FS00:Connecting to 65.254.110.245:8080
21:28:26:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:28:26:WU00:FS00:Connecting to 18.218.241.186:80
21:28:26:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:28:26:ERROR:WU00:FS00:Exception: Could not get an assignment
21:57:28:WU00:FS00:Connecting to 65.254.110.245:8080
21:57:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:57:28:WU00:FS00:Connecting to 18.218.241.186:80
21:57:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:57:29:ERROR:WU00:FS00:Exception: Could not get an assignment
22:44:26:WU00:FS00:Connecting to 65.254.110.245:8080
22:44:27:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
22:44:27:WU00:FS00:Connecting to 18.218.241.186:80
22:44:27:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
22:44:27:ERROR:WU00:FS00:Exception: Could not get an assignment
00:00:28:WU00:FS00:Connecting to 65.254.110.245:8080
00:00:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
00:00:28:WU00:FS00:Connecting to 18.218.241.186:80
00:00:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
00:00:29:ERROR:WU00:FS00:Exception: Could not get an assignment
02:03:27:WU00:FS00:Connecting to 65.254.110.245:8080
02:03:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
02:03:28:WU00:FS00:Connecting to 18.218.241.186:80
02:03:28:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
02:03:28:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-15 *******************************
05:22:28:WU00:FS00:Connecting to 65.254.110.245:8080
05:22:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
05:22:28:WU00:FS00:Connecting to 18.218.241.186:80
05:22:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
05:22:29:ERROR:WU00:FS00:Exception: Could not get an assignment


Re: problem with exit-when-done

Posted: Wed Apr 15, 2020 7:42 am
by PantherX
Welcome to the F@H Forum gw666,

It seems that since you started the client, you haven't been assigned a WU hence, it hasn't exited as it never finished a WU. There's a known issue where the demand for GPU WUs is significantly more than supply for GPU WUs. There's work in the pipeline to resolve this issue :)

Re: problem with exit-when-done

Posted: Wed Apr 15, 2020 8:57 am
by gw666
I had the same command on a different node. I has finished a few WU:

Code: Select all

21:35:20:WU01:FS00:Upload 70.73%
21:35:26:WU01:FS00:Upload 81.37%
21:35:30:WU00:FS00:Connecting to 65.254.110.245:8080
21:35:30:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:35:30:WU00:FS00:Connecting to 18.218.241.186:80
21:35:31:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:35:31:ERROR:WU00:FS00:Exception: Could not get an assignment
21:35:32:WU01:FS00:Upload 95.01%
21:35:35:WU01:FS00:Upload complete
21:35:35:WU01:FS00:Server responded WORK_ACK (400)
21:35:35:WU01:FS00:Final credit estimate, 156113.00 points
21:35:35:WU01:FS00:Cleaning up
21:38:07:WU00:FS00:Connecting to 65.254.110.245:8080
21:38:08:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:38:08:WU00:FS00:Connecting to 18.218.241.186:80
21:38:08:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:38:08:ERROR:WU00:FS00:Exception: Could not get an assignment
and is idling since:

Code: Select all

02:53:17:WU00:FS00:Connecting to 65.254.110.245:8080
02:53:18:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
02:53:18:WU00:FS00:Connecting to 18.218.241.186:80
02:53:18:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
02:53:18:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-15 *******************************
06:12:17:WU00:FS00:Connecting to 65.254.110.245:8080
06:12:18:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
06:12:18:WU00:FS00:Connecting to 18.218.241.186:80
06:12:19:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
06:12:19:ERROR:WU00:FS00:Exception: Could not get an assignment

Re: problem with exit-when-done

Posted: Wed Apr 15, 2020 9:31 am
by PantherX
It should have exited if the correct command was give. It seems that you might be using a wrong command:

Code: Select all

  exit-when-done <boolean=false>
    Exit when all slots are paused.
Try this instead:

Code: Select all

  --finish
      Finish all current work units, send the results, then exit.

Re: problem with exit-when-done

Posted: Wed Apr 15, 2020 10:19 am
by gw666
PantherX wrote:It should have exited if the correct command was give. It seems that you might be using a wrong command:

Code: Select all

  exit-when-done <boolean=false>
    Exit when all slots are paused.
Try this instead:

Code: Select all

  --finish
      Finish all current work units, send the results, then exit.
I think --finish is for exiting an already running instance of Folding, if you start a new instance of Folding with --finish, it will never do anything. That's why I'm suspecting that the --exit-when-done=true option doesn't work as intended. Maybe my idling slot is never paused?

Re: problem with exit-when-done

Posted: Thu Apr 16, 2020 7:17 am
by PantherX
I had another look at the options and think that you have to use this one in addition to exit-when-done set to true:

Code: Select all

  max-units <integer=0>
    Process at most this number of units, then pause.
This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true

Re: problem with exit-when-done

Posted: Thu Apr 16, 2020 9:18 am
by gw666
PantherX wrote:I had another look at the options and think that you have to use this one in addition to exit-when-done set to true:

Code: Select all

  max-units <integer=0>
    Process at most this number of units, then pause.
This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true
I'll give that a try. Any idea on how long F@H will try to get that one unit?

Re: problem with exit-when-done

Posted: Thu Apr 16, 2020 9:26 am
by PantherX
The Client will try and if it fails, will back off in an exponential manner. Currently, on attempt 10, it is about 1 hour wait time.

Re: problem with exit-when-done

Posted: Fri Apr 17, 2020 2:04 pm
by gw666
PantherX wrote:I had another look at the options and think that you have to use this one in addition to exit-when-done set to true:

Code: Select all

  max-units <integer=0>
    Process at most this number of units, then pause.
This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true
This command is working fine for me, the job takes between 1 and 2:15 hours processing one WU, that makes it perfect for backfilling.

Re: problem with exit-when-done

Posted: Fri Apr 17, 2020 10:00 pm
by PantherX
Glad to hear that it works as per your expectations! If you can always change the number from 1 to 2 or whatever you think you can successfully fold within that time. Please note that the folding time for WUs varies from Project to Project so you may need to keep an eye on it :)

Re: problem with exit-when-done

Posted: Mon Apr 20, 2020 7:24 am
by gw666
I'm not yet happy, there have been several cases where the program was idle for 4 hours without getting a WU. I would've preferred the program to exit in that case.

Re: problem with exit-when-done

Posted: Mon Apr 20, 2020 7:30 am
by bruce
Why are you setting --cuda_index=0?

The current folding slots don't really care wht cuda-index is used because cuda, itself, is never used.

Re: problem with exit-when-done

Posted: Mon Apr 20, 2020 2:13 pm
by gw666
bruce wrote:Why are you setting --cuda_index=0?

The current folding slots don't really care wht cuda-index is used because cuda, itself, is never used.
Thanks for pointing it out, I've removed the option.