problem with exit-when-done

Moderators: Site Moderators, FAHC Science Team

problem with exit-when-done

Postby gw666 » Wed Apr 15, 2020 8:35 am

Hi everyone,

I'm trying to do some backfilling on a GPU farm, e.g. starting some GPU load if available and exiting if no work units are available. I am using FAHClient on Ubuntu 18.04. config.xml looks like this:
Code: Select all
<config>
  <!-- Folding Slots -->
  <slot id='0' type='GPU'/>
</config>

The full command line looks like this:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --exit-when-done=true

The program hasn't received any work units and is sitting idle for hours, blocking the GPU on the farm. I would have expected that the --exit-when-done option would make FAHClient actually exit if no WUs are assigned.

From the log:
Code: Select all
20:52:32:WU00:FS00:Connecting to 18.218.241.186:80
20:52:33:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:52:33:ERROR:WU00:FS00:Exception: Could not get an assignment
20:59:23:WU00:FS00:Connecting to 65.254.110.245:8080
20:59:24:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:59:24:WU00:FS00:Connecting to 18.218.241.186:80
20:59:24:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:59:24:ERROR:WU00:FS00:Exception: Could not get an assignment
21:10:29:WU00:FS00:Connecting to 65.254.110.245:8080
21:10:29:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:10:29:WU00:FS00:Connecting to 18.218.241.186:80
21:10:30:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:10:30:ERROR:WU00:FS00:Exception: Could not get an assignment
21:28:25:WU00:FS00:Connecting to 65.254.110.245:8080
21:28:26:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:28:26:WU00:FS00:Connecting to 18.218.241.186:80
21:28:26:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:28:26:ERROR:WU00:FS00:Exception: Could not get an assignment
21:57:28:WU00:FS00:Connecting to 65.254.110.245:8080
21:57:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:57:28:WU00:FS00:Connecting to 18.218.241.186:80
21:57:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:57:29:ERROR:WU00:FS00:Exception: Could not get an assignment
22:44:26:WU00:FS00:Connecting to 65.254.110.245:8080
22:44:27:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
22:44:27:WU00:FS00:Connecting to 18.218.241.186:80
22:44:27:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
22:44:27:ERROR:WU00:FS00:Exception: Could not get an assignment
00:00:28:WU00:FS00:Connecting to 65.254.110.245:8080
00:00:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
00:00:28:WU00:FS00:Connecting to 18.218.241.186:80
00:00:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
00:00:29:ERROR:WU00:FS00:Exception: Could not get an assignment
02:03:27:WU00:FS00:Connecting to 65.254.110.245:8080
02:03:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
02:03:28:WU00:FS00:Connecting to 18.218.241.186:80
02:03:28:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
02:03:28:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-15 *******************************
05:22:28:WU00:FS00:Connecting to 65.254.110.245:8080
05:22:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
05:22:28:WU00:FS00:Connecting to 18.218.241.186:80
05:22:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
05:22:29:ERROR:WU00:FS00:Exception: Could not get an assignment

gw666
 
Posts: 14
Joined: Thu Apr 09, 2020 9:53 am

Re: problem with exit-when-done

Postby PantherX » Wed Apr 15, 2020 8:42 am

Welcome to the F@H Forum gw666,

It seems that since you started the client, you haven't been assigned a WU hence, it hasn't exited as it never finished a WU. There's a known issue where the demand for GPU WUs is significantly more than supply for GPU WUs. There's work in the pipeline to resolve this issue :)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
User avatar
PantherX
Site Moderator
 
Posts: 6327
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: problem with exit-when-done

Postby gw666 » Wed Apr 15, 2020 9:57 am

I had the same command on a different node. I has finished a few WU:
Code: Select all
21:35:20:WU01:FS00:Upload 70.73%
21:35:26:WU01:FS00:Upload 81.37%
21:35:30:WU00:FS00:Connecting to 65.254.110.245:8080
21:35:30:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:35:30:WU00:FS00:Connecting to 18.218.241.186:80
21:35:31:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:35:31:ERROR:WU00:FS00:Exception: Could not get an assignment
21:35:32:WU01:FS00:Upload 95.01%
21:35:35:WU01:FS00:Upload complete
21:35:35:WU01:FS00:Server responded WORK_ACK (400)
21:35:35:WU01:FS00:Final credit estimate, 156113.00 points
21:35:35:WU01:FS00:Cleaning up
21:38:07:WU00:FS00:Connecting to 65.254.110.245:8080
21:38:08:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:38:08:WU00:FS00:Connecting to 18.218.241.186:80
21:38:08:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:38:08:ERROR:WU00:FS00:Exception: Could not get an assignment

and is idling since:
Code: Select all
02:53:17:WU00:FS00:Connecting to 65.254.110.245:8080
02:53:18:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
02:53:18:WU00:FS00:Connecting to 18.218.241.186:80
02:53:18:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
02:53:18:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-15 *******************************
06:12:17:WU00:FS00:Connecting to 65.254.110.245:8080
06:12:18:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
06:12:18:WU00:FS00:Connecting to 18.218.241.186:80
06:12:19:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
06:12:19:ERROR:WU00:FS00:Exception: Could not get an assignment
gw666
 
Posts: 14
Joined: Thu Apr 09, 2020 9:53 am

Re: problem with exit-when-done

Postby PantherX » Wed Apr 15, 2020 10:31 am

It should have exited if the correct command was give. It seems that you might be using a wrong command:
Code: Select all
  exit-when-done <boolean=false>
    Exit when all slots are paused.

Try this instead:
Code: Select all
  --finish
      Finish all current work units, send the results, then exit.
User avatar
PantherX
Site Moderator
 
Posts: 6327
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: problem with exit-when-done

Postby gw666 » Wed Apr 15, 2020 11:19 am

PantherX wrote:It should have exited if the correct command was give. It seems that you might be using a wrong command:
Code: Select all
  exit-when-done <boolean=false>
    Exit when all slots are paused.

Try this instead:
Code: Select all
  --finish
      Finish all current work units, send the results, then exit.

I think --finish is for exiting an already running instance of Folding, if you start a new instance of Folding with --finish, it will never do anything. That's why I'm suspecting that the --exit-when-done=true option doesn't work as intended. Maybe my idling slot is never paused?
gw666
 
Posts: 14
Joined: Thu Apr 09, 2020 9:53 am

Re: problem with exit-when-done

Postby PantherX » Thu Apr 16, 2020 8:17 am

I had another look at the options and think that you have to use this one in addition to exit-when-done set to true:
Code: Select all
  max-units <integer=0>
    Process at most this number of units, then pause.

This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true
User avatar
PantherX
Site Moderator
 
Posts: 6327
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: problem with exit-when-done

Postby gw666 » Thu Apr 16, 2020 10:18 am

PantherX wrote:I had another look at the options and think that you have to use this one in addition to exit-when-done set to true:
Code: Select all
  max-units <integer=0>
    Process at most this number of units, then pause.

This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true

I'll give that a try. Any idea on how long F@H will try to get that one unit?
gw666
 
Posts: 14
Joined: Thu Apr 09, 2020 9:53 am

Re: problem with exit-when-done

Postby PantherX » Thu Apr 16, 2020 10:26 am

The Client will try and if it fails, will back off in an exponential manner. Currently, on attempt 10, it is about 1 hour wait time.
User avatar
PantherX
Site Moderator
 
Posts: 6327
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: problem with exit-when-done

Postby gw666 » Fri Apr 17, 2020 3:04 pm

PantherX wrote:I had another look at the options and think that you have to use this one in addition to exit-when-done set to true:
Code: Select all
  max-units <integer=0>
    Process at most this number of units, then pause.

This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true

This command is working fine for me, the job takes between 1 and 2:15 hours processing one WU, that makes it perfect for backfilling.
gw666
 
Posts: 14
Joined: Thu Apr 09, 2020 9:53 am

Re: problem with exit-when-done

Postby PantherX » Fri Apr 17, 2020 11:00 pm

Glad to hear that it works as per your expectations! If you can always change the number from 1 to 2 or whatever you think you can successfully fold within that time. Please note that the folding time for WUs varies from Project to Project so you may need to keep an eye on it :)
User avatar
PantherX
Site Moderator
 
Posts: 6327
Joined: Wed Dec 23, 2009 10:33 am
Location: Land Of The Long White Cloud

Re: problem with exit-when-done

Postby gw666 » Mon Apr 20, 2020 8:24 am

I'm not yet happy, there have been several cases where the program was idle for 4 hours without getting a WU. I would've preferred the program to exit in that case.
gw666
 
Posts: 14
Joined: Thu Apr 09, 2020 9:53 am

Re: problem with exit-when-done

Postby bruce » Mon Apr 20, 2020 8:30 am

Why are you setting --cuda_index=0?

The current folding slots don't really care wht cuda-index is used because cuda, itself, is never used.
bruce
 
Posts: 19656
Joined: Thu Nov 29, 2007 11:13 pm
Location: So. Cal.

Re: problem with exit-when-done

Postby gw666 » Mon Apr 20, 2020 3:13 pm

bruce wrote:Why are you setting --cuda_index=0?

The current folding slots don't really care wht cuda-index is used because cuda, itself, is never used.

Thanks for pointing it out, I've removed the option.
gw666
 
Posts: 14
Joined: Thu Apr 09, 2020 9:53 am


Return to V7.5.1 Public Release Windows/Linux/MacOS X

Who is online

Users browsing this forum: No registered users and 1 guest

cron