Configuration¶
Both grid_search
and hp_optimization
expect as input a settings file
with the configuration. The file can be any format that is supported by
smart_settings (currently JSON, YAML and TOML).
General Parameters¶
Todo
Explain the “job directory”
(${HOME}/.cache/cluster_utils-${optimization_procedure_name}-*
)
somewhere in a central place.
These parameters are the same for grid_search
and hp_optimization
.
- optimization_procedure_name¶
Required.
Name of the setup.
- results_dir¶
Required.
Result files will be written to
{results_dir}/optimization_procedure_name
.
-
run_in_working_dir =
false
¶ If true,
git_params
are ignored and the script specified inscript_relative_path
is expected to be in the current working directory. Otherwise seegit_params
.
- git_params¶
If
run_in_working_dir
is false, the specified git repository is cloned to the job directory and the script specified inscript_relative_path
is expected to be found there.- branch : str¶
The git branch to use.
- commit¶
Hash of a specific commit that should be used.
Note: The current implementation still needs a valid branch to be set as it first clones the repo using that branch and only afterwards checks out the specified commit.
- url¶
URL to the repo. If not set, the application expects the current working directory to be inside a git repository and uses the origin URL of this repo.
- depth : int¶
Create a shallow clone with a history truncated to the specified number of commits.
-
remove_local_copy : bool =
true
¶ Remove the local working copy when finished.
- script_relative_path¶
Required.
Python script that is executed. If
run_in_working_dir
= true, the path is resolved relative to the working directory, otherwise relative to the root of the git repository specified ingit_params
.
-
remove_jobs_dir : bool =
true
¶ Whether to remove the data stored in
${HOME}/.cache
once finished or not. Note that when running on the cluster this directory also contains the stdout/stderr of the jobs (but not when running locally).
-
remove_working_dirs : bool =
{grid_search: false, hp_optimization: true}
¶ Remove the working directories of the jobs (including the parameters used for that job, saved metrics and potentially other output files like checkpoints) once they are finished.
For hp_optimization the directories of the best jobs kept independent of this setting, see
num_best_jobs_whose_data_is_kept
.
-
generate_report : str =
"never"
¶ Specifies whether a report should be generated automatically. Can be one of the following values:
never
: Do not generate report automatically.when_finished
: Generate once when the optimization has finished.every_iteration
: Generate report of current state after every iteration (not supported bygrid_search
).
If enabled, the report is saved as
result.pdf
in the results directory (seeresults_dir
). Note that independent of the setting here, the report can always be generated manually, see Manual Report Generation.Added in version 3.0. Set to “every_iteration” to get the behaviour of versions <=2.5
- environment_setup¶
Required.
Note: while the
environment_setup
argument itself is mandatory, all its content are optional (i.e. it can be empty).- pre_job_script : str¶
Path to an executable (e.g. bash script) that is executed before the main script runs.
- virtual_env_path : str¶
Path of folder of virtual environment to activate.
- conda_env_path : str¶
Name of conda environment to activate (this option might be broken).
- variables : dict[str]¶
Environment variables to set. Variables are set after a virtual/conda environment is activated, thus override environment variables set before. They are also set before the
environment_setup.pre_job_script
: this can be useful to pass parameters to the script, e.g. to setup a generic script that changes its behavior based on the values defined in the cluster_utils config file.
-
is_python_script : bool =
true
¶ Whether the target to run is a Python script.
-
run_as_module : bool =
false
¶ Whether to run the script as a Python module (
python -m my_package.my_module
) or as a script (python my_package/my_module.py
).
- cluster_requirements¶
Required.
Settings for the cluster (number of CPUs, bid, etc.). See Cluster Requirements.
- singularity¶
- fixed_params¶
Required.
TODO
Cluster Requirements¶
When running on a cluster, you have to specify the resources needed for each job (number
of CPUs/GPUs, memory, etc.). This is all configured in the section
cluster_requirements
.
Note
The cluster requirements are ignored when running on a local machine.
Some of the options are common among all supported cluster systems, some are system-specific. Note that all the options are per job, i.e. each job will get the requested CPUs, memory, …, it’s not shared between jobs.
Simple example (in TOML):
[cluster_requirements]
request_cpus = 1
request_gpus = 0
memory_in_mb = 1000
bid = 1000
Common Options¶
- request_cpus : int¶
Number of CPUs that is requested.
- request_gpus : int¶
Number of GPUs that is requested.
- memory_in_mb : int¶
Memory (in MB) that is requested.
- forbidden_hostnames : list[str]¶
Cluster nodes to exclude from running jobs. Useful if nodes are malfunctioning.
Condor-specific Options¶
The following options are only used when running on Condor (i.e. the MPI cluster).
- bid : int¶
The amount of cluster money you are bidding for each job. See documentation of the MPI-IS cluster on how the bidding system works.
- cuda_requirement¶
cuda_requirement
has multiple behaviors. If it is a number, it specifies the minimum CUDA capability the GPU should have. If the number is prefixed with<
or<=
, it specifies the maximum CUDA capability. Otherwise, the value is taken as a full requirement string, example (in TOML):[cluster_requirements] # ... cuda_requirement = "TARGET.CUDACapability >= 5.0 && TARGET.CUDACapability <= 8.0" # ...
Remember to prefix the constraints with
TARGET.
. See https://atlas.is.localnet/confluence/display/IT/Specific+GPU+needs for the kind of constraints that are possible.
- gpu_memory_mb : int¶
Minimum memory size the GPU should have, in megabytes.
- concurrency_limit¶
Limit the number of concurrent jobs. You can assign a resource (tag) to your jobs and specify how many tokens each jobs consumes. There is a total of 10,000 tokens per resource. If you want to run 10 concurrent jobs, each job has to consume 1,000 tokens.
To use this feature, it is as easy as adding (example in TOML)
[cluster_requirements] # ... concurrency_limit_tag = "gpu" concurrency_limit = 10 # ...
to the settings.
You can assign different tags to different runs. In that way you can limit only the number of gpu jobs, for instance.
- concurrency_limit_tag¶
- hostname_list : list[str]¶
Cluster nodes to exclusively use for running jobs.
- extra_submission_options : dict | list | str¶
This allows to add additional lines to the .sub file used for submitting jobs to the cluster. Note that this setting is normally not needed, as cluster_utils automatically builds the submission file for you.
Todo
Is the list above complete?
Slurm-specific Options¶
- partition : str¶
Required.
Name of the partition to run the jobs on. See documentation of the corresponding cluster on what partitions are available.
Multiple partitions can be given as a comma-separated string (
partition1,partition2
), in this case jobs will be executed on any of them (depending on which has free capacity first).
- request_time : str¶
Required.
Time limit for the jobs. Jobs taking longer than this will be aborted, so make sure to request enough time (but don’t exaggerate too much as shorter jobs can be scheduled more easily).
From the Slurm documentation:
Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
So for example to request 1 hour per job use
request_time = "1:00:00"
.
- signal_seconds_to_timeout : int¶
Time in seconds before timeout at which Slurm sends a USR1 signal to the job (see
--signal
ofsbatch
). If not set, no signal is sent.See example Get Signal Before Timeout on Slurm.
- extra_submission_options : list[str]¶
List of additional options for
sbatch
. Can be used if a specific setting is needed which is not already covered by the options above. Expects a list with arguments as they are passed tosbatch
, for example:extra_submission_options = ["--gpu-freq=high", "--begin=2010-01-20T12:34:00"]
Note
There are currently no options to restrict the type of GPU. On the ML Cloud cluster of the University of Tübingen, this is currently done via the partitions. See https://portal.mlcloud.uni-tuebingen.de/user-guide/batch for a list of available partitions.
If needed, e.g. when using cluster_utils on a different Slurm cluster, missing
options can always be provided via extra_submission_options
.
Use Singularity/Apptainer Containers¶
Jobs can be executed inside Singularity/Apptainer [1] containers to give
you full control over the environment, installed packages, etc. To enable
containerisation of jobs, add a section singularity
in the config file. This
section can have the following parameters:
- image¶
Required.
Path to the container image.
-
executable : str =
"singularity"
¶ Specify the executable that is used to run the container (mostly useful if you want to explicitly use Apptainer instead of Singularity in an environment where both are installed).
-
use_run : bool =
false
¶ Per default the container is run with
singularity exec
. Set this to true to usesingularity run
instead. This is only useful for images that use a wrapper run script that executes the given command (sometimes needed for some environment initialisation).
-
args : list[str] =
[]
¶ List of additional arguments that are passed to
singularity exec|run
. Use this to set flags like--nv
,--cleanenv
,--contain
, etc. if needed.
Example (in TOML):
[singularity]
image = "my_container.sif"
args = ["--nv", "--cleanenv"]
Specific for hp_optimization¶
- num_best_jobs_whose_data_is_kept : int¶
Required.
Keep copies of the working directories of the given number of best jobs. They are stored in
{results_dir}/best_jobs/
.
-
kill_bad_jobs_early : bool =
false
¶ TODO
- early_killing_params¶
TODO
- optimizer_str¶
Required.
The optimisation method that is used to find good hyperparameters. Supported methods are
cem_metaoptimizer
nevergrad *
gridsearch
* To use nevergrad, the optional dependencies from the “nevergrad” group are needed, see Optional Dependencies.
- optimizer_settings¶
Required.
Settings specific to the optimiser selected in
optimizer_str
. See Optimizer Settings.
- optimization_setting¶
Required.
General settings for the optimisation (independent of the optimisation method). See General Optimisation Settings.
- optimized_params¶
Required.
Defines the parameters that are optimised over. Expectes a list of dictionaries with each entry having the following elements:
param
: Name of the parameter. Apparently can have object/attribute structure, e.g. “fn_args.x”.distribution
: Distribution that is used for sampling. Options are:TruncatedNormal
Normal distribution using floats.
TruncatedLogNormal
Log-normal distribution using floats.
IntNormal
Normal distribution using integer values.
IntLogNormal
Log-normal distribution using integer values.
Discrete
Discrete list of values.
bounds
: List[min_value, max_value]
options
: List of possible values (used instead of bounds for “Discrete” distribution).
Example (in TOML):
[[optimized_params]] param = "fn_args.w" distribution = "IntNormal" bounds = [ -5, 5 ] [[optimized_params]] param = "fn_args.y" distribution = "TruncatedLogNormal" bounds = [ 0.01, 100.0 ] [[optimized_params]] param = "fn_args.sharp_penalty" distribution = "Discrete" options = [ false, true ]
General Optimisation Settings¶
The optimization_setting
parameter defines the general optimisation
settings (i.e. the ones independent of the optimisation method set in
optimizer_str
). A dictionary with the following values is expected:
- metric_to_optimize : str¶
Required.
Name of the metric that is used for the optimisation. Has to match the name of one of the metrics that are saved with
finalize_job()
.
- minimize : bool¶
Required.
Specify whether the metric shall be minimized (true) or maximised (false).
- number_of_samples : int¶
Required.
The total number of jobs that will be run.
- n_jobs_per_iteration : int¶
Required.
The number of jobs submitted to the cluster concurrently, and also the number of finished jobs per report iteration.
-
n_completed_jobs_before_resubmit : int =
1
¶ The number of jobs that have to be finished before another
n_completed_jobs_before_resubmit
jobs are submitted. Defaults to 1 (i.e. submit new job immediately when one finishes).
- run_local : bool¶
Specify if the optimisation shall be run locally if the cluster is not detected. If not set, the user will be asked at runtime in this case.
About Iterations¶
The exact meaning of one “iteration” of the hp_optimization mode is a bit complicated and depends on the configuration.
Relevant are the following parameters from the optimization_setting
section:
number_of_samples
is simply the total number of jobs that are run.
n_jobs_per_iteration
says how many jobs can be executed in parallel.
From this a number of iterations is derived. Basically an iteration counter is
used that is incremented by one whenever another n_jobs_per_iteration
jobs
has been completed (resulting in number_of_samples / n_jobs_per_iteration
iterations in the end). However, it does not necessarily mean that the
optimisation is split into distinct iterations where the next iteration only
starts when the previous one has finished. Instead, whenever a job completes,
the optimiser is updated with the results and the next one is started
immediately, so that always n_jobs_per_iteration
jobs are running at the
same time. The notion of “iterations” is only used to have a regular update of
the report every n_jobs_per_iteration
jobs.
The behaviour can be changed by setting n_completed_jobs_before_resubmit
.
The meaning of this parameter is as follows: Always wait until
n_completed_jobs_before_resubmit
jobs have finished, then submit another
n_completed_jobs_before_resubmit
jobs. Its default value is 1, resulting in
the behaviour described in the previous paragraph. However, setting it to a
larger value results in the optimisation to wait for several jobs to have
finished before sampling new parameters. Setting
n_completed_jobs_before_resubmit = n_jobs_per_iteration
results in what one
would intuitively assume regarding iterations, i.e. the optimisation would wait
for n_jobs_per_iteration
to be finished and only then start the next
iteration with another n_jobs_per_iteration
jobs.
Optimizer Settings¶
optimizer_settings
expects as value a dictionary with configuration specific
to the method that is specified in optimizer_str
. Below are the
corresponding parameters for each method.
cem_metaoptimizer¶
- with_restarts : bool¶
Required.
Whether a specific set of settings can be run multiple times. This can be useful to automatically verify if good runs were just lucky runs because of e.g. the random seed, making the found solutions more robust.
If enabled, new settings are sampled for the first
num_jobs_in_elite
jobs. After that each new job has a 20% chance to use the same settings as a previous job (drawn from the set of best jobs).
- num_jobs_in_elite : int¶
Required.
TODO
nevergrad¶
Note
To use nevergrad, the optional dependencies from the “nevergrad” group are needed, see Optional Dependencies.
- opt_alg¶
Required.
TODO
Specific for grid_search¶
- local_run¶
TODO
-
load_existing_results : bool =
false
¶ TODO
- restarts¶
Required.
How often to run each configuration (useful if there is some randomness in the result).
- samples¶
TODO
- hyperparam_list¶
Required.
Probably list of parameters over which the grid search is performed. List of dicts:
param
: Parameter name (e.g. “fn_args.x”).values
: List of values. Be careful with types,42
will be passed as int, use42.0
if you want float instead.
Overwriting Parameters on the Command Line¶
When executing grid_search
or hp_optimization
it is possible to
overwrite one or more parameters of the config file by providing values on the
command line.
The general syntax for this is parameter_name=value
given after the
config file. Note, however, that value
is evaluated as Python code. This
means that string values need to be quoted in a way that is preserved by the
shell. So for example to use a custom name for the output directory:
python3 -m cluster_utils.grid_search config.json 'optimization_procedure_name="foo"'
Nested parameters can be set using dots:
python3 -m cluster_utils.grid_search config.json 'git_params.branch="foo"'