Configuration

Both grid_search and hp_optimization expect as input a settings file with the configuration. The file can be any format that is supported by smart_settings (currently JSON, YAML and TOML).

General Parameters

Todo

Explain the “job directory” (${HOME}/.cache/cluster_utils-${optimization_procedure_name}-*) somewhere in a central place.

These parameters are the same for grid_search and hp_optimization.

optimization_procedure_name

Required.

Name of the setup.

results_dir

Required.

Result files will be written to {results_dir}/optimization_procedure_name.

run_in_working_dir = false

If true, git_params are ignored and the script specified in script_relative_path is expected to be in the current working directory. Otherwise see git_params.

git_params

If run_in_working_dir is false, the specified git repository is cloned to the job directory and the script specified in script_relative_path is expected to be found there.

branch : str

The git branch to use.

commit

Hash of a specific commit that should be used.

Note: The current implementation still needs a valid branch to be set as it first clones the repo using that branch and only afterwards checks out the specified commit.

url

URL to the repo. If not set, the application expects the current working directory to be inside a git repository and uses the origin URL of this repo.

depth : int

Create a shallow clone with a history truncated to the specified number of commits.

remove_local_copy : bool = true

Remove the local working copy when finished.

script_relative_path

Required.

Python script that is executed. If run_in_working_dir = true, the path is resolved relative to the working directory, otherwise relative to the root of the git repository specified in git_params.

remove_jobs_dir : bool = true

Whether to remove the data stored in ${HOME}/.cache once finished or not. Note that when running on the cluster this directory also contains the stdout/stderr of the jobs (but not when running locally).

remove_working_dirs : bool = {grid_search: false, hp_optimization: true}

Remove the working directories of the jobs (including the parameters used for that job, saved metrics and potentially other output files like checkpoints) once they are finished.

For hp_optimization the directories of the best jobs kept independent of this setting, see num_best_jobs_whose_data_is_kept.

generate_report : str = "never"

Specifies whether a report should be generated automatically. Can be one of the following values:

  • never: Do not generate report automatically.

  • when_finished: Generate once when the optimization has finished.

  • every_iteration: Generate report of current state after every iteration (not supported by grid_search).

If enabled, the report is saved as result.pdf in the results directory (see results_dir). Note that independent of the setting here, the report can always be generated manually, see Manual Report Generation.

Added in version 3.0. Set to “every_iteration” to get the behaviour of versions <=2.5

environment_setup

Required.

Note: while the environment_setup argument itself is mandatory, all its content are optional (i.e. it can be empty).

pre_job_script : str

Path to an executable (e.g. bash script) that is executed before the main script runs.

virtual_env_path : str

Path of folder of virtual environment to activate.

conda_env_path : str

Name of conda environment to activate (this option might be broken).

variables : dict[str]

Environment variables to set. Variables are set after a virtual/conda environment is activated, thus override environment variables set before. They are also set before the environment_setup.pre_job_script: this can be useful to pass parameters to the script, e.g. to setup a generic script that changes its behavior based on the values defined in the cluster_utils config file.

is_python_script : bool = true

Whether the target to run is a Python script.

run_as_module : bool = false

Whether to run the script as a Python module (python -m my_package.my_module) or as a script (python my_package/my_module.py).

cluster_requirements

Required.

Settings for the cluster (number of CPUs, bid, etc.). See Cluster Requirements.

singularity

See Use Singularity/Apptainer Containers.

fixed_params

Required.

TODO

Cluster Requirements

When running on a cluster, you have to specify the resources needed for each job (number of CPUs/GPUs, memory, etc.). This is all configured in the section cluster_requirements.

Note

The cluster requirements are ignored when running on a local machine.

Some of the options are common among all supported cluster systems, some are system-specific. Note that all the options are per job, i.e. each job will get the requested CPUs, memory, …, it’s not shared between jobs.

Simple example (in TOML):

[cluster_requirements]
request_cpus = 1
request_gpus = 0
memory_in_mb = 1000
bid = 1000

Common Options

request_cpus : int

Number of CPUs that is requested.

request_gpus : int

Number of GPUs that is requested.

memory_in_mb : int

Memory (in MB) that is requested.

forbidden_hostnames : list[str]

Cluster nodes to exclude from running jobs. Useful if nodes are malfunctioning.

Condor-specific Options

The following options are only used when running on Condor (i.e. the MPI cluster).

bid : int

The amount of cluster money you are bidding for each job. See documentation of the MPI-IS cluster on how the bidding system works.

cuda_requirement

cuda_requirement has multiple behaviors. If it is a number, it specifies the minimum CUDA capability the GPU should have. If the number is prefixed with < or <=, it specifies the maximum CUDA capability. Otherwise, the value is taken as a full requirement string, example (in TOML):

[cluster_requirements]
# ...
cuda_requirement = "TARGET.CUDACapability >= 5.0 && TARGET.CUDACapability <= 8.0"
# ...

Remember to prefix the constraints with TARGET.. See https://atlas.is.localnet/confluence/display/IT/Specific+GPU+needs for the kind of constraints that are possible.

gpu_memory_mb : int

Minimum memory size the GPU should have, in megabytes.

concurrency_limit

Limit the number of concurrent jobs. You can assign a resource (tag) to your jobs and specify how many tokens each jobs consumes. There is a total of 10,000 tokens per resource. If you want to run 10 concurrent jobs, each job has to consume 1,000 tokens.

To use this feature, it is as easy as adding (example in TOML)

[cluster_requirements]
# ...
concurrency_limit_tag = "gpu"
concurrency_limit = 10
# ...

to the settings.

You can assign different tags to different runs. In that way you can limit only the number of gpu jobs, for instance.

concurrency_limit_tag

See cluster_requirements.concurrency_limit

hostname_list : list[str]

Cluster nodes to exclusively use for running jobs.

extra_submission_options : dict | list | str

This allows to add additional lines to the .sub file used for submitting jobs to the cluster. Note that this setting is normally not needed, as cluster_utils automatically builds the submission file for you.

Todo

Is the list above complete?

Slurm-specific Options

partition : str

Required.

Name of the partition to run the jobs on. See documentation of the corresponding cluster on what partitions are available.

Multiple partitions can be given as a comma-separated string (partition1,partition2), in this case jobs will be executed on any of them (depending on which has free capacity first).

request_time : str

Required.

Time limit for the jobs. Jobs taking longer than this will be aborted, so make sure to request enough time (but don’t exaggerate too much as shorter jobs can be scheduled more easily).

From the Slurm documentation:

Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.

So for example to request 1 hour per job use request_time = "1:00:00".

signal_seconds_to_timeout : int

Time in seconds before timeout at which Slurm sends a USR1 signal to the job (see --signal of sbatch). If not set, no signal is sent.

See example Get Signal Before Timeout on Slurm.

extra_submission_options : list[str]

List of additional options for sbatch. Can be used if a specific setting is needed which is not already covered by the options above. Expects a list with arguments as they are passed to sbatch, for example:

extra_submission_options = ["--gpu-freq=high", "--begin=2010-01-20T12:34:00"]

Note

There are currently no options to restrict the type of GPU. On the ML Cloud cluster of the University of Tübingen, this is currently done via the partitions. See https://portal.mlcloud.uni-tuebingen.de/user-guide/batch for a list of available partitions.

If needed, e.g. when using cluster_utils on a different Slurm cluster, missing options can always be provided via extra_submission_options.

Use Singularity/Apptainer Containers

Jobs can be executed inside Singularity/Apptainer [1] containers to give you full control over the environment, installed packages, etc. To enable containerisation of jobs, add a section singularity in the config file. This section can have the following parameters:

image

Required.

Path to the container image.

executable : str = "singularity"

Specify the executable that is used to run the container (mostly useful if you want to explicitly use Apptainer instead of Singularity in an environment where both are installed).

use_run : bool = false

Per default the container is run with singularity exec. Set this to true to use singularity run instead. This is only useful for images that use a wrapper run script that executes the given command (sometimes needed for some environment initialisation).

args : list[str] = []

List of additional arguments that are passed to singularity exec|run. Use this to set flags like --nv, --cleanenv, --contain, etc. if needed.

Example (in TOML):

[singularity]
image = "my_container.sif"
args = ["--nv", "--cleanenv"]

Specific for hp_optimization

num_best_jobs_whose_data_is_kept : int

Required.

Keep copies of the working directories of the given number of best jobs. They are stored in {results_dir}/best_jobs/.

kill_bad_jobs_early : bool = false

TODO

early_killing_params

TODO

optimizer_str

Required.

The optimisation method that is used to find good hyperparameters. Supported methods are

  • cem_metaoptimizer

  • nevergrad *

  • gridsearch

* To use nevergrad, the optional dependencies from the “nevergrad” group are needed, see Optional Dependencies.

optimizer_settings

Required.

Settings specific to the optimiser selected in optimizer_str. See Optimizer Settings.

optimization_setting

Required.

General settings for the optimisation (independent of the optimisation method). See General Optimisation Settings.

optimized_params

Required.

Defines the parameters that are optimised over. Expectes a list of dictionaries with each entry having the following elements:

  • param: Name of the parameter. Apparently can have object/attribute structure, e.g. “fn_args.x”.

  • distribution: Distribution that is used for sampling. Options are:

    TruncatedNormal

    Normal distribution using floats.

    TruncatedLogNormal

    Log-normal distribution using floats.

    IntNormal

    Normal distribution using integer values.

    IntLogNormal

    Log-normal distribution using integer values.

    Discrete

    Discrete list of values.

  • bounds: List [min_value, max_value]

  • options: List of possible values (used instead of bounds for “Discrete” distribution).

Example (in TOML):

[[optimized_params]]
param = "fn_args.w"
distribution = "IntNormal"
bounds = [ -5, 5 ]

[[optimized_params]]
param = "fn_args.y"
distribution = "TruncatedLogNormal"
bounds = [ 0.01, 100.0 ]

[[optimized_params]]
param = "fn_args.sharp_penalty"
distribution = "Discrete"
options = [ false, true ]

General Optimisation Settings

The optimization_setting parameter defines the general optimisation settings (i.e. the ones independent of the optimisation method set in optimizer_str). A dictionary with the following values is expected:

metric_to_optimize : str

Required.

Name of the metric that is used for the optimisation. Has to match the name of one of the metrics that are saved with finalize_job().

minimize : bool

Required.

Specify whether the metric shall be minimized (true) or maximised (false).

number_of_samples : int

Required.

The total number of jobs that will be run.

n_jobs_per_iteration : int

Required.

The number of jobs submitted to the cluster concurrently, and also the number of finished jobs per report iteration.

n_completed_jobs_before_resubmit : int = 1

The number of jobs that have to be finished before another n_completed_jobs_before_resubmit jobs are submitted. Defaults to 1 (i.e. submit new job immediately when one finishes).

run_local : bool

Specify if the optimisation shall be run locally if the cluster is not detected. If not set, the user will be asked at runtime in this case.

About Iterations

The exact meaning of one “iteration” of the hp_optimization mode is a bit complicated and depends on the configuration.

Relevant are the following parameters from the optimization_setting section:

number_of_samples is simply the total number of jobs that are run. n_jobs_per_iteration says how many jobs can be executed in parallel.

From this a number of iterations is derived. Basically an iteration counter is used that is incremented by one whenever another n_jobs_per_iteration jobs has been completed (resulting in number_of_samples / n_jobs_per_iteration iterations in the end). However, it does not necessarily mean that the optimisation is split into distinct iterations where the next iteration only starts when the previous one has finished. Instead, whenever a job completes, the optimiser is updated with the results and the next one is started immediately, so that always n_jobs_per_iteration jobs are running at the same time. The notion of “iterations” is only used to have a regular update of the report every n_jobs_per_iteration jobs.

The behaviour can be changed by setting n_completed_jobs_before_resubmit. The meaning of this parameter is as follows: Always wait until n_completed_jobs_before_resubmit jobs have finished, then submit another n_completed_jobs_before_resubmit jobs. Its default value is 1, resulting in the behaviour described in the previous paragraph. However, setting it to a larger value results in the optimisation to wait for several jobs to have finished before sampling new parameters. Setting n_completed_jobs_before_resubmit = n_jobs_per_iteration results in what one would intuitively assume regarding iterations, i.e. the optimisation would wait for n_jobs_per_iteration to be finished and only then start the next iteration with another n_jobs_per_iteration jobs.

Optimizer Settings

optimizer_settings expects as value a dictionary with configuration specific to the method that is specified in optimizer_str. Below are the corresponding parameters for each method.

cem_metaoptimizer

with_restarts : bool

Required.

Whether a specific set of settings can be run multiple times. This can be useful to automatically verify if good runs were just lucky runs because of e.g. the random seed, making the found solutions more robust.

If enabled, new settings are sampled for the first num_jobs_in_elite jobs. After that each new job has a 20% chance to use the same settings as a previous job (drawn from the set of best jobs).

num_jobs_in_elite : int

Required.

TODO

nevergrad

Note

To use nevergrad, the optional dependencies from the “nevergrad” group are needed, see Optional Dependencies.

opt_alg

Required.

TODO

local_run

TODO

load_existing_results : bool = false

TODO

restarts

Required.

How often to run each configuration (useful if there is some randomness in the result).

samples

TODO

hyperparam_list

Required.

Probably list of parameters over which the grid search is performed. List of dicts:

  • param: Parameter name (e.g. “fn_args.x”).

  • values: List of values. Be careful with types, 42 will be passed as int, use 42.0 if you want float instead.

Overwriting Parameters on the Command Line

When executing grid_search or hp_optimization it is possible to overwrite one or more parameters of the config file by providing values on the command line.

The general syntax for this is parameter_name=value given after the config file. Note, however, that value is evaluated as Python code. This means that string values need to be quoted in a way that is preserved by the shell. So for example to use a custom name for the output directory:

python3 -m cluster_utils.grid_search config.json 'optimization_procedure_name="foo"'

Nested parameters can be set using dots:

python3 -m cluster_utils.grid_search config.json 'git_params.branch="foo"'