Cluster Utils Documentation

cluster_utils is a tool for easily running hyperparameter optimization or grid search on a Slurm or HTCondor [1] cluster. It takes care of submitting and monitoring the jobs as well as aggregating the results.

It is geared towards tasks typical for machine learning research, for example running multiple seeds, grid searches, and hyperparameter optimization.

cluster_utils is developed by the Autonomous Learning group at the University of Tübingen.

Main Features

A non-exhaustive list of features is the following:

  • Parametrized jobs and hyperparameter optimization: run grid searches or multi-stage hyperparameter optimization.

  • Supports several cluster backends: currently, Slurm and HTCondor [1], as well as local (single machine runs) are supported.

  • Automatic job management: jobs are submitted, monitored (with error reporting), and cleaned up in an automated way.

  • Timeouts & restarting of failed jobs: jobs can be stopped and resubmitted after some time; failed jobs can be (manually) restarted.

  • Integrated with git: jobs are run from a git clone with customizable branch and commit number to enhance reproducility.

  • Reporting: results are summarized in CSV files, and optionally PDF reports with basic summaries and plots.

Basic Usage

There are two basic functionalities:

python3 -m cluster_utils.grid_search specification_of_grid_search.json

for grid search and

python3 -m cluster_utils.hp_optimization specification_of_hp_opt.json

for hyperparameter optimization.

For more information see Usage and the examples in the examples/basic/ and examples/rosenbrock/ for simple demonstrations.