Skip to content

Backlog

Johannes Klicpera edited this page Apr 19, 2021 · 78 revisions

Possible and planned features:

  • Proper documentation via readthedocs
  • Tests
  • Pass "Batch job submission failed" errors to user
  • detect Slurm state instead of only whether the job got killed, reflect in database (raw (last seen), seml-equivalent); potentially remove KILLED state, add "reason" field instead; remove detect-killed, make seml status the primary way of detecting Slurm states
  • SEML portable mode for publishing source code: Start local experiment directly from config (no MongoDB and Slurm, only Sacred)
  • integrate with Tensorboard HParams for nicer evaluation
  • Pausing experiments (hold/stop/suspend)
  • Recommend using separate DB for each user. Maybe provide installation instructions?

Low priority:

  • Job chaining via sbatch --dependency
  • suspend (and then restart) experiments
  • Integrate with PyTorch Lightning (what would this even mean? Some convenience functions?)
  • Automatic hyperparameter optimization (via Sherpa, hyperopt, Optuna?) -> parallel, on Cluster
  • Make Sacred optional (makes SEML easier for beginners, and Sacred might be discontinued at some point)
Clone this wiki locally