mpi_jm Objectives:

  • Efficiently run a large set of jobs of bounded size in a large allocation of nodes.
  • Solve node fragmentation function. Repeated launch and completion of jobs leaves available nodes spread out with poor interconnect performance.
  • Automate binding of node resources, picking the right cores to match GPU use, setting affinity for openMP threads.
  • Support overlay of jobs using distinct resources on the same nodes, i.e. GPU vs CPU jobs.
  • Jobs support pre and post actions that can be used to chain computations with simple dependencies.
  • Customizable collection of workload with python based frontend.
  • Dynamic dependencies - jobs can wait for set of prerequisite conditions like completion of a job with neighboring parameters.
  • Python based runtime estimation. Can implement machine learning model configured on prior run data.
  • Dynamic reconfiguration of jobs to compute inside allocation time.