KubeFlux: a Scheduler Plugin Bridging the Cloud-HPC Gap in Kubernetes

Abstract

The cloud is an increasingly important market sector of computing and is driving innovation. Adoption of cloud technologies by high performance computing (HPC) is accelerating, and HPC users want their applications to perform well everywhere. While cloud orchestration frameworks like Kubernetes provide advantages like resiliency, elasticity, and automation, they are not designed to enable application performance to the same degree as HPC workload managers and schedulers. As HPC and cloud Computing converge, techniques from HPC can be integrated into the cloud to improve application performance and provide universal scalability.

We present KubeFlux, a Kubernetes plugin based on the Fluxion open-source HPC scheduler component of the Flux framework developed at the Lawrence Livermore National Laboratory. We introduce the Flux framework and the Fluxion scheduler and describe how their hierarchical, graph-based foundation is naturally suited to converged computing. We discuss uses for KubeFlux and compare the performance of an application scheduled by the Kubernetes default scheduler and KubeFlux. KubeFlux is an example of the rich capability that can be added to Kubernetes and paves the way to democratization of the cloud for HPC workloads.

Speakers

Biography

Claudia Misale is a Research Staff Member in the Hybrid Cloud Infrastructure Software group at IBM T.J. Watson Research Center (NY). Her research is focused on Kubernetes for IBM Public Cloud, and targets porting HPC applications to the cloud by enabling batch scheduling alternatives for Kubernetes. She is mainly interested in cloud computing and container technologies, and her background is on high-level parallel programming models and patterns, and big data analytics on HPC platforms. She received her master’s summa cum laude and
bachelor’s degree in Computer Science at the University of Calabria (Italy), and her PhD from the Computer Science Department of the University of Torino (Italy).


Daniel Milroy is a Computer Scientist at the Center for Applied Scientific Computing at the Lawrence Livermore National Laboratory. His research focuses on graph-based scheduling and resource representation and management for high performance computing (HPC) and cloud converged environments. While Dan’s research background is numerical analysis and software quality assurance and correctness for climate simulations, he is currently interested in scheduling and representing dynamic resources, and co-scheduling and management
techniques for HPC and cloud. Dan holds a B.A. in physics from the University of Chicago, and an M.S. and PhD in computer science from the University of Colorado Boulder.