HPC Meets Cloud: Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications

A Tutorial to be presented at The 10th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2017)
Dhabaleswar K. (DK) Panda and Xiaoyi Lu (The Ohio State University)

When: Dec 8, 2017 (Morning)
Where: Austin, Texas, USA


Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (such as InfiniBand, Omni-Path, iWARP, and RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization for both scientific computing and Big Data processing is becoming more and more attractive. In this tutorial, we first provide an overview of popular virtualization system software on HPC cloud environments, such as hypervisors (e.g., KVM), containers (e.g., Docker, Singularity), OpenStack, Slurm, etc. Then we provide an overview of high-performance interconnects and communication mechanisms on HPC clouds, such as InfiniBand, RDMA, SR-IOV, IVShmem, etc. We further discuss the opportunities and technical challenges of designing high-performance MPI runtime over these environments. Next, we introduce our proposed novel approaches to enhance MPI library design over SR-IOV enabled InfiniBand clusters with both virtual machines and containers. We also discuss how to integrate these designs into popular cloud management systems like OpenStack and HPC cluster resource managers like Slurm. Not only for HPC middleware and applications, we will demonstrate how high-performance solutions can be designed to run Big Data and Deep Learning workloads (like Hadoop, Spark, TensorFlow, CNTK, Caffe) in HPC cloud environments.

Targeted Audience and Scope

This tutorial is targeted for various categories of people working in the areas of HPC, Big Data processing, and Deep Learning on modern HPC clouds with high-performance interconnects. Specific audience this tutorial is aimed at include: The content level will be as follows: 30% beginner, 40% intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the attendee has a general knowledge in Cloud Computing, HPC, Big Data, Hadoop, Spark, networking and storage architecture, and related issues, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. This tutorial is organized as a coherent talk to cover multiple topics.

Outline of the Tutorial

Brief Biography of Speakers

Dr. Dhabaleswar K. (DK) Panda is a Professor of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing, and Quality of Service. He has published over 400 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and 10GigE/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group, are currently being used by more than 2,825 organizations worldwide (in 85 countries). This software has enabled several InfiniBand clusters (including the 1st one) to get into the latest TOP500 ranking. More than 428,000 downloads of these libraries have taken place from the project's site. These software packages are also available with the stacks for network vendors (InfiniBand and iWARP), server vendors and Linux distributors. The RDMA-enabled Apache Hadoop, Spark and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are publicly available from High-Performance Big Data (HiBD) project site: http://hibd.cse.ohio-state.edu. These packages are currently being used by more than 250 organizations in 31 countries. More than 23,450 downloads have taken place from the project's site. Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda, including a comprehensive CV and publications are available here.

Dr. Xiaoyi Lu is a Research Scientist of the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization and Cloud Computing. He has published over 90 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, Reviewer, Session Chair) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from (http://hibd.cse.ohio-state.edu). These libraries are currently being used by more than 250 organizations from 31 countries. More than 23,450 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at here.
Last Updated: Oct. 03, 2017