HPC Meets Cloud: Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications
When: Dec 8, 2017 (Morning)
Where: Austin, Texas, USA
Abstract
Significant growth has been witnessed during the last few years in HPC clusters
with multi-/many-core processors, accelerators, and high-performance
interconnects (such as InfiniBand, Omni-Path, iWARP, and RoCE). To alleviate
the cost burden, sharing HPC cluster resources to end users through
virtualization for both scientific computing and Big Data processing is
becoming more and more attractive. In this tutorial, we first provide an
overview of popular virtualization system software on HPC cloud environments,
such as hypervisors (e.g., KVM), containers (e.g., Docker, Singularity),
OpenStack, Slurm, etc. Then we provide an overview of high-performance
interconnects and communication mechanisms on HPC clouds, such as InfiniBand,
RDMA, SR-IOV, IVShmem, etc. We further discuss the opportunities and technical
challenges of designing high-performance MPI runtime over these environments.
Next, we introduce our proposed novel approaches to enhance MPI library design
over SR-IOV enabled InfiniBand clusters with both virtual machines and
containers. We also discuss how to integrate these designs into popular cloud
management systems like OpenStack and HPC cluster resource managers like Slurm.
Not only for HPC middleware and applications, we will demonstrate how
high-performance solutions can be designed to run Big Data and Deep Learning
workloads (like Hadoop, Spark, TensorFlow, CNTK, Caffe) in HPC cloud
environments.
Targeted Audience and Scope
This tutorial is targeted for various categories of people working in the areas of HPC, Big Data processing, and Deep Learning on modern HPC clouds with high-performance interconnects. Specific audience this tutorial is aimed at include:
- Scientists, engineers, researchers, and students engaged in designing next-generation HPC, Big Data, and Deep Learning systems and applications over HPC clouds with high-performance interconnects
- Designers and developers of Cloud Computing, HPC, Big Data, Deep Learning, OpenStack, Slurm, MPI, Hadoop, Spark, gRPC/TensorFlow, etc. middleware
- Newcomers to the field of Cloud Computing, HPC, Big Data processing, and Deep Learning on modern high-performance computing clusters who are interested in familiarizing themselves with OpenStack, Slurm, MPI, Hadoop, Spark, gRPC/TensorFlow, RDMA, high-performance networking, etc.
- Managers and administrators responsible for setting-up next generation HPC Clouds to efficiently run HPC, Big Data, and Deep Learning workloads in their organizations/laboratories
The content level will be as follows: 30% beginner, 40%
intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the
attendee has a general knowledge in Cloud Computing, HPC, Big Data, Hadoop, Spark, networking and storage architecture, and related issues, he/she will
be able to understand and appreciate it. The tutorial is designed in such a way
that an attendee gets exposed to the topics in a smooth and progressive manner.
This tutorial is organized as a coherent talk to cover multiple topics.
Outline of the Tutorial
- Introduction to Cloud Computing and Virtualization Technologies
- Architecture Overview of Cloud System Software
- Hypervisor-based Virtualization
- Container-based Virtualization
- OpenStack and Other Cloud Resource Managers
- Slurm and SPANK
- Overview of Modern Interconnects, Protocols, and Storage Architectures on HPC Clouds
- InfiniBand, 10/40/100 GigE, iWARP, RoCE, Omni-Path technologies
- High-Performance Communication Mechanisms on HPC Clouds (e.g., RDMA, PCI Passthrough, SR-IOV, IVShmem, CMA, etc.)
- SSD/NVM-based storage and Cloud Storage Systems (e.g., OpenStack Swift)
- Architecture Overview of HPC, Big Data, and Deep Learning Middleware (20 mins)
- Message Passing Interface (MPI)
- MapReduce, YARN, HDFS, Spark, HBase
- Caffe, TensorFlow, BigDL
- Opportunities and Challenges of Building HPC Clouds on Modern Networking and Storage Architectures
- Overview of Benchmarks and Applications using MPI, Hadoop, Spark, gRPC/TensorFlow, Caffe, BigDL
- Designing High-Performance MVAPICH2 MPI Library on HPC Clouds and In-Depth Performance Evaluation
- VM-aware MVAPICH2 on InfiniBand Clusters
- Live-Migration Support in MVAPICH2 for SR-IOV enabled InfiniBand
- Container-aware MVAPICH2 on InfiniBand Clusters
- MVAPICH2 on Nested Cloud Environments
- Designing High-Performance Big Data Libraries on HPC Clouds and In-Depth Performance Evaluation
- RDMA-based Designs for Hadoop Components
- Locality-aware Design for RDMA-Hadoop
- RDMA-based Swift Cloud Storage System for Big Data Workloads
- Designing High-Performance Deep Learning Libraries on HPC Clouds and In-Depth Performance Evaluation
- RDMA-based Designs for Deep Learning over Big Data Stacks (e.g., CaffeOnSpark, TensorFlowOnSpark, BigDL)
- In-Depth Characterization on Performance, Accuracy, Scalability, and Resource Utilization
- Integrated Designs with OpenStack and Slurm
- Extending OpenStack and Slurm for Managing SR-IOV and IVShmem
- OpenStack Heat-based Complex Appliances for MPI and Hadoop
- A Demo with Heat Appliances
- Conclusion and Q&A
Brief Biography of Speakers
Dr. Dhabaleswar
K. (DK) Panda is a Professor of Computer Science at the Ohio State
University. He obtained his Ph.D. in computer engineering from the University
of Southern California. His research interests include parallel computer
architecture, high performance computing, communication protocols, files
systems, network-based computing, and Quality of Service. He has published over
400 papers in major journals and international conferences related to these
research areas. Dr. Panda and his research group members have been doing
extensive research on modern networking technologies including InfiniBand, HSE
and RDMA over Converged Enhanced Ethernet (RoCE). His research group is
currently collaborating with National Laboratories and leading InfiniBand and
10GigE/iWARP companies on designing various subsystems of next generation
high-end systems. The
MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE)
open-source software
package, developed by his research group, are currently being used by more than
2,825 organizations worldwide (in 85 countries). This software has enabled
several InfiniBand clusters (including the 1st one) to get into the latest
TOP500 ranking. More than 428,000 downloads of
these libraries have taken place from the project's site.
These software packages are also available with the
stacks for network vendors (InfiniBand and iWARP), server vendors and
Linux distributors. The RDMA-enabled Apache Hadoop, Spark
and Memcached packages, consisting of
acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are publicly available from High-Performance Big Data (HiBD) project site:
http://hibd.cse.ohio-state.edu. These packages are currently being
used by more than 250 organizations in 31 countries. More than 23,450
downloads have taken place from the project's site.
Dr. Panda's research is supported
by funding from US National Science Foundation, US Department of Energy, and
several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and
NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr.
Panda, including a comprehensive CV and publications are available here.
Dr. Xiaoyi Lu
is a Research Scientist
of the Department of Computer Science and Engineering at the Ohio
State University, USA. His current research interests include high
performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached
Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization and
Cloud Computing. He has published over 90 papers in international
journals and conferences related to these research areas. He has been
actively involved in various professional activities (PC Co-Chair, PC
Member, Reviewer, Session Chair) in academic journals and
conferences. Recently, Dr. Lu is leading the research
and development of
RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD
micro-benchmarks, which are publicly available from
(http://hibd.cse.ohio-state.edu). These libraries are currently being
used by more than 250 organizations from 31 countries. More than
23,450 downloads of these libraries have taken place from the project
site. He is a core member of the
MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE)
project and he is leading the
research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud).
He is a member of IEEE and ACM. More
details about Dr. Lu are available at here.
Last Updated: Oct. 03, 2017