High Performance Distributed Deep Learning

High Performance Distributed Deep Learning: A Beginner's Guide

A tutorial to be presented at PPoPP 2019: Principles and Practice of Parallel Programming 2019

When: Feb 17, 2019 (8:00 am - 12:00 noon)
Where: Marriott Marquis, Washington DC, USA

Dhabaleswar K. (DK) Panda, Ammar Ahmad Awan, and Hari Subramoni

The Ohio State University

Abstract

The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe2, TensorFlow, Cognitive Toolkit (CNTK), PyTorch, and several others have emerged that offer ease of use and flexibility to describe, train, and deploy various types of Deep Neural Networks (DNNs). In this tutorial, we will provide an overview of interesting trends in DNN design and how cutting-edge hardware architectures are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and DL frameworks. Most DL frameworks started with a single-node/single-GPU design. However, approaches to parallelize the process of DNN training are also being actively explored. The DL community has moved along different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. In this context, we will highlight new challenges and opportunities for communication runtimes to efficiently support distributed DNN training. We also highlight some of our co-design efforts to utilize CUDA-Aware MPI for large-scale DNN training on modern GPU clusters. Finally, we include hands-on exercises in this tutorial to enable the attendees gain a first-hand experience of running distributed DNN training experiments on a modern GPU cluster.

Tutorial Objectives

Recent advancements in Artificial Intelligence (AI) have been fueled by the resurgence of Deep Neural Networks (DNNs) and various Deep Learning (DL) frameworks like Caffe, Facebook Caffe2, Facebook Torch/PyTorch, Chanter/ChainerMN, Google TensorFlow, and Microsoft Cognitive Toolkit (CNTK). DNNs have found widespread applications in classical areas like Image Recognition, Speech Processing, Textual Analysis, as well as areas like Cancer Detection, Medical Imaging, and even Autonomous Vehicle systems. Two driving elements can be attributed to the momentum that DL has gained recently; first is the public availability of various data sets like ImageNet, CIFAR, etc., and second is the widespread adoption of data-parallel hardware like GPUs and accelerators to perform DNN training. The raw number crunching capabilities of GPUs have significantly improve DNN training. Today, the community is designing better, bigger, and deeper networks for improving the accuracy through models like AlexNet, GoogLeNet, Inception v3, and VGG. The models differ in the architecture (number and type of layers) but share the common requirement of faster computation and communication capabilities of the underlying systems. Based on these trends, this tutorial is proposed with the following objectives:

Help newcomers to the field of distributed Deep Learning (DL) on modern high-performance computing clusters to understand various design choices and implementations of several popular DL frameworks.
Guide Message Passing Interface (MPI) application researchers, designers and developers to achieve optimal training performance with distributed DL frameworks like Google TensorFlow, OSU-Caffe, CNTK, and ChainerMN on modern HPC clusters with high-performance interconnects (e.g., InfiniBand), NVIDIA GPUs, and multi/many core processors.
Demonstrate the impact of advanced optimizations and tuning of CUDA-Aware MPI libraries like MVAPICH2 on DNN training performance through case studies with representative benchmarks and applications.

Targeted Audience

This tutorial is targeted for various categories of people working in the areas of Deep Learning and MPI-based distributed DNN training on modern HPC clusters with high-performance interconnects. Specific audience this tutorial is aimed at include:

Scientists, engineers, researchers, and students engaged in designing next-generation Deep Learning frameworks and applications over high-performance interconnects and GPUs
Designers and developers of Caffe, TensorFlow, and other DL frameworks who are interested in scaling-out DNN training to multiple nodes of a cluster
Newcomers to the field of Deep Learning on modern high-performance computing clusters who are interested in familiarizing themselves with Caffe, CNTK, OSU-Caffe, and other MPI-based DL frameworks
Managers and administrators responsible for setting-up next generation Deep Learning executions environments and modern high-performance clusters/facilities in their organizations/laboratories

Audience Prerequisites

There is no fixed pre-requisite. As long as the attendee has a general knowledge in HPC and Networking, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. The content level will be as follows: 60% beginner, 30% intermediate, and 10% advanced.

Detailed Outline of the Tutorial

The tutorial is organized along the following topics with a detailed time budget (half-day):

Introduction

The Past, Present, and Future of Deep Learning (DL)

Brief History and Current/Future Trends
DL Resurgence in the Many-core Era

What are Deep Neural Networks?

Brief Introduction
Training and Inference

Diverse Applications of Deep Learning

Vision
Speech
Text
Autonomous Driving

Deep Learning Frameworks?

Why we need DL frameworks?
Define-by-run frameworks vs. Define-and-run?
Caffe/Caffe2
Microsoft Cognitive Toolkit (CNTK)
Chainer/ChainerMN
Torch/PyTorch
Google TensorFlow

Overview of Execution Environments

Where do we run our DL Framework?
Conventional vs. Upcoming Execution Environments
DL Frameworks and Underlying (BLAS/DNN) Libraries
Holistic Performance Characterization

Parallel and Distributed DNN Training

The Need for Parallel and Distributed Training
Parallelization Strategies
Communication Runtimes
Scale-up and Scale-out

Latest Trends in HPC Technologies

HPC Hardware

Interconnects (InfiniBand, RoCE, and Omni-Path)
GPUs, Multi-/Many-cores, FPGAs, TPUs, Intel Neural Network Processor, Intelligence Processing Unit
Storage - NVMe, SSDs, Burst Buffers, etc.

Communication Middleware

Message Passing Interface (MPI)
NVIDIA NCCL/NCCL2 and Facebook Gloo
Intel Machine Learning Scaling Library

Challenges in Exploiting HPC Technologies

Large Batch and Model Size, Accuracy, and Scalability
Exploiting GPUs and CUDA-Aware MPI
Co-design of Communication Runtimes and DL Frameworks
Efficient Collective Communication for DL Workloads

Solutions and Case Studies

NVIDIA NCCL/NCCL2
LLNL Aluminum
Baidu-allreduce
Facebook Gloo and Caffe2
Co-design MPI Runtimes and DL Frameworks

MPI+NCCL for CUDA-Aware CNTK
Ring-based Optimized MPI for CUDA-Aware CNTK
OSU-Caffe

Distributed Training for TensorFlow

TensorFlow with gRPC
TensorFlow with No-gRPC
TensorFlow with gRPC+X (X=MPI,NCCL/NCCL2)

Scaling DNN Training on Multi-/Many-core CPUs

Intel Optimized Caffe + Intel MLSL
Intel Optimized TensorFlow

PowerAI Distributed Deep Learning

Hands-on Exercises

Distributed Training for TensorFlow + MPI (Horovod)
Horovod MPI with MVAPICH2 and MVAPICH2-GDR
Horovod MPI with MVAPICH2-GDR and NCCL2

Open Issues and Challenges

Which Framework should I use?

Use-cases, Eco-systems, and Application Domains

What is the Rationale behind NCCL/NCCL2, Aluminum, Gloo, and MPI?
How can we handle scenarios which do not fit in available memory (out of core)
Convergence of DL and HPC research
Thoughts on DL Benchmarks and Standardization
Scalability and Large batch-size training?

Conclusion

Brief Bio of Presenters

Dr. Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing, and Quality of Service. He has published over 400 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and 10GigE/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group, are currently being used by more than 2,950 organizations worldwide (in 86 countries). This software has enabled several InfiniBand clusters (including the 1st one) to get into the latest TOP500 ranking. More than 507,000 downloads of these libraries have taken place from the project's site. These software packages are also available with the stacks for network vendors (InfiniBand and iWARP), server vendors and Linux distributors. The RDMA-enabled Apache Hadoop, Spark and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are publicly available from High-Performance Big Data (HiBD) project site: http://hibd.cse.ohio-state.edu. These packages are currently being used by more than 295 organizations in 35 countries. More than 28,450 downloads have taken place from the project's site. The group has also been focusing on co-designing Deep Learning Frameworks and MPI Libraries. High-performance and scalable versions of the Caffe and TensorFlow frameworks are available from High-Performance Deep Learning (HiDL) Project site: site: http://hidl.cse.ohio-state.edu. Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, SUN, Mellanox, Microsoft, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda, including a comprehensive CV and publications are available here.

Ammar Ahmad Awan Ammar Ahmad Awan received his B.S. and M.S.degrees in Computer Science and Engineering from National University of Science and Technology (NUST), Pakistan and Kyung Hee University (KHU), South Korea, respectively. Currently, Ammar is working towards his Ph.D. degree in Computer Science and Engineering at The Ohio State University. His current research focus lies at the intersection of High Performance Computing (HPC) libraries and Deep Learning (DL) frameworks. He previously worked on a Java-based Message Passing Interface (MPI) and nested parallelism with OpenMP and MPI for scientific applications. He has published 14 papers in conferences and journals related to these research areas. He actively contributes to various projects like MVAPICH2-GDR (High Performance MPI for GPU clusters, OMB (OSU Micro Benchmarks), and HiDL (High Performance Deep Learning). He is the lead author of the OSU-Caffe framework (part of HiDL project) that allows efficient distributed training of Deep Neural Networks.

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available here.