High Performance Machine Learning, Deep Learning, and Data Science

A tutorial to be presented at ISCA 2022: International Symposium on Computer Architecture 2022

When: June 18, 2022 (8:30 am to 12:00 noon EDT with a break from 10:30-11:00 am EDT)
Where: New York City, USA (Conference Hotel - Sheraton Time Squares)

Dhabaleswar K. (DK) Panda, Hari Subramoni, Arpan Jain, and Aamir Shafi

The Ohio State University

Abstract

Recent advances in Machine and Deep Learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL and Data Science frameworks including TensorFlow, PyTorch, and Dask have emerged that offer high-performance training and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures and ML/DL frameworks with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU architectures available on modern HPC clusters. The tutorial covers training traditional ML models including — K-Means, linear regression, nearest neighbours — using the cuML framework accelerated using MVAPICH2-GDR. Also, the tutorial presents accelerating GPU-based Data Science applications using MPI4Dask, which is an MPI-based backend for Dask. Throughout the tutorial, we include hands-on exercises to enable attendees to gain first-hand experience of running distributed ML/DL training and Dask on a modern GPU cluster

Tutorial Objectives

Recent advancements in Artificial Intelligence (AI) have been fueled by the resurgence of Deep Neural Networks (DNNs) and state-of-the-art Machine Learning (ML) models. Deep Learning and Machine Learning have found widespread applications in classical as well as emerging applications areas like Image Recognition, Speech Processing, and Autonomous Vehicle systems. Machine learning is emerging as a popular approach for automation in business operations, identity verification, advertisements, marketing, etc. On the other hand, rapid growth in DL can be attributed to 1) Public availability of various datasets like ImageNet and CIFAR and 2) Widespread adoption of data-parallel hardware like GPUs and accelerators for DNN training. Today, the community is designing better, bigger, and deeper networks for improving the accuracy through models like ResNet-50 and Inception v4 that require HPC systems with high-bandwidth and low-latency interconnects to scale-out DNN training on hundreds of nodes efficiently. Based on these trends, this tutorial is proposed with the following objectives:

Help newcomers to the field of distributed Deep Learning (DL) on modern high-performance computing clusters to understand various design choices and implementations of several popular DL frameworks.
Guide Machine Learning / Data Science application researchers, designers and developers to achieve optimal training performance with distributed frameworks like Dask, and cuML on modern HPC clusters with high-performance interconnects (e.g., InfiniBand), Nvidia GPUs, and multi/many core processors.
Demonstrate the impact of advanced optimizations and tuning of CUDA-Aware MPI libraries like MVAPICH2 on DNN training performance through case studies with representative benchmarks and applications.

Targeted Audience

This tutorial is targeted for various categories of people working in the areas of Deep Learning, Machine Learning and MPI-based distributed DNN/ML training on modern HPC clusters with high-performance interconnects. Specific audience this tutorial is aimed at include:

Scientists, engineers, researchers, and students engaged in designing next-generation DL/ML frameworks and applications over high-performance interconnects and GPUs
Newcomers to the field of DL/ML on modern high-performance computing clusters who are interested in familiarizing themselves with PyTorch, TensorFlow, Dask, cuML, and other MPI-based DL/ML frameworks
Managers and administrators responsible for setting-up next generation DL/ML executions environments and modern high-performance clusters/facilities in their organizations/laboratories

Audience Prerequisites

There is no fixed pre-requisite. As long as the attendee has a general knowledge in HPC and Networking, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. The content level will be as follows: 50% beginner, 30% intermediate, and 20% advanced.

Detailed Outline of the Tutorial

The tutorial is organized along the following topics with a detailed time budget (half-day):

Introduction

The Past, Present, and Future of Artificial Intelligence (AI)

Brief History and Current/Future Trends of Machine Learning (ML) and Deep Learning (DL)

What are Deep Neural Networks?

Training and Inference

Diverse Applications of Artificial Intelligence

Vision, Speech, Text, and Autonomous Driving

Machine Learning and Deep Learning Frameworks?

Overview of Execution Environments

Where do we run our DL Framework? (Conventional vs. Upcoming Execution Environments)
Holistic Performance Characterization - DL Frameworks and Underlying (BLAS/DNN) Libraries

Distributed Machine Learning Algorithms using cuML
Parallel and Distributed DNN Training

Essential Concepts: Bach-propagation, Learning Rate, Batch Size
The Need for Parallel and Distributed Training
Parallelization Strategies, Communication Runtimes, and Scale-up and Scale-out

Data Science using Dask
Distributed Machine Learning Algorithms
Latest Trends in HPC Technologies

HPC Hardware
Communication Middleware

Challenges in Exploiting HPC Technologies

Large Batch and Model Size, Accuracy, and Scalability
Exploiting GPUs and CUDA-Aware MPI
Co-design of Communication Runtimes and DL Frameworks
Efficient Collective Communication for DL Workloads

Solutions and Case Studies

Data Parallelism

Distributed Training Support and Co-designs

Model and Hybrid Parallelism

GPipe, FlexFlow, HyPar-Flow, GEMS, and SUPER

Open Issues and Challenges
Hands-on Exercises

Distributed Machine Learning and Deep Learning Labs

Conclusion

Brief Bio of Presenters

Dr. Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing, and Quality of Service. He has published over 500 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE, and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand, HSE, and RoCE companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group, are currently being used by more than 3,200 organizations worldwide (in 89 countries). This software has enabled several InfiniBand clusters (including the 1st one) to get into the latest TOP500 ranking. More than 1.57 million downloads of these libraries have taken place from the project's site. These software packages are also available with the stacks for network vendors (InfiniBand and iWARP), server vendors and Linux distributors. The RDMA-enabled Apache Hadoop, Spark and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are publicly available from High-Performance Big Data (HiBD) project site: http://hibd.cse.ohio-state.edu. These packages are currently being used by more than 340 organizations in 38 countries. More than 44,000 downloads have taken place from the project's site. The group has also been focusing on co-designing Deep Learning Frameworks and MPI Libraries. High-performance and scalable versions of the Caffe and TensorFlow frameworks are available from High-Performance Deep Learning (HiDL) Project site: site: http://hidl.cse.ohio-state.edu. Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including AMD, Broadcom, Cisco, Intel, Oracle, Mellanox, Microsoft, NetApp, NVIDIA, and QLogic. He is an IEEE Fellow and a member of ACM. He is also a recipient of the 2022 IEEE Charles Babbage Award. More details about Dr. Panda, including a comprehensive CV and publications are available here.

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available here.

Arpan Jain received his B.Tech. and M.Tech. degrees in Information Technology from ABV-IIITM, India. Currently, Arpan is working towards his Ph.D. degree in Computer Science and Engineering at The Ohio State University. His current research focus lies at the intersection of High Performance Computing (HPC) libraries and Deep Learning (DL) frameworks. He is working on parallelization and distribution strategies for large-scale Deep Neural Network (DNN) training. He previously worked on speech analysis, time series modeling, hyperparameter optimization, and object recognition. He actively contributes to projects like HiDL (high-performance deep learning), MVAPICH2-GDR software, and LBANN deep learning framework.

Aamir Shafi is currently a Research Scientist in the Department of CS&E at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express.