PGAS and Hybrid MPI+PGAS Programming Models on Modern HPC Clusters with
Accelerators
A Tutorial to be presented at
The IEEE Cluster Conference (Cluster 2016)
by
Dhabaleswar K. (DK) Panda and Khaled Hamidouche (The Ohio State University)
When: Sept 12, 2016 (2:00-5:30pm)
Where: Taipei, Taiwan
Abstract
Multi-core processors, accelerators (GPGPUs), co-processors (Xeon Phis) and
high-performance interconnects (InfiniBand, 10-40 GigE/iWARP and RoCE) with
RDMA support are shaping the architectures for next generation clusters.
Efficient
programming models to design applications on these clusters as well as on
future exascale systems are still evolving. The new MPI-3 standard brings
enhancements to Remote Memory Access Model (RMA) as well as introduce
non-blocking collectives. Partitioned Global Address Space (PGAS) Models provide
an attractive alternative to the MPI model owing to their easy to use global
shared memory abstractions and light-weight one-sided communication. At the same
time, Hybrid MPI+PGAS programming models are gaining attention as a possible
solution to programming exascale systems. These hybrid models help the
transition of codes designed using MPI to take advantage of PGAS models without
paying the prohibitive cost of re-designing complete applications. They also
enable hierarchical design of applications using the different models to suite
modern architectures.
In this tutorial, we provide an overview of the research and development taking
place along the programming models (MPI, PGAS and Hybrid MPI+PGAS) and discuss
associated opportunities and challenges in designing the associated runtimes as
we head toward exascale computing with accelerator-based systems. We start with
an in-depth overview of modern system architectures with multi-core processors,
GPU accelerators, Xeon Phi co-processors and high-performance interconnects. We
present an overview of the new MPI-3 RMA model, language based (UPC and CAF) and
library based (OpenSHMEM, UPC++) PGAS models. We introduce MPI+PGAS
hybrid programming
models and the associated unified runtime concept. We examine and contrast
different challenges in designing high-performance MPI-3 compliant, OpenSHMEM
and hybrid MPI+OpenSHMEM runtimes for both host-based and accelerator (GPU- and
MIC-) based systems. We present case-studies using application kernels, to
demonstrate how one can exploit hybrid MPI+PGAS programming models to achieve
better performance without rewriting the complete code. Using the publicly
available MVAPICH2-X, MVAPICH2- GDR and MVAPICH-MIC libraries, we present the
challenges and opportunities to design efficient MPI, PGAS and hybrid MPI+PGAS
runtimes for next generation systems. We introduce the concept of `CUDA-Aware
MPI/PGAS' to combine high productivity and high performance. We present how to
take advantage of GPU features such as Unified Virtual Address, CUDA-IPC and
GPUDirect RDMA technologies to design efficient MPI, OpenSHMEM and Hybrid
MPI+OpenSHMEM runtimes. Similarly, using MVAPICH2-MIC runtime, we expose
optimized data movement schemes for different system configurations including
multiple MICs per-node in same socket and/or different sockets configurations.
Objectives
HPC systems are marked by the usage of multi-cores, accelerators
(GPGPUs), co-processors (Xeon Phis) and high-performance interconnects
(InfiniBand, 10-40 GigE/iWARP and RoCE) with RDMA support. Efficient programming
models to design applications on these clusters as well as on future exascale
systems are still evolving. However, programming models, run- times and
associated application designs are not fully taking advantage of these trends.
Highlighting these emerging trends and the associated challenges, this tutorial
is proposed with the following goals:
- Teach designers, developers and users how to efficiently design
and use parallel programming models (MPI and PGAS) and accelerators (GPU and
MIC)
- Guide scientists, engineers, researchers and students engaged in
designing next-generation HPC systems and applications
- Help newcomers to the field of HPC and exascale computing to
understand the concepts and designs of parallel programming models,
accelerators, networking, and RDMA
- Demonstrate the impact advanced optimizations and tuning of
middlewares can have on application performance through case studies with
representative benchmarks and applications
The content level will be as follows: 30% beginner, 40% intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the attendee has a general
knowledge in high performance computing, networking, programming models, parallel applications, and related issues, he/she will be able to understand and appreciate
it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner.
Outline of the Tutorial
- Overview of the Modern HPC System Architectures
- Multi-core Processors
- High Performance Interconnects (InfiniBand, 10-40GigE/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE))
- Heterogeneity with Accelerators (GPUs) and Coprocessors (Xeon Phis)
- Introduction to MPI and Partitioned Global Address Space
(PGAS) Programming Models
- MPI-3 Features including RMA and Non-blocking
collectives
- Library-based Models: Case Study with OpenSHMEM
- Language-based Models: Case Study with UPC
- Overview of MPI+PGAS Hybrid Programming Models and
Benefits
- Challenges and Opportunities in Designing Scalable and
High Performance Runtimes (MPI, PGAS and Hybrid MPI+PGAS) on host-based modern
systems.
- Application-level Case Studies for using Hybrid MPI+PGAS Models
- Challenges and Opportunities in Designing Scalable and
High Performance Runtimes (MPI, PGAS and Hybrid MPI+PGAS) on GPU Clusters
- Overview of CUDA-Aware Concept
- Designing Efficient MPI Runtime for GPU Clusters
- Designing Efficient OpenSHMEM Runtime for GPU Clusters
- Challenges and Opportunities in Designing Scalable and
High Performance Runtimes (MPI, PGAS and Hybrid MPI+PGAS) on MIC Clusters
- Designing Efficient MPI Runtime for Intel MIC Clusters
- Designing Efficient OpenSHMEM Runtime for Intel MIC Clusters
- Conclusion and Q&A
Brief Biography of Speakers
Dr. Dhabaleswar K. (DK)
Panda is a Professor and University Distinguished Scholar of
Computer Science at the Ohio State University. He obtained his
Ph.D. in computer engineering from the University of Southern
California. His research interests include parallel computer
architecture, high performance computing, communication protocols,
files systems, network-based computing, and Quality of Service. He has
published over 400 papers in major journals and international
conferences related to these research areas. Dr. Panda and his
research group members have been doing extensive research on modern
networking technologies including InfiniBand, HSE and RDMA over
Converged Enhanced Ethernet (RoCE). His research group is currently
collaborating with National Laboratories and leading InfiniBand and
10GigE/iWARP companies on designing various subsystems of next
generation high-end systems. The MVAPICH2 (High Performance
MPI over InfiniBand, iWARP and RoCE) open-source software
libraries, developed by his research group, are currently being used
by more than 2,650 organizations worldwide (in 81 countries). This
software has enabled several InfiniBand clusters
to get into the latest TOP500 ranking during the last
decade. More than 383,000 (0.38 million) downloads of these libraries
have taken place from the project's site. These software packages are
also available with the Open Fabrics stack for network vendors
(InfiniBand and iWARP), server vendors and Linux distributors.
Dr. Panda's research is supported by funding from US National Science
Foundation, US Department of Energy, and several industry including
Intel, Cisco, Cray, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE
Fellow and a member of ACM. More details about Dr. Panda, including
a comprehensive CV and publications are available here.
Khaled Hamidouche
is a Research Scientist in the Department of Computer Science and
Engineering at The Ohio State University. He is a member of the Network-Based
Computing Laboratory lead by Dr. D. K. Panda. His research inter- ests include
high-performance interconnects, parallel programming models, accelerator
computing and high-end computing applications. His current focus is on designing
high performance unified MPI, PGAS and hybrid MPI+PGAS runtimes for InfiniBand
clusters and their support for accelerators. Dr. Hamidouche is involved in the
design and development of the popular MVAPICH2 library and its derivatives
MVAPICH2-MIC, MVAPICH2-GDR, MVAPICH2-EA and
MVAPICH2-X. He has published over 45 papers in international journals and
conferences related to these research areas. He has been actively involved in
various professional activities in academic journals and conferences. He is a
member of ACM. More details about Dr. Hamidouche are available
here.
Last Updated: August 17, 2016