Designing Scalable and High Performance MPI and PGAS Programming
Models on Modern Clusters with Accelerators
When: June 11, 2015 (8:30am-12:30pm)
Where: Newport Beach, California, USA
Abstract
Multi-core processors, accelerators (GPGPUs), co-processors (Xeon Phis) and
high-performance interconnects (InfiniBand, 10 GigE/iWARP and RoCE) with RDMA
support are shaping the architectures for next generation clusters. Efficient
programming models to design applications on these clusters as well as on future
exascale systems are still evolving. The new MPI-3 standard brings enhancements
to Remote Memory Access Model (RMA) as well as introduce non-blocking
collectives and MPI-Tools (MPI-T) interface.
Partitioned Global Address Space (PGAS) Models provide an
attractive alternative to the MPI model owing to their easy to use global shared
memory abstractions and light-weight one-sided communication. At the same time,
Hybrid MPI+PGAS programming models are gaining attention as a possible solution
to programming exascale systems. These hybrid models help the transition of
codes designed using MPI to take advantage of PGAS models without paying the
prohibitive cost of re-designing complete applications. In this tutorial, we
provide an overview of the research and development taking place along the
programming models (MPI, PGAS and Hybrid MPI+PGAS) and discuss associated
opportunities and challenges in designing the associated runtimes as we head
toward exascale. We start with an in-depth overview of modern system
architectures with multi-core processors, GPU accelerators, Xeon Phi
co-processors and high-performance interconnects. We present an overview of the
new MPI-3 RMA model, language based (UPC) and library based (OpenSHMEM) PGAS
models. We introduce MPI+PGAS hybrid programming models and highlight the
advantages and challenges in designing a unified runtime to support them. We
examine and contrast the different challenges in designing high-performance
runtimes for MPI-3, UPC, OpenSHMEM and CAF. We also highlight the need for
unified runtimes to support hybrid MPI+UPC, MPI+OpenSHMEM and MPI+CAF
programming models. We present case-studies using application kernels, to
demonstrate how one can exploit hybrid MPI+PGAS programming models to achieve
better performance without rewriting the complete code. Finally, we present new
challenges and designs to support these programming models on GPU and MIC based
systems. Using the publicly available MVAPICH2 software packages
(http://mvapich.cse.ohio- state.edu), we provide concrete case studies and
in-depth evaluation of runtime and applications-level designs that are targeted
for modern clusters with multi-core processors, GPUs, Xeon Phis and
high-performance interconnects.
Targeted Audience and Scope
This tutorial is targeted for various categories of people working in the areas
of PGAS and MPI programming models, high performance communication and I/O,
networking, middleware, accelerators, exascale computing and applications. Specific audience
this tutorial is aimed at include:
- Designers, developers and users of parallel programming models (MPI and PGAS)
- Scientists, engineers, researchers and students engaged in designing next-generation HPC systems and applications
- Newcomers to the field of HPC and exascale computing who are interested in familiarizing themselves with programming models, accelerators, networking, and RDMA
- Managers and administrators responsible for setting-up next generation HPC environment and high-end systems/facilities in their organizations/laboratories
The content level will be as follows: 30% beginner, 40% intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the attendee has a general
knowledge in high performance computing, networking, programming models, parallel applications, and related issues, he/she will be able to understand and appreciate
it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner.
Outline of the Tutorial
- Overview of the Modern HPC System Architectures
- Multi-core Processors
- High Performance Interconnects (InfiniBand, 10-40GigE/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE))
- Heterogeneity with Accelerators (GPUs) and Coprocessors (Xeon Phis)
- Introduction to MPI and Partitioned Global Address Space Models
- MPI-3 features including RMA, Non-blocking collectives (NBC) and MPI-T
- Library-based Models: Case Study with OpenSHMEM
- Language-based Models: Case Study with UPC
- Overview of MPI+PGAS Hybrid Programming Models and Benefits
- Designing Scalable and High Performance Runtimes (MPI,
OpenSHMEM, UPC and CAF) and Hybrid MPI+PGAS Models on Modern Clusters
- Application-level Case Studies for using Hybrid MPI+PGAS Models
- Designing Runtimes for MPI, PGAS and Hybrid MPI+PGAS with
GPUGPUs and Xeon Phis Clusters.
- Conclusion and Q&A
Brief Biography of Speakers
Dr. Dhabaleswar K. (DK)
Panda is a Professor and University Distinguished Scholar of Computer Science at the Ohio State
University. He obtained his Ph.D. in computer engineering from the
University of Southern California. His research interests include
parallel computer architecture, high performance computing,
communication protocols, files systems, network-based computing,accelerators, virtualization and
quality of service. He has published over 350 papers in major journals
and international conferences related to these research
areas. Dr. Panda and his research group members have been doing
extensive research on modern networking technologies including
InfiniBand, HSE and RDMA over
Converged Enhanced Ethernet (RoCE). His research group is currently
collaborating with National Laboratories and leading InfiniBand and
10-40GigE/iWARP companies on designing various subsystems of next
generation high-end
systems. The
MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE)
and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM and UPC))
software packages, developed by his research group, are
currently being used by more than 2,350 organizations worldwide (in 75
countries).
More than 242,000 downloads of this software have taken place
from the project website alone.
This software has enabled several InfiniBand clusters
(including the 7th one) to get into the latest TOP500
ranking. These software packages are also available with the
software stacks of network vendors (InfiniBand and iWARP), server
vendors and Linux distributors.
Dr. Panda and his team members are also engaged in research related
to accelerating popular
Big Data middleware (such as Hadoop, Spark and Memcached).
Software packages along this direction aree available
from
High-Performance Big Data (HiBD) project site.
Dr. Panda's research is supported by
funding from US National Science Foundation, US Department of Energy,
and several industry including Intel, Cisco, Cray, SUN, Mellanox,
QLogic, NVIDIA
and NetApp.
He is an IEEE Fellow and a member of ACM.
More details about Dr. Panda, including a comprehensive CV
and publications are available
here.
Khaled Hamidouche
received the Ph.D. degree in Computer Science from The University of Paris Sud
in 2011. He is a Senior Research Associate in the Department of Computer Science
and Engineering at The Ohio State University. His research interests include
high-performance interconnects, parallel programming models, accelerator
computing and high- end computing applications. His current focus is on
designing high performance unified MPI, PGAS and hybrid MPI+PGAS runtimes for
InfiniBand clusters and their support for accelerators. He has published over 30
papers in international journals and conferences related to these research
areas. Khaled is involved in the design and development of the popular MVAPICH2
library and its derivatives MVAPICH2-MIC, MVAPICH2-GDR and MVAPICH2-X.
He has been actively involved in various professional activities in
academic journals and conferences. He is a member of ACM. More details about Dr.
Hamidouche are available
here.
Last Updated: April 13, 2015