Designing Scalable and High Performance MPI and PGAS Programming Models on Modern Clusters with Accelerators

A Tutorial to be presented at The 29th International Conference on Supercomputing (ICS-2015)
by
Dhabaleswar K. (DK) Panda and Khaled Hamidouche (The Ohio State University)


When: June 11, 2015 (8:30am-12:30pm)
Where: Newport Beach, California, USA


Abstract

Multi-core processors, accelerators (GPGPUs), co-processors (Xeon Phis) and high-performance interconnects (InfiniBand, 10 GigE/iWARP and RoCE) with RDMA support are shaping the architectures for next generation clusters. Efficient programming models to design applications on these clusters as well as on future exascale systems are still evolving. The new MPI-3 standard brings enhancements to Remote Memory Access Model (RMA) as well as introduce non-blocking collectives and MPI-Tools (MPI-T) interface. Partitioned Global Address Space (PGAS) Models provide an attractive alternative to the MPI model owing to their easy to use global shared memory abstractions and light-weight one-sided communication. At the same time, Hybrid MPI+PGAS programming models are gaining attention as a possible solution to programming exascale systems. These hybrid models help the transition of codes designed using MPI to take advantage of PGAS models without paying the prohibitive cost of re-designing complete applications. In this tutorial, we provide an overview of the research and development taking place along the programming models (MPI, PGAS and Hybrid MPI+PGAS) and discuss associated opportunities and challenges in designing the associated runtimes as we head toward exascale. We start with an in-depth overview of modern system architectures with multi-core processors, GPU accelerators, Xeon Phi co-processors and high-performance interconnects. We present an overview of the new MPI-3 RMA model, language based (UPC) and library based (OpenSHMEM) PGAS models. We introduce MPI+PGAS hybrid programming models and highlight the advantages and challenges in designing a unified runtime to support them. We examine and contrast the different challenges in designing high-performance runtimes for MPI-3, UPC, OpenSHMEM and CAF. We also highlight the need for unified runtimes to support hybrid MPI+UPC, MPI+OpenSHMEM and MPI+CAF programming models. We present case-studies using application kernels, to demonstrate how one can exploit hybrid MPI+PGAS programming models to achieve better performance without rewriting the complete code. Finally, we present new challenges and designs to support these programming models on GPU and MIC based systems. Using the publicly available MVAPICH2 software packages (http://mvapich.cse.ohio- state.edu), we provide concrete case studies and in-depth evaluation of runtime and applications-level designs that are targeted for modern clusters with multi-core processors, GPUs, Xeon Phis and high-performance interconnects.

Targeted Audience and Scope

This tutorial is targeted for various categories of people working in the areas of PGAS and MPI programming models, high performance communication and I/O, networking, middleware, accelerators, exascale computing and applications. Specific audience this tutorial is aimed at include: The content level will be as follows: 30% beginner, 40% intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the attendee has a general knowledge in high performance computing, networking, programming models, parallel applications, and related issues, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner.

Outline of the Tutorial

Brief Biography of Speakers

Dr. Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing,accelerators, virtualization and quality of service. He has published over 350 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and 10-40GigE/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM and UPC)) software packages, developed by his research group, are currently being used by more than 2,350 organizations worldwide (in 75 countries). More than 242,000 downloads of this software have taken place from the project website alone. This software has enabled several InfiniBand clusters (including the 7th one) to get into the latest TOP500 ranking. These software packages are also available with the software stacks of network vendors (InfiniBand and iWARP), server vendors and Linux distributors. Dr. Panda and his team members are also engaged in research related to accelerating popular Big Data middleware (such as Hadoop, Spark and Memcached). Software packages along this direction aree available from High-Performance Big Data (HiBD) project site. Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, Cray, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda, including a comprehensive CV and publications are available here.

Khaled Hamidouche received the Ph.D. degree in Computer Science from The University of Paris Sud in 2011. He is a Senior Research Associate in the Department of Computer Science and Engineering at The Ohio State University. His research interests include high-performance interconnects, parallel programming models, accelerator computing and high- end computing applications. His current focus is on designing high performance unified MPI, PGAS and hybrid MPI+PGAS runtimes for InfiniBand clusters and their support for accelerators. He has published over 30 papers in international journals and conferences related to these research areas. Khaled is involved in the design and development of the popular MVAPICH2 library and its derivatives MVAPICH2-MIC, MVAPICH2-GDR and MVAPICH2-X. He has been actively involved in various professional activities in academic journals and conferences. He is a member of ACM. More details about Dr. Hamidouche are available here.


Last Updated: April 13, 2015