Accelerating Big Data Processing

Accelerating Big Data Processing with Hadoop, Spark and Memcached on Datacenters with Modern Architectures

A Tutorial to be presented at The 23rd IEEE International Symposium On High Performance Computer Architecture (HPCA-2017)
by
Dhabaleswar K. (DK) Panda and Xiaoyi Lu (The Ohio State University)

When: February 5, 2017 (8:30 am -12:00 noon, Room #415A, Hilton)
Where: Austin, TX, USA

Abstract

Apache Hadoop and Spark are gaining prominence in handling Big Data and analytics. Similarly, Memcached in Web-2.0 environment is becoming important for large-scale query processing. Recent studies have shown default Hadoop, Spark, and Memcached can not leverage the high-performance networking and storage architectures on modern datacenters efficiently, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects, heterogeneous and high-speed storage systems (e.g. HDD, SSD, NVMe-SSD, and Lustre). These middleware are traditionally written with sockets and do not deliver best performance on modern high-performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, RPC, HBase, etc.), Spark and Memcached. We will examine the challenges in re-designing networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand, iWARP, RoCE, and RSocket) with RDMA and storage architectures. Using the publicly available software packages in the High-Performance Big Data (HiBD, http://hibd.cse.ohio-state.edu) project, we will provide case studies of the new designs for several Hadoop/Spark/Memcached components and their associated benefits. Through these case studies, we will also examine the interplay between high-performance interconnects, high-speed storage systems, and multi-core platforms to achieve the best solutions for these components and Big Data applications on modern datacenters.

Targeted Audience and Scope

The tutorial content is planned for half-a-day. This tutorial is targeted for various categories of people working in the areas of Big Data including high-performance Hadoop/Spark/Memcached, high performance communication and I/O architecture, storage, networking, middleware, cloud computing and applications. Specific audience this tutorial is aimed at include:

Scientists, engineers, researchers, and students engaged in designing next-generation Big Data systems and applications
Designers and developers of Big Data, Hadoop, Spark and Memcached middleware
Newcomers to the field of Big Data who are interested in familiarizing themselves with Hadoop, Spark, Memcached, RDMA, and high-performance networking and storage architectures
Managers and administrators responsible for setting-up next generation Big Data environment and high-end systems/facilities in their organizations/laboratories

The content level will be as follows: 30% beginner, 40% intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the attendee has a general knowledge in Big Data, Hadoop, Spark, Memcached, high performance computing, networking and storage architectures, and related issues, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. This tutorial is organized as a coherent talk to cover multiple topics.

Outline of the Tutorial

Introduction to Big Data Applications and Analytics
Overview of MapReduce and Resilient Distributed Datasets (RDD) Programming Models
Architecture Overview of Apache Hadoop, Spark, and Memcached

MapReduce and YARN
HDFS
Spark
RPC
HBase
Memcached

Overview of High-Performance Interconnects, Protocols, and Storage Architectures for Modern Datacenters

InfiniBand and RDMA
10/40 GigE, iWARP and RoCE technologies
RSocket and SDP protocols
SSD-based storage and Lustre parallel filesystem

Challenges in Accelerating Hadoop, Spark, and Memcached on Modern Datacenters
Overview of Benchmarks and Applications using Hadoop, Spark, and Memcached
Acceleration Case Studies and In-Depth Performance Evaluation with Benchmarks and Applications
- MapReduce over InfiniBand with RDMA, SSD, and Lustre
- HDFS over InfiniBand with RDMA and Heterogeneous Storage (RAMDisk, SSD, HDD, and Lustre)
- Spark over InfiniBand with RDMA and SSD
- HBase over InfiniBand with RDMA and SSD
- RPC over InfiniBand with RDMA
- Memcached over InfiniBand with RDMA and SSD
The High-Performance Big Data (HiBD) Project and Associated Releases
Ongoing and Future Activities for Accelerating Big Data Applications
Conclusion and Q&A

Brief Biography of Speakers

Dr. Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science at the Ohio State University. He obtained his Ph.D. in computer engineering from the University of Southern California. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing, and Quality of Service. He has published over 400 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and 10GigE/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software libraries, developed by his research group, are currently being used by more than 2,725 organizations worldwide (in 83 countries). This software has enabled several InfiniBand clusters (including the 1st one) to get into the latest TOP500 ranking during the last decade. More than 408,000 (0.4 million) downloads of these libraries have taken place from the project's site. These software packages are also available with the Open Fabrics stack for network vendors (InfiniBand and iWARP), server vendors and Linux distributors. The RDMA-enabled Apache Hadoop, Spark and Memcached packages, consisting of acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are publicly available from http://hibd.cse.ohio-state.edu. These packages are currently being used by more than 205 organizations in 29 countries. More than 19,700 downloads have taken place from the project's site. Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including ntel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda, including a comprehensive CV and publications are available here.

Dr. Xiaoyi Lu is a Research Scientist in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high-performance interconnects and protocols, Big Data, Hadoop/Spark Ecosystem, Parallel Computing Models (MPI/PGAS), GPU/MIC, Virtualization and Cloud Computing. He has published over 60 papers in major journals and international conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Recently, Dr. Lu is doing research and working on design and development for the High-Performance Big Data project (http://hibd.cse.ohio-state.edu). He is a member of IEEE and a member of ACM. More details about Dr. Lu are available here.

Last Updated: February 5, 2017