How to Accelerate Your Big Data and Associated Deep Learning Applications with Hadoop and Spark?

A Tutorial to be presented at Practice and Experience in Advanced Research Computing (PEARC-2019)
Dhabaleswar K. (DK) Panda (The Ohio State University), Xiaoyi Lu (The Ohio State University), and Mahidhar Tatineni (San Diego Supercomputer Center)

When: Monday, July 29th, 2019 (1:30pm-5:00pm)
Where: Chicago, USA


Apache Hadoop and Spark are gaining prominence in handling Big Data analytics. Recent studies have shown that default Hadoop and Spark can not leverage the high-performance networking and storage architectures efficiently, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects and heterogeneous storage systems (e.g. HDD, SSD, NVMe-SSD, and Lustre). These middleware are traditionally written with sockets and do not deliver the best performance on modern high-performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, etc.) and Spark. We will examine the challenges in re-designing networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand and RoCE) with RDMA and storage architectures. Using the publicly available software packages in the High-Performance Big Data (HiBD, project, we will provide case studies of the new designs for several Hadoop/Spark components and their associated benefits. Through these case studies, we will also examine the interplay between high-performance interconnects, high-speed storage systems, and multi-core platforms to achieve the best solutions for these components, Big Data processing, and Deep Learning applications on modern HPC clusters. This tutorial will provide hands-on sessions of Hadoop and Spark on SDSC Comet supercomputer.

Targeted Audience and Scope

This tutorial is targeted for various categories of people working in the areas of Big Data and Deep Learning including high-performance Big Data and Deep Learning middleware, like Hadoop, Spark, gRPC, and TensorFlow, high performance communication and I/O architecture, storage, networking, cloud computing and data-intensive applications. Specific audience this tutorial is aimed at include: The content level will be as follows: 30% beginner, 40% intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the attendee has a general knowledge in Big Data, Deep Learning, Hadoop, Spark, TensorFlow, HPC, networking and storage architectures, and related issues, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. This tutorial is organized as a coherent talk to cover multiple topics.

Outline of the Tutorial

Brief Biography of Speakers

D. K. Panda Dr. Dhabaleswar K. (DK) Panda is a Professor of Computer Science at the Ohio State University. His research interests include parallel computer architecture, high performance computing, communication protocols, files systems, network-based computing, and Quality of Service. He has published over 450 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His research group is currently collaborating with National Laboratories and leading InfiniBand and 10-100GigE/iWARP companies on designing various subsystems of next generation high-end systems. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software libraries, developed by his research group, are currently being used by more than 3000 organizations worldwide (in 88 countries). This software has enabled several InfiniBand clusters (including the 3rd one) to get into the latest TOP500 ranking during the last decade. Dr. Panda's research is supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr. Panda are available at here.
Xiaoyi Lu Dr. Xiaoyi Lu is a Research Assistant Professor of the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Deep Learning, Virtualization and Cloud Computing. He has published over 100 papers in International journals and conferences related to these research areas. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, Kafka, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from http://hibd.cse.ohio- These libraries are currently being used by more than 305 organizations from 35 countries. He is a core member of the MVAPICH2 project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at here.
Mahidhar Tatineni Dr. Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources including Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems.

Last Updated: May. 30, 2019