Accelerating Big Data Processing with Hadoop, Spark and Memcached over High-Performance Interconnects
When: August 26, 2016 (8:30am - 12:00pm)
Where: Huawei North America Headquarters, Santa Clara, CA, USA
Abstract
Apache Hadoop and Spark are gaining prominence in handling Big Data
and analytics. Similarly, Memcached in Web 2.0 environment is becoming
important for large-scale query processing. These middleware are
traditionally written with sockets and do not deliver best performance
on datacenters with modern high performance networks. In this
tutorial, we will provide an in-depth overview of the architecture of
Hadoop components (HDFS, MapReduce, RPC, HBase, etc.), Spark and
Memcached. We will examine the challenges in re-designing the
networking and I/O components of these middleware with modern
interconnects, protocols (such as InfiniBand, iWARP, RoCE, and
RSocket) with RDMA and storage architecture. Using the publicly
available software packages in the High-Performance Big Data (HiBD, http://hibd.cse.ohio-state.edu)
project, we will provide case studies of the new designs for several
Hadoop/Spark/Memcached components and their associated
benefits. Through these case studies, we will also examine the
interplay between high performance interconnects, storage systems (HDD
and SSD), and multi-core platforms to achieve the best solutions for
these components.
Targeted Audience and Scope
The tutorial content is planned for half-a-day. This tutorial is
targeted for various categories of people working in the areas of Big
Data including high-performance Hadoop/Spark/Memcached, high
performance communication and I/O architecture, storage, networking,
middleware, cloud computing and applications. Specific audience this
tutorial is aimed at include:
- Scientists, engineers, researchers, and students engaged in designing next-generation Big Data systems and applications
- Designers and developers of Big Data, Hadoop, Spark and Memcached middleware
- Newcomers to the field of Big Data who are interested in familiarizing themselves with Hadoop, Spark, Memcached, RDMA, and high-performance networking
- Managers and administrators responsible for setting-up next
generation Big Data environment and high-end systems/facilities in
their organizations/laboratories
The content level will be as follows: 30% beginner, 40% intermediate,
and 30% advanced. There is no fixed pre-requisite. As long as the
attendee has a general knowledge in Big Data, Hadoop, Spark,
Memcached, high performance computing, networking and storage
architecture, and related issues, he/she will be able to understand
and appreciate it. The tutorial is designed in such a way that an
attendee gets exposed to the topics in a smooth and progressive
manner. This tutorial is organized as a coherent talk to cover
multiple topics.
Outline of the Tutorial
- Introduction to Big Data Applications and Analytics
- Overview of MapReduce and RDD Programming Models
- Architecture Overview of Apache Hadoop, Spark and Memcached
- MapReduce and YARN
- HDFS
- Spark
- RPC
- HBase
- Memcached
- Overview of Modern Interconnects, Protocols and Storage Architectures for Data Center Systems
- InfiniBand and RDMA
- 10/40 GigE, iWARP and RoCE technologies
- RSocket and SDP protocols
- SSD-based storage
- Challenges in Accelerating Hadoop, Spark and Memcached on Modern Datacenters
- Overview of Benchmarks and Applications using Hadoop, Spark and Memcached
- Acceleration Case Studies and In-Depth Performance Evaluation
- MapReduce over InfiniBand with RDMA, SSD, and Lustre
- HDFS over InfiniBand with RDMA and Heterogeneous Storage (RAMDisk, SSD, HDD, and Lustre)
- Spark over InfiniBand with RDMA and SSD
- RPC over InfiniBand with RDMA
- HBase over InfiniBand with RDMA and SSD
- Memcached over InfiniBand with RDMA and SSD
- The High-Performance Big Data (HiBD) Project and Associated Releases
- Ongoing and Future Activities for Accelerating Big Data Applications
- Conclusion and Q&A
Brief Biography of Speakers
Dr. Dhabaleswar K. (DK)
Panda is a Professor and University Distinguished Scholar of
Computer Science at the Ohio State University. He obtained his
Ph.D. in computer engineering from the University of Southern
California. His research interests include parallel computer
architecture, high performance computing, communication protocols,
files systems, network-based computing, and Quality of Service. He has
published over 400 papers in major journals and international
conferences related to these research areas. Dr. Panda and his
research group members have been doing extensive research on modern
networking technologies including InfiniBand, HSE and RDMA over
Converged Enhanced Ethernet (RoCE). His research group is currently
collaborating with National Laboratories and leading InfiniBand and
10GigE/iWARP companies on designing various subsystems of next
generation high-end systems. The MVAPICH2 (High Performance
MPI over InfiniBand, iWARP and RoCE) open-source software
libraries, developed by his research group, are currently being used
by more than 2,650 organizations worldwide (in 81 countries). This
software has enabled several InfiniBand clusters (including the 10th
one) to get into the latest TOP500 ranking during the last
decade. More than 383,000 (0.38 million) downloads of these libraries
have taken place from the project's site. These software packages are
also available with the Open Fabrics stack for network vendors
(InfiniBand and iWARP), server vendors and Linux distributors. The
RDMA-enabled Apache Hadoop, Spark and Memcached packages, consisting
of acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are
publicly available from
http://hibd.cse.ohio-state.edu. These packages are currently
being used by more than 185 organizations in 26 countries. More than
17,550 downloads have taken place from the project's site.
Dr. Panda's research is supported by funding from US National Science
Foundation, US Department of Energy, and several industry including
Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE
Fellow and a member of ACM. More details about Dr. Panda, including
a comprehensive CV and publications are available here.
Dr. Xiaoyi Lu is a
Research Scientist in the Department of Computer Science and
Engineering at the Ohio State University, USA. He obtained his
Ph.D. degree in Computer Science from Institute of Computing
Technology, Chinese Academy of Sciences, Beijing, China. His current
research interests include high-performance interconnects and
protocols, Big Data, Hadoop/Spark Ecosystem, Parallel Computing Models
(MPI/PGAS), GPU/MIC, Virtualization and Cloud Computing. He has
published over 60 papers in major journals and international
conferences related to these research areas. He has been actively
involved in various professional activities in academic journals and
conferences. Recently, Dr. Lu is doing research and working on design
and development for the High-Performance Big Data project (http://hibd.cse.ohio-state.edu). He
is a member of IEEE and a member of ACM. More details about Dr. Lu are
available here.
Last Updated: August 26, 2016