Accelerating Big Data Processing with Hadoop, Spark and Memcached on Datacenters with Modern Architectures
When: March 12, 2016 (2:00-5:30pm)
Where: Barcelona, Spain
Abstract
Apache Hadoop and Spark are gaining prominence in handling Big Data and
analytics. Similarly, Memcached in Web-2.0 environment is becoming important
for large-scale query processing. Recent studies have shown that default Hadoop,
Spark, and Memcached can not leverage the high-performance networking
and storage
architectures on modern datacenters efficiently, like Remote Direct Memory
Access (RDMA) enabled high-performance interconnects, heterogeneous and
high-speed storage systems (e.g. HDD, SSD, NVMe-SSD, and Lustre). These
middleware are traditionally written with sockets and do not deliver best
performance on modern high-performance networks. In this tutorial, we will
provide an in-depth overview of the architecture of Hadoop components (HDFS,
MapReduce, RPC, HBase, etc.), Spark and Memcached. We will examine the
challenges in re-designing networking and I/O components of these middleware
with modern interconnects, protocols (such as InfiniBand, iWARP, RoCE, and
RSocket) with RDMA and storage architectures. Using the publicly available
software packages in the High-Performance Big Data (HiBD,
http://hibd.cse.ohio-state.edu) project, we will provide case studies of the
new designs for several Hadoop/Spark/Memcached components and their associated
benefits. Through these case studies, we will also examine the interplay
between high-performance interconnects, high-speed storage systems, and
multi-core platforms to achieve the best solutions for these components and Big
Data applications on modern datacenters.
Targeted Audience and Scope
The tutorial content is planned for half-a-day. This tutorial
is targeted for various categories of people working in the areas of Big Data
including high-performance Hadoop/Spark/Memcached, high performance communication and I/O
architecture, storage, networking, middleware, cloud computing and applications.
Specific audience this tutorial is aimed at include: - Scientists,
engineers, researchers, and students engaged in designing next-generation Big
Data systems and applications
- Designers and developers of Big Data,
Hadoop, Spark and Memcached middleware
- Newcomers to the field of Big Data who
are interested in familiarizing themselves with Hadoop, Spark, Memcached, RDMA, and
high-performance networking and storage architectures
- Managers and administrators responsible
for setting-up next generation Big Data environment and high-end systems/facilities
in their organizations/laboratories
The content level will be as follows: 30% beginner, 40%
intermediate, and 30% advanced. There is no fixed pre-requisite. As long as the
attendee has a general knowledge in Big Data, Hadoop, Spark, Memcached, high performance
computing, networking and storage architectures, and related issues, he/she will
be able to understand and appreciate it. The tutorial is designed in such a way
that an attendee gets exposed to the topics in a smooth and progressive manner.
This tutorial is organized as a coherent talk to cover multiple topics.
Outline of the Tutorial
- Introduction to Big Data Applications and Analytics
- Overview of MapReduce and Resilient Distributed Datasets (RDD) Programming Models
- Architecture Overview of Apache Hadoop, Spark, and Memcached
- MapReduce and YARN
- HDFS
- Spark
- RPC
- HBase
- Memcached
- Overview of High-Performance Interconnects, Protocols, and Storage Architectures for Modern Datacenters
- InfiniBand and RDMA
- 10/40 GigE, iWARP and RoCE technologies
- RSocket and SDP protocols
- SSD-based storage and Lustre parallel filesystem
- Challenges in Accelerating Hadoop, Spark, and Memcached on Modern Datacenters
- Overview of Benchmarks and Applications using Hadoop, Spark, and Memcached
- Acceleration Case Studies and In-Depth Performance Evaluation with Benchmarks and Applications
- MapReduce over InfiniBand with RDMA, SSD, and Lustre
- HDFS over InfiniBand with RDMA and Heterogeneous Storage (RAMDisk, SSD, HDD, and Lustre)
- Spark over InfiniBand with RDMA and SSD
- HBase over InfiniBand with RDMA and SSD
- RPC over InfiniBand with RDMA
- Memcached over InfiniBand with RDMA and SSD
- The High-Performance Big Data (HiBD) Project and Associated Releases
- Ongoing and Future Activities for Accelerating Big Data Applications
- Conclusion and Q&A
Brief Biography of Speakers
Dr. Dhabaleswar
K. (DK) Panda is a Professor of Computer Science at the Ohio State
University. He obtained his Ph.D. in computer engineering from the University
of Southern California. His research interests include parallel computer
architecture, high performance computing, communication protocols, files
systems, network-based computing, and Quality of Service. He has published over
350 papers in major journals and international conferences related to these
research areas. Dr. Panda and his research group members have been doing
extensive research on modern networking technologies including InfiniBand, HSE
and RDMA over Converged Enhanced Ethernet (RoCE). His research group is
currently collaborating with National Laboratories and leading InfiniBand and
10GigE/iWARP companies on designing various subsystems of next generation
high-end systems. The MVAPICH2
(High Performance MPI over InfiniBand, iWARP and RoCE) open-source software
libraries, developed by his research group, are currently being used
by more than
2,500 organizations worldwide (in 76 countries). This software has enabled
several InfiniBand clusters (including the 8th one) to get into the latest
TOP500 ranking during the last decade. More than 340,000 downloads of
these libraries have taken place from the project's site.
These software packages are also available with the
stacks for network vendors (InfiniBand and iWARP), server vendors and
Linux distributors.
The RDMA-enabled Apache Hadoop, Spark
and Memcached packages, consisting of
acceleration for HDFS, MapReduce, RPC, Spark and Memcached, are publicly available from High-Performance Big Data (HiBD) project site:
http://hibd.cse.ohio-state.edu. These packages are currently being
used by more than 140 organizations in 20 countries. More than 14,800
downloads have taken place from the project's site.
Dr. Panda's research is supported
by funding from US National Science Foundation, US Department of Energy, and
several industry including Intel, Cisco, SUN, Mellanox, QLogic, NVIDIA and
NetApp. He is an IEEE Fellow and a member of ACM. More details about Dr.
Panda, including a comprehensive CV and publications are available here.
Dr. Xiaoyi Lu
is a Research Scientist in the Department of Computer Science and
Engineering at the Ohio State University, USA. His current research interests include
high-performance interconnects and protocols, Big Data, Hadoop/Spark Ecosystem,
Parallel Computing Models (MPI/PGAS), GPU/MIC, Virtualization and Cloud
Computing. He has published over 50 papers in major journals and international
conferences related to these research areas. He has been actively involved in
various professional activities in academic journals and conferences.
Recently, Dr. Lu is doing research and working on design and development for
the High-Performance Big Data project
(http://hibd.cse.ohio-state.edu). He is a member of IEEE. More
details about Dr. Lu are available here.
Last Updated: March 11, 2016