TR-11-7.pdf

"YSmart: Yet another SQL-to-MapReduce Translator",  

Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang 

Proceedings of 31st International Conference on Distributed Computing Systems 
(ICDCS 2011), Minneapolis, Minnesota, June 20-24, 2011.  


Abstract

MapReduce has become an effective approach to big
data analytics in large cluster systems, where SQL-like queries
play important roles to interface between users and systems.
However, based on our Facebook daily operation results, certain
types of queries are executed at an unacceptable low speed by
Hive (a production SQL-to-MapReduce translator). In this paper,
we demonstrate that existing SQL-to-MapReduce translators that
operate in a one-operation-to-one-job mode and do not consider
query correlations cannot generate high-performance MapReduce
programs for certain queries, due to the mismatch between
complex SQL structures and simple MapReduce framework.
We propose and develop a system called YSmart, a correlation
aware SQL-to-MapReduce translator. YSmart applies a set of
rules to use the minimal number of MapReduce jobs to execute
multiple correlated operations in a complex query. YSmart can
significantly reduce redundant computations, I/O operations and
network transfers compared to existing translators. We have implemented
YSmart with intensive evaluation for complex queries
on two Amazon EC2 clusters and one Facebook production
cluster. The results show that YSmart can outperform Hive and
Pig, two widely used SQL-to-MapReduce translators, by more
than four times for query execution.