TR-17-1.pdf
``Feisu: fast query execution over heterogeneous data sources on large-scale
clusters"
An Qin, Yuan Yuan, Dai Tan, Pengyu Sun, Xiang Zhang, Hao Cao,
Rubao Lee, and Xiaodong Zhang
Proceedings of 33rd International Conference on Data Engineering,
(ICDE'17), San Diego, California, USA, April 19-22, 2017.
Abstract
Fast data analytics at an increasingly large scale
has become a critical task in any Internet service company.
For example, in Baidu, the major search engine company in
China, large volumes of Web and business data in PB-scale are
timely and constantly acquired and analyzed for the purposes
of evaluating product revenue, tracking product demanding
activities on market, predicting user behavior, upgrading product
rankings, and diagnosing spam cases, and many others. Response
time for queries of various data analytics not only affects user
experiences, but also has a serious impact on productivity of
business operations.
In this paper, to meet the challenge of fast data analytics, we
present Feisu (meaning fast in Chinese), a data integration system
over heterogeneous storage systems, which has been widely used
in Baidu’s critical and daily business analytics applications after
our R&D efforts. Feisu is designed and implemented to co-work
together with several heterogeneous storage systems, and exploit
the query similarity embedded in complex query workloads. Our
experiments using real world workloads show that Feisu can
significantly improve query performance in Baidu. Feisu has been
in production use in Baidu for two years to effectively manage
over dozens of petabytes of data for various applications.