TR-17-1.pdf

``Feisu: fast query execution over heterogeneous data sources on large-scale 
clusters"

An Qin, Yuan Yuan, Dai Tan, Pengyu Sun, Xiang Zhang, Hao Cao, 
Rubao Lee, and Xiaodong Zhang

Proceedings of 33rd International Conference on Data Engineering,
(ICDE'17), San Diego, California, USA, April 19-22, 2017.


Abstract

Fast  data  analytics  at  an  increasingly  large  scale
has  become  a  critical  task  in  any  Internet  service  company.
For  example,  in  Baidu,  the  major  search  engine  company  in
China,  large  volumes  of  Web  and  business  data  in  PB-scale  are
timely  and  constantly  acquired  and  analyzed  for  the  purposes
of   evaluating   product   revenue,   tracking   product   demanding
activities on market, predicting user behavior, upgrading product
rankings, and diagnosing spam cases, and many others. Response
time  for  queries  of  various  data  analytics  not  only  affects  user
experiences,  but  also  has  a  serious  impact  on  productivity  of
business  operations.

In this paper, to meet the challenge of fast data analytics, we
present Feisu (meaning fast in Chinese), a data integration system
over heterogeneous storage systems, which has been widely used
in Baidu’s critical and daily business analytics applications after
our R&D efforts. Feisu is designed and implemented to co-work
together with several heterogeneous storage systems, and exploit
the query similarity embedded in complex query workloads. Our
experiments  using  real  world  workloads  show  that  Feisu  can
significantly improve query performance in Baidu. Feisu has been
in  production  use  in  Baidu  for  two  years  to  effectively  manage
over  dozens  of  petabytes  of  data  for  various  applications.