TR-14-2.pdf

"Major technical advancements in Apache Hive"  

Yin Hua, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, 
Owen O' Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang

Proceedings of 2014 ACM SIGMOD Conference on Management of Data (SIGMOD 2014), 
Snowbird, Utah, June 22-27, 2014.


Abstract
Apache Hive is a widely used data warehouse system for Apache
Hadoop, and has been adopted by many organizations for various
big data analytics applications. Closely working with many users
and organizations, we have identified several shortcomings of Hive
in its file formats, query planning, and query execution, which are
key factors determining the performance of Hive. In order to make
Hive continuously satisfy the requests and requirements of processing
increasingly high volumes data in a scalable and efficient way,
we have set two goals related to storage and runtime performance
in our efforts on advancing Hive. First, we aim to maximize the effective
storage capacity and to accelerate data accesses to the data
warehouse by updating the existing file formats. Second, we aim to
significantly improve cluster resource utilization and runtime performance
of Hive by developing a highly optimized query planner
and a highly efficient query execution engine. In this paper,
we present a community-based effort on technical advancements in
Hive. Our performance evaluation shows that these advancements
provide significant improvements on storage efficiency and query
execution performance. This paper also shows how academic research
lays a foundation for Hive to improve its daily operations.