TR-13-3.pdf

"Hadoop-GIS: a high performance spatial data warehousing system over MapReduce" 

Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, 
and Joel Saltz

Proceedings of 39th International Conference on Very Large Data Bases
(VLDB 2013), Riva del Garda, Trento, Italy, August 26-30, 2013.


Abstract

Support of high performance queries on large volumes of spatial
data becomes increasingly important in many application domains,
including geospatial problems in numerous fields, location based
services, and emerging scientific applications that are increasingly
data- and compute-intensive. The emergence of massive scale spatial
data is due to the proliferation of cost effective and ubiquitous
positioning technologies, development of high resolution imaging
technologies, and contribution from a large number of community
users. There are two major challenges for managing and querying
massive spatial data to support spatial queries: the explosion of spatial
data, and the high computational complexity of spatial queries.
In this paper, we present Hadoop-GIS  a scalable and high performance  
spatial data warehousing system for running large scale
spatial queries on Hadoop. Hadoop-GIS supports multiple types
of spatial queries on MapReduce through spatial partitioning, custamizable 
spatial query engine RESQUE, implicit parallel spatial
query execution on MapReduce, and effective methods for amending
query results through handling boundary objects. Hadoop-GIS
utilizes global partition indexing and customizable on demand local
spatial indexing to achieve efficient query processing. Hadoop-GIS
is integrated into Hive to support declarative spatial queries with
an integrated architecture. Our experiments have demonstrated the
high efficiency of Hadoop-GIS on query response and high scalability  
to run on commodity clusters. Our comparative experiments
have showed that performance of Hadoop-GIS is on par with
parallel SDBMS and outperforms SDBMS for compute-intensive
queries. Hadoop-GIS is available as a set of library for processing
spatial queries, and as an integrated software package in Hive.