Declarative array processing for large-scale scientific analyses

Supported by NSF Award #1464381

Personnel

Spyros Blanas (Principal Investigator)

Sofoklis Floratos (Graduate Student)

Haoyuan Xing (Graduate Student)

Lingyan Yin (Graduate Student)

Abstract

Scientists understand complex natural phenomena through data-intensive analyses that run on hundreds of thousands of processing cores. Quickly exploring very large datasets in parallel for insights, however, is challenging. Analyzing larger-than-memory datasets exposes the intricacies of the storage hierarchy and necessitates different implementations based on the anticipated data volume and the system architecture. Domain scientists using large-scale computing facilities are therefore faced with a dilemma: they need to either perpetually maintain and tune their data processing codes to the evolving system infrastructure, or limit their investigation to analyses that can be completed in a reasonable time as datasets continue to grow in size.

Declarative data processing techniques can alleviate scientists from the burden of managing how scientific data are accessed or stored. Although many declarative data management systems are actively used by scientists, these systems require time-consuming data transformations, such as loading, chunking and repartitioning, before answering any scientific query. In addition, many data management systems assume complete control of the underlying hardware, and are oblivious to optimizations and scaling opportunities that are offered through the batch execution paradigm of large-scale computing facilities. To address this gap in research, we will investigate techniques in the intersection of data management and scientific computing on how to allocate resources, and how to proactively manage parallel I/O and distributed memory. To impact scientific practice, we will develop a prototype runtime system that will augment an established scientific file format library with declarative querying capabilities for data analysis in leadership computing facilities.