Component-Level Dataflow Analysis

Problem

Interprocedural dataflow analysis is a form of static program analysis that has been investigated and used widely in various software tools. However, its widespread use in real-world tools is hindered by several serious challenges. One of the central problems is the underlying model of whole-program analysis for a homogeneous program. Modern software systems present critical challenges for this traditional model. In particular, systems often incorporate reusable components. Whole-program analysis is based on the implicit assumption that it is appropriate and desirable to analyze the source code of the entire program as a single unit. However, this assumption is clearly violated for software systems that are built with reusable components:

Some program components may be available only in binary form, without source code.
Analysis results may be needed even when a whole program simply does not exist. For example, when the developer of a component wants to use a dataflow analysis for understanding, restructuring, or testing of her code, she often does this without having all other components that will eventually be combined to construct a complete application.
For analysis running time and memory usage, it is highly undesirable to reanalyze a component every time this component is used as part of a new system. For example, a popular library may be used in thousands of different applications, and whole-program analysis requires reanalysis of this library from scratch as part of each such application.
Treating the program as a homogeneous entity means that code changes in one component typically require complete reanalysis of the entire application.
The running time of whole-program analysis is often dominated by the analysis of the underlying large library components. To achieve practical running times, analyses often must employ approximations which typically reduce the precision and usefulness of the analysis solution.

The essence of the problem is the following: whole-program interprocedural dataflow analysis is often impossible or inefficient for software systems that employ reusable components. Thus, the real-world usefulness of hundreds of existing analyses remains questionable. In many case these analyses cannot be used at all. Even if they are possible, they have to be relatively approximate and imprecise in order to scale for industrial-sized software containing hundreds of thousands lines of code in multiple components.

Solution

Our work proposes an alternative conceptual model of dataflow analysis which we refer to as component-level analysis (CLA). While a whole-program analysis takes as input the source code of a complete program, a component-level analysis processes the source code of a single program component, given some information about the environment of this component. All of the deficiencies of whole-program analysis that are listed above can be eliminated in this new model. Given this approach, there are two major open questions:

What is the appropriate theoretical framework for designing CLA? In particular, how should the traditional mathematical machinery used in the design of whole-program analysis be adapted for CLA? We defined an initial proposal for such an adaptation [PhD02, FSE99] followed by a more powerful approach [CC08, PhD07, CC06, CBSE05]. This new formal framework provides the foundation for CLA versions of many widely-used dataflow analyses.
Given the theoretical model for CLA, how should specific whole-program analyses be modified to fit the CLA model? For example, how should researchers and tool builders define CLA versions of popular analyses such as pointer analysis, side-effect analysis, data dependence analysis, and constant propagation analysis? We first defined CLA versions of points-to analysis and side-effect analysis for C [CC01], object reference analysis for Java [TSE04, ICSE03], and side-effect analysis for Java [ICSM04]. Our subsequent work considered call graph construction, alias analysis, and dependence analysis for Java programs [SOAP12, ISSTA11, CC08, PhD07].

This effort is funded by an NSF CAREER award and IBM Eclipse Innovation award.

main page