Differential Privacy for Software Analysis
Differential privacy has emerged as a leading theoretical framework
for privacy-preserving data gathering and analysis. In addition to a
rich body of theoretical results, differentially-private analyses have
transitioned to industry in companies such as Google, Apple, and
Microsoft, as well as to government agencies such as the U.S. Census
Bureau.
Differential privacy allows meaningful statistics to be collected for
a population without revealing "too much" information about any
individual member of the population. For the purposes of software
analysis, this machinery allows run-time execution data from many
users of a deployed software system to be collected and analyzed in a
privacy-preserving manner. Such a solution is appealing to many
stakeholders: software users, software developers, providers of
software analysis infrastructures, and government agencies that
enforce consumer protections.
The main focus of our current work is on developing
differentially-private analyses for gathering profiling data and
coverage data from software executions. This work is funded by NSF
grant
1907715.
Specific contributions are as follows:
- We developed an approach for differentially-private frequency
analysis of screen view events in Andoid apps. Our target were apps
that collect event information using the Google Analytics
infrastructure. This is the first work to design and evaluate
differentially-private event profiling for mobile apps
[SCAM19].
- Based on our experience with screen event profiling, we developed a general
approach for event frequency profiling with differential privacy. This approach is
applicable to any form of remote software analysis, including mobile app analytics
[CC20].
- We considered node coverage analysis for control-flow graphs, and developed a
differentially-private solution to this problem. This work is the first to introduce intrinsic
software control-flow graph properties into the problem definition and solution for locally-differentially-private
graph analysis
[SEC20].
- We developed a differentially-private solution for frequency profiling of a
dynamically-changing stream of data items. This problem is conceptually different
from the other software analysis problems we have considered, and has important
applications for mobile app analytics.
[ICSME20,
PHD21].
- A key observation for the design of differentially-private
software profiling is that event frequencies are closely correlated
due to the static code structure. Differential privacy protections
must account for such relationships; otherwise, a seemingly-strong
privacy guarantee is actually weaker than it appears. Motivated by
this observation, we propose a novel and general
differentially-private profiling scheme when correlations between
frequencies can be expressed through linear inequalities. Using a
linear programming formulation, we show how to determine the magnitude
of random noise that should be added to achieve meaningful privacy
protections under such linear constraints
[OOPSLA20,
PHD20].
This work supercedes the approach from CC'20 and provides a general
design template for software profiling with differential privacy.
- Next we consider coverage information
for software execution traces.
Unlike in our prior work, here the domain of possible items
is either infinite or, at best, restricted to many billions of elements. Further, the traces typically have structure defined by the static properties of the software. To capture these novel aspects, we define the trace domain with the help of context-free grammars and then illustrate this approach with two exemplars: a call chain analysis and an enter/exit trace analysis.
To randomize such domains, we propose to use count sketch,
a fixed-size hashing data structure for summarizing frequent items.
We develop a version of count sketch for trace analysis and demonstrate its suitability for software execution data.
We also develop a novel approach to identify hot traces from the collected randomized sketches
[ECOOP21].
main page