Many data analysis pipelines are adaptive: the choice of which analysis to run next depends on the outcome of previous analyses. Common examples include variable selection for regression problems and hyper-parameter optimization in large-scale machine learning problems: in both cases, common practice involves repeatedly evaluating a series of models on the same dataset. Unfortunately, this kind of adaptive re-use of data invalidates many traditional methods of avoiding overfitting and false discovery, and has been blamed in part for the recent flood of non-reproducible findings in the empirical sciences. An exciting line of work beginning with Dwork et al. in 2015 establishes the first formal model and first algorithmic results providing a general approach to mitigating the harms of adaptivity, via a connection to the notion of differential privacy.
In this talk, we’ll explore the notion of differential privacy and gain some understanding of how and why it provides protection against adaptivity-driven overfitting. Many interesting questions in this space remain open.
Joint work with: Christopher Jung (UPenn), Seth Neel (Harvard), Aaron Roth (UPenn), Saeed Sharifi-Malvajerdi (UPenn), and Moshe Shenfeld (HUJI).