A few weeks ago the long awaited paper The reusable holdout: Preserving validity in adaptive data analysis (paywall) appeared in Science. The freely accessible version is here. Not only does the paper boast an incredible authors’ pedigree (erm – Google, Microsoft, IBM, Samsung, Toronto and UPenn), but more importantly it brings a fresh solution to an old pain point – reusing the holdout (test) set in machine learning. Moritz Hardt has a nice write-up here.
Reuse of holdout is particularly painful in Kaggle competitions, where participants routinely have public leader board submissions in the hundreds, mistreating the leaderboard essentially as an additional, “remote” train set. This leads to unpleasant surprises of public board leaders sinking in the final ranking by a huge margin, as well as “hacky” ways to avoid it, like averaging cross-validation and public leader board scores to select the final model submission. Taking this approach to the extreme, you can even break the public leader board without looking at the data (by over-fitting it to the hell).
Even outside of Kaggle, holdout re-use happens commonly, as the “real” world machine learning process is necessarily an iterative and evolutionary process. While I rarely break even a dozen re-uses, even doing it once makes me always feel dirty.
The authors now came out with a really neat method: by using differential privacy they allow you to safely re-use the holdout.
Differential privacy is a concept from cryptography – a method to safely query statistical databases while minimizing the chance of re-identification of individual records (which here would translate to overfitting).
The scaling of the method is also quite generous – N2, where N is the number of records in the holdout. I could even imagine the whole cross-validation procedure being merged into a single process (of course assuming computational efficiency and effective search of the hyperparameter space).
Really neat stuff!