By Chong Xiang and Prateek Mittal
Thanks to the stunning advancement of Machine Learning (ML) technologies, ML models are increasingly being used in critical societal contexts — such as in the courtroom, where judges look to ML models to determine whether a defendant is a flight risk, and in autonomous driving, where driverless vehicles are operating in city downtowns. However, despite the advantages, ML models are also vulnerable to adversarial attacks, which can be harmful to society. For example, an adversary against image classifiers can augment an image with an adversarial pixel patch to induce model misclassification. Such attacks raise questions about the reliability of critical ML systems and have motivated the design of trustworthy ML models.
In this 2-part post on trustworthy machine learning design, we will focus on ML models for image classification and discuss how to protect them against adversarial patch attacks. We will first introduce the concept of adversarial patches and then present two of our defense algorithms: PatchGuard in Part 1 and PatchCleanser in Part 2.
Adversarial Patch Attacks: A Threat in the Physical World
The adversarial patch attack, first proposed by Brown et al., targets image recognition models (e.g., image classifiers). The attacker aims to overlay an image with a carefully generated adversarial pixel patch to induce models’ incorrect predictions (e.g., misclassification). Below is a visual example of the adversarial patch attack against traffic sign recognition models: after attaching an adversarial patch, the model prediction changes from “stop sign” to “speed limit 80