Deep Learning: What Could Go Wrong?
In a field of research where algorithms can misinterpret stop signs as speed limit signs with the addition of minimal graffiti [3], many commentators are wondering whether current artificial intelligence (AI) solutions are sufficiently robust, resilient, and trustworthy. How can the research community quantify and address such issues?
Many empirical approaches investigate the generation of adversarial attacks: small, deliberate perturbations to an input that cause dramatic changes in a system’s output. Changes that are essentially imperceptible to the human eye may alter predictions in the field of image classification, which has implications in many high-stakes and safety-critical settings. The rise of algorithms that construct attacks—and heuristic techniques that identify or guard against them—has led to a version of conflict escalation wherein attack and defense strategies become increasingly ingenious [10].
These issues concern the conditioning of the underlying problem and stability of the algorithms in use. Recent research has utilized mathematical tools—notably from numerical analysis, applied probability, and high-dimensional geometry—to shed light on this field. However, many open problems remain.
Inevitability of Attacks
A simple but powerful example helps illustrate the way in which adversarial attacks may arise [4]. Imagine that we have data points \(x \in \mathbb{R}^n\), which may be pixels in an image that are stacked into a vector. Suppose that the images come from two categories: cats and dogs. Given some fixed vector \(w \in \mathbb{R}^n\) and scalar \(\alpha\), a linear classifier will classify a new point \(x\) as a cat or dog depending upon whether \(w^T x\) is less than or greater than \(\alpha\). Here, \(w\) would be constructed according to some sort of best-fit procedure on a training set of labeled images.
If we perturb \(x\) to \(x + \Delta x\), the output from the linear classifier changes by \(w^T \Delta x\). Suppose that we are able to perturb each pixel in the input image by at most \(\epsilon\); that is, \(\| \Delta x \|_{\infty} \le \epsilon\). If we know the vector \(w\), we can then increase the classifier’s output as much as possible by selecting a perturbation with every component \(\Delta {x}_i = \epsilon \ \mathrm{sign}( w_i )\); similarly, the maximum decrease occurs with \(\Delta {x}_i = -\epsilon \ \mathrm{sign} (w_i)\). In this way, we can alter the output by \(\epsilon \| w \|_1\). If \(m\) is the average size of components in \(w\), then a per pixel change of \(\epsilon\) can lead to a change of \(n m \epsilon\) in the classifier output. The classifier is vulnerable to this type of attack when the dimension \(n\)—the number of pixels—is large.
This simple illustration highlights a number of issues. First, any smooth map can be well described locally by a first-order (linear) Taylor series approximation, meaning that this type of attack is relevant whenever the attacker has access to gradient information. In the black box setting where attackers can only choose inputs and observe the corresponding outputs, they could use finite difference approximations to build up the necessary gradient information for the attack. Second, the choice of norm that we use to measure perturbation size is clearly important — what is an appropriate norm with which to measure the “smallness” of a perturbation to an image? Finally, the dimension of the input space is also a key feature. If naturally occurring images lie on a lower-dimensional manifold of \(\mathbb{R}^n\), then one could exploit this property to guard against attacks.
These ideas have inspired a growing interest in the following fundamental question: Under what circumstances are successful adversarial attacks guaranteed to exist with high probability? Isoperimetric and concentration of measure results play important roles in this investigation [7, 8].
No Cure for Instability
Let us return to the aforementioned binary classification problem and focus on the case of deep learning (DL). Suppose that the pixel values are normalized such that each image is in \([0,1]^{n}\). We let \(f:[0,1]^{n} \to \{0,1\}\) denote the “ground truth” map, so that \(f(x) = 0\) for cats and \(f(x) = 1\) for dogs. For the purposes of analysis, we conventionally assume that the images of interest arise from a distribution \(\mathcal{D}\) over \([0,1]^{n}\). The network receives images from this distribution and corresponding cat-dog labels for training purposes. We then test the trained network’s performance by measuring its ability to reproduce the labels of further images from this distribution.
More formally, we suppose that the following standard approach constructs a DL map \(\phi:[0,1]^{n} \to \{0,1\}\) (further details about network architectures are available in [5]).
- Take \(r\) random samples from \([0,1]^n\) according to distribution \(\mathcal{D}\) to form a training set \(\mathcal{T} = \{x^1,x^2,\dots,x^r\}\).
- Fix a number of layers \(L \geq 2\) and neural network (NN) dimensions \(N = (N_L=1,N_{L-1},\dots,N_1,N_0=n)\), such that the \(i\)th layer of the NN has \(N_i\) layers. Let \(\mathcal{NN}_{N,L}\) be the class of all such NNs.
- Attempt to compute \[{\phi} \in \mathop{\mathrm{arg min}}_{\varphi \in \mathcal{NN}_{N,L}} \mathcal{R} \left(\{\varphi({x}^j)\}_{j=1}^r,\{f({x}^j)\}_{j=1}^r\right)\tag1\]with some combination of stochastic gradient descent and backpropagation, where \(\mathcal{R}\) is a cost function that quantifies goodness-of-fit over the training set.
- Test \(\phi\) on a validation set \(\mathcal{V} = \{y^1,y^2,\dots,y^s\}\), where the \(y^i\) are taken according to \(\mathcal{D}\). If \(\phi(y^i) \approx f(y^i)\) for a large percentage of vectors in \(\mathcal{V}\), then the resulting NN \(\phi\) is a success. If not, repeat the process with a different choice of NN dimensions or a potentially larger choice of \(r\).
This strategy can generate \(\phi\) that are accurate on both the training set \(\mathcal{T}\) and new samples that form the test set \(\mathcal{V}\). But it also introduces vulnerabilities. We have demonstrated that uncountably many classification functions \(f\) and distributions \(\mathcal{D}\) exist for a given sufficiently small instability parameter \(\epsilon > 0\), such that each of the following occur simultaneously with high probability — provided that \(r\) is sufficiently large relative to the number of neurons [2] (see Figure 1).
- For all cost functions \(\mathcal{R}\) with \(R(v,w) = 0\) iff \(v=w\), any optimizer \(\phi\) of \((1)\) will have \(\phi(x) = f(x)\) for any \(x \in \mathcal{T} \cup \mathcal{V}\). The NN will thus have 100 percent accuracy on both the training and validation sets, and performance will seemingly be excellent. However, this leads to the next point.
- For any \(\hat{\phi} \in \mathcal{NN}_{N,L}\)—in particular, for \(\hat\phi=\phi\)—there are uncountably many \(\eta \in \mathbb{R}^n\), such that there is a collection of \(x \in \mathcal{T}\) with \[|\hat \phi(x+\eta) - f(x+\eta)| \geq 1/2\quad \text{and} \quad \| \eta\|_{1} < \epsilon, \ \ |\mathrm{supp}(\eta)| \leq 2.\]For many simple classification functions \(f\), this theoretical result guarantees the existence of adversarial attacks for any successful NN, regardless of architecture and training model (even around the training set). There is thus no cure for instabilities within the standard framework, which is where NN dimensions are fixed. But permitting the NN dimensions to be adaptive—such that they depend upon the input—will somewhat paradoxically allow for the existence of stable NNs that still have excellent performance.
- There exists a stable and accurate NN \(\psi \notin \mathcal{NN}_{N,L}\) that satisfies \(\psi(x) = f(x)\) for all \(x\) within an \(\infty\)-norm distance \(\epsilon\) of points in \(\mathcal{T}\) or \(\mathcal{V}\).
Stealth Attacks
What happens if we perturb the system itself instead of the input? This question motivates the idea of a stealth attack — a different type of adversarial intervention [8, 9]. Suppose that the owner of the AI system has a test set that validates its operation; the owner checks the system’s integrity by ensuring that the known outputs are correctly reproduced when the system is run on the test set. The stealth attacker, who does not have any knowledge of the test set, wants to perturb the AI system in such a way that the system behaves as normal on the test set but produces a desired output when applied to a particular trigger input. For instance, the attacker may wish for approval of a specific insurance claim or certain classification for a particular image. This type of scenario might be relevant when an information technology team has a corrupt or disgruntled member. It is also pertinent to the “democratization of AI” movement [1]; storing and exchanging copies of large-scale models and parameter sets in the public domain make them more susceptible to malicious intervention.
Under appropriate circumstances, interlopers may construct a successful stealth attack with high probability by adding an extra neuron to a DL network [8, 9]. Moreover, simply altering the weights and biases of a single neuron—a so-called one-neuron attack—is surprisingly effective [9] (see Figure 2). One may view this type of vulnerability as a consequence of the massive over-parameterization that is a key feature of this technology. Researchers typically compute weights and biases as approximate local minima of large-scale, non-convex optimization problems, and they often perform computations at very crude numerical precision. Given these two factors, fragility is certainly a potential cause for concern.
Outlook
The basic building blocks of DL networks rely on familiar techniques for researchers in applied and computational mathematics, including ideas from approximation theory, applied linear algebra, automatic differentiation, and optimization. Growing interest in DL is evident within the SIAM community, and many SIAM members are contributing to the development and analysis of new algorithms. We aim to demonstrate that applied and computational mathematicians also have the skills to address many of the related open questions that concern reliability and robustness. We must tackle issues such as how to (i) define an appropriate notion of data’s dimension, (ii) determine the inherent dimension and structure of the input space, (iii) analyze the conditioning and stability of heavily-parametrized nonlinear maps, and (iv) quantify the effect of low-precision computation if we wish to understand the trade-off between impressive performance on test data and vulnerability to adversarial attacks.
References
[1] Allen, B., Agarwal, S., Kalpathy-Cramer, J., & Dreyer, K. (2019). Democratizing AI. J. Am. Coll. Radiol., 16(7), 961-963.
[2] Bastounis, A., Hansen, A.C., & Vlačić, V. (2021). The mathematics of adversarial attacks in AI — Why deep learning is unstable despite the existence of stable neural networks. Preprint, arXiv:2109.06098.
[3] Evtimov, I., Eykholt, K., Fernandes, E., Kohno, T., Li, B., Prakash, A., ..., Song, D. (2017). Robust physical-world attacks on machine learning models. Preprint, arXiv:1707.08945.
[4] Goodfellow, I.J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In Y. Bengio & Y. LeCun (Eds.), Third international conference on learning representations (ICLR 2015). San Diego, CA.
[5] Higham, C.F., & Higham, D.J. (2019). Deep learning: An introduction for applied mathematicians. SIAM Rev., 61(4), 860-891.
[6] LeCun, Y., Cortes, C., & Burges, C.J.C. (2019). The MNIST database of handwritten digits. Retrieved from http://yann.lecun.com/exdb/mnist.
[7] Shafahi, A., Huang, W.R., Studer, C., Feizi, S., & Goldstein, T. (2019). Are adversarial examples inevitable? In Seventh international conference on learning representations (ICLR 2019). New Orleans, LA.
[8] Tyukin, I.Y., Higham, D.J., & Gorban, A.N. (2020). On adversarial examples and stealth attacks in artificial intelligence systems. In 2020 International joint conference on neural networks (IJCNN) (pp. 1-6). Glasgow, UK: IEEE.
[9] Tyukin, I.Y., Higham, D.J., Gorban, A.N., & Woldegeorgis, E. (2021). The feasibility and inevitability of stealth attacks. Preprint, arXiv2106.13997.
[10] Yuan, X., He, P., Zhu, Q., & Li, X. (2019). Adversarial examples: Attacks and defenses for deep learning. IEEE Trans. Neural Netw. Learn. Syst., 30(9), 2805-2824.
About the Authors
Alexander Bastounis
Lecturer, King’s College London
Alexander Bastounis is a lecturer at King’s College London. In 2019, he received a Leslie Fox Prize for Numerical Analysis.
Anders C. Hansen
Professor, University of Cambridge and University of Oslo
Anders C. Hansen is a professor of mathematics at the University of Cambridge and the University of Oslo.
Desmond J. Higham
Professor, University of Edinburgh
Desmond J. Higham is a professor of numerical analysis at the University of Edinburgh. He is a SIAM Fellow.
Ivan Y. Tyukin
Professor, King’s College London
Ivan Y. Tyukin is a professor of mathematical data science and modeling at King’s College London and a UKRI Turing AI Acceleration Fellow through U.K. Research and Innovation.
Verner Vlačić
Tech Industry
Verner Vlačić graduated in 2020 with a Ph.D. in applied mathematics and engineering from ETH Zürich, where he worked on optimization and neural network theory. He currently works in the tech industry.