Five Key Concepts That Shaped Iterative Solution Methods for Linear Systems

he advent of electronic computers in the mid-20th century played a pivotal role in defining a new era for numerical linear algebra. George Forsythe’s remarkable 1953 article—enigmatically titled “Solving Linear Algebraic Equations Can Be Interesting”—serves as a testament to these origins [2]; his writing displays amazing vision and addresses key topics in the nascent field, several of which would become foundational. The article introduces the concept of the condition number, considers the impact of finite precision arithmetic, and contemplates the idea of acceleration. It also discusses iterative methods, including the conjugate gradient (CG) algorithm — the “newest process on the roster” that had been discovered a few years earlier [5]. However, hints of tension are evident in the early parts of the paper. Forsythe seems somewhat apologetic about the topic, warning that “The subject of this talk is mathematically a lowly one.” A footnote also informs readers that the original title was “Solving Linear Equations Is Not Trivial.” It is important to recognize that during the 1950s, numerical linear algebra (NLA) was just beginning to establish its presence and had not yet gained widespread acceptance as a legitimate field.

We can trace the evolution of iterative methods from Carl Friedrich Gauss’ era to the present day by examining a few pivotal concepts. Five such “big ideas” constitute the foundational elements of these methods.

First Big Idea: Relaxation

In the early 19th century, Gauss and Adrien-Marie Legendre each invented the method of least squares (LS) within a few years of each other, and both noted that an ordinary elimination—i.e., a direct method—can yield the solution of the resulting system [4]. In an 1823 letter to his former doctoral student Christian Ludwig Gerling, Gauss introduced a method of indirect elimination to solve these systems. We write the system in question as

\[Ax=b \quad \text{or} \quad a_i^T x = \beta_i \quad \text{for} \quad i=1,2,\cdots, n,\tag1\]

where \(a_i^T\) is the \(i\)th row of \(A\) and \(b = [\beta_1, \beta_2, \cdots, \beta_n]^T\). The residual vector is \(r = b -A x,\) \(= [\rho_1, \rho_2, \cdots, \rho_n]^T\). Gauss started with an initial approximation to the solution and then updated its components one by one to zero out the largest residual component at each time. If \(|\rho_i|=\max_{j=1:n} |\rho_j|\), he would thus modify the \(i\)th component of the current \(x\) to \(x_i^{(\textrm{new})} :=x_i+\delta_i\) so that \(\rho_i^{(\textrm{new})} = 0\). The condition \(a_i^T (x + \delta e_i) - \beta_i = 0\) yields \(\delta_i = \frac{\rho_i}{a_{ii}}\). This process repeats until every \(\rho_i\) is sufficiently small.

In addition to highlighting its simplicity, Gauss emphasized another significant advantage of the method: users could calculate updates with only a few digits of accuracy, which was particularly appealing since calculations were done by hand at the time. Clearly excited by this new method, Gauss concluded his letter by saying, “I recommend this method to you for imitation. You will hardly ever again eliminate directly, at least not when you have more than two unknowns.” Extensions of this basic approach or techniques that were similar in spirit dominated the iterative method scene for decades, and the dawn of electronic computers in the 1950s brought about a renewed interest in relaxation methods. These methods—aided by contributions from giants like David Young, Stanley Frankel, and Richard Varga—took center stage until approximately the 1970s.

Second Big Idea: Projection

Projection processes allow us to extract an approximate solution to a problem from a subspace. When we apply these processes to linear systems of equations, we assume the knowledge of some initial guess \(x_0\) to the solution and two subspaces \(K\) and \(L\) (both of dimension \(m\)). We use these assumptions to formulate the following projected problem:

\[\mbox{Find} \quad \tilde x = x_0 + \delta, \ \delta \in K \quad \mbox{such that}\quad b-A \tilde x \perp L.\tag2\]

With \(m\) degrees of freedom \((K)\) and \(m\) constraints \((L)\), \((2)\) results in an \(m \times m\) linear system that, under mild conditions, is nonsingular. We can translate it into matrix form by exploiting bases \(V = [ v_1, \ldots, v_m ]\) for \(K\) and \(W = [ w_1, \ldots, w_m ]\) for \(L\). The approximate solution then becomes \(\tilde x = x_0 + \delta \equiv x_0 + V y\), where \(y \in \mathbb{R}^m\). The orthogonality constraints yield

\[W^T (r_0 - AVy)= 0 \to \tilde x = x_0 + V [ W^T AV]^{-1}W^T r_0.\]

This projection process has important optimality properties in two particular cases.

Orthogonal Projection (OP) Methods: When \(K=L\) and matrix \(A\) is symmetric positive definite (SPD), we can show that \(\tilde x \) minimizes \(( A (x - x_*), (x - x_*)) \equiv \| x - x_* \|_A^2\) over all vectors \(x\) in the affine space \(x_0 + K\), where \(x_*\) is the exact solution.

A representative of the OP class of methods is the well-known steepest descent algorithm for SPD matrices. This algorithm corresponds to the application of a projection step with one-dimensional subspaces \(L=K = \textrm{Span} \{r\}\). We can describe the iteration as \(\tilde x = x + \alpha r\), where \(\alpha := (r,r) / (Ar,r)\).

Minimal Residual (MR) Methods: When \(L = AK\), \(\tilde x\) minimizes the Euclidean norm of the residual over the affine space \(x_0 + K\). A representative of the MR class is the minimal residual iteration, which corresponds to taking \(K = \textrm{Span} \{r\}\) and \(L=AK\). The iteration thus becomes \(\tilde x = x + \alpha r\), where \(\alpha := (r, A r) / (A r, A r)\). It does not break down if \(A\) is nonsingular and will converge if \(A+A^T\) is positive definite.

Third Big Idea: Polynomial Acceleration

Consider an iterative scheme that takes the form \(x_{k+1} = x_{k} +\alpha_k r_k\). Steepest descent and the MR iteration are both of this form, where \(\alpha_k\) is calculated in a greedy, shortsighted way. Polynomial iteration aims to calculate the \(\alpha_i \textrm{s}\) by taking a more global view. From \(r_{k+1} = b - A (x_{k} +\alpha_k r_k)\), we obtain \(r_{k+1} = r_k - \alpha_k A r_k = (I -\alpha_k A ) r_k\). This leads to the relation

\[r_{k+1} = (I -\alpha_k A ) (I -\alpha_{k-1} A )\cdots (I -\alpha_0 A ) r_0 \equiv p_{k+1}(A) r_0,\tag3\]

where \(p_{k+1}(t) = (1 -\alpha_k t ) (1 -\alpha_{k-1} t )\cdots (1 -\alpha_0 t )\) is a polynomial of degree \(k+1\). Note that \(p_{k+1} (0) =1\).

Lewis Fry Richardson was the first person to formulate the problem of selecting the \(\alpha_i \textrm{s}\) with a goal of minimizing the error \(u_{k+1} \equiv A^{-1} r_{k+1}\). Assume that \(A\) is SPD with eigenvalues in an interval \([\alpha, \beta]\) where \(\alpha > 0\), and let \(\mathbb{P}_{k+1,0}\) be the set of polynomials \(p\) of degree \(k+1\) such that \(p(0) = 1\). To minimize the maximum deviation of \(p_{k+1} (t)\) from \(0\) in the interval, we must find a polynomial \(p \in \mathbb{P}_{k+1,0}\) for which \(\max_{t \ \in \ [\alpha, \beta] } |p(t)|\) is minimal. The solution is known in terms of \(C_{k+1}(t)\), i.e., the Chebyshev polynomial of the first kind of degree \(k+1\):

\[T_{k+1} (t) \equiv {1 \over \sigma_{k+1} } C_{k+1} \left( {{ \beta + \alpha - 2 t } \over {\beta-\alpha} } \right) \quad \mbox{with} \quad \sigma_{k+1} \equiv C_{k+1} \left( {{\beta + \alpha }\over{\beta - \alpha} } \right).\tag4\]

In the context of \((3)\), we see that the best \(\alpha_k \textrm{s}\) are the inverses of the roots of \(T_{k+1}\). Richardson was seemingly unaware of Chebyshev polynomials, as he selected the roots \(1/\alpha_i\) by spreading them in an ad hoc fashion within \([\alpha, \beta]\).

It took more than four decades for ideas based on Chebyshev polynomials to emerge. Young, Cornelius Lanczos, and George Shortley were among the first researchers to invoke the concept, though their methods did not yield numerically reliable algorithms because they did not fully exploit the three-term recurrence of Chebyshev polynomials. While Lanczos did employ the three-term recurrence, his approach was a preprocessing scheme that was not quite related to Chebyshev acceleration. In fact, it seems that John von Neumann actually described the first acceleration scheme based on Chebyshev polynomials that exploits the three-term recurrence [1]. The article published in 1959 but von Neumann died in early 1957, so he must have developed the method in 1956 or earlier. In 1961, Varga and Gene Golub published a very similar technique—called the semi-iterative method—that acknowledged von Neumann’s prior work in a footnote [3]. We can easily derive the Chebyshev acceleration algorithm from \((4)\) and the three-term recurrence of the \(C_k \textrm{s}\) [1, 3, 8].

From left to right: Magnus Hestenes, Eduard Stiefel, and Cornelius Lanczos — three major contributors to Krylov subspace methods. Image of Hestenes courtesy of <a href="https://commons.wikimedia.org/wiki/File:Magnus_Hestenes.jpg" rel="noopener noreferrer" target="_blank">Konrad Jacobs and Wikimedia</a> under the <a href="https://creativecommons.org/licenses/by-sa/2.0/de/deed.en" rel="noopener noreferrer" target="_blank">Creative Commons Attribution-ShareAlike 2.0 (CC BY-SA 2.0) Germany license</a>; image of Stiefel courtesy of <a href="https://commons.wikimedia.org/wiki/File:Eduard_Stiefel_ETH-Bib_Portr_00817.jpg" rel="noopener noreferrer" target="_blank">ETH-Bibliothek Zürich, Bildarchiv and Wikimedia</a> under the <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en" rel="noopener noreferrer" target="_blank">Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) Unported license</a>; and image of Lanczos courtesy of <a href="https://books.google.com/books?id=XzVHBJ1zOAAC&lpg=PR20&pg=PR18#v=onepage&q&f=false" rel="noopener noreferrer" target="_blank">Ida Rhodes and SIAM</a>. — From left to right: Magnus Hestenes, Eduard Stiefel, and Cornelius Lanczos — three major contributors to Krylov subspace methods. Image of Hestenes courtesy of Konrad Jacobs and Wikimedia under the Creative Commons Attribution-ShareAlike 2.0 (CC BY-SA 2.0) Germany license; image of Stiefel courtesy of ETH-Bibliothek Zürich, Bildarchiv and Wikimedia under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) Unported license; and image of Lanczos courtesy of Ida Rhodes and SIAM.

Fourth Big Idea: Krylov Subspace Methods

Krylov subspace methods for the solution of \((1)\) are projection methods on the Krylov subspace \(K_m (A,b) = \textrm{Span} \{ b, A b, \cdots, A^{m-1} b \}\). Here, we simplify the notation by assuming that the initial guess is \(x_0 \equiv 0\). We now return to the two aforementioned cases.

OP Case: When \(A\) is SPD, an OP approach yields an approximate solution that minimizes the scalar function \(f(x) = \frac{1}{2}x^T Ax - b^T x\) over all \(x\) vectors in \(K_m(A,b)\). Magnus Hestenes and Eduard Stiefel’s implementation of this method resulted in the conjugate gradient algorithm [5]. The process invokes purely geometric arguments; Hestenes and Stiefel’s insights from the two-dimensional case and their knowledge of conics inspired them to exploit conjugate directions.

Lanczos developed a similar method from a completely different viewpoint. He utilized an MR approach and relied on what is now called a Lanczos basis to implement it. Initially, critics considered the CG algorithm to be an unstable direct solution method. It therefore laid dormant until the mid-1970s, when it resurfaced with force.

MR Case: Several projection techniques on the subspace \(K_m (A,b)\) emerged in the late 1970s, with the objective of minimizing the residual norm over the subspace. An implementation with an orthonormal basis \(K_m(A,b)\) leads to the generalized minimal residual (GMRES) method; other implementations include generalized CG-LS, Orthomin, Orthodir, and the generalized conjugate residual. A considerable volume of work has followed these beginnings of Krylov methods for nonsymmetric linear systems [7].

Fifth Big Idea: Preconditioning

Krylov subspace methods can achieve fast convergence when matrices have spectra that are clustered around \(1\) and are not highly nonnormal; resorting to preconditioning can exploit these properties. For example, we can use a Krylov subspace method to solve a system like \(M^{-1}Ax=M^{-1}b\) (instead of the original system), where the preconditioning matrix \(M\) is close to \(A\) in some sense. However, systems such as \(Mx=f\) are inexpensive to solve. A common approach is the incomplete lower-upper (LU) factorization \(M=LU\), which stems from an approximate LU factorization of \(A\). The combination of Krylov methods and various forms of preconditioning gives rise to one of the most important and effective present-day iterative methods.

Looking Forward

The new wave in NLA embraces randomness and statistical analysis; in this context, standard “optimal” methods (such as CG and GMRES) are not as useful and those that exploit short-term recurrences lose their appeal. The current buoyant activity in randomized NLA is somewhat reminiscent of the golden era of iterative methods three decades ago. It remains to be seen whether this ongoing transformation will last. Nevertheless, the basic tools from the past still constitute key ingredients for future methods.

What will be the next “big idea” in NLA? Big ideas typically result from a pressing need to solve well-defined problems (e.g., the “flutter problem” in the 1940s) and the drive of bright, motivated researchers with exceptional knowledge and vision [6]. It is evident that machine learning and data-driven approaches have become the primary catalysts in research. As for motivating bright researchers, it is important to actively share our work to capture readers’ interests and potentially spark inspiration for the next prominent star. The significance of thoughtfully conveying ideas and disseminating software and other artifacts cannot be overstated, but the fast-paced and competitive nature of our discipline often leaves little time for such initiatives. Nonetheless, the gratification of witnessing their impact validates these efforts.

Yousef Saad received the 2023 John von Neumann Prize and delivered the associated lecture at the 10th International Congress on Industrial and Applied Mathematics, which took place in Tokyo, Japan, last year.

References

[1] Blair, A., Metropolis, N., von Neumann, J., Taub, A.H., & Tsingou, M. (1959). A study of a numerical solution to a two-dimensional hydrodynamical problem. Math. Comput., 13, 145-184.
[2] Forsythe, G.E. (1953). Solving linear algebraic equations can be interesting. Bull. Amer. Math. Soc., 59(4), 299-329.
[3] Golub, G.H., & Varga, R.S. (1961). Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods. Numer. Math., 3, 147-156.
[4] Grcar, J.F. (2011). Mathematicians of Gaussian elimination. Not. Am. Math. Soc., 58(6), 782-792.
[5] Hestenes, M.R., & Stiefel, E.L. (1952). Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand., 49(6), 409-436.
[6] Hestenes, M.R., & Todd, J. (1991). NBS-INA—the Institute for Numerical Analysis—UCLA 1947-1954. (NIST Special Publication 730). Washington, D.C.: National Institute of Standards and Technology.
[7] Meurant, G., & Tebbens, J.D. (2020). Krylov methods for nonsymmetric linear systems: From theory to computations. In Springer series in computational mathematics (Vol. 57). Cham, Switzerland: Springer Nature.
[8] Saad, Y. (2011). Numerical methods for large eigenvalue problems. In Classics in applied mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics.

About the Author

Yousef Saad

Professo, University of Minnesota

Yousef Saad is a College of Science and Engineering Distinguished Professor in the Department of Computer Science and Engineering at the University of Minnesota. He joined the faculty of Minnesota in 1990.