This section is a rather rapid tour of some cool ideas that get a lot of use in applied linear algebra. We are rather light on details here. The interested reader can consult sections 8.3–8.6 in the Nicholson textbook.
Subsection4.6.1Matrix Factorizations
Subsubsection4.6.1.1Positive Operators
Let \(T\) be a linear operator defined by a matrix \(A\text{.}\) If \(A\) is symmetric (for the case of \(\R^n\)) or hermitian (for the case of \(\C^n\)), we say that the operator \(T\) is self-adjoint.
Definition4.6.1.
A self-adjoint operator \(T\) is positive if \(\xx^H T\xx\geq 0\) for all vectors \(\xx\neq \zer\text{.}\) It is positive-definite if \(\xx^H T\xx\gt 0\) for all nonzero \(\xx\text{.}\) If \(T=T_A\) for some matrix \(A\text{,}\) we also refer to \(A\) as a positive(-definite) matrix.
The definition of a positive matrix is equivalent to requiring that all its eigenvalues are non-negative. Every positive matrix \(A\) has a unique positive square root: a matrix \(R\) such that \(R^2=A\text{.}\) Since \(A\) is symmetric/hermitian, it can be diagonalized. Writing \(A = PDP^{-1}\) where \(P\) is orthogonal/unitary and
What is interesting is that the converse to the above statement is also true. The Cholesky factorization of a positive-definite matrix \(A\) is given by \(A=U^TU\text{,}\) where \(U\) is upper-triangular, with positive diagonal entries.
Even better is that there is a very simple algorithm for obtaining the factorization: Carry the matrix \(A\) to triangular form, using only row operations of the type \(R_i+kR_j\to R_i\text{.}\) Then divide each row by the square root of the diagonal entry to obtain the matrix \(U\text{.}\)
The SymPy library contains the cholesky() algorithm. Note however that it produces a lower triangular matrix, rather than upper triangular. (That is, the output gives \(L=U^T\) rather than \(U\text{,}\) so you will have \(A=LL^T\text{.}\)) Let’s give it a try. First, let’s create a positive-definite matrix.
Next, find the Cholesky factorization:
Note that \(L\) is not the same as the matrix \(B\text{!}\)
Subsubsection4.6.1.2Singular Value Decomposition
For any \(n\times n\) matrix \(A\text{,}\) the matrices \(A^TA\) and \(AA^T\) are both positive. (Exercise!) This means that we can define \(\sqrt{A^TA}\text{,}\) even if \(A\) itself is not symmetric or positive.
Since \(A^TA\) is symmetric, we know that it can be diagonalized.
Since \(A^TA\) is positive, we know its eigenvalues are non-negative.
This means we can define the singular values \(\sigma_i = \sqrt{\lambda_i}\) for each \(i=1,\ldots, n\text{.}\)
Note: it’s possible to do this even if \(A\) is not a square matrix!
The SymPy library has a function for computing the singular values of a matrix. Given a matrix A, the command A.singular_values() will return its singular values. Try this for a few different matrices below:
In fact, SymPy can even return singular values for a matrix with variable entries! Try the following example from the SymPy documentation 1 .
For an \(n\times n\) matrix \(A\text{,}\) we might not be able to diagonalize \(A\) (with a single orthonormal basis). However, it turns out that it’s always possible to find a pair of orthonormal bases \(\{e_1,\ldots, e_n\}, \{f_1,\ldots, f_n\}\) such that
In matrix form, \(A = P\Sigma_A Q^T\) for orthogonal matrices \(P,Q\text{.}\)
In fact, this can be done even if \(A\) is not square, which is arguably the more interesting case! Let \(A\) be an \(m\times n\) matrix. We will find an \(m\times m\) orthogonal matrix \(P\) and \(n\times n\) orthogonal matrix \(Q\text{,}\) such that \(A=P\Sigma_A Q^T\text{,}\) where \(\Sigma_A\) is also \(m\times n\text{.}\)
The basis \(\{f_1,\ldots, f_n\}\) is an orthonormal basis for \(A^TA\text{,}\) and the matrix \(Q\) is the matrix whose columns are the vectors \(f_i\text{.}\) As a result, \(Q\) is orthogonal.
The matrix \(\Sigma_A\) is the same size as \(A\text{.}\) First, we list the positive singular values of \(A\) in decreasing order:
That is, we put \(D_A\) in the upper-left, and then fill in zeros as needed, until \(\Sigma_A\) is the same size as \(A\text{.}\)
Next, we compute the vectors \(e_i = \frac{1}{\len{Af_i}}Af_i\text{,}\) for \(i=1,\ldots, k\text{.}\) As shown in Nicolson, \(\{e_1,\ldots, e_r\}\) will be an orthonormal basis for the column space of \(A\text{.}\) The matrix \(P\) is constructed by extending this to an orthonormal basis of \(\R^m\text{.}\)
All of this is a lot of work to do by hand, but it turns out that it can be done numerically, and more importantly, efficiently, by a computer. The SymPy library has an SVD algorithm, but it will not be efficient for larger matrices. In practice, most Python users will use the SVD algorithm provided by NumPy; we will stick with SymPy for simplicity and consistency.
Remark4.6.3.
The version of the SVD given above is not used in computations, since it tends to be more resource intensive. In particular, it requires us to store more information than necessary: the last \(n-r\) rows of \(Q\text{,}\) and the last \(m-r\) columns of \(P\text{,}\) get multiplied by columns/rows of zeros in \(\Sigma_A\text{,}\) so we don’t really need to keep track of these columns.
Instead, most algorithms that you find will give the \(r\times r\) diagonal matrix \(D_A\text{,}\) consisting of the nonzero singular values, and \(P\) will be replaced by the \(m\times r\) matrix consisting of its first \(r\) columns, while \(Q\) gets replaced by the \(r\times n\) matrix consisting of its first \(r\) rows. The resulting product is still equal to the original matrix.
In some cases, even the matrix \(D_A\) is too large, and a decision is made to truncate to some smaller subset of singular values. In this case, the resulting product is no longer equal to the original matrix, but it does provide an approximation. A discussion can be found on Wikipedia 2 .
Example4.6.4.
Find the singular value decomposition of the matrix \(A = \begin{bmatrix}1\amp 1\amp 1\\1\amp 0\amp -1\end{bmatrix}\text{.}\)
Solution.
Using SymPy, we get the condensed SVD 3 . First, let’s check the singular values.
Note that the values are not listed in decreasing order. Now, let’s ask for the singular value decomposition. The output consists of three matrices; the first line below assigns those matrices to the names P,S,Q.
Note that the output is the “condensed” version, which doesn’t match the exposition above. It also doesn’t follow the same ordering convention: we’ll need to swap columns in each of the matrices. But it does give us a decomposition of the matrix \(A\text{:}\)
To match our earlier presentation, we first set \(\Sigma_A = \bbm \sqrt{3}\amp 0\amp 0\\0 \amp \sqrt{2}\amp 0\ebm\text{.}\) Next, we need to extend the \(3\times 2\) matrix in the output above to a \(3\times 3\) matrix. We can do this by choosing any vector orthogonal to the two existing columns, and normalizing. Let’s use entries \(1/\sqrt{6},-2/\sqrt{6},1/\sqrt{6}\text{.}\) Noting that we also need to swap the first two columns (to match the fact that we swapped columns in \(\Sigma_A\)), we get the matrix
The Singular Value Decomposition has a lot of useful appplications, some of which are described in Nicholson’s book. On a very fundamental level the SVD provides us with information on some of the most essential properties of the matrix \(A\text{,}\) and any system of equations with \(A\) as its coefficient matrix.
Recall the following definitions for an \(m\times n\) matrix \(A\text{:}\)
The rank of \(A\) is the number of leadning ones in the RREF of \(A\text{,}\) which is also equal to the dimension of the column space of \(A\) (or if you prefer, the dimension of \(\im (T_A)\)).
The column space of \(A\text{,}\) denoted \(\csp(A)\text{,}\) is the subspace of \(\R^m\) spanned by the columns of \(A\text{.}\) (This is the image of the matrix transformation \(T_A\text{;}\) it is also the space of all vectors \(\mathbf{b}\) for which the system \(A\xx=\mathbf{b}\) is consistent.)
The row space of \(A\text{,}\) denoted \(\operatorname{row}(A)\text{,}\) is the span of the rows of \(A\text{,}\) viewed as column vectors in \(\R^n\text{.}\)
The null space of \(A\) is the space of solutions to the homogeneous system \(A\xx=\zer\text{.}\) This is, of course, equal the kernel of the associated transformation \(T_A\text{.}\)
There are some interesting relationships among these spaces, which are left as an exercise.
Exercise4.6.5.
Let \(A\) be an \(m\times n\) matrix. Prove the following statements.
(a)
\((\operatorname{row}(A))^\bot = \nll(A)\)
Hint.
Note that \(\vv\in \nll(A)\) if and only if \(A\vv=\zer\text{,}\) and \(\vv\in(\operatorname{row}(A))^\bot\) if and only if \(\vv\cdot \mathbf{r}_i=\zer\) for each row \(\mathbf{r}_i\) of \(A\text{.}\)
Note also that \((A\vv)^T=\vv^T A^T\) is the (dot) product of \(\vv^T\) with each column of \(A^T\text{,}\) and each column of \(A^T\) is a row of \(A\text{.}\)
(b)
\((\csp(A))^\bot = \nll(A^T)\)
Hint.
Notice that \(\vv\in \nll(A^T)\) if and only if \(A^T\vv=\zer\text{,}\) and that \((A^T\vv)^T=\vv^T A\text{.}\) Your reasoning should be similar to that of the previous part.
Here’s the cool thing about the SVD. Let \(\sigma_1\geq \sigma_2\geq \cdots \geq \sigma_r\gt 0\) be the positive singular values of \(A\text{.}\) Let \(\vecq_1,\ldots, \vecq_r,\ldots, \vecq_n\) be the orthonormal basis of eigenvectors for \(A^TA\text{,}\) and let \(\vecp_1,\ldots, \vecp_r,\ldots, \vecp_m\) be the orthonormal basis of \(\R^m\) constructed in the SVD algorithm. Then:
\(\displaystyle \rank(A)=r\)
\(\vecq_1,\ldots, \vecq_r\) form a basis for \(\operatorname{row}(A)\text{.}\)
\(\vecp_1,\ldots, \vecp_r\) form a basis for \(\csp(A)\) (and thus, the “row rank” and “column rank” of \(A\) are the same).
\(\vecq_{r+1},\ldots, \vecq_n\) form a basis for \(\nll(A)\text{.}\) (And these are therefore the basis solutions of \(A\xx=\zer\text{!}\))
\(\vecp_{r+1},\ldots, \vecp_m\) form a basis for \(\nll(A^T)\text{.}\)
If you want to explore this further, have a look at the excellent notebook by Dr. Juan H Klopper 4 . The ipynb file can be found on his GitHub page 5 . In it, he takes you through various approaches to finding the singular value decomposition, using the method above, as well as using NumPy and SciPy (which, for industrial applications, are superior to SymPy).
Subsubsection4.6.1.3QR Factorization
Suppose \(A\) is an \(m\times n\) matrix with independent columns. (Question: for this to happen, which is true — \(m\geq n\text{,}\) or \(n\geq m\text{?}\))
A \(QR\)-factorization of \(A\) is a factorization of the form \(A=QR\text{,}\) where \(Q\) is \(m\times n\text{,}\) with orthonormal columns, and \(R\) is an invertible upper-triangular (\(n\times n\)) matrix with positive diagonal entries. If \(A\) is a square matrix, \(Q\) will be orthogonal.
A lot of the methods we’re looking at here involve more sophisticated numerical techniques than SymPy is designed to handle. If we wanted to spend time on these topics, we’d have to learn a bit about the NumPy package, which has built in tools for finding things like polar decomposition and singular value decomposition. However, SymPy does know how to do \(QR\) factorization. After defining a matrix A, we can use the command
Q, R = A.QRdecomposition()
.
Let’s check that the matrix \(Q\) really is orthogonal:
Details of how to perform the QR factorization can be found in Nicholson’s textbook. It’s essentially a consequence of performing the Gram-Schmidt algorithm on the columns of \(A\text{,}\) and keeping track of our work.
The calculation above is a symbolic computation, which is nice for understanding what’s going on. The reason why the \(QR\) factorization is useful in practice is that there are efficient numerical methods for doing it (with good control over rounding errors). Our next topic looks at a useful application of the \(QR\) factorization.
Subsection4.6.2Computing Eigenvalues
Our first method focuses on the dominant eigenvalue of a matrix. An eigenvalue is dominant if it is larger in absolute value than all other eigenvalues. For example, if \(A\) has eigenvalues \(1,3,-2,-5\text{,}\) then \(-5\) is the dominant eigenvalue.
If \(A\) has eigenvalues \(1,3,0,-4,4\) then there is no dominant eigenvalue. Any eigenvector corresponding to a dominant eigenvalue is called a dominant eigenvector.
Subsubsection4.6.2.1The Power Method
If a matrix \(A\) has a dominant eigenvalue, there is a method for finding it (approximately) that does not involve finding and factoring the characteristic polynomial of \(A\text{.}\)
We start with some initial guess \(\xx_0\) for a dominant eigenvector. We then set \(\xx_{k+1} = A\xx_k\) for each \(k\geq 0\text{,}\) giving a sequence
We expect (for reasons we’ll explain) that \(\lVert \xx_k-\xx\rVert \to 0\) as \(k\to\infty\text{,}\) where \(\xx\) is a dominant eigenvector. Let’s try an example.
The dominant eigenvalue is \(\lambda = 7\text{.}\) Let’s try an initial guess of \(\xx_0=\begin{bmatrix}1\\0\end{bmatrix}\) and see what happens.
We might want to confirm whether that rather large fraction is close to \(\frac23\text{.}\) To do so, we can get the computer to divide the numerator by the denominator.
The above might show you the fraction rather than its decimal approximation. (This may depend on whether you’re on Sage or Jupyter.) To get the decimal, try wrapping the above in float() (or N, or append with .evalf()).
For the eigenvalue, we note that if \(A\xx=\lambda \xx\text{,}\) then
We can convert a rational number r to a float using either N(r) or r.evalf(). (The latter seems to be the better bet when working with a list.)
So why does this work? Suppose \(A\) is diagonalizable, so that there is a basis of eigenvectors \(B = \{\vv_1,\ldots, \vv_n\}\) for \(\R^n\text{.}\) Suppose also that \(A\) has dominant eigenvalue \(\lambda_1\text{,}\) with corresponding eigenvector \(\vv_1\text{.}\) Then our initial guess \(\xx_0\) can be written in terms of this basis:
and since \(\lambda_1\) is larger than all the other eigenvalues in absolute value, the powers \(\left(\frac{\lambda_m}{\lambda_1}\right)^k\) will be very small when \(k\) is large.
If \(\vv_1\) is an eigenvector corresponding to the dominant eigenvalue \(\lambda_1\text{,}\) then so is the scalar multiple \(c_1\lambda_1^k\vv_1\text{,}\) so \(\xx_k\) is approximately equal to an eigenvector.
Two things are worth noting:
The result does not depend on the initial guess \(\xx_0\text{,}\) unless we choose an \(\xx_0\) for which the coefficient \(c_1\) is 0. If this turns out to be the case, we must choose another initial guess.
If \(A\) is not diagonalizable, we can still use this method, but we must choose \(\xx_0\) from the subspace spanned by a set of eigenvectors for \(A\text{.}\)
Subsubsection4.6.2.2The QR Algorithm
Given an \(n\times n\) matrix \(A\text{,}\) we know we can write \(A=QR\text{,}\) with \(Q\) orthogonal and \(R\) upper-triangular. The \(QR\)-algorithm exploits this fact. We set \(A_1=A\text{,}\) and write \(A_1=Q_1R_1\text{.}\)
Then we set \(A_2 = R_1Q_1\text{,}\) and factor: \(A_2=Q_2R_2\text{.}\) Notice \(A_2 = R_1Q_1 = Q_1^TA_1Q_1\text{.}\) Since \(A_2\) is similar to \(A_1\text{,}\)\(A_2\) has the same eigenvalues as \(A_1=A\text{.}\)
Next, set \(A_3 = R_2Q_2\text{,}\) and factor as \(A_3 = Q_3R_3\text{.}\) Since \(A_3 = Q_2^TA_2Q_2\text{,}\)\(A_3\) has the same eigenvalues as \(A_2\text{.}\) In fact, \(A_3 = Q_2^T(Q_1^TAQ_1)Q_2 = (Q_1Q_2)^TA(Q_1Q_2)\text{.}\)
After \(k\) steps we have \(A_{k+1} = (Q_1\cdots Q_k)^TA(Q_1\cdots Q_k)\text{,}\) which still has the same eigenvalues as \(A\text{.}\) By some sort of dark magic, this sequence of matrices converges to an upper triangular matrix with eigenvalues on the diagonal!
Consider the matrix \(A = \begin{bmatrix}5&-2&3\\0&4&0\\0&-1&3\end{bmatrix}\)
Now we repeat the process:
Do this a few more times, and see what results! (If someone can come up with a way to code this as a loop, let me know!) The diagonal entries should get closer to \(5,4,3\text{,}\) respectively, and the \((3,2)\) entry should get closer to \(0\text{.}\)
Exercises4.6.3Exercises
1.
Find the singular values \(\sigma_1 \ge \sigma_2\) of