applied linear algebra - alistair savage · these are notes for the course applied linear algebra...

Applied Linear AlgebraMAT 3341

Spring/Summer 2019

Alistair Savage

Department of Mathematics and Statistics

University of Ottawa

This work is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License

http://alistairsavage.ca/mat3341

https://alistairsavage.ca

http://creativecommons.org/licenses/by-sa/4.0/

Contents

Preface 4

1 Matrix algebra 51.1 Conventions and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Matrix arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Matrices and linear transformations . . . . . . . . . . . . . . . . . . . . . . . 111.4 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5 Matrix inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.6 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Matrix norms, sensitivity, and conditioning 362.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 Normed vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.3 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Orthogonality 483.1 Orthogonal complements and projections . . . . . . . . . . . . . . . . . . . . 483.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3 Hermitian and unitary matrices . . . . . . . . . . . . . . . . . . . . . . . . . 563.4 The spectral theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5 Positive definite matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.6 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.7 Computing eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 Generalized diagonalization 854.1 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2 Fundamental subspaces and principal components . . . . . . . . . . . . . . . 914.3 Pseudoinverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.4 Jordan canonical form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.5 The matrix exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5 Quadratic forms 1065.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Diagonalization of quadratic forms . . . . . . . . . . . . . . . . . . . . . . . 108

2

Contents 3

5.3 Rayleigh’s principle and the min-max theorem . . . . . . . . . . . . . . . . . 112

Index 118

Preface

These are notes for the course Applied Linear Algebra (MAT 3341) at the University ofOttawa. This is a third course in linear algebra. The prerequisites are uOttawa coursesMAT 1322 and (MAT 2141 or MAT 2342).

In this course we will explore aspects of linear algebra that are of particular use in concreteapplications. For example, we will learn how to factor matrices in various ways that aid insolving linear systems. We will also learn how one can effectively compute estimates ofeigenvalues when solving for precise ones is impractical. In addition, we will investigate thetheory of quadratic forms. The course will involve a mixture of theory and computation. Itis important to understand why our methods work (the theory) in addition to being able toapply the methods themselves (the computation).

Acknowledgements: I would like to thank Benoit Dionne, Monica Nevins, and Mike Newmanfor sharing with me their lecture notes for this course.

Alistair Savage

Course website: https://alistairsavage.ca/mat3341

4

https://catalogue.uottawa.ca/search/?P=MAT%201322

https://alistairsavage.ca/mat2141/

https://catalogue.uottawa.ca/search/?P=MAT%202342

https://mysite.science.uottawa.ca/bdionne/

https://mysite.science.uottawa.ca/mnevins/

https://mysite.science.uottawa.ca/mnewman/

https://alistairsavage.ca

https://alistairsavage.ca/mat3341

Chapter 1

Matrix algebra

We begin this chapter by briefly recalling some matrix algebra that you learned in previouscourses. In particular, we review matrix arithmetic (matrix addition, scalar multiplication,the transpose, and matrix multiplication), linear transformations, and gaussian elimination(row reduction). Next we discuss matrix inverses. Although you have seen the concept of amatrix inverse in previous courses, we delve into the topic in further detail. In particular,we will investigate the concept of one-sided inverses. We then conclude the chapter with adiscussion of LU factorization, which is a very useful technique for solving linear systems.

1.1 Conventions and notation

We let Z denote the set of integers, and let N = {0, 1, 2, . . . } denote the set of naturalnumbers.

In this course we will work over the field R of real numbers or the field C of complexnumbers, unless otherwise specified. To handle both cases simultaneously, we will use thenotation F to denote the field of scalars. So F = R or F = C, unless otherwise specified. Wecall the elements of F scalars. We let F× = F \ {0} denote the set of nonzero scalars.

We will use uppercase roman letters to denote matrices: A, B, C, M , N , etc. We willuse the corresponding lowercase letters to denote the entries of a matrix. Thus, for instance,aij is the (i, j)-entry of the matrix A. We will sometimes write A = [aij] to emphasize this.In some cases, we will separate the indices with a comma when there is some chance forconfusion, e.g. ai,i+1 versus aii+1.

Recall that a matrix A has size m× n if it has m rows and n columns:

A =

a11 a12 · · · a1na21 a22 · · · a2n...

.... . .

...am1 am2 · · · amn

.We let Mm,n(F) denote the set of all m × n matrices with entries in F. We let GL(n,F)denote the set of all invertible n×n matrices with entries in F. (Here ‘GL’ stands for general

5

6 Chapter 1. Matrix algebra

linear group.) If a1, . . . , an ∈ F, we define

diag(a1, . . . , an) =

a1 0 · · · · · · 0

0 a2. . .

......

. . . . . . . . ....

.... . . . . . 0

0 · · · · · · 0 an

.

We will use boldface lowercase letters a, b, x, y, etc. to denote vectors. (In class, we

will often write vectors as ~a, ~b, etc. since bold is hard to write on the blackboard.) Most ofthe time, our vectors will be elements of Fn. (Although, in general, they can be elements ofany vector space.) For vectors in Fn, we denote their components with the correspondingnon-bold letter with subscripts. We will write vectors x ∈ Fn in column notation:

x =

x1x2...xn

, x1, x2, . . . , xn ∈ F.

Sometimes, to save space, we will also write this vector as

x = (x1, x2, . . . , xn).

For 1 ≤ i ≤ n, we let ei denote the i-th standard basis vector of Fn. This is the vector

ei = (0, . . . , 0, 1, 0, . . . , 0),

where the 1 is in the i-th position. Then {e1, e2, · · · , en} is a basis for Fn. Indeed, everyx ∈ Fn can be written uniquely as the linear combination

x = x1e1 + x2e2 + · · ·+ xnen.

1.2 Matrix arithmetic

We now quickly review the basic matrix operations. Further detail on the material in thissection can be found in [Nic, §§2.1–2.3].

1.2.1 Matrix addition and scalar multiplication

We add matrices of the same size componentwise:

A+B = [aij + bij].

If A and B are of different sizes, then the sum A+B is not defined. We define the negativeof a matrix A by

−A = [−aij]

1.2. Matrix arithmetic 7

Then the difference of matrices of the same size is defined by

A−B = A+ (−B) = [aij − bij].

If k ∈ F is a scalar, then we define the scalar multiple

kA = [kaij].

We denote the zero matrix by 0. This is matrix with all entries equal to zero. Note thatthere is some possibility for confusion here since we will use 0 to denote the real (or complex)number zero, as well as the zero matrices of different sizes. The context should make it clearwhich zero we mean. The context should also make clear what size of zero matrix we areconsidering. For example, if A ∈Mm,n(F) and we write A+ 0, then 0 must denote the m×nzero matrix here.

The following theorem summarizes the important properties of matrix addition and scalarmultiplication.

Proposition 1.2.1. Let A, B, and C be m × n matrices and let k, p ∈ F be scalars. Thenwe have the following:

(a) A+B = B + A (commutativity)

(b) A+ (B + C) = (A+B) + C (associativity)

(c) 0 + A = A (0 is an additive identity)

(d) A+ (−A) = 0 (−A is the additive inverse of A)

(e) k(A+B) = kA+ kB (scalar multiplication is distributive over matrix addition)

(f) (k + p)A = kA+ pA (scalar multiplication is distributive over scalar addition)

(g) (kp)A = k(pA)

(h) 1A = A

Remark 1.2.2. Proposition 1.2.1 can be summarized as stating that the set Mm,n(F) is avector space over the field F under the operations of matrix addition and scalar multiplication.

1.2.2 Transpose

The transpose of an m × n matrix A, written AT , is the n ×m matrix whose rows are thecolumns of A in the same order. In other words, the (i, j)-entry of AT is the (j, i)-entry ofA. So,

if A = [aij], then AT = [aji].

We say the matrix A is symmetric if AT = A. Note that this implies that all symmetricmatrices are square, that is, they are of size n× n for some n.

Example 1.2.3. We have [π i −15 7 3/2

]T=

π 5i 7−1 3/2

.


The matrix 1 −5 7−5 0 87 8 9

is symmetric.

Proposition 1.2.4. Let A and B denote matrices of the same size, and let k ∈ F. Then wehave the following:

(a) (AT )T = A

(b) (kA)T = kAT

(c) (A+B)T = AT +BT

1.2.3 Matrix-vector multiplication

Suppose A is an m× n matrix with columns a1, a2, . . . , an ∈ Fm:

A =[a1 a2 · · · an

].

For x ∈ Fn, we define the matrix-vector product

Ax := x1a1 + x2a2 + · · ·+ xnan ∈ Fm.

Example 1.2.5. If

A =

2 −1 03 1/2 π−2 1 10 0 0

and x =

−112

,then

Ax = −1

23−20

+ 1

−11/210

+ 2

0π10

=

−3

−5/2 + 2π50

.1.2.4 Matrix multiplication

Suppose A ∈Mm,n(F) and B ∈Mn,k(F). Let bi denote the i-th column of B, so that

B =[b1 b2 · · · bk

].

We then define the matrix product AB to be the m× k matrix given by

AB :=[Ab1 Ab2 · · · Abk

].

Recall that the dot product of x,y ∈ Fn is defined to be

x · y = x1y1 + x2y2 + · · ·+ xnyn. (1.1)

1.2. Matrix arithmetic 9

Then another way to compute the matrix product is as follows: The (i, j)-entry of AB isthe dot product of the i-th row of A and the j-column of B. In other words

C = AB ⇐⇒ cij =n∑`=1

ai`b`j for all 1 ≤ i ≤ m, 1 ≤ ` ≤ k.

Example 1.2.6. If

A =

[2 0 −1 10 3 2 −1

]and B =

0 1 −11 0 20 0 −23 1 0

then

AB =

[3 3 00 −1 2

].

Recall that the n× n identity matrix is the matrix

I :=

1 0 · · · · · · 00 1 0 · · · 0... 0

. . . . . ....

......

. . . . . . 00 0 · · · 0 1

.Even though there is an n× n identity matrix for each n, the size of I should be clear fromthe context. For instance, if A ∈ Mm,n(F) and we write AI, then I is the n × n identitymatrix. If, on the other hand, we write IA, then I is the m×m identity matrix. In case weneed to specify the size to avoid confusion, we will write In for the n× n identity matrix.

Proposition 1.2.7 (Properties of matrix multiplication). Suppose A, B, and C are matricesof sizes such that the indicated matrix products are defined. Furthermore, suppose a is ascalar. Then:

(a) IA = A = AI (I is a multiplicative identity)

(b) A(BC) = (AB)C (associativity)

(c) A(B + C) = AB + AC (distributivity on the left)

(d) (B + C)A = BA+ CA (distributivity on the right)

(e) a(AB) = (aA)B = A(aB)

(f) (AB)T = BTAT

Note that matrix multiplication is not commutative in general. First of all, it is possiblethat the product AB is defined but BA is not. This the case when A ∈ Mm,n(F) andB ∈Mn,k(F) with m 6= k. Now suppose A ∈Mm,n(F) and B ∈Mn,m(F). Then AB and BAare both defined, but they are different sizes when m 6= n. However, even if m = n, so thatA and B are both square matrices, we can have AB 6= BA. For example, if

A =

[1 00 0

]and B =

[0 10 0

]


then

AB =

[0 10 0

]6=[0 00 0

]= BA.

Of course, it is possible for AB = BA for some specific matrices (e.g. the zero or identitymatrices). In this case, we say that A and B commute. But since this does not hold ingeneral, we say that matrix multiplication is not commutative.

1.2.5 Block form

It is often convenient to group the entries of a matrix into submatrices, called blocks. Forinstance, if

A =

1 2 −1 3 50 3 0 4 75 −4 2 0 1

,then we could write

A =

[B

CD

], where B =

[1 2

], C =

−1 3 50 4 72 0 1

, D =

[0 35 −4

].

Similarly, if

X =

[2 1−1 0

]and A =

[3 −5−2 9

],

then [0 e2

XA

]=

0 0 2 10 1 −1 00 0 3 −50 0 −2 9

,where we have used horizontal and vertical lines to indicate the blocks. Note that we couldinfer from the sizes of X and A, that the 0 in the block matrix must be the zero vector inF4 and that e2 must be the second standard basis vector in F4.

Provided the sizes of the blocks match up, we can multiply matrices in block form usingthe usual rules for matrix multiplication. For example,[

A BC D

] [XY

]=

[AX +BYCX +DY

]as long as the products AX, BY , CX, and DY are defined. That is, we need the numberof columns of A to equal the number of rows of X, etc. (See [Nic, Th. 2.3.4].) Note that,since matrix multiplication is not commutative, the order of the multiplication of the blocksis very important here.

We can also compute transposes of matrices in block form. For example,

[A B C

]T=

ATBT

CT

and

[A BC D

]T=

[AT CT

BT DT

].

1.3. Matrices and linear transformations 11

In certain circumstance, we can also compute determinants in block form. Precisely, if Aand B are square matrices, then

det

[A X0 B

]= (detA)(detB) and det

[A 0Y B

]= (detA)(detB). (1.2)

(See [Nic, Th. 3.1.5].)

Exercises.

Recommended exercises: Exercises in [Nic, §§2.1–2.3].

1.3 Matrices and linear transformations

We briefly recall the connection between matrices and linear transformations. For furtherdetails, see [Nic, §2.6].

Recall that a map T : V → W , where V and W are F-vector spaces (e.g. V = Fn,W = Fm), is called a linear transformation if

(T1) T (x + y) = T (x) + T (y) for all x,y ∈ V , and

(T2) T (ax) = aT (x) for all x ∈ V and a ∈ F.

We let L(V,W ) denote the set of all linear transformations from V to W .Multiplication by A ∈Mm,n(F) is a linear transformation

TA : Fn → Fm, x 7→ Ax. (1.3)

Conversely, every linear map Fm → Fn is given by multiplication by a some matrix. Indeed,suppose

T : Fn → Fm

is a linear transformation. Then define the matrix

A =[T (e1) T (e2) · · · T (en)

]∈Mm,n(F).

Proposition 1.3.1. With A and T defined as above, we have T = TA.

Proof. For x ∈ Fn, we have

Tx = T (x1e1 + · · ·+ xnen)

= x1T (e1) + · · ·+ xn(Ten) (since T is linear)

= Ax.


It follows from the above discussion that we have a one-to-one correspondence

Mm,n(F)→ L(Fn,Fm), A 7→ TA. (1.4)

Now suppose A ∈Mm,n(F) and B ∈Mn,k(F). Then, for x ∈ Fk, we have

(TA ◦ TB)(x) = TA(TB(x)) = TA(Bx) = A(Bx) = (AB)x = TABx.

HenceTA ◦ TB = TAB.

So, under the bijection (1.4), matrix multiplication corresponds to composition of lineartransformations. In fact, this is why we define matrix multiplication the way we do.

Recall the following definitions associated to a matrix A ∈Mm,n(F).

• The column space of A, denoted colA, is the span of the columns of A. It is a subspaceof Fm and equal to the image of TA, denoted imTA.

• The row space of A, denoted rowA, is the span of the columns of A. It is a subspaceof Fn.

• The rank of A is

rankA = dim(colA) = dim(imTA) = dim(rowA).

(The first equality is a definition, the second follows from the definition of TA, and thethird is a logical consequence that we will recall in Section 1.4.)

• The null space of A is{x ∈ Fn : Ax = 0}.

• The nullity of A, denoted nullA, is the dimension of the null space of A.

Recall also that the kernel of a linear transformation T : V → W is

kerT = {v ∈ V : Tv = 0}

It follows that kerTA is equal to the null space of A, and so

nullA = dim(kerTA). (1.5)

The important Rank-Nullity Theorem (also known as the Dimension Theoreom) statesthat if T : V → W is a linear transformation, then

dimV = dim(kerT ) + dim(imT ). (1.6)

For a matrix A ∈Mm,n(F), applying the Rank-Nullity Theorem to TA : Fn → Fm gives

n = nullA+ rankA. (1.7)

Exercises.Recommended exercises: Exercises in [Nic, §2.6].

1.4. Gaussian elimination 13

1.4 Gaussian elimination

Recall that you learned in previous courses how to row reduce a matrix. This procedure iscalled Gaussian elimination. We briefly review this technique here. For further details, see[Nic, §§1.1, 1.2, 2.5].

To row reduce a matrix, you used the following operations:

Definition 1.4.1 (Elementary row operations). The following are called elementary rowoperations on a matrix A with entries in F.

• Type I : Interchange two rows of A.

• Type II : Multiply any row of A by a nonzero element of F.

• Type III : Add a multiple of one row of A to another row of A.

Definition 1.4.2 (Elementary matrices). An n×n elementary matrix is a matrix obtainedby performing an elementary row operation on the identity matrix In. In particular, wedefine the following elementary matrices:

• For 1 ≤ i, j ≤ n, i 6= j, we let Pi,j be the elementary matrix obtained from In byinterchanging the i-th and j-th rows.

• For 1 ≤ i ≤ n and a ∈ F×, we let Mi(a) be the elementary matrix obtained from In bymultiplying the i-th row by a.

• For i ≤ i, j ≤ n, i 6= j, and a ∈ F, we let Ei,j(a) be the elementary matrix obtainedfrom In by adding a times row i to row j.

The type of the elementary matrix is the type of the corresponding row operation performedon In.

Example 1.4.3. If n = 4, we have

P1,3 =

0 0 1 00 1 0 01 0 0 00 0 0 1

, M4(−2) =

1 0 0 00 1 0 00 0 1 00 0 0 −2

, and E2,4(3) =

1 0 0 00 1 0 00 0 1 00 3 0 1

.Proposition 1.4.4. (a) Every elementary matrix is invertible and the inverse is an ele-

mentary matrix of the same type.

(b) Performing an elementary row operation on a matrix A is equivalent to multiplying Aon the left by the corresponding elementary matrix.

Proof. (a) We leave it as an exercise (see Exercise 1.4.1) to check that

Pi,jPi,j = I, Mi(a)Mi(a−1) = I, Ei,j(a)Ei,j(−a) = I (1.8)

(b) We give the proof for row operations of type III, and leave the proofs for types I andII as exercises. (See Exercise 1.4.1.) Fix 1 ≤ i, j ≤ n, with i 6= j. Note that


• row k of Ei,j(a) is ek if k 6= j, and

• row j of Ei,j(a) is ej + aei.

Thus, if k 6= j, then row k of Ei,j(a)A is

ekA = row k of A,

and row j of Ei,jA is

(ej + aei)A = ejA+ aeiA = (row j of A) + a(row i of A).

Therefore, Ei,j(a)A is the result of adding a times row i to row j.

Definition 1.4.5 (Row-echelon form). A matrix is in row-echelon form (and will be calleda row-echelon matrix ) if:

(a) all nonzero rows are above all zero rows,

(b) the first nonzero entry from the left in each nonzero row is a 1, called the leading 1 forthat row, and

(c) each leading 1 is strictly to the right of all leading 1s in rows above it.

A row-echelon matrix is in reduced row-echelon form (and will be called a reduced row-echelonmatrix ) if, in addition,

(d) each leading 1 is the only nonzero entry in its column.

Remark 1.4.6. Some references do not require the leading entry (i.e. the first nonzero entryfrom the left in a nonzero row) to be 1 in row-echelon form.

Proposition 1.4.7. Every matrix A ∈ Mm,n(F) can be transformed to a row-echelon formmatrix R by performing elementary row operations. Equivalently, there exist finitely manyelementary matrices E1, E2, . . . , Ek such that R = E1E2 · · ·EkA.

Proof. You saw this in previous courses, and so we will omit the proof here. In fact, thereis a precise algorithm, called the gaussian algorithm, for bringing a matrix to row-echelonform. See [Nic, Th. 1.2.1] for details.

Proposition 1.4.8. A square matrix is invertible if and only if it is a product of elementarymatrices.

Proof. Since the elementary matrices are invertible by Proposition 1.4.4(a), if A is a productof invertible matrices, then A is invertible. Conversely, suppose A is invertible. Then it canbe row-reduced to the identity matrix I. Hence, by Proposition 1.4.7, there are elementarymatrices E1, E2, . . . , Ek such that I = E1E2 · · ·EkA. Then

A = E−1k · · ·E−12 E−11 .

Since inverses of elementary matrices are elementary matrices by Proposition 1.4.4(a), weare done.

1.4. Gaussian elimination 15

Reducing a matrix all the way to reduced row-echelon form is sometimes called Gauss–Jordan elimination.

Recall that the rank of a matrix A, denoted rankA, is the dimension of the column spaceof A. Equivalently, rankA is the number of nonzero rows (which is equal to the number ofleading 1s) in any row-echelon matrix U that is row equivalent to A (i.e. that can be obtainedfrom A by row operations). Thus we see that rankA is also the dimension of the row spaceof A, as noted earlier.

Recall that every linear system consisting of m linear equations in n variables can bewritten in matrix form

Ax = b,

where A is an m× n matrix, called the coefficient matrix ,

x =

x1x2...xn

is the vector of variables (or unknowns), and b is the vector of constant terms. We say that

• the linear system is overdetermined if there are more equations than unknowns (i.e. ifm > n),

• the linear system is underdetermined if there are more unknowns than equations (i.e.if m < n),

• the linear system is square if there are the same number of unknowns as equations (i.e.if m = n),

• an m× n matrix is tall if m > n, and

• an m× n matrix is wide if m < n.

It follows immediately that the linear system Ax = b is

• overdetermined if and only if A is tall,

• underdetermined if and only if A is wide, and

• square if and only if A is square.

Example 1.4.9. As a refresher, let’s solve the following underdetermined system of linearequations:

−4x3 + x4 + 2x5 = 114x1 − 2x2 + 8x3 + x4 − 5x5 = 52x1 − x2 + 2x3 + x4 − 3x5 = 2

We write down the augmented matrix and row reduce: 0 0 −4 1 2 114 −2 8 1 −5 52 −1 2 1 −3 2

R1↔R3−−−−→

2 −1 2 1 −3 24 −2 8 1 −5 50 0 −4 1 2 11


R2−2R1−−−−→

2 −1 2 1 −3 20 0 4 −1 1 10 0 −4 1 2 11

R3+R2−−−−→

2 −1 2 1 −3 20 0 4 −1 1 10 0 0 0 3 12

One can now easily solve the linear system using a technique called back substitution, whichwe will discuss in Section 1.6.1. However, to further illustrate the process of row reduction,let’s continue with gaussian elimination: 2 −1 2 1 −3 2

0 0 4 −1 1 10 0 0 0 3 12

13R3−−→

2 −1 2 1 −3 20 0 4 −1 1 10 0 0 0 1 4

R1+3R3R2−R3−−−−→

2 −1 2 1 0 140 0 4 −1 0 −30 0 0 0 1 4

14R2−−→

2 −1 2 1 0 140 0 1 −1/4 0 −3/40 0 0 0 1 4

R2−2R2−−−−→

2 −1 0 3/2 0 31/20 0 1 −1/4 0 −3/40 0 0 0 1 4

12R1−−→

1 −1/2 0 3/4 0 31/40 0 1 −1/4 0 −3/40 0 0 0 1 4

The matrix is now in reduced row-echelon form. This reduced matrix corresponds to theequivalent linear system:

x1 − 12x2 + 3

4x4 = 31

4

x3 − 14x4 = −3

4

x5 = 4

The leading variables are the variables corresponding to leading 1s in the reduced row-echelonmatrix: x1, x3, and x5. The non-leading variables , or free variables , are x2 and x4. We letthe free variables be parameters:

x2 = s, x4 = t, s, t ∈ F.

Then we solve for the leading variables in terms of these parameters giving the generalsolution in parametric form:

x1 =31

4+

1

2s− 3

4t

x2 = s

x3 = −3

4+

1

4t

x4 = t

x5 = 4

1.5. Matrix inverses 17

Exercises.

1.4.1. Complete the proof of Proposition 1.4.4 by verifying the equalities in (1.8) and veri-fying part (b) for the case of elementary matrices of types I and II.

Additional recommended exercises: [Nic, §§1.1, 1.2, 2.5].

1.5 Matrix inverses

In earlier courses you learned about invertible matrices and how to find their inverses. Wenow revisit the topic of matrix inverses in more detail. In particular, we will discuss the moregeneral notions of one-sided inverses. In this section we follow the presentation in [BV18,Ch. 11].

1.5.1 Left inverses

Definition 1.5.1 (Left inverse). A matrix X satisfying

XA = I

is called a left inverse of A. If such a left inverse exists, we say that A is left-invertible. Notethat if A has size m× n, then any left inverse X will have size n×m.

Examples 1.5.2. (a) If A ∈ M1,1(F), then we can think of A simply as a scalar. In thiscase a left inverse is equivalent to the inverse of the scalar. Thus, A is left-invertible ifand only if it is nonzero, and in this case it has only one left inverse.

(b) Any nonzero vector a ∈ Fn is left-invertible. Indeed, if ai 6= 0 for some 1 ≤ i ≤ n, then

1

aieTi a =

[1].

Hence 1ai

eTi is a left inverse of a. For example, if

a =

20−11

,then [

1/2 0 0 0],[0 0 −1 0

], and

[0 0 0 1

]are all left inverses of a. In fact, a has infinitely many left inverses. See Exercise 1.5.1.


(c) The matrix

A =

4 3−6 −4−1 −1

has left inverses

B =1

9

[−7 −8 1111 10 −16

]and C =

1

2

[0 −1 40 1 −6

].

Indeed, one can check directly that BA = CA = I.

Proposition 1.5.3. If A has a left inverse, then the columns of A are linearly independent.

Proof. Suppose A ∈ Mm,n(F) has a left inverse B, and let a1, . . . , an be the columns of A.Suppose that

x1a1 + · · ·+ xnan = 0

for some x1, . . . , xn ∈ F. Thus, taking x = (x1, . . . , xn), we have

Ax = x1a1 + · · ·+ xnan = 0.

Thusx = Ix = BAx = B0 = 0.

This implies that x1, . . . , xn = 0. Hence the columns of A are linearly independent.

We will prove the converse of Proposition 1.5.3 in Proposition 1.5.15.

Corollary 1.5.4. If A has a left inverse, then A is square or tall.

Proof. Suppose A is a wide matrix, i.e. A is m × n with m < n. Then it has n columns,each of which is a vector in Fm. Since m < n, these columns cannot be linearly independent.Hence A cannot have a left inverse.

So we see that only square or tall matrices can be left-invertible. Of course, not everysquare or tall matrix is left-invertible (e.g. consider the zero matrix).

Now suppose we want to solve a system of linear equations

Ax = b

in the case where A has a left inverse C. If this system has a solution x, then

Cb = CAx = Ix = x.

So x = Cb is the solution of Ax = b. However, we started with the assumption that thesystem has a solution. If, on the other hand, it has no solution, then x = Cb cannot satisfyAx = b.

This gives us a method to check if the linear system Ax = b has a solution, and to findthe solution if it exists, provided A has a left inverse C. We simply compute ACb.


• If ACb = b, then x = Cb is the unique solution of the linear system Ax = b.

• If ACb 6= b, then the linear system Ax = b has no solution.

Keep in mind that this method only works when A has a left inverse. In particular, A mustbe square or tall. So this method only has a chance of working for square or overdeterminedsystems.

Example 1.5.5. Consider the matrices of Example 1.5.2(c): The matrix

A =

4 3−6 −4−1 −1

has left inverses

B =1

9

[−7 −8 1111 10 −16

]and C =

1

2

[0 −1 40 1 −6

].

Suppose we want to solve the over-determined linear system

Ax =

1−20

. (1.9)

We can use either left inverse and compute

B

1−20

=

[1−1

]= C

1−20

.Then we check to see if (1,−1) is a solution:

A

[1−1

]=

1−20

.So (1,−1) is the unique solution to the linear system (1.9).

Example 1.5.6. Using the same matrix A from Example 1.5.5, consider the over-determinedlinear system

Ax =

1−10

.We compute

B

1−10

=

[1/91/9

].


Thus, if the system has a solution, it must be (1/9, 1/9, 0). However, we check that

A

1/91/90

=

7/9−10/9−2/9

6= 1−10

.Thus, the system has no solution. Of course, we could also compute using the left inverse Cas above, or see that

B

1−10

=

[1/91/9

]6=[

1/2−1/2

]= C

1−10

.If the system had a solution, both (1/9, 1/9, 0) and (1/2,−1/2, 0) would be the uniquesolution, which is clearly not possible.

1.5.2 Right inverses

Definition 1.5.7 (Right inverse). A matrix X satisfying

AX = I

is called a right inverse of A. If such a right inverse exists, we say that A is right-invertible.Note that if A has size m× n, then any right inverse X will have size n×m.

Suppose A has a right inverse B. Then

BTAT = (AB)T = I,

and so BT is a left inverse of AT . Similarly, if A has a left inverse C, then

ATCT = (CA)T = I,

and so CT is a right inverse of AT . This allows us to translate our results about left inversesto results about right inverses.

Proposition 1.5.8. (a) The matrix A is left-invertible if and only if AT is right invertible.Furthermore, if C is a left inverse of A, then CT is a right inverse of AT

(b) Similarly, A is right-invertible if and only if AT is left-invertible. Furthermore, if B isa right inverse of A, then BT is a left inverse of AT .

(c) If a matrix is right-invertible then its rows are linearly independent.

(d) If A has a right inverse, then A is square or wide.

Proof. (a) We proved this above.

(b) We proved this above. Alternatively, it follows from part (a) and the fact that(AT )T = A.

(c) This follows from Proposition 1.5.3 and part (b) since the rows of A are the columnsof AT .


(d) This follows from Corollary 1.5.4 and part (b) since A is square or wide if and onlyif AT is square or tall.

We can also transpose Examples 1.5.2.

Examples 1.5.9. (a) If A ∈M1,1(F), then a right inverse is equivalent to the inverse of thescalar. Thus, A is right-invertible if and only if it is nonzero, and in this case it hasonly one right inverse.

(b) Any nonzero row matrix a =[a1 · · · an

]∈ M1,n(F) is right-invertible. Indeed, if

ai 6= 0 for some 1 ≤ i ≤ n, then1

aiaei =

[1].

Hence 1ai

ei is a right inverse of a.

(c) The matrix

A =

[4 −6 −13 −4 −1

]has right inverses

B =1

9

−7 11−8 1011 −16

and C =1

2

0 0−1 14 −6

.Now suppose we want to solve a linear system

Ax = b

in the case where A has a right inverse B. Note that

ABb = Ib = b.

Thus x = Bb is a solution to his system. Hence, the system has a solution for any b. Ofcourse, there can be other solutions; the solution x = Bb is just one of them.

This gives us a method to solve any linear system Ax = b in the case that A has a rightinverse. Of course, this implies that A is square or wide. So this method only has a chanceof working for square or underdetermined systems.

Example 1.5.10. Using the matrix A from Example 1.5.9(c) with right inverses B and C,the linear system

Ax =

[11

]has solutions

B

[11

]=

4/92/9−5/9

and C

[11

]=

00−1

.(Of course, there are more. As you learned in previous courses, any linear system with morethan one solution has infinitely many solutions.) Indeed, we can find a solution of Ax = bfor any b.


1.5.3 Two-sided inverses

Definition 1.5.11 (Two-sided inverse). A matrix X satisfying

AX = I = XA

is called a two-sided inverse, or simply an inverse, of A. If such an inverse exists, we saythat A is invertible or nonsingular . A square matrix that is not invertible is called singular .

Proposition 1.5.12. A matrix A is invertible if and only if it is both left-invertible andright-invertible. In this case, any left inverse of A is equal to any right inverse of A. Hencethe inverse of A is unique.

Proof. Suppose AX = I and Y A = I. Then

X = IX = Y AX = Y I = Y.

If A is invertible, we denote its inverse by A−1.

Corollary 1.5.13. Invertible matrices are square.

Proof. If A is invertible, it is both left-invertible and right-invertible by Proposition 1.5.12.Then A must be square by Corollary 1.5.4 and Proposition 1.5.8(d).

You learned about inverses in previous courses. In particular, you learned:

• If A is an invertible matrix, then the linear system Ax = b has the unique solutionx = A−1b.

• If A is invertible, then the inverse can be computed by row-reducing an augmentedmatrix: [

A I] [I A−1

].

Proposition 1.5.14. If A is a square matrix, then the following are equivalent:

(a) A is invertible.

(b) The columns of A are linearly independent.

(c) The rows of A are linearly independent.

(d) A is left-invertible.

(e) A is right-invertible.

Proof. Let A ∈Mn,n(F). First suppose that A is left-invertible. Then, by Proposition 1.5.3,the columns of A are linearly independent. Since there are n columns, this implies that theyform a basis of Fn. Thus any vector in Fn is a linear combination of the columns of A. Inparticular, each of the standard basis vector ei, 1 ≤ i ≤ n, can be expressed as ei = Abi forsome bi ∈ Fn. Then the matrix B =

[b1 b2 · · · bn

]satisfies

AB =[Ab1 Ab2 · · · Abn

]=[e1 e2 · · · en

]= I.

Therefore B is a right inverse of A. So we have shown that


(d) =⇒ (b) =⇒ (e).

Applying this result to the transpose of A, we see that

(e) =⇒ (c) =⇒ (d).

Therefore (b) to (e) are all equivalent. Since, by Proposition 1.5.12, (a) is equivalent to ((d)and (e)), the proof is complete.

1.5.4 The pseudoinverse

We learned in previous courses how to compute two-sided inverses of invertible matrices.But how do we compute left and right inverses in general? We will assume that F = R here,but similar methods work over the complex numbers. We just have to replace the transposeby the conjugate transpose (see Definition 3.3.1).

If A ∈Mm,n(R), then the square matrix

ATA ∈Mn,n(R)

is called the Gram matrix of A.

Proposition 1.5.15. (a) A matrix has linearly independent columns if and only if itsGram matrix ATA is invertible.

(b) A matrix is left-invertible if and only if its columns are linearly independent. Further-more, if A is left-invertible, then (ATA)−1AT is a left inverse of A.

Proof. (a) First suppose the columns of A are linearly independent. Assume that

ATAx = 0

for some x ∈ Rn. Multiplying on the left by xT gives

0 = xT0 = xTATAx = (Ax)T (Ax) = (Ax) · (Ax),

which implies that Ax = 0. (Recall that, for any vector v ∈ Rn, we have v · v = 0 if andonly if v = 0.) Because the columns of A are linearly independent, this implies that x = 0.Since the only solution to ATAx = 0 is x = 0, we conclude that ATA is invertible.

Now suppose the columns of A are linearly dependent. Thus, there exists a nonzerox ∈ Rn such that Ax = 0. Multiplying on the left by AT gives

ATAx = 0, x 6= 0.

Thus the Gram matrix ATA is singular.

(b) We already saw in Proposition 1.5.3 that if A is left-invertible, then its columnsare linearly independent. To prove the converse, suppose the columns of A are linearlyindependent. Then, by (a), the matrix ATA is invertible. Then we compute(

(ATA)−1AT)A = (ATA)−1(ATA) = I.

Hence (ATA)−1AT is a left inverse of A.


If the columns of A are linearly independent (in particular, A is square or tall), then theparticular left inverse (ATA)−1AT described in Proposition 1.5.15 is called the pseudoinverseof A, the generalized inverse of A, or the Moore–Penrose inverse of A, and is denoted A+.Recall that left inverses are not unique in general. So this is just one left inverse. However,when A is square, we have

A+ = (ATA)−1AT = A−1(AT )−1AT = A−1I = A−1,

and so the pseudoinverse reduces to the ordinary inverse (which is unique). Note that thisequation does not make sense when A is not square or, more generally, when A is notinvertible.

We also have a right analogue of Proposition 1.5.15.

Proposition 1.5.16. (a) The rows of A are linearly independent if and only if the matrixAAT is invertible.

(b) A matrix is right-invertible if and only if its rows are linearly independent. Further-more, if A is right-invertible, then AT (AAT )−1 is a right inverse of A.

Proof. Essentially, we take the transpose of the statements in Proposition 1.5.15.

(a) We have

rows of A are lin. ind. ⇐⇒ cols of AT are lin. ind.

⇐⇒ (AT )TAT = AAT is invertible (by Proposition 1.5.15(a)).

(b) We have

A is right-invertible ⇐⇒ AT is left-invertible

⇐⇒ columns of AT are lin. ind. (by Proposition 1.5.15(b))

⇐⇒ rows of A are lin. ind.

Furthermore, if A is right-invertible, then, by part (a), AAT is invertible. Then we have

AAT (AAT )−1 = I,

and so AT (AAT )−1 is a right inverse of A.

Propositions 1.5.15 and 1.5.16 give us a method to compute right and left inverses, if theyexist. Precisely, these results reduce the problem to the computation of two-sided inverses,which you have done in previous courses. Later in the course we will develop other, moreefficient, methods for computing left- and right-inverses. (See Sections 3.6 and 4.3.)

1.6. LU factorization 25

Exercises.

1.5.1. Suppose A is a matrix with left inverses B and C. Show that, for any scalars α andβ satisfying α + β = 1, the matrix αB + βC is also a left inverse of A. It follows that if amatrix has two different left inverses, then it has a infinite number of left inverses.

1.5.2. Let A be an m × n matrix. Suppose there exists a nonzero n × k matrix B suchthat AB = 0. Show that A has no left inverse. Formulate an analogous statement for rightinverses.

1.5.3. Let A ∈Mm,n(R) and let TA : Rn → Rm be the corresponding linear map (see (1.3)).

(a) Prove that A has a left inverse if and only if TA is injective.

(b) Prove that A has a right inverse if and only if TA is surjective.

(c) Prove that A has a two-sided inverse if and only if TA is an isomorphism.

1.5.4. Consider the matrix

A =

[1 1 1−2 1 4

].

(a) Is A left-invertible? If so, find a left inverse.

(b) Compute AAT .

(c) Is A right-invertible? If so, find a right inverse.

Additional recommended exercises from [BV18]: 11.2, 11.3, 11.5, 11.6, 11.7, 11.12, 11.13,11.17, 11.18, 11.22.

1.6 LU factorization

In this section we discuss a certain factorization of matrices that is very useful in solvinglinear systems. We follow here the presentation in [Nic, §2.7].

1.6.1 Triangular matrices

If A = [aij] is an m× n matrix, the elements a11, a22, . . . form the main diagonal of A, evenif A is not square. We say A is upper triangular if every entry below and to the left of themain diagonal is zero. For example, the matrices

5 2 −50 0 30 0 80 0 0

,2 5 −1 0 1

0 −1 4 7 −10 0 3 7 0

, and

0 1 2 3 00 0 0 1 −30 0 2 7 −4

are all upper triangular. In addition


every row-echelon matrix is upper triangular.

Similarly, A is lower triangular if every entry above and to the right of the main diagonalis zero. Equivalently, A is lower triangular if and only if AT is upper triangular. We say amatrix is triangular if it is upper or lower triangular.

If the coefficient matrix of a linear system is upper triangular, there is a particularlyefficient way to solve the system, known as back substitution, where later variables are sub-stituted into earlier equations.

Example 1.6.1. Let’s solve the following system:

2x1 − x2 + 2x3 + x4 − 3x5 = 24x3 − x4 + x5 = 1

3x5 = 12

Note that the coefficient matrix is2 −1 2 1 −30 0 4 −1 10 0 0 0 3

,which is upper triangular. We obtained this matrix part-way through the row reduction inExample 1.4.9. As in gaussian elimination, we let the free variables (i.e. the non-leadingvariables) be parameters:

x2 = s, x4 = t.

Then we solve for x5, x3, and x1, in that order. The last equation gives

x5 =12

3= 4.

Substitution into the second-to-last equation then gives

x3 =1

4(1 + x4 − x5) = −3

4+

1

4t.

Then, substitution into the first equation gives

x1 =1

2(2 + x2 − 2x3 − x4 + 3x5) =

1

2

(2 + s+

3

2− 1

2t− t+ 12

)=

31

4+

1

2s− 3

4t.

Note that this is the same solution we obtained in Example 1.4.9. But the method of backsubstitution involved less work!

Similarly, if the coefficient matrix of a system is lower triangular, it can be solved byforward substitution, where earlier variables are substituted into later equations. Back sub-stitution is more numerically efficient than gaussian elimination. (With n equations, wheren is large, gaussian elimination requires approximately n3/2 multiplications and divisions,whereas back substitution requires about n3/3.) Thus, if we want to solve a large numberof linear systems with the same coefficient matrix, it would be useful to write the coefficientmatrix in terms of lower and/or upper triangular matrices.

Suppose we have a linear system Ax = b, where we can factor A as A = LU for somelower triangular matrix L and upper triangular matrix U . Then we can efficiently solve thesystem Ax = b as follows:


(a) First solve Ly = b for y using forward substitution.

(b) Then solve Ux = y for x using back substitution.

Then we haveAx = LUx = Ly = b,

and so we have solved the system. Furthermore, every solution can be found using thisprocedure: if x is a solution, take y = Ux. This is an efficient way to solve the system thatcan be implemented on a computer.

The following lemma will be useful in our exploration of this topic.

Lemma 1.6.2. Suppose A and B are matrices.

(a) If A and B are both lower (upper) triangular, then so is AB (assuming it is defined).

(b) If A is a square lower (upper) triangular matrix, then A is invertible if and only if everymain diagonal entry is nonzero. In this case A−1 is also lower (upper) triangular.

Proof. The proof of this lemma is left as Exercise 1.6.1

1.6.2 LU factorization

Suppose A is an m × n matrix. Then we can use row reduction to transform A to a row-echelon matrix U , which is therefore upper triangular. As discussed in Section 1.4, thisreduction can be performed by multiplying on the left by elementary matrices:

A→ E1A→ E2E1A→ · · · → EkEk−1 · · ·E2E1A = U.

It follows that

A = LU, where L = (EkEk−1 · · ·E2E1)−1 = E−11 E−12 · · ·E−1k−1E

−1k .

As long as we do not require that U be reduced then, except for row interchanges, none ofthe above row operations involve adding a row to a row above it. Therefore, if we can avoidrow interchanges, all the Ei are lower triangular. In this case, L is lower triangular (andinvertible) by Lemma 1.6.2. Thus, we have the following result. We say that a matrix canbe lower reduced if it can be reduced to row-echelon form without using row interchanges.

Proposition 1.6.3. If A can be lower reduced to a row-echelon (hence upper triangular)matrix U , then we have

A = LU

for some lower triangular, invertible matrix L.

Definition 1.6.4 (LU factorization). A factorization A = LU as in Proposition 1.6.3 iscalled an LU factorization or LU decomposition of A.

It is possible that no LU factorization exists, when A cannot be reduced to row-echelonform without using row interchanges. We will discuss in Section 1.6.3 how to handle thissituation. However, if an LU factorization exists, then the gaussian algorithm gives U and aprocedure for finding L.


Example 1.6.5. Let’s find an LU factorization of

A =

0 1 2 −1 30 −2 −4 4 −20 −1 −2 4 3

.We first lower reduce A to row-echelon form:

A =

0 1 2 −1 30 −2 −4 4 −20 −1 −2 4 3

R2+2R1R3+R1−−−−→

0 1 2 −1 30 0 0 2 40 0 0 3 6

12R2

R3−3R2−−−−→

0 1 2 −1 30 0 0 1 20 0 0 0 0

= U.

We have highlighted the leading column at each step. In each leading column, we divide thetop row by the top (pivot) entry to create a 1 in the pivot position. Then we use the leadingone to create zeros below that entry. Then we have

A = LU, where L =

1 0 0−2 2 0−1 3 1

.The matrix L is obtained from the identity matrix I3 by replacing the bottom of the first twocolumns with the highlighted columns above. Note that rankA = 2, which is the number ofhighlighted columns.

The method of Example 1.6.5 works in general, provided A can be lower reduced. Notethat we did not need to calculate the elementary matrices used in our row operations.

Algorithm 1.6.6 (LU algorithm). Suppose A is an m×n matrix of rank r, and that A canbe lower reduced to a row-echelon matrix U . Then A = LU where L is a lower triangular,invertible matrix constructed as follows:

(a) If A = 0, take L = Im and U = 0.

(b) If A 6= 0, write A1 = A and let c1 be the leading column of A1. Use c1 and lowerreduction to create a leading one at the top of c1 and zeros below it. When this iscompleted, let A2 denote the matrix consisting of rows 2 to m of the new matrix.

(c) If A2 6= 0, let c2 be the leading column of A2 and repeat Step (b) on A2 to create A3.

(d) Continue in this way until U is reached, where all rows below the last leading 1 are zerorows. This will happen after r steps.

(e) Create L by replacing the bottoms of the first r columns of Im with c1, . . . , cr.

Proof. If c1, c2, . . . , cr are columns of lengths m,m − 1, . . . ,m − r + 1, respectively, letL(m)(c1, c2, . . . , cr) denote the lower triangular m ×m matrix obtained from Im by placing


c1, c2, . . . , cr at the bottom of the first r columns:

L(m)(c1, c2, . . . , cr) =

c1 0 · · · · · · · · · · · · · · · 0

c2. . .

.... . . . . .

...

cr. . .

...

1. . .

...

0. . . . . .

......

. . . . . . 00 · · · 0 1

.

We prove the result by induction on n. The case where A = 0 or n = 1 is straightforward.Suppose n > 1 and that the result holds for n−1. Let c1 be the leading column of A. By theassumption that A can be lower reduced, there exist lower-triangular elementary matricesE1, . . . , Ek such that, in block form,

(Ek · · ·E2E1)A =

[0 e1

X1

A1

], where (Ek · · ·E2E1)c1 = e1.

(Recall that e1 = (1, 0, . . . , 0) is the standard basis element.) Define

G = (Ek · · ·E2E2)−1 = E−11 E−12 · · ·E−1k .

Then we have Ge1 = c1. By Lemma 1.6.2, G is lower triangular. In addition, each Ej, andhence each E−1j , is the result of either multiplying row 1 of Im by a nonzero scalar or addinga multiple of row 1 to another row. Thus, in block form,

G =

[c1

0Im−1

]By our induction hypothesis, we have an LU factorization A1 = L1U1, where L1 =

L(m−1)(c2, . . . , cr) and U1 is row-echelon. Then block multiplication gives

G−1A =

[0 e1

X1

L1U1

]=

[1 00 L1

] [0 1 X1

0 0 U1

].

Thus A = LU , where

U =

[0 1 X1

0 0 U1

]is row-echelon and

L =

[c1

0Im−1

] [1 00 L1

]=

[c1

0L1

]= L(m)(c1, c2, . . . , cr).

This completes the proof of the induction step.


LU factorization is very important in practice. It often happens that one wants to solvea series of systems

Ax = b1, Ax = b2, · · · , Ax = bk

with the same coefficient matrix. It is very efficient to first solve the first system by gaus-sian elimination, simultaneously creating an LU factorization of A. Then one can use thisfactorization to solve the remaining systems quickly by forward and back substitution.

Example 1.6.7. Let’s find an LU factorization for

A =

3 6 −3 0 3−2 4 6 8 −21 0 −2 −5 01 2 −1 6 3

.We reduce A to row-echelon form:

A =

3 6 −3 0 3−2 4 6 8 −21 0 −2 −5 01 2 −1 6 3

13R1

R2+2R1R3−R1R4−R1−−−−→

1 2 −1 0 10 8 4 8 00 −2 −1 −5 −10 0 0 6 2

18R2

R3+2R1−−−−→

1 2 −1 0 10 1 1/2 1 00 0 0 −3 −10 0 0 6 2

− 13R3

R4−6R3−−−−→

1 2 −1 0 10 1 1/2 1 00 0 0 1 1/30 0 0 0 0

= U.

Thus we have A = LU , with

L =

3 0 0 0−2 8 0 01 −2 −3 01 0 6 1

.Let’s do one more example, this time where the matrix A is invertible.

Example 1.6.8. Let’s find an LU factorization for

A =

1 −1 2−1 3 41 −4 −3

.We reduce A to row-echelon form:

A =

1 −1 2−1 3 41 −4 −3

R2+R1R3−R1−−−−→

1 −1 20 2 60 −3 −5

12R2

R3+3R2−−−−→

1 −1 20 1 30 0 4

14R3−−→

1 −1 20 1 30 0 1

= U.

Then A = LU with

L =

1 0 0−1 2 01 −3 4

.


1.6.3 PLU factorization

All of our examples in Section 1.6.2 worked because we started with matrices A that couldbe lower reduced. However, there are matrices that have no LU factorization. Consider, forexample, the matrix

A =

[0 11 0

].

Suppose this matrix had an LU decomposition A = LU . Write

L =

[`11 0`21 `22

]and U =

[u11 u120 u22

].

Then we have

A =

[0 11 0

]=

[`11u11 `11u12`21u11 `21u12 + `22u22

].

In particular, `11u11 = 0, which implies that `11 = 0 or u11 = 0. But this would mean thatL is singular or U is singular. In either case, we would have

det(A) = det(L) det(U) = 0,

which contradicts the fact that det(A) = −1. Therefore, A has no LU decomposition. ByAlgorithm 1.6.6, this means that A cannot be lower reduced. The problem is that we needto use a row interchange to reduce A to row-echelon form.

The following theorem tells us how to handle LU factorization in general.

Theorem 1.6.9. Suppose an m×n matrix A is row reduced to a row-echelon matrix U . LetP1, P2, . . . , Ps be the elementary matrices corresponding (in order) to the row interchangesusing in this reduction, and let P = Ps · · ·P2P1. (If no interchanges are used, take P = Im.)Then PA has an LU factorization.

Proof. The only thing that can go wrong in the LU algorithm (Algorithm 1.6.6) is that, instep (b), the leading column (i.e. first nonzero column) may have a zero entry at the top.This can be remedied by an elementary row operation that swaps two rows. This correspondsto multiplication by a permutation matrix (see Proposition 1.4.4(b)). Thus, if U is a rowechelon form of A, then we can write

LrPrLr−1Pr−1 · · ·L2P2L1P1A = U, (1.10)

where U is a row-echelon matrix, each of the P1, . . . , Pr is either an identity matrix or anelementary matrix corresponding to a row interchange, and, for each 1 ≤ j ≤ r, Lj =L(m)(e1, . . . , ej−1, cj)

−1 for some column cj of length m − j + 1. (Refer to the proof ofAlgorithm 1.6.6 for notation.) It is not hard to check that, for each 1 ≤ j ≤ r,

Lj = L(m)(e1, . . . , ej−1, c′j)

for some column c′j of length m− j + 1.


Now, each permutation matrix can be “moved past” each lower triangular matrix to theright of it, in the sense that, if k > j, then

PkLj = L′jPk,

where L′j = L(m)(e1, . . . , ej−1, c′′j ) for some column c′′j of length m− j+1. See Exercise 1.6.2.

Thus, from (1.10) we obtain

(LrL′r−1 · · ·L′2L′1)(PrPr−1 · · ·P2P1)A = U,

for some lower triangular matrices L′1, L′2, . . . , L

′r−1. Setting P = PrPr−1 · · ·P2P1, this implies

that PA has an LU factorization, since LrL′r−1 · · ·L′2L′1 is lower triangular and invertible by

Lemma 1.6.2.

Note that Theorem 1.6.9 generalizes Proposition 1.6.3. If A can be lower reduced, thenwe can take P = Im in Theorem 1.6.9, which then states that A has an LU factorization.

A matrix that is the product of elementary matrices corresponding to row interchangesis called a permutation matrix . (We also consider the identity matrix to be a permutationmatrix.) Every permutation matrix P is obtained from the identity matrix by permutingthe rows. Then PA is the matrix obtained from A by performing the same permutationon the rows of A. The matrix P is a permutation matrix if and only if it has exactly one1 in each row and column, and all other entries are zero. The elementary permutationmatrices are those matrices obtained from the identity matrix by a single row exchange.Every permutation matrix is a product of elementary ones.

Example 1.6.10. Consider the matrix

A =

0 2 0 41 0 1 31 0 1 14−2 −1 −1 −10

.Let’s find a permutation matrix P such that PA has an LU factorization, and then find thefactorization.

We first row reduce A:

AR1↔R2−−−−→

1 0 1 30 2 0 41 0 1 14−2 −1 −1 −10

R3−R1R4+2R1−−−−→

1 0 1 30 2 0 40 0 0 110 −1 1 −4

R4+12R2−−−−−→

1 0 1 30 2 0 40 0 0 110 0 1 −2

R3↔R4−−−−→

1 0 1 30 2 0 40 0 1 −20 0 0 11

.We used two row interchanges: first R1 ↔ R2 and then R3 ↔ R4. Thus, as in Theorem 1.6.9,we take

P =

1 0 0 00 1 0 00 0 0 10 0 1 0

0 1 0 01 0 0 00 0 1 00 0 0 1

=

0 1 0 01 0 0 00 0 0 10 0 1 0

.


We now apply the LU algorithm to PA:

PA =

1 0 1 30 2 0 4−2 −1 −1 −101 0 1 14

R3+2R1R4−R1−−−−→

1 0 1 30 2 0 40 −1 1 −40 0 0 11

12R2

R3+R2−−−−→

1 0 1 30 1 0 20 0 1 −20 0 0 11

−→

1 0 1 30 1 0 20 0 1 −20 0 0 11

111R3−−−→

1 0 1 30 1 0 20 0 1 −20 0 0 1

= U.

Hence PA = LU , where

L =

1 0 0 00 2 0 0−2 −1 1 01 0 0 11

.Theorem 1.6.9 is an important factorization theorem that applies to any matrix. If A

is any matrix, this theorem asserts that there exists a permutation matrix P and an LUfactorization PA = LU . Furthermore, it tells us how to find P , and we then know how tofind L and U .

Note that Pi = P−1i for each i (since any elementary permutation matrix is its owninverse). Thus, the matrix A can be factored as

A = P−1LU,

where P−1 is a permutation matrix, L is lower triangular and invertible, and U is a row-echelon matrix. This is called a PLU factorization or a PA = LU factorization of A.

1.6.4 Uniqueness

Theorem 1.6.9 is an existence theorem. It tells us that a PLU factorization always exists.However, it leaves open the question of uniqueness. In general, LU factorizations (and hencePLU factorizations) are not unique. For example,[

−1 04 1

] [1 4 −50 0 0

]=

[−1 −4 54 16 −20

]=

[−1 04 8

] [1 4 −50 0 0

].

(In fact, one can put any value in the (2, 2)-position of the 2×2 matrix and obtain the sameresult.) The key to this non-uniqueness is the zero row in the row-echelon matrix. Notethat, if A is m× n, then the matrix U has no zero row if and only if A has rank m.

Theorem 1.6.11. Suppose A is an m× n matrix with LU factorization A = LU . If A hasrank m (that is, U has no zero row), then L and U are uniquely determined by A.

Proof. Suppose A = MV is another LU factorization of A. Thus, M is lower triangular andinvertible, and V is row-echelon. Thus we have

LU = MV, (1.11)


and we wish to show that L = M and U = V .We have

V = M−1LU = NU, where N = M−1L.

Note that N is lower triangular and invertible by Lemma 1.6.2. It suffices to show thatN = I. Suppose N is m×m. We prove the result by induction on m.

First note that the first column of V is N times the first column of U . Since N isinvertible, this implies that the first column of V is zero if and only if the first column of Uis zero. Hence, by deleting zero columns if necessary, we can assume that the (1, 1)-entry is1 in both U and V .

If m = 1, then, since U is row-echelon, we have

LU =[`11] [

1 u12 · · · u1n]

=[`11 `11u12 · · · `11u1n

]= A =

[a11 · · · a1n

].

Thus `11 = a11. Similarly, m11 = a11. So L =[a11]

= M , as desired.Now suppose m > 1 and that the result holds for N of size (m− 1)× (m− 1). As before,

we can delete any zero columns. So we can write, in block form,

N =

[a 0X N1

], U =

[1 Y0 U1

], V =

[1 Z0 V1

].

Then

NU = V =⇒[a aYX XY +N1U1

]=

[1 Z0 V1

].

This implies thata = 1, Y = Z, X = 0, and N1U1 = V1.

By the induction hypothesis, the equality N1U1 = V1 implies N1 = I. Hence N = I, asdesired.

Recall that an m×m matrix is invertible if and only if it has rank m. Thus, we get thefollowing special case of Theorem 1.6.11.

Corollary 1.6.12. If an invertible matrix A has an LU factorization A = LU , then L andU are uniquely determined by A.

Exercises.

1.6.1. Prove Lemma 1.6.2.

1.6.2 ([Nic, Ex. 2.7.11]). Recall the notation L(m)(c1, c2, . . . , cr) from the proof of Algo-rithm 1.6.6. Suppose 1 ≤ i < j < k ≤ m, and let ci be a column of length m− i+ 1. Showthat there is another column c′i of length m− i+ 1 such that

Pj,kL(m)(e1, . . . , ei−1, ci) = L(m)(e1, . . . , ei−1, c

′i)Pj,k.


Here Pj,k is the m ×m elementary permutation matrix (see Definition 1.4.2). Hint : Recallthat P−1j,k = Pj,k. Write

Pj,k =

[Ii 00 Pj−i,k−i

]in block form.

Additional recommended exercises: [Nic, §2.7].

Chapter 2

Matrix norms, sensitivity, andconditioning

In this chapter we will consider the issue of how sensitive a linear system is to small changesor errors in its coefficients. This is particularly important in applications, where thesecoefficients are often the results of measurement, and thus inherently subject to some levelof error. Therefore, our goal is to develop some precise measure of how sensitive a linearsystem is to such changes.

2.1 Motivation

Consider the linear systemsAx = b and Ax′ = b′

where

A =

[1 11 1.00001

], b =

[2

2.00001

], b′ =

[2

2.00002

].

Since detA 6= 0, the matrix A is invertible, and hence these linear systems have uniquesolutions. Indeed it is not hard to see that the solutions are

x =

[11

]and x′ =

[02

].

Note that even though the vector b′ is very close to b, the solutions to the two systems arequite different. So the solution is very sensitive to the entries of the vector of constants b.

When the solution to the systemAx = b

is highly sensitive to the entries of the coefficient matrix A or the vector b of constant terms,we say that the system is ill-conditioned . Ill-conditioned systems are especially problematicwhen the coefficients are obtained from experimental results (which always come associatedwith some error) or when computations are carried out by computer (which can involveround-off error).

So how do we know if a linear system is ill-conditioned? To do this, we need to discussvector and matrix norms.

36

2.2. Normed vector spaces 37

Exercises.

2.1.1. Consider the following linear systems:[400 −201−800 401

] [x1x2

]=

[200−200

]and

[401 −201−800 401

] [x1x2

]=

[200−200

].

Solve these two linear systems (feel free to use a computer) to see how the small changein the coefficient matrix results in a large change in the solution. So the solution is verysensitive to the entries of the coefficient matrix.

2.2 Normed vector spaces

Recall that every complex number z ∈ C can be written in the form

a+ bi, a, b ∈ R.

The complex conjugate of z is

z = a− bi.

Note that z = z if and only if z ∈ R. The absolute value, or magnitude, of z = a+ bi ∈ C isdefined to be

|z| =√zz =

√a2 + b2.

Note that if z ∈ R, then |z| is the usual absolute value of z.

Definition 2.2.1 (Vector norm, normed vector space). A norm (or vector norm) on anF-vector space V (e.g. V = Fn) is a function

‖ · ‖ : V → R

that satisfies the following axioms. For all u,v ∈ V and c ∈ F, we have

(N1) ‖v‖ ≥ 0,

(N2) if ‖v‖ = 0 then v = 0,

(N3) ‖cv‖ = |c| ‖v‖, and

(N4) ‖u + v‖ ≤ ‖u‖+ ‖v‖ (the triangule inequality).

(Note that the codomain of the norm is R, regardless of whether F = R or F = C.) A vectorspace equipped with a norm is called a normed vector space. Thus, a normed vector spaceis a pair (V, ‖ · ‖), where ‖ · ‖ is a norm on V . However, we will often just refer to V as anormed vector space, leaving it implied that we have a specific norm ‖ · ‖ in mind.

38 Chapter 2. Matrix norms, sensitivity, and conditioning

Note that (N3) implies that

‖0‖ = |0| ‖0‖ = 0‖0‖ = 0.

Thus, combined with (N2), we have

‖v‖ = 0 ⇐⇒ v = 0.

Example 2.2.2 (2-norm). Let V = R3. Then

‖(x, y, z)‖2 :=√x2 + y2 + z2

is the usual norm, called the 2-norm or the Euclidean norm. It comes from the dot product,since ‖v‖ =

√v · v. We can clearly generalize this to a norm on V = Rn by

‖(x1, . . . , xn)‖2 :=√x21 + · · ·+ x2n.

We can even define an analogous norm on the complex vector space V = Cn by

‖(x1, . . . , xn)‖2 :=√|x1|2 + · · ·+ |xn|2. (2.1)

(Since the definition (2.1) works for R or C, we will take it as the definition of the 2-normfrom now on.) You have verified in previous courses that ‖ · ‖ satisfies axioms (N1)–(N4).

Example 2.2.3 (1-norm). Let V = Fn and define

‖(x1, . . . , xn)‖1 := |x1|+ · · ·+ |xn|.

Let’s verify that this is a norm, called the 1-norm. Since |c| ≥ 0 for all c ∈ F, we have

‖(x1, . . . , xn)‖1 = |x1|+ · · ·+ |xn| ≥ 0

and‖(x1, . . . , xn)‖1 = 0 =⇒ |x1| = · · · = |xn| = 0 =⇒ x1 = · · · = xn = 0.

Thus, axioms (N1) and (N2) are satisfied. To verify axiom (N3), we see that

‖c(x1, . . . , xn)‖ = ‖(cx1, . . . , cxn)‖= |cx1|+ · · ·+ |cxn|= |c| |x1|+ · · ·+ |c| |xn|= |c|

(|x1|+ · · ·+ |xn|

)= |c| ‖(x1, . . . , xn)‖1.

Also, we have

‖(x1, . . . , xn) + (y1, . . . , yn)‖1 = ‖(x1 + y1, . . . , xn + yn)‖1= |x1 + y1|+ · · ·+ |xn + yn|≤ |x1|+ |y1|+ · · ·+ |xn|+ |yn| (by the ∆ inequality for F)

= |x1|+ · · ·+ |xn|+ |y1|+ · · ·+ |yn|= ‖(x1, . . . , xn)‖+ ‖(y1, . . . , yn)‖.

So (N4) is satisfied.

2.2. Normed vector spaces 39

In general, for any p ∈ R, p ≥ 1, one can define the p-norm by

‖(x1, . . . , xn)‖p := (|x1|p + · · ·+ |x2|p)1/p .

(It is a bit harder to show this is a norm in general. The proof uses Minkowski’s inequality.)As p approaches ∞, this becomes the norm

‖ · ‖∞ : Fn → R, ‖(x1, . . . , xn)‖∞ := max{|x1|, . . . , |xn|}, (2.2)

which is called the ∞-norm or maximum norm. See Exercise 2.2.1. In this course, we’llfocus mainly on the cases p = 1, 2,∞.

Remark 2.2.4. It is a theorem of analysis that all norms on Fn are equivalent in the sensethat, if ‖ · ‖ and ‖ · ‖′ are two norms on Fn, then there is a c ∈ R, c > 0, such that

1

c‖v‖ ≤ ‖v‖′ ≤ c‖v‖ for all v ∈ Fn.

This implies that they induce the same topology on Fn. That’s beyond the scope of thiscourse, but it means that, in practice, we can choose whichever norm best suits our particularapplication.

Exercises.

2.2.1. Show that (2.2) defines a norm on Fn.

2.2.2. Suppose that ‖ · ‖ is a norm on Fn.

(a) Show that ‖u− v‖ ≥∣∣‖u‖ − ‖v‖∣∣ for all u,v ∈ Fn.

(b) Show that, if v 6= 0, then ‖(1/c)v‖ = 1 when c = ‖v‖.

2.2.3. Show that

‖v‖∞ ≤ ‖v‖2 ≤√n‖v‖∞

for all v ∈ Fn.

2.2.4. Suppose that, for p > 1, we define

‖v‖ = |v1|p + |v2|p + · · ·+ |vn|p for v = (v1, v2, . . . , vn) ∈ Fn.

Is this a norm on Fn? If yes, prove it. If not, show that one of the axioms of a norm isviolated.

https://en.wikipedia.org/wiki/Minkowski_inequality


2.3 Matrix norms

We would now like to define norms of matrices. In view of the motivation in Section 2.1,we would like the norm to somehow measure how much a matrix A can change a vector. Ifmultiplication by A has a large effect, then small changes in x can result in large changes inAx, which is a problem.

Definition 2.3.1 (Matrix norm). Let A ∈Mm,n(F). If ‖ · ‖p and ‖ · ‖q are norms on Fn andFm respectively, we define

‖A‖p,q = max

{‖Ax‖q‖x‖p

: x ∈ Fn, x 6= 0

}. (2.3)

This is called the matrix norm, or operator norm, of A with respect to the norms ‖ · ‖p and‖ · ‖q. We also say that ‖A‖p,q is the matrix norm associated to the norms ‖ · ‖p and ‖ · ‖q.

The next lemma tells us that, in order to compute the matrix norm (2.3), it is notnecessary to check the value of ‖Ax‖q/‖x‖p for every x 6= 0. Instead, it is enough toconsider unit vectors.

Proposition 2.3.2. In the setup Definition 2.3.1, we have

‖A‖q,p = max {‖Ax‖q : x ∈ Fn, ‖x‖p = 1} .

Proof. Let

S =

{‖Ax‖q‖x‖p

: x ∈ Fn, x 6= 0

}and T = {‖Ax‖q : x ∈ Fn, ‖x‖p = 1} .

These are both sets of nonnegative real numbers, and we wish to show that they have thesame maximum. To do this, we will show that these sets are actually the same (which is astronger assertion).

First note that, if ‖x‖p = 1, then

‖Ax‖q‖x‖p

= ‖Ax‖q.

Thus T ⊆ S.Now we want to show the reverse inclusion: S ⊆ T . Suppose s ∈ S. Then there is some

x 6= 0 such that

s =‖Ax‖q‖x‖p

.

Define

c =1

‖x‖p∈ R.

Then, by (N3),

‖cx‖p = |c| ‖x‖p =1

‖x‖p‖x‖p = 1.

2.3. Matrix norms 41

Thus we have

s =‖Ax‖q‖x‖p

=|c| ‖Ax‖q|c| ‖x‖p

(N3)=‖cAx‖q‖cx‖p

=‖A(cx)‖q‖cx‖p

= ‖A(cx)‖q ∈ T,

since ‖cx‖p = 1. Thus we have shown the reverse inclusion S ⊆ T .Having shown both inclusions, it follows that S = T , and hence their maxima are equal.

Remark 2.3.3. In general, a set of real numbers may not have a maximum. (Consider theset R itself.) Thus, it is not immediately clear that ‖A‖p,q is indeed well defined. However,one can use Proposition 2.3.2 to show that it is well defined. The set {x ∈ Fn : ‖x‖p = 1} iscompact, and the function x 7→ ‖Ax‖q is continuous. It follows from a theorem in analysisthat this function attains a maximum.

When we use the same type of norm (e.g. the 1-norm, the 2-norm, or the ∞-norm) inboth the domain and the codomain, we typically use a single subscript on the matrix norm.Thus, for instance,

‖A‖1 := ‖A‖1,1, ‖A‖2 := ‖A‖2,2, ‖A‖∞ := ‖A‖∞,∞.

Note that, in principle, we could choose different types of norms for the domain and codomain.However, in practice, we usually choose the same one.

Example 2.3.4. Let’s calculate ‖A‖1 for F = R and

A =

[2 31 −5

].

We will use Proposition 2.3.2, so we need to consider the set

{(x, y) : ‖(x, y)‖1 = 1} = {(x, y) : |x|+ |y| = 1}.

This set is the union of the four blue line segments shown below:

(1, 0)

(0, 1)

(−1, 0)

(0,−1)

By Proposition 2.3.2, we have

‖A‖1 = max{‖Ax‖1 : ‖x‖1 = 1}

= max

{∥∥∥∥A [xy]∥∥∥∥

1

: |x|+ |y| = 1

}= max

{∥∥∥∥[2x+ 3yx− 5y

]∥∥∥∥1

: |x|+ |y| = 1

}


= max {|2x+ 3y|+ |x− 5y| : |x|+ |y| = 1}≤ max {2|x|+ 3|y|+ |x|+ 5|y| : |x|+ |y| = 1}= max {3|x|+ 8|y| : |x|+ |y| = 1}= max {3|x|+ 8(1− |x|) : 0 ≤ |x| ≤ 1}= max {8− 5x : 0 ≤ x ≤ 1}= 8.

So we know that ‖A‖1 = max{‖Ax‖1 : ‖x‖1 = 1} is at most 8. To show that it is equal to 8,it is enough to show that 8 ∈ {‖Ax‖1 : ‖x‖1 = 1}. So we want to find an x ∈ R2 such that

‖x‖1 = 1 and ‖Ax‖1 = 8.

Indeed, if we take x = (0, 1), then ‖x‖1 = 1, and

Ax =

[2 31 −5

] [01

]=

[3−5

],

and so ‖Ax‖1 = |3|+ | − 5| = 8. Thus ‖A‖1 = 8.

Note that, in Example 2.3.4, the matrix norm ‖A‖1 was precisely the 1-norm of one of itscolumns (the second column, to be precise). The following result gives the general situation.

Theorem 2.3.5. Suppose A = [aij] ∈Mm,n(F). Then

(a) ‖A‖1 = max {∑m

i=1 |aij| : 1 ≤ j ≤ n}, the maximum of the 1-norms of the columns ofA;

(b) ‖A‖∞ = max{∑n

j=1 |aij| : 1 ≤ i ≤ m}

, the maximum of the 1-norms of the rows of A;

(c) ‖A‖2 ≤√∑m

i=1

∑nj=1 |aij|2 =: ‖A‖F , called the Frobenius norm of A, which is the

2-norm of A, viewing it as a vector in Fnm.

Proof. We will prove part (a) and leave parts (b) and (c) as Exercise 2.3.1.Recall that aj = Aej is the j-th column of A, for 1 ≤ j ≤ n. We have ‖ej‖1 = 1 and

‖Aej‖1 =m∑i=1

|ai,j|

is the 1-norm of the j-th column of A. Thus, by definition,

‖A‖1 ≥ max

{m∑i=1

|aij| : 1 ≤ j ≤ n

}.

It remains to prove the reverse inequality. If x = (x1, . . . , xn), we have

Ax = x1a1 + · · ·+ xnan.

2.3. Matrix norms 43

Thus

‖Ax‖1 = ‖x1a1 + · · ·+ xnan‖1≤ ‖x1a1‖1 + · · ·+ ‖xnan‖1 (by the triangle inequality (N4))

≤ |x1| ‖a1‖1 + · · ·+ |xn| ‖an‖1. (by (N3))

Now, suppose the column of A with the maximum 1-norm is the j-th column, and suppose‖x‖1 = 1. Then we have

‖Ax‖1 ≤ |x1| ‖aj‖1 + · · ·+ |xn| ‖aj‖1 = (|x1|+ · · ·+ |xn|) ‖aj‖1 = ‖aj‖1.

This completes the proof.

Note that part (c) of Theorem 2.3.5 involves an inequality. In practice, the norm ‖A‖2is difficult to compute, while the Frobnenius norm is much easier.

The following theorem summarizes the most important properties of matrix norms.

Theorem 2.3.6 (Properties of matrix norms). Suppose ‖ · ‖ is a family of norms on Fn,n ≥ 1. We also use the notation ‖A‖ for the matrix norm with respect to these vector norms.

(a) For all v ∈ Fn and A ∈Mm,n(F), we have ‖Av‖ ≤ ‖A‖ ‖v‖.(b) ‖I‖ = 1.

(c) For all A ∈Mm,n(F), we have ‖A‖ ≥ 0 and ‖A‖ = 0 if and only if A = 0.

(d) For all c ∈ F and A ∈Mm,n(F), we have ‖cA‖ = |c| ‖A‖.(e) For all A,B ∈Mm,n(F), we have ‖A+B‖ ≤ ‖A‖+ ‖B‖.(f) For all A ∈Mm,n(F) and B ∈Mn,k(F), we have ‖AB‖ ≤ ‖A‖ ‖B‖.(g) For all A ∈Mn,n(F), we have ‖Ak‖ ≤ ‖A‖k for all k ≥ 1.

(h) If A ∈ GL(n,F), then ‖A−1‖ ≥ 1‖A‖ .

Proof. We prove parts (a) and (h) and leave the remaining parts as Exercise 2.3.3. Supposev ∈ Fn and A ∈Mm,n(F). If v = 0, then

‖Av‖ = 0 = ‖A‖ ‖v‖.

Now suppose v 6= 0. Then

‖Av‖‖v‖

≤ max

{‖Ax‖‖x‖

: x ∈ Fn, x 6= 0

}= ‖A‖.

Multiplying both sides by ‖v| then gives (a).Now suppose A ∈ GL(n,F). By the definition of ‖A‖, we can choose x ∈ Fn, x 6= 0, such

that

‖A‖ =‖Ax‖‖x‖

.

Then we have1

‖A‖=‖x‖‖Ax‖

=‖A−1(Ax)‖‖Ax‖

≤ ‖A−1‖

by part (a) for A−1.


Exercises.

2.3.1. Prove parts (b) and (c) of Theorem 2.3.5. For part (c), use the Cauchy–Schwarzinequality

|uTv| ≤ ‖u‖ ‖v‖, u,v ∈ Fn,

(here we view the 1 × 1 matrix uTv as an element of F) and note that the entries of theproduct Ax are of the form uTx, where u is a row of A.

2.3.2. Suppose A ∈ Mn,n(F) and that λ is an eigenvalue of A. Show that, for any choice ofvector norm on Fn, we have ‖A‖ ≥ |λ|, where ‖A‖ is the associated matrix norm of A.

2.3.3. Complete the proof of Theorem 2.3.6.

2.3.4. What is the matrix norm of a zero matrix?

2.3.5. Suppose A ∈Mm,n(F) and that there is some fixed k ∈ R such that ‖Av‖ ≤ k‖v‖ forall v ∈ Fn. (Here we have fixed some arbitrary norms on Fn and Fm.) Show that ‖A‖ ≤ k.

2.3.6. For each of the following matrices, find ‖A‖1 and ‖A‖∞.

(a)

−415

(b)

[−4 1 5

](c)

[−9 0 2 9

](d)

[−5 86 2

]

(e)

2i 54 −i3 5

2.4 Conditioning

Our goal is to develop some measure of how “good” a matrix is as a coefficient matrix ofa linear system. That is, we want some measure that allows us to know whether or not amatrix can exhibit the bad behaviour we saw in Section 2.1.

Definition 2.4.1 (Condition number). Suppose A ∈ GL(n,F) and let ‖ · ‖ denote a normon Fn as well as the associated matrix norm. The value

κ(A) = ‖A‖ ‖A−1‖ ≥ 1

is called the condition number of the matrix A, relative to the choice of norm ‖ · ‖.

2.4. Conditioning 45

Note that the condition number depends on the choice of norm. The fact that κ(A) ≥ 1follows from Theorem 2.3.6(h).

Examples 2.4.2. (a) Consider the matrix from Section 2.1. We have

A =

[1 11 1.00001

]and A−1 = 105

[1.00001 −1−1 1

].

Therefore, with respect to the 1-norm,

κ(A) = (2.00001)(2.00001 · 105) ≥ 4 · 105.

(b) If

B =

[2 24 3

], then B−1 =

−1

2

[3 −2−4 2

].

Thus, with respect to the 1-norm,

κ(B) = 6 · 7

2= 21.

(c) If a ∈ R and

C =

[1 a0 1

]and C−1 =

[1 −a0 1

]and so, with respect to the 1-norm,

κ(C) = (|a|+ 1)2.

Since a is arbitrary, this example shows that the condition number can be arbitrarily large.

Lemma 2.4.3. If c ∈ F×, then for every invertible matrix A, we have κ(cA) = κ(A).

Proof. Note that (cA)−1 = c−1A−1. Thus

κ(cA) = ‖cA‖ ‖c−1A−1‖ = |c| |c−1| ‖A‖ ‖A−1‖ = |cc−1|κ(A) = κ(A).

Example 2.4.4. If

M =

[1016 0

0 1016

],

then M = 1016I. Hence κ(M) = κ(I) = 1 by Theorem 2.3.6(b).

Now that we’ve defined the condition number of a matrix, what does it have to do withthe situation discussed in Section 2.1? Suppose we want to solve the system

Ax = b.

where b is determined by some experiment (thus subject to measurement error) or com-putation (subject to rounding error). Let ∆b be some small perturbation in b, so thatb′ = b + ∆b is close to b′. Then there is some new solution

Ax′ = b′.


Let ∆x = x′ − x. We say that the error ∆b induces (via A) the error ∆x in x.The norms of the vectors ∆b and ∆x are the absolute errors ; the quotients

‖∆b‖‖b‖

and‖∆x‖‖x‖

are the relative errors. In general, we are interested in the relative error.

Theorem 2.4.5. With the above notation,

‖∆x‖‖x‖

≤ κ(A)‖∆b‖‖b‖

.

Proof. We haveA(∆x) = A(x′ − x) = Ax′ − Ax = b′ − b = ∆b,

and so ∆x = A−1∆b. Thus, by Theorem 2.3.6(a), we have

‖∆x‖ ≤ ‖A−1‖ ‖∆b‖ and ‖b‖ ≤ ‖A‖ ‖x‖.

Thus‖∆x‖‖x‖

≤ ‖A−1‖ ‖∆b‖‖b‖/‖A‖

= κ(A)‖∆b‖‖b‖

.

In fact, there always exists a choice of b and ∆b such that we have equality in Theo-rem 2.4.5. See Exercise 2.4.1.

Theorem 2.4.5 says that, when solving the linear system,

Ax = b,

the condition number is roughly the rate at which the solution x will change with respectto a change in b. So if the condition number is large, then even a small error/change in bmay cause a large error/change in x. More precisely, the condition number is the maximumratio of the relative error in x to the relative error in b.

If κ(A) is close to 1, we say that A is well-conditioned . On the other hand, if κ(A) islarge, we say that A is ill-conditioned . Note that these terms are a bit vague; we do nothave a precise notion of how large κ(A) needs to be before we say A is ill-conditioned. Thisdepends somewhat on the particular situation.

Example 2.4.6. Consider the situation from Section 2.1 and Example 2.4.2(a):

A =

[1 11 1.00001

], b =

[2

2.00001

], b′ =

[2

2.00002

], x =

[11

], x′ =

[02

].

Thus‖∆b‖1 = 10−5, ‖b‖1 = 4.00001, ‖∆x‖1 = 2, ‖x‖1 = 2.

So we have

κ(A)‖∆b‖‖b‖

= (2.00001)2 · 105 · 10−5

4.00001≥ 1 =

‖∆x‖‖x‖

,

as predicted by Theorem 2.4.5. The fact that A is ill-conditioned explains the phenomenonwe noticed in Section 2.1: that a small change in b can result in a large change in the solutionx to the system Ax = b.

2.4. Conditioning 47

What happens if there is some small change/error in the matrix A in addition to achange/error in b? (See Exercise 2.1.1.) In general, one can show that

‖∆x‖‖x‖

≤ c · κ(A)

(‖∆b|‖b‖

+‖∆A‖‖A‖

),

where c = ‖(∆A)A−1‖ or c = ‖A−1(∆A)‖ (and we assume that c < 1). For a proof see, forexample, [ND77, Th. 6.29].

While an ill-conditioned coefficient matrix A tells us that the system Ax = b can beill-conditioned (see Exercise 2.4.1), it does not imply in general that this system is ill-conditioned. See Exercise 2.4.2. Thus, κ(A) measures the worse case scenario for a linearsystem with coefficient matrix A.

Exercises.

2.4.1. Show that there always exists a choice of b and ∆b such that we have equality inTheorem 2.4.5.

2.4.2. If s is very large, we know from Example 2.4.2(c) that the matrix

C =

[1 s0 1

]is ill-conditioned. Show that, if b = (1, 1), then the system Cx = b satisfies

‖∆x‖‖x‖

≤ 3‖∆b||‖b‖

and is therefore well-conditioned. (Here we use the 1-norm.) On the other hand, find achoice of b for which the system Cx = b is ill-conditioned.

2.4.3. Suppose k ∈ F× and find the condition number of the matrix

A =

[1 00 1

k

]using either the 1-norm or the ∞-norm.

2.4.4 ([ND77, 6.5.16]). Consider the system of equations

0.89x1 + 0.53x2 = 0.360.47x1 + 0.28x2 = 0.19

with exact solution x1 = 1, x2 = −1.

(a) Find ∆b so that if you replace the right-hand side b by b + ∆b, the exact solutionwill be x1 = 0.47, x2 = 0.11.

(b) Is the system ill-conditioned or well-conditioned?

(c) Find the condition number for the coefficient matrix of the system using the ∞-norm.

Chapter 3

Orthogonality

The notion of orthogonality is fundamental in linear algebra. You’ve encountered this con-cept in previous courses. Here we will delve into this subject in further detail. We begin bybriefly reviewing the Gram–Schmidt algorithm, orthogonal complements, orthogonal projec-tion, and diagonalization. We then discuss hermitian and unitary matrices, which are com-plex analogues of symmetric and orthogonal matrices that you’ve seen before. Afterwards,we learn about Schur decomposition and prove the important spectral and Cayley–Hamiltontheorems. We also define positive definite matrices and consider Cholesky and QR factor-izations. We conclude with a discussion of computing/estimating eigenvalues, including theGershgorin circle theorem.

3.1 Orthogonal complements and projections

In this section we briefly review the concepts of orthogonality, orthogonal complements,orthogonal projections, and the Gram–Schmidt algorithm. Since you have seen this materialin previous courses, we will move quickly and omit proofs. We follow the presentation in[Nic, §8.1], and proofs can be found there.

Recall that F is either R or C. If v = (v1, . . . , vn) ∈ Fn, we define

v = (v1, . . . , vn) ∈ Fn.

Then we define the inner product on Fn as follows:

〈u,v〉 := uTv = u1v1 + · · ·+ unvn, v = (v1, . . . , vn), u = (u1, . . . , un). (3.1)

When F = R, this is the usual dot product . (Note that, in [Nic, §8.7], the inner productis defined with the complex conjugation on the second vector.) The inner product has thefollowing important properties: For all u,v,w ∈ Fn and c, d ∈ F, we have

(IP1) 〈u,v〉 = 〈v,u〉,(IP2) 〈cu + dv,w〉 = c〈u,w〉+ d〈v,w〉,(IP3) 〈u, cv + dw〉 = c〈u,v〉+ d〈u,w〉,(IP4) 〈u,u〉 ∈ R and 〈u,u〉 ≥ 0,

48

3.1. Orthogonal complements and projections 49

(IP5) 〈u,u〉 = 0 if and only if u = 0.

More generally, if V is a vector space over F, then any map

〈·, ·〉 : V × V → R

satisfying (IP1)–(IP5) is called an inner product on V . In light of (IP4), for any innerproduct, we may define

‖v‖ =√〈u,u〉,

and one can check that this defines a norm on V . For the purposes of this course, we willstick to the particular inner product (3.1). In this case ‖v‖ is the usual 2-norm:

‖v‖ =√|v1|2 + · · ·+ |vn|2.

Throughout this chapter, ‖ · ‖ will denote the 2-norm.

Definition 3.1.1 (Orthogonal, orthonormal). We say that u,v ∈ Fn are orthogonal , andwe write u ⊥ v, if

〈u,v〉 = 0.

A set {v1, . . . ,vm} is orthogonal if vi ⊥ vj for all i 6= j. If, in addition, we have ‖vi‖ = 1for all i, then we say the set is orthonormal . An orthogonal basis is a basis that is also anorthogonal set. Similarly, an orthonormal basis is a basis that is also an orthonormal set.

Proposition 3.1.2. Let U be a subspace of Fn.

(a) Every orthogonal subset {v1, . . . ,vm} in U is a subset of an orthogonal basis of U . (Wesay that we can extend any orthogonal subset to an orthogonal basis.)

(b) The subspace U has an orthogonal basis.

Proof. You saw this in previous courses, so we will omit the proof here. It can be found in[Nic, Th. 8.1.1].

Theorem 3.1.3 (Gram–Schmidt orthogonalization algorithm). If {v1, . . . ,vm} is any basisof a subspace of Fn, construct u1, . . . ,um ∈ U successively as follows:

u1 = v1,

u2 = v2 −〈u1,v2〉‖u1‖2

u1,

u3 = v3 −〈u1,v3〉‖u1‖2

u1 −〈u2,v3〉‖u2‖2

u2,

...

um = vm −〈u1,vm〉‖u1‖2

u1 −〈u2,vm〉‖u2‖2

u2 − · · · −〈um−1,vm〉‖um−1‖2

um−1.

Then

50 Chapter 3. Orthogonality

(a) {u1, . . . ,um} is an orthogonal basis of U , and

(b) Span{u1, . . . ,uk} = Span{v1, . . . ,vk} for each k = 1, 2, . . . ,m.

Proof. You saw this in previous courses, and so we will not repeat the proof here. See[Nic, Th. 8.1.2]. Note that, in [Nic, Th. 8.1.2], it is assumed that F = R, in which case〈v,u〉 = 〈u,v〉 for all u,v ∈ Rn. If we wish to allow F = C, we have (IP1) instead. Then itis important that we write 〈uk,v〉 in the Gram–Schmidt algorithm instead of 〈v,uk〉.

Example 3.1.4. Let’s find an orthogonal basis for the row space of

A =

1 −1 0 12 −1 −2 30 2 0 1

.Let v1,v2,v3 denote the rows of A. One can check that these rows are linearly independent.(Reduce A to echelon form and note that it has rank 3.) So they give a basis of the rowspace. Let’s use the Gram–Schmidt algorithm to find an orthogonal basis:

u1 = v1 = (1,−1, 0, 1),

u2 = v2 −〈u1,v2〉‖u1‖2

u1 = (2,−1,−2, 3)− 6

3(1,−1, 0, 1) = (0, 1,−2, 1),

u3 = v3 −〈u1,v3〉‖u1‖2

u1 −〈u2,v3〉‖u2‖2

u2

= (0, 2, 0, 1)− −1

3(1,−1, 0, 1)− 3

6(0, 1,−2, 1) =

(1

3,7

6, 1,

5

6

)It can be nice to eliminate the fractions (see Remark 3.1.5), so

{(1,−1, 0, 1), (0, 1,−2, 1), (2, 7, 6, 5)}

is an orthogonal basis for the row space of A. If we wanted an orthonormal basis, we woulddivide each of these basis vectors by its norm.

Remark 3.1.5. Note that, for c ∈ F× and u,v ∈ Fn, we have

〈cu,v〉‖cu‖2

(cu) =c〈u,v〉|c|2‖u‖2

(cu) =〈u,v〉‖u‖2

u.

Therefore, in the Gram–Schmidt algorithm, replacing some ui by cui, c 6= 0, does not affectany of the subsequent steps. This is useful in computations, since we can, for instance, cleardenominators.

Orthogonal (especially orthonormal) bases are particularly nice since it is easy to writea vector as a linear combinations of the elements of such a basis.

Proposition 3.1.6. Suppose {u1, . . . ,um} is an orthogonal basis of a subspace U of Fn.Then, for any v ∈ U , we have

v =m∑i=1

〈ui,v〉‖ui‖2

ui. (3.2)

3.1. Orthogonal complements and projections 51

In particular, if the basis is orthonormal, then

v =m∑i=1

〈ui,v〉ui. (3.3)

Proof. Since v ∈ U , we can write

v =m∑i=1

ciui for some c1, . . . , cm ∈ F.

Then, for j = 1, . . . ,m, we have

〈uj,v〉 =

⟨uj,

m∑i=1

ciui

⟩(IP3)=

m∑i=1

ci〈uj,ui〉 = cj〈uj,uj〉 = cj‖uj‖2.

Thus cj =〈uj ,v〉‖uj‖2 , as desired.

Remark 3.1.7. What happens if you apply the Gram–Schmidt algorithm to a set of vectorsthat is not linearly independent? Remember that a list of vectors v1, . . . ,vm is linearlydependent if and only if one of the vectors, say vk, is a linear combination of the previousones. Then, using Proposition 3.1.6, one can see that the Gram–Schmidt algorithm will giveuk = 0. Thus, you can still apply the Gram–Schmidt algorithm to linearly dependent sets,as long as you simply throw out any zero vectors that you obtain in the process.

Definition 3.1.8. If U is a subspace of Fn, we define the orthogonal complement of U by

U⊥ := {v ∈ Fn : 〈u,v〉 = 0 for all u ∈ U}.

We read U⊥ as “U -perp”.

Lemma 3.1.9. Let U be a subspace of Fn.

(a) U⊥ is a subspace of Fn.

(b) {0}⊥ = Fn and (Fn)⊥ = {0}.(c) If U = Span{u1, . . . ,uk}, then U⊥ = {v ∈ Fn : 〈v,ui〉 = 0 for all i = 1, 2, . . . , k}.

Proof. You saw these properties of the orthogonal complement in previous courses, so wewill not repeat the proofs there. See [Nic, Lem. 8.1.2].

Definition 3.1.10 (Orthogonal projection). Let U be a subspace of Fn with orthogonalbasis {u1, . . . ,um}. If v ∈ Fn, then the vector

projU v =〈u1,v〉‖u1‖2

u1 +〈u2,v〉‖u2‖2

u2 + · · ·+ 〈um,v〉‖um‖2

um (3.4)

is called the orthogonal projection of v onto U . For the zero subspace U = {0}, we defineproj{0} x = 0.


Noting the similarity between (3.4) and (3.2), we see that

v = projU v ⇐⇒ v ∈ U.

(The forward implication follows from the fact that projU v ∈ U , while the reverse implicationfollows from Proposition 3.1.6.) Also note that the k-th step of Gram–Schmidt algorithmcan be written as

uk = vk − projSpan{u1,...,uk−1} vk.

Proposition 3.1.11. Suppose U is a subspace of Fn and v ∈ Fn. Define p = projU v.

(a) We have p ∈ U and v − p ∈ U⊥.

(b) The vector p is the vector in U closest to v in the sense that

‖v − p‖ < ‖v − u‖ for all u ∈ U, u 6= p.

Proof. You saw this in previous courses, so we will not repeat the proofs here. See [Nic,Th. 8.1.3].

Proposition 3.1.12. Suppose U is a subspace of Fn. Define

T : Fn → Fn, T (v) = projU v, v ∈ Fn.

(a) T is a linear operator.

(b) imT = U and kerT = U⊥.

(c) dimU + dimU⊥ = n.

Proof. See [Nic, Th. 8.1.4].

If U is a subspace of Fn, then every v ∈ Fn can be written uniquely as a sum of a vectorin U and a vector in U⊥. Precisely, we have

v = projU v + projU⊥ v.

Exercises.

3.1.1. Prove that the inner product defined by (3.1) satisfies conditions (IP1)–(IP5).

3.1.2. Suppose U and V are subspaces of Fn. We write U ⊥ V if

u ⊥ v for all u ∈ U, v ∈ V.

Show that if U ⊥ V , then projU ◦ projV = 0.

Additional recommended exercises:

• [Nic, §8.1]: All exercises

• [Nic, §8.7]: 8.7.1–8.7.4

3.2. Diagonalization 53

3.2 Diagonalization

In this section we quickly review the topics of eigenvectors, eigenvalues, and diagonalizationthat you saw in previous courses. For a more detailed review of this material, see [Nic, §3.3].

Throughout this section, we suppose that A ∈Mn,n(F). Recall that if

Av = λv for some λ ∈ F, v ∈ Fn, v 6= 0, (3.5)

then we say that v is an eigenvector of A with corresponding eigenvalue λ. So λ ∈ F is aneigenvector of A if (3.5) is satisfied for some nonzero vector v ∈ Fn.

You learned in previous courses how to find the eigenvalues of A. The eigenvalues of Aare precisely the roots of the characteristic polynomial

cA(x) := det(xI − A).

If F = C, then the characteristic polynomial will factor completely. That is, we have

det(xI − A) = (x− λ1)mλ1 (x− λ2)mλ2 · · · (x− λk)mλk ,

where the λ1, . . . , λk are distinct. Then mλi is the called algebraic multiplicity of λi. Itfollows that

∑ki=1mλi = n.

For an eigenvalue λ of A, the set of solutions to the equation

Ax = λx or (A− λI)x = 0

is the associated eigenspace Eλ. The eigenvectors corresponding to λ are precisely thenonzero vectors in Eλ. The dimension dimEλ is called the geometric multiplicity of λ. Wealways have

1 ≤ dimEλ ≤ mλ (3.6)

for every eigenvalue λ.Recall that the matrix A is diagonalizable if there exists an invertible matrix P such that

P−1AP

is diagonal. If D is this diagonal matrix, then we have

A = PDP−1.

We say that matrices A,B ∈ Mn,n(C) are similar if there exists some invertible matrix Psuch that A = PBP−1. So a matrix is diagonalizable if and only if it is similar to a triangularmatrix.

Theorem 3.2.1. Suppose A ∈Mn,n(C). The following statements are equivalent.

(a) Cn has a basis consisting of eigenvectors of A.

(b) dimEλ = mλ for every eigenvalue λ of A.

(c) A is diagonalizable.


Corollary 3.2.2. If A ∈Mn,n(C) has n distinct eigenvalues, then A is diagonalizable.

Proof. Suppose A has n distinct eigenvalues. Then, mλ = 1 for every eigenvalue λ. Itthen from (3.6) that dimEλ = mλ for every eigenvalue λ. Hence A is diagonalizable byTheorem 3.2.1.

Suppose A is diagonalizable, and let v1, . . . ,vn be a basis of eigenvectors. Then thematrix

P =[v1 v2 · · · vn

]is invertible. Furthermore, if we define

D =

λ1 0 · · · · · · 00 λ2 0 · · · 0... 0

. . . . . ....

......

. . . . . . 00 0 · · · 0 λn

,

where λi is the eigenvalue corresponding to the eigenvector vi, then we have A = PDP−1.Each eigenvalue appears on the diagonal of D a number of times equal to its (algebraic orgeometric) multiplicity.


A =

3 0 01 3 1−4 0 −1

.The characteristic polynomial is

cA(x) = det(xI − A) =

∣∣∣∣∣∣x− 3 0 0−1 x− 3 −14 0 x+ 1

∣∣∣∣∣∣ = (x− 3)

∣∣∣∣x− 3 −10 x+ 1

∣∣∣∣ = (x− 3)2(x+ 1).

Thus, the eigenvalues are −1 and 3, with algebraic multiplicities

m−1 = 1, m3 = 2.

For the eigenvalue 3, we compute the corresponding eigenspace E3 by solving the system(A− 3I)x = 0: 0 0 0 0

1 0 1 0−4 0 −4 0

row reduce−−−−−−→

1 0 1 00 0 0 00 0 0 0

.Thus

E3 = Span{(1, 0,−1), (0, 1, 0)}.

In particular, this eigenspace is 2-dimensional, with basis {(1, 0,−1), (0, 1, 0)}.

3.2. Diagonalization 55

For the eigenvalue −1, we find E−1 by solving (A+ I)x = 0: −4 0 0 01 −4 1 0−4 0 0 0

row reduce−−−−−−→

1 0 0 00 1 −1/4 00 0 0 0

.Thus

E−1 = Span{(0, 1, 4)}.

In particular, this eigenspace is 1-dimensional, with basis {(0, 1, 4)}.Since we have dimEλ = mλ for each eigenvalue λ, the matrix A is diagonalizable. In

particular, we have A = PDP−1, where

P =

1 0 00 1 1−1 0 4

and D =

3 0 00 3 00 0 −1

.Example 3.2.4. Consider the matrix

B =

0 0 02 0 00 −3 0

.Then

cB(x) = det(xI −B) =

∣∣∣∣∣∣x 0 0−2 x 00 3 x

∣∣∣∣∣∣ = x3.

Thus B has only one eigenvalue λ = 0, with algebraic multiplicity 3. To find E0 we solve0 0 02 0 00 −3 0

x = 0

and find the solution set

E0 = Span{(0, 0, 1)}.

So dimE0 = 1 < 3 = m0. Thus there is no basis for C3 consisting of eigenvectors. Hence Bis not diagonalizable.

Exercises.

Recommended exercises: Exercises in [Nic, §3.3].


3.3 Hermitian and unitary matrices

In this section we introduce hermitian and unitary matrices, and study some of their im-portant properties. These matrices are complex analogues of symmetric and orthogonalmatrices, which you saw in previous courses.

Recall that real matrices (i.e. matrices with real entries) can have complex eigenvaluesthat are not real. For example, consider the matrix

A =

[0 1−1 0

].

Its characteristic polynomial is

cA(x) = det(xI − A) = x2 + 1,

which has roots ±i. Then one can find the associated eigenvectors in the usual way to seethat

A

[1i

]=

[i−1

]= i

[1i

]and A

[1−i

]=

[−i−1

]= −i

[1−i

].

Thus, when considering eigenvalues, eigenvectors, and diagonalization, it makes much moresense to work over the complex numbers.

Definition 3.3.1 (Conjugate transpose). The conjugate transpose (or hermitian conjugate)AH of a complex matrix is defined by

AH = (A)T = (AT ).

Other common notations for AH are A∗ and A†.

Note that AH = AT when A is real. In many ways, the conjugate transpose is the “cor-rect” complex analogue of the transpose for real matrices, in the sense that many theoremsfor real matrices involving the transpose remain true for complex matrices when you replace“transpose” by “conjugate transpose”. We can also rewrite the inner product (3.1) as

〈u,v〉 = uHv. (3.7)

Example 3.3.2. We have

[2− i −3 i−5i 7− 3i 0

]H=

2 + i 5i−3 7 + 3i−i 0

.Proposition 3.3.3. Suppose A,B ∈Mm,n(C) and c ∈ C.

(a) (AH)H = A.

(b) (A+B)H = AH +BH .

(c) (cA)H = cAH .

3.3. Hermitian and unitary matrices 57

If A ∈Mm,n(C) and B ∈Mn,k(C), then

(d) (AB)H = BHAH .

Recall that a matrix is symmetric if AT = A. The natural complex generalization of thisconcept is the following.

Definition 3.3.4 (Hermitian matrix). A square complex matrix A is hermitian if AH = A,equivalently, if A = AT .

Example 3.3.5. The matrix −2 2− i 5i2 + i 1 4−5i 4 8

is hermitian, whereas [

3 ii 0

]and

[i 2− i

2 + i 3

]are not. Note that entries on the main diagonal of a hermitian matrix must be real.

Proposition 3.3.6. A matrix A ∈Mn,n(C) is hermitian if and only if

〈x, Ay〉 = 〈Ax,y〉 for all x,y ∈ Cn. (3.8)

Proof. First suppose that A = [aij] is hermitian. Then, for x,y ∈ Cn, we have

〈x, Ay〉 = xHAy = xHAHy = (Ax)Hy = 〈Ax,y〉.

Conversely, suppose that (3.8) holds. Then, for all 1 ≤ i, j ≤ n, we have

aij = eTi Aej = 〈ei, Aej〉 = 〈Aei, ej〉 = (Aei)Hej = eHi A

Hej = eTi AHej = aji.

Thus A is hermitian.

Proposition 3.3.7. If A ∈Mn,n(C) is hermitian, then every eigenvalue of A is real.

Proof. Suppose we have

Ax = λx, x ∈ Cn, x 6= 0, λ ∈ C.

Then

λ〈x,x〉 = 〈x, λx〉 (by (IP3))

= 〈x, Ax〉= xHAx (by (3.7))

= xHAHx (since A is hermitian)

= (Ax)Hx

= 〈Ax,x〉 (by (3.7))


= 〈λx,x〉= λ〈x,x〉. (by (IP2))

Thus we have(λ− λ)〈x,x〉 = 0.

Since x 6= 0, (IP5) implies that λ = λ. Hence λ ∈ R.

Proposition 3.3.8. If A ∈ Mn,n(C) is hermitian, then eigenvectors of A corresponding todistinct eigenvalues are orthogonal.

Proof. SupposeAx = λx, Ay = µy, x,y 6= 0, λ 6= µ.

Since A is hermitian, we have λ, µ ∈ R by Proposition 3.3.7. Then we have

λ〈x,y〉 = 〈λx,y〉 (by (IP2) and λ ∈ R)

= 〈Ax,y〉= 〈x, Ay〉 (by Proposition 3.3.6)

= 〈x, µy〉= µ〈x,y〉. (by (IP3))

Thus we have(λ− µ)〈x,y〉 = 0.

Since λ 6= µ, it follows that 〈x,y〉 = 0.

Proposition 3.3.9. The following conditions are equivalent for a matrix U ∈Mn,n(C).

(a) U is invertible and U−1 = UH .

(b) The rows of U are orthonormal.

(c) The columns of U are orthonormal.

Proof. The proof of this result is almost identical to the characterization of orthogonalmatrices that you saw in previous courses. For details, see [Nic, Th. 8.2.1] and [Nic, Th. 8.7.6].

Definition 3.3.10 (Unitary matrix). A square complex matrix U is unitary if U−1 = UH .

Recall that a matrix P is orthogonal if P−1 = P T . Note that a real matrix is unitaryif and only if it is orthogonal. You should think of unitary matrices as the correct complexanalogue of orthogonal matrices.

Example 3.3.11. The matrix [i 1− i1 1 + i

]has orthogonal columns. However, the columns are not orthonormal (and the rows are notorthogonal). So the matrix is not unitary. However, if we normalize the columns, we obtainthe unitary matrix

1

2

[√2i 1− i√2 1 + i

].

3.3. Hermitian and unitary matrices 59

You saw in previous courses that symmetric real matrices are always diagonalizable. Wewill see in the next section that the same is true for complex hermitian matrices. Beforediscussing the general theory, let’s do a simple example that illustrates some of the ideaswe’ve seen in this section.

Example 3.3.12. Consider the hermitian matrix

A =

[2 i−i 2

].

Its characteristic polynomial is

cA(x) = det(xI − A) =

∣∣∣∣x− 2 −ii x− 2

∣∣∣∣ = (x− 2)2 − 1 = x2 − 4x+ 3 = (x− 1)(x− 3).

Thus, the eigenvalues are 1 and 3, which are real, as predicted by Proposition 3.3.7. Thecorresponding eigenvectors are [

1i

]and

[i1

].

These are orthogonal, as predicted by Proposition 3.3.8. Each has length√

2, so an or-thonormal basis of eigenvectors is {

1√2(1, i), 1√

2(i, 1)

}.

Thus, the matrix

U =1√2

[1 ii 1

]is unitary, and we have that

UHAU =

[1 00 3

]is diagonal.

Exercises.

3.3.1. Recall that, for θ ∈ R, we define the complex exponential

eiθ = cos θ + i sin θ.

Find necessary and sufficient conditions on α, β, γ, θ ∈ R for the matrix[eiα eiβ

eiγ eiθ

]to be hermitian. Your final answer should not involve any complex exponentials or trigono-metric functions.


3.3.2. Show that if A is a square matrix, then det(AH) = detA.

Additional recommended exercises: Exercises 8.7.12–8.7.17 in [Nic, §8.7].

3.4 The spectral theorem

In previous courses you studied the diagonalization of symmetric matrices. In particular, youlearned that every real symmetric matrix was orthogonally diagonalizable. In this sectionwe will study the complex analogues of those results.

Definition 3.4.1 (Unitarily diagonalizable matrix). An matrix A ∈ Mn,n(C) is said to beunitarily diagonalizable if there exists a unitary matrix U such that U−1AU = UHAU isdiagonal.

One of our goals in this section is to show that every hermitian matrix is unitarily diago-nalizable. We first prove an important theorem which has this result as an easy consequence.

Theorem 3.4.2 (Schur’s theorem). If A ∈ Mn,n(C), then there exists a unitary matrix Usuch that

UHAU = T

is upper triangular. Moreover, the entries on the main diagonal of T are the eigenvalues ofA (including multiplicities).

Proof. We prove the result by induction on n. If n = 1, then A is already upper triangular,and we are done (just take U = I). Now suppose n > 1, and that the theorem holds for(n− 1)× (n− 1) complex matrices.

Let λ1 be an eigenvalue of A, and let y1 be a corresponding eigenvector with ‖y1‖ = 1.By Proposition 3.1.2, we can extend this to an orthonormal basis

{y1,y2, . . . ,yn}

of Cn. Then

U1 =[y1 y2 · · · yn

]is a unitary matrix. In block form, we have

UH1 AU1 =

yH1yH2...

yHn

[λ1y1 Ay2 · · · Ayn]

=

[λ1 X1

0 A1

].

Now, by the induction hypothesis applied to the (n− 1)× (n− 1) matrix A1, there exists aunitary (n− 1)× (n− 1) matrix W1 such that

WH1 A1W1 = T1

3.4. The spectral theorem 61

is upper triangular. Then

U2 =

[1 00 W1

]is a unitary n× n matrix. If we let U = U1U2, then

UH = (U1U2)H = UH

2 UH1 = U−12 U−11 = (U1U2)

−1 = U−1,

and so U is unitary. Furthermore,

UHAU = UH2 (UH

1 AU1)U2 =

[1 00 WH

1

] [λ1 X1

0 A1

] [1 00 W1

]=

[λ1 X1W1

0 T1

]= T

is upper triangular.Finally, since A and T are similar matrices, they have the same eigenvalues, and these

eigenvalues are the diagonal entries of T since T is upper triangular.

By Schur’s theorem (Theorem 3.4.2), every matrix A ∈ Mn,n(C) can be written in theform

A = UTUH = UTU−1 (3.9)

where U is unitary, T is upper triangular, and the diagonal entries of T are the eigenvaluesof T . The expression (3.9) is called a Schur decomposition of A.

Recall that the trace of a square matrix A = [aij] is

trA = a11 + a22 + . . .+ ann.

In other words, trA is the sum of the entries of A on the main diagonal. Similar matriceshave the same trace and determinant (see Exercises 3.4.1 and 3.4.2).

Corollary 3.4.3. Suppose A ∈ Mn,n(C), and let λ1, . . . , λn denote the eigenvalues of A,including multiplicities. Then

detA = λ1λ2 · · ·λn and trA = λ1 + λ2 + · · ·+ λn.

Proof. Since the statements are clear true for triangular matrices, the corollary follows fromthe fact mentioned above that similar matrices have the same determinant and trace.

Schur’s theorem states that every complex square matrix can be “unitarily triangular-ized”. However, not every complex square matrix can be unitarily diagonalized. For example,the matrix [

1 10 1

]cannot be unitarily diagonalized. You can see this by find the eigenvectors of A and seeingthat there is no basis of eigenvectors (there is only one eigenvalue, but its correspondingeigenspace only has dimension one).

Theorem 3.4.4 (Spectral theorem). Every hermitian matrix is unitarily diagonalizable. Inother words, if A is a hermitian matrix, then there exists a unitary matrix U such that UHAUis diagonal.


Proof. Suppose A is a hermitian matrix. By Schur’s Theorem (Theorem 3.4.2), there existsa unitary matrix U such that UHAU = T is upper triangular. Then we have

TH = (UHAU)H = UHAH(UH)H = UHAU = T.

Thus T is both upper and lower triangular. Hence T is diagonal.

The terminology “spectral theorem” comes from the fact that the set of distinct eigen-values is called the spectrum of a matrix. In previous courses, you learned the following realanalogue of the spectral theorem.

Theorem 3.4.5 (Real spectral theorem, principal axes theorem). The following conditionsare equivalent for A ∈Mn,n(R).

(a) A has an orthonormal set of n eigenvectors in Rn.

(b) A is orthogonally diagonalizable. That is, there exists a real orthogonal matrix P suchthat P−1AP = P TAP is diagonal.

(c) A is symmetric.

A set of orthonormal eigenvectors of a symmetric matrix A is called a set of principleaxes for A (hence the name of Theorem 3.4.5.).

Note that the principle axes theorem states that a real matrix is orthogonally diagonal-izable if and only if the matrix is symmetric. However, the converse of the spectral theorem(Theorem 3.4.4) is false, as the following example shows.

Example 3.4.6. Consider the non-hermitian matrix

A =

[0 −22 0

].

The characteristic polynomial is

cA(x) = det(xI − A) = x2 + 4.

Thus the eigenvalues are 2i and −2i. The corresponding eigenvectors are[−1i

]and

[i−1

].

These vectors are orthogonal and both have length√

2. Therefore

U =1√2

[−1 ii −1

]is a unitary matrix such that

UHAU =

[i 00 −i

]is diagonal.

3.4. The spectral theorem 63

Why does the converse of Theorem 3.4.4 fail? Why doesn’t the proof that an orthogonallydiagonalizable real matrix is symmetric carry over to the complex setting? Let’s recall theproof in the real case. Suppose A is orthogonally diagonalizable. Then there exists a realorthogonal matrix P (so P−1 = P T ) and a real diagonal matrix D such that A = PDP T .Then we have

AT =(PDP T

)T= PDTP T = PDP T = A,

where we used the fact that D = DT for a diagonal matrix. Hence A is symmetric. However,suppose we assume that A is a unitarily diagonalizable. Then there exists a unitary matrixU and a complex diagonal matrix D such that A = UDUT . We have

AH =(UDUH

)H= UDHUH ,

and here we’re stuck. We won’t have DH = D unless the entries of the diagonal matrixD are all real. So the argument fails. It turns out that we need to introduce a strongercondition on the matrix A.

Definition 3.4.7 (Normal matrix). A matrix N ∈Mn,n(C) is normal if NNH = NHN .

Clearly every hermitian matrix is normal. Note that the matrix A in Example 3.4.6 isalso normal since

AAH =

[0 −22 0

] [0 2−2 0

]=

[4 00 4

]=

[0 2−2 0

] [0 −22 0

]= AHA.

Theorem 3.4.8. A complex square matrix is unitarily diagonalizable if and only if it isnormal.

Proof. First suppose that A ∈Mn,n(C) is unitarily diagonalizable. So we have

UHAU = D

for some unitary matrix U and diagonal matrix D. Since diagonal matrices commute witheach other, we have DDH = DHD. Now

DDH = (UHAU)(UHAHU) = UHAAHU

andDHD = (UHAHU)(UHAU) = UHAHAU.

HenceUH(AAH)U = UH(AHA)U.

Multiplying on the left by U and on the right by UH gives AAH = AHA, as desired.Now suppose that A ∈ Mn,n(C) is normal, so that AAH = AHA. By Schur’s theorem

(Theorem 3.4.2), we can writeUHAU = T

for some unitary matrix U and upper triangular matrix T . Then T is also normal since

TTH = (UHAU)(UHAHU) = UH(AAH)U = UH(AHA)U = (UHAHU)(UHAU) = THT.


So it is enough to show that a normal n× n upper triangular matrix is diagonal. We provethis by induction on n. The case n = 1 is clear, since all 1×1 matrices are diagonal. Supposen > 1 and that all normal (n − 1) × (n − 1) upper triangular matrices are diagonal. LetT = [tij] be a normal n×n upper triangular matrix. Equating the (1, 1)-entries of TTH andTHT gives

|t11|2 + |t12|2 + · · ·+ |t1n|2 = |t11|2.It follows that

t12 = t13 = · · · = t1n = 0.

Thus, in block form, we have

T =

[t11 00 T11

]Then

TH =

[t11 00 TH11

]and so we have [

|t11|2 00 T1T

H1

]= TTH = THT =

[|t211| 0

0 TH1 T1.

]Thus T1T

H1 = TH1 T1. By our induction hypothesis, this implies that T1 is diagonal. Hence

T is diagonal, completing the proof of the induction step.

We conclude this section with a famous theorem about matrices. Recall that if A is asquare matrix, then we can form powers Ak, k ≥ 0, of A. (We define A0 = I). Thus, we cansubstitute A into polynomials. For example, if

p(x) = 2x3 − 3x2 + 4x+ 5 = 2x3 − 3x2 + 4x+ 5x0,

thenp(A) = 2A3 − 3A2 + 4A+ 5I.

Theorem 3.4.9 (Cayley–Hamilton Theorem). If A ∈ Mn,n(C), then cA(A) = 0. In otherwords, every square matrix is a “root” of its characteristic polynomial.

Proof. Note that, for any k ≥ 0 and invertible matrix P , we have

(P−1AP )k = P−1AkP.

It follows that if p(x) is any polynomial, then

p(P−1AP ) = P−1p(A)P.

Thus, p(A) = 0 if and only if p(P−1AP ) = 0. Therefore, by Schur’s theorem (Theorem 3.4.2),we may assume that A is upper triangular. Then the eigenvalues λ1, λ2, . . . , λn of A appearon the main diagonal and we have

cA(x) = (x− λ1)(x− λ2) · · · (x− λn).

ThereforecA(A) = (A− λ1I)(A− λ2I) · · · (A− λnI).

Each matrix A− λiI is upper triangular. Observe that:

3.5. Positive definite matrices 65

(a) A− λ1I has zero first column, since the first column of A is (λ1, 0, . . . , 0).

(b) Then (A− λ1I)(A− λ2I) has the first two columns zero because the second column of(A− λ2I) is of the form (b, 0, . . . , 0) for some b ∈ C.

(c) Next (A− λ1I)(A− λ2I)(A− λ3I) has the first three columns zero because the thirdcolumn of (A− λ3I) is of the form (c, d, 0, . . . , 0) for some c, d ∈ C.

Continuing in this manner, we see that (A− λ1I)(A− λ2I) · · · (A− λnI) has all n columnszero, and hence is the zero matrix.

Exercises.

3.4.1. Suppose that A and B are similar square matrices. Show that detA = detB.

3.4.2. Suppose A,B ∈Mn,n(C).

(a) Show that tr(AB) = tr(BA).

(b) Show that if A and B are similar, then trA = trB.

3.4.3. (a) Suppose that N ∈ Mn,n(C) is upper triangular with diagonal entries equal tozero. Show that, for all j = 1, 2, . . . , n, we have

Nej ∈ Span{e1, e2, . . . , ej−1},

where ei is the i-th standard basis vector. (When j = 1, we interpret the set {e1, e2, . . . , ej−1}as the empty set. Recall that Span∅ = {0}.)

(b) Again, suppose that N ∈ Mn,n(C) is upper triangular with diagonal entries equal tozero. Show that Nn = 0.

(c) Suppose A ∈ Nn,n(C) has eigenvalues λ1, . . . , λn, with multiplicity. Show that A =P + N for some P,N ∈ Mn,n(C) satisfying Nn = 0 and P = UDUT , where U isunitary and D = diag(λ1, . . . , λn). Hint : Use Schur’s Theorem.

Additional exercises from [Nic, §8.7]: 8.7.5–8.7.9, 8.7.18–8.7.25.

3.5 Positive definite matrices

In this section we look at symmetric matrices whose eigenvalues are all positive. Thesematrices are important in applications including optimization, statistics, and geometry. Wefollow the presentation in [Nic, §8.3], except that we will consider the complex case (whereas[Nic, §8.3] works over the real numbers).

LetR>0 = {a ∈ R : a > 0}

denote the set of positive real numbers.


Definition 3.5.1 (Positive definite). A hermitian matrix A ∈Mn,n(C) is positive definite if

〈x, Ax〉 ∈ R>0 for all x ∈ Cn, x 6= 0.

By Proposition 3.3.7, we know that the eigenvalues of a hermitian matrix are real.

Proposition 3.5.2. A hermitian matrix is positive definite if and only if all its eigenvaluesλ are positive, that is, λ > 0.

Proof. Suppose A is a hermitian matrix. By the spectral theorem (Theorem 3.4.4), thereexists a unitary matrix U such that UHAU = D = diag(λ1, . . . , λn), where λ1, . . . , λn arethe eigenvalues of A. For x ∈ Cn, define

y = UHx =

y1y2...yn

.Then

〈x, Ax〉 = xHAx = xH(UDUH)x

= (UHx)HDUHx = yHDy = λ1|y1|2 + λ2|y2|2 + · · ·λn|yn|2. (3.10)

If every λi > 0, then (3.10) implies that 〈x, Ax〉 > 0 since some yj > 0 (because x 6= 0 andU is invertible). So A is positive definite.

Conversely, suppose A is positive definite. For j ∈ {1, 2, . . . , n}, let x = Uej 6= 0. Theny = ej, and so (3.10) gives

λj = 〈x, Ax〉 > 0.

Hence all the eigenvalues of A are positive.

Remark 3.5.3. A hermitian matrix is positive semi-definite if 〈x, Ax〉 ≥ 0 for all x 6= 0 inCn. Then one can show that a hermitian matrix is positive semi-definite if and only if allits eigenvalues λ are nonnegative, that is λ ≥ 0. (See Exercise 3.5.1.) One can also considernegative definite and negative semi-definite matrices and they have analogous properties.However, we will focus here on positive definite matrices.


A =

[1 1−1 1

].

Note that A is not hermitian (which is the same as symmetric, since A is real). For anyx = (x1, x2), we have

xTAx = x21 + x22.

Thus, if x ∈ R2, then xTAx is always positive when x 6= 0. However, if x = (1, i), we have

xHAx = 2 + 2i,

which is not real. However, if A ∈ Mn,n(R) is symmetric, then A is positive definite if andonly if 〈x, Ax〉 > 0 for all x ∈ Rn. That is, for real symmetric matrices, it is enough to checkthis condition for real vectors.


Corollary 3.5.5. If A is positive definite, then it is invertible and detA ∈ R>0.

Proof. Suppose A is positive definite. Then, by Proposition 3.5.2, the eigenvalues of A arepositive real numbers. Hence, by Corollary 3.4.3, detA ∈ R>0. In particular, since itsdeterminant is nonzero, A is invertible.

Example 3.5.6. Let’s show that, for any invertible matrix U ∈Mn,n(C), the matrix A = UHUis positive definite. Indeed, we have

AH = (UHU)H = UHU = A,

and so A is hermitian. Also, for x ∈ Cn, x 6= 0, we have

xHAx = xH(UHU)x = (Ux)H(Ux) = ‖Ux‖2 > 0,

where the last equality follows from (IP5) and the fact that Ux 6= 0, since x 6= 0 and U isinvertible.

In fact, we will see that the converse to Example 3.5.6 is also true. Before verifying this,we discuss another important concept.

Definition 3.5.7 (Principal submatrices). If A ∈Mn,n(C), let (r)A denote the r× r subma-trix in the upper-left corner of A; that is, (r)A is the matrix obtained from A by deleting thelast n− r rows and columns. The matrices (1)A, (2)A, . . . , (n)A = A are called the principalsubmatrices of A.

Example 3.5.8. If

A =

5 7 2− i6i 0 −3i

2 + 9i −3 1

,then

(1)A =[5], (2)A =

[5 76i 0

], (3)A = A.

Lemma 3.5.9. If A ∈Mn,n(C) is positive definite, then so is each principal matrix (r)A forr = 1, 2, . . . , n.

Proof. Write

A =

[(r)A PQ R

]in block form. First note that[

(r)A PQ R

]= A = AH =

[((r)A

)HQH

PH RH

].

Hence (r)A =((r)A

)H, and so A is hermitian.

Now let y ∈ Cr, y 6= 0. Define

x =

[y0

]∈ Cn.


Then x 6= 0 and so, since A is positive definite, we have

R>0 3 xHAx =[yH 0

] [(r)A PQ R

] [y0

]= yH

((r)A

)y.

Thus (r)A is positive definite.

We can now prove a theorem that includes the converse to Example 3.5.6.

Theorem 3.5.10. The following conditions are equivalent for any hermitian A ∈Mn,n(C).

(a) A is positive definite.

(b) det((r)A

)∈ R>0 for each r = 1, 2, . . . , n.

(c) A = UHU for some upper triangular matrix U with positive real entries on the diagonal.

Furthermore, the factorization in (c) is unique, called the Cholesky factorization of A.

Proof. We already saw that (c) =⇒ (a) in Example 3.5.6. Also, (a) =⇒ (b) by Lemma 3.5.9and Corollary 3.5.5. So it remains to show (b) =⇒ (c).

Assume that (b) is true. We prove that (c) holds by induction on n. If n = 1, thenA =

[a], where a ∈ R>0 by (b). So we can take U =

[√a].

Now suppose n > 1 and that the result holds for matrices of size (n−1)× (n−1). Define

B = (n−1)A.

Then B is hermitian and satisfies (b). Hence, by Lemma 3.5.9 and our induction hypothesis,we have

B = UHU

for some upper triangular matrix U ∈ Mn−1,n−1(C) with positive real entries on the maindiagonal. Since A is hermitian, it has block form

A =

[B ppH b

], p ∈ Cn−1, b ∈ R.

If we definex = (UH)−1p and c = b− xHx,

then block multiplication gives

A =

[UHU ppH b

]=

[UH 0xH 1

] [U x0 c

]. (3.11)

Taking determinants and using (1.2) gives

detA = (detUH)(detU)c = c | detU |2.

(Here we use Exercise 3.3.2.) Since detA > 0 by (b), it follows that c > 0. Thus, thefactorization (3.11) can be modified to give

A =

[UH 0xH

√c

] [U x0√c

].


Since U is upper triangular with positive real entries on the main diagonal, this proves theinduction step.

It remains to prove the uniqueness assertion in the statement of the theorem. Supposethat

A = UHU = UH1 U1

are two Cholesky factorizations. Then, by Lemma 1.6.2,

D = UU−11 = (UH)−1UH1 (3.12)

is both upper triangular (since U and U1 are) and lower triangular (since UH and UH1 are).

Thus D is a diagonal matrix. It follows from (3.12) that

U = DU1 and U1 = DHU,

and soU = DU1 = DDHU.

Since U is invertible, this implies that DDH = I. Because the diagonal entries of D arepositive real numbers (since this is true for U and U1), it follows that D = I. Thus U = U1,as desired.

Remark 3.5.11. (a) If the real matrix A ∈ Mn,n(R) is symmetric (hence also hermitian),then the matrix U appearing in the Cholesky factorization A = UHU also has realentries, and so A = UTU . See [Nic, Th. 8.3.3].

(b) Positive semi-definite matrices also have Cholesky factorizations, as long as we allowthe diagonal matrices of U to be zero. However, the factorization is no longer uniquein general.

Theorem 3.5.10 tells us that every positive definite matrix has a Cholesky factorization.But how do we find the Cholesky factorization?

Algorithm 3.5.12 (Algorithm for the Cholesky factorization). If A is a positive definitematrix, then the Cholesky factorization A = UHU can be found as follows:

(a) Transform A to an upper triangular matrix U1 with positive real diagonal entries usingrow operations, each of which adds a multiple of a row to a lower row.

(b) Obtain U from U1 by dividing each row of U1 by the square root of the diagonal entryin that row.

The key is that step (a) is possible for any positive definite matrix A. Let’s do an examplebefore proving Algorithm 3.5.12.


A =

2 i −3−i 5 2i−3 −2i 10

.


We can compute

det (1)A = 2 > 0, det (2)A = 11 > 0, det (3)A = detA = 49 > 0.

Thus, by Theorem 3.5.10, A is positive definite and has a unique Cholesky factorization. Wecarry out step (a) of Algorithm 3.5.12 as follows:

A =

2 i −3−i 5 2i−3 −2i 10

R2+i2R1

R3+32R1−−−−−→

2 i −30 9/2 i/20 −i/2 11/2

R3+i9R2−−−−−→

2 i −30 9/2 i/20 0 49/9

= U1.

Now we carry out step (b) to obtain

U =

√

2 i√2

−3√2

0 3√2

i√2

6

0 0√493

.You can then check that A = UHU .

Proof of Algorithm 3.5.12. Suppose A is positive definite, and let A = UHU be the Choleskyfactorization. Let D = diag(d1, . . . , dn) be the common diagonal of U and UH . (So the diare positive real numbers.) Then UHD−1 is lower unitriangular (lower triangular with oneson the diagonal). Thus L = (UHD−1)−1 is also lower unitriangular. Therefore we can write

L = Er · · ·E2E1In,

where each Ei is an elementary matrix corresponding to a row operation that adds a multipleof one row to a lower row (we modify columns right to left). Then we have

Er · · ·E2E1A = LA =(D(UH)−1

) (UHU

)= DU

is upper triangular with positive real entries on the diagonal. This proves that step (a) ofthe algorithm is possible.

Now consider step (b). We have already shown that we can find a lower unitriangularmatrix L1 and an invertible upper triangular matrix U1, with positive real entries on thediagonal, such that L1A = U1. (In the notation above, L1 = Er · · ·E1 and U1 = DU .) SinceA is hermitian, we have

L1UH1 = L1(L1A)H = L1A

HLH1 = L1ALH1 = U1L

H1 . (3.13)

Let D1 = diag(d1, . . . , dn) denote diagonal matrix with the same diagonal entries as U1.Then (3.13) implies that

L1

(UH1 D

−11

)= U1L

H1 D

−11 .

This is both upper triangular (since U1LH1 D

−11 is) and lower unitriangular (since L1

(UH1 D

−11

)is), and so must equal In. Thus

UH1 D

−11 = L−11 .

3.6. QR factorization 71

Now letD2 = diag

(√d1, . . . ,

√dn

),

so that D22 = D1. If we define U = D−12 U1, then

UHU =(UH1 D

−12

) (D−12 U1

)= UH

1

(D2

2

)−1U1 =

(UH1 D

−11

)U1 = L−11 U1 = A.

Since U = D−12 U1 is the matrix obtained from U1 by dividing each row by the square rootof its diagonal entry, this completes the proof of step (b).

Suppose we have a linear systemAx = b,

where A is a hermitian (e.g. real symmetric) matrix. Then we can find the Cholesky decom-position A = UHU and consider the linear system

UHUx = b.

As with the LU decomposition, we can first solve UHy = b by forward substitution, and thensolve Ux = y by back substitution. For linear systems that can be put in symmetric form,using the Cholesky decomposition is roughly twice as efficient as using the LU decomposition.

Exercises.

3.5.1. Show that a hermitian matrix is positive semi-definite if and only if all its eigenvaluesλ are nonnegative, that is, λ ≥ 0.

Additional recommended exercises: [Nic, §8.3].

3.6 QR factorization

Unitary matrices are very easy to invert, since the conjugate transpose is the inverse. Thus,it is useful to factor an arbitrary matrix as a product of a unitary matrix and a triangularmatrix (which we’ve seen are also nice in many ways). We tackle this problem in this section.A good reference for this material is [Nic, §8.4]. (However, see Remark 3.6.2.)

Definition 3.6.1 (QR factorization). A QR factorization of A ∈ Mm,n(C), m ≥ n, is afactorization A = QR, where Q is an m × m unitary matrix and R is an m × n uppertriangular matrix whose entries on the main diagonal are nonnegative real numbers.

Note that the bottom m−n rows of an m×n upper triangular matrix (with m ≥ n) arezero rows. Thus, we can write a QR factorization A = QR in block form

A = QR = Q

[R1

0

]=[Q1 Q2

] [R1

0

]= Q1R1


where R1 is an n × n upper triangular matrix whose entries on the main diagonal arenonnegative real numbers, 0 is the (m−n)×n zero matrix, Q1 is m×n, Q2 is m× (m−n),and Q1, Q2 both have orthonormal columns. The factorization A = Q1R1 is called a thinQR factorization or reduced QR factorization of A.

Remark 3.6.2. You may find different definitions of QR factorization in other references. Inparticular, some references, including [Nic, §8.6], refer to the reduced QR factorization asa QR factorization (without using the word “reduced”). In addition, [Nic, §8.6] imposesthe condition that the columns of A are linearly independent. This is not needed for theexistence result we will prove below (Theorem 3.6.3), but it will be needed for uniqueness(Theorem 3.6.6). When consulting other references, be sure to look at which definition theyare using to avoid confusion.

Note that, given a reduced QR factorization, one can easily obtain a QR factorizationby extending the columns of Q1 to an orthonormal basis, and defining Q2 to be the matrixwhose columns are the additional vectors in this basis. It follows that, when m > n, theQR factorization is not unique. (Once can extend to an orthonormal basis in more thanone way.) However the reduced QR factorization has some chance of being unique. (SeeTheorem 3.6.6.)

The power of the QR factorization comes from the fact that there are computer algorithmsthat can compute it with good control over round-off error. Finding the QR factorizationinvolves the Gram–Schmidt algorithm.

Recall that a matrix A ∈ Mm,n(C) has linearly independent columns if and only if itsrank is n, which can only occur if A is tall or square (i.e. m ≥ n).

Theorem 3.6.3. Every tall or square matrix has a QR factorization.

Proof. We will prove the theorem under the additional assumption that A has full rank (i.e.rankA = n), and then make some remarks about how one can modify the proof to workwithout this assumption. We show that A has a reduced QR factorization, from which itfollows (as discussed above) that A has a QR factorization.

SupposeA =

[a1 a2 · · · an

]∈Mm,n(C)

with linearly independent columns a1, a2, . . . , an. We can use the Gram-Schmidt algorithmto obtain an orthogonal set u1, . . . ,un spanning the column space of A. Namely, we setu1 = a1 and

uk = ak −k−1∑j=1

〈uj, ak〉‖u2

j‖uj for k = 2, 3, . . . , n. (3.14)

If we define

qk =1

‖uk‖uk for each k = 1, 2, . . . , n,

then the q1, . . . ,qn are orthonormal and (3.14) becomes

‖uk‖qk = ak −k−1∑j=1

〈qj, ak〉qj. (3.15)


Using (3.15), we can express each ak as a linear combination of the qj:

a1 = ‖u1‖q1,

a2 = 〈q1, a2〉q1 + ‖u2‖q2,

a3 = 〈q1, a3〉q1 + 〈q2, a3〉q2 + ‖u3‖q3

...

an = 〈q1, an〉q1 + 〈q2, an〉q2 + . . .+ 〈qn−1, an〉qn−1 + ‖un‖qn.

Writing these equations in matrix form gives us the factorization we’re looking for:

A =[a1 a2 a3 · · · an

]=[q1 q2 q3 · · · qn

] ‖u1‖ 〈q1, a2〉〈q1, a3〉 · · · 〈q1, an〉

0 ‖u2‖ 〈q2, a3〉 · · · 〈q2, an〉0 0 ‖u3‖ · · · 〈q3,un〉0 0 0 · · · ‖un‖

.What do we do if rankA < n? In this case, the columns of A are linearly dependent,

so some of the columns are in the span of the previous ones. Suppose that ak is the firstsuch column. Then, as noted in Remark 3.1.7, we get uk = 0. We can fix the proof asfollows: Let v1, . . . ,vr be an orthonormal basis of (colA)⊥. Then, if we obtain uk = 0 inthe Gram–Schmidt algorithm above, we let qk be one of the vi, using each vi exactly once.We then continue the proof as above, and the matrix R will have a row of zeros in the k-throw for each k that gave uk = 0 in the Gram–Schmidt algorithm.

Remark 3.6.4. Note that for a tall or square real matrix, the proof of Theorem 3.6.3 showsus that we can find a QR factorization with Q and R real matrices.

Example 3.6.5. Let’s find a QR factorization of

A =

1 20 11 0

.We first find a reduced QR factorization. We have

a1 = (1, 0, 1) and a2 = (2, 1, 0).

Thus

u1 = a1,

u2 = a2 −〈u1, a2〉‖u1‖2

u1 = (2, 1, 0)− 2

2(1, 0, 1) = (1, 1,−1).

So we have

q1 =1

‖u1‖u1 =

1√2

(1, 0, 1),


q2 =1

‖u2‖u2 =

1√3

(1, 1,−1).

We define

Q1 =[q1 q2

]=

1√2

1√3

0 1√3

1√2− 1√

3

and R1 =

[‖u1‖ 〈q1, a2〉

0 ‖u2‖

]=

[√2√

2

0√

3

].

We can then verify that A = Q1R1. To get a QR factorization, we need to complete{q1,q2} to an orthonormal basis. So we need to find a vector orthogonal to both q1 and q2

(equivalently, to their multiples u1 and u2). A vector u3 = (x, y, z) is orthogonal to bothwhen

0 = 〈u1, (x, y, z)〉 = x+ z and 0 = 〈u2, (x, y, z)〉 = x+ y − z.

Solving this system, we see that u3 = (1,−2,−1) is orthogonal to u1 and u2. So we define

q3 =1

‖u3‖u3 =

1√6

(1,−2,−1).

Then, setting

Q =[q1 q2 q3

]=

1√2

1√3

1√6

0 1√3− 2√

61√2− 1√

3− 1√

6

and R =

√2√

2

0√

30 0

,we have a QR factorization A = QR.

Now that we know that QR factorizations exist, what about uniqueness? As noted earlier,here we should focus on reduced QR factorizations.

Theorem 3.6.6. Every tall or square matrix A with linearly independent columns has aunique reduced QR factorization A = QR. Furthermore, the matrix R is invertible.

Proof. Suppose A ∈ Mm,n(C), m ≥ n, has linearly independent columns. We know fromTheorem 3.6.3 that A has a QR factorization A = QR. Furthermore, since Q is invertible,we have

rankR = rank(QR) = rankA = n.

Since R is m×n and m ≥ n, the columns of R are also linearly independent. It follows thatthe entries on the main diagonal of R must be nonzero, and hence positive (since they arenonnegative real numbers). This also holds for the upper triangular matrix appearing in areduced QR factorization.

Now suppose

A = QR and A = Q1R1

are two reduced QR factorizations of A. By the above, the entries on the main diagonal ofR and R1 are positive. Since R and R1 are square and upper triangular, this implies that


they are invertible. We wish to show that Q = Q1 and R = R1. Label the columns of Q andQ1:

Q =[c1 c2 · · · cn

]and Q1 =

[d1 d2 · · · dn

].

Since Q and Q1 have orthonormal columns, we have

QHQ = In = QH1 Q1.

(Note that Q and Q1 are not unitary matrices unless they are square. If they are notsquare, then QQH and Q1Q

H1 are not defined. Recall our discussion of one-sided inverses in

Section 1.5.) Therefore, the equality QR = Q1R1 implies QH1 Q = R1R

−1. Let

[tij] = QH1 Q = R1R

−1. (3.16)

Since R and R1 are upper triangular with positive real diagonal elements, we have

tii ∈ R>0 and tij = 0 for i > j.

On the other hand, the (i, j)-entry of QH1 Q is dHi cj. So we have

〈di, cj〉 = dHi cj = tij for all i, j.

Since Q = Q1(R1R−1), each cj is in Span{d1,d2, . . . ,dn}. Thus, by (3.3), we have

cj =n∑i=1

〈di, cj〉di =

j∑i=1

tijdi,

since 〈di, cj〉 = tij = 0 for i > j. Writing out these equations explicitly, we have:

c1 = t11d1,

c2 = t12d1 + t22d2,

c3 = t13d1 + t23d2 + t33d3,

c4 = t14d1 + t24d2 + t34d3 + t44d4,

...

(3.17)

The first equation gives

1 = ‖c1‖ = ‖t11d1‖ = |t11| ‖d1‖ = t11, (3.18)

since t11 ∈ R>0. Thus c1 = d1. Then we have

t12 = 〈d1, c2〉 = 〈c1, c2〉 = 0,

and so the second equation in (3.17) gives

c2 = t22d2.

As in (3.18), this implies that c2 = d2. Then t13 = 0 and t23 = 0 follows in the same way.Continuing in this way, we conclude that ci = di for all i. Thus, Q = Q1 and, by (3.16),R = R1, as desired.


So far our results are all about tall or square matrices. What about wide matrices?

Corollary 3.6.7. Every wide or square matrix A factors as A = LP , where P is unitarymatrix, and L is a lower triangular matrix whose entries on the diagonal are nonnegativereal numbers. It also factors as A = L1P1, where L1 is a square lower triangular matrixwhose entries on the main diagonal are nonnegative real numbers and P1 has orthonormalrows. If A has linearly independent rows, then the second factorization is unique and thediagonal entries of L1 are positive.

Proof. We apply Theorems 3.6.3 and 3.6.6 to AT .

Corollary 3.6.8. Every invertible (hence square) matrix A has unique factorizations A =QR and Q = LP where Q and P are unitary, R is upper triangular with positive real diagonalentries, and L is lower triangular with positive real diagonal entries.

QR factorizations have a number of applications. We learned in Proposition 1.5.15 thata matrix has linearly independent columns if and only if it has a left inverse if and only ifAHA is invertible. Furthermore, if A is left-invertible, then (AHA)−1AH is a left inverse ofA. (We worked over R in Section 1.5.4, but the results are true over C as long as we replacethe transpose by the conjugate transpose.) So, in this situation, it is useful to compute(AHA)−1. Here Theorem 3.6.6 guarantees us that we have a reduced QR factorizationA = QR with R invertible. (And we have an algorithm for finding this factorization!) SinceQ has orthonormal columns, we have QHQ = I. Thus

AHA = RHQHQR = RHR,

and so(AHA)−1 = R−1

(R−1

)H.

Since R is upper triangular, it is easy to find its inverse. So this gives us an efficient methodof finding left inverses.

Students who took MAT 2342 learned about finding best approximations to (possiblyinconsistent) linear systems. (See [Nic, §5.6].) In particular, if Ax = b is a linear system,then any solution z to the normal equations

(AHA)z = AHb (3.19)

is a best approximation to a solution to Ax = b in the sense that ‖b−Az‖ is the minimumvalue of ‖b−Ax‖ for x ∈ Cn. (You probably worked over R in MAT 2342.) As noted above,A has linearly independent columns if and only if AHA is invertible. In this case, there is aunique solution z to (3.19), and it is given by

z = (AHA)−1AHb.

As noted above, (AHA)−1AH is a left inverse to A. We saw in Section 1.5.1 that if Ax = bhas a solution, then it must be (AHA)−1AHb. What we’re saying here is that, even there isno solution, (AHA)−1AHb is the best approximation to a solution.

3.7. Computing eigenvalues 77

Exercises.Recommended exercises: Exercises in [Nic, §8.4]. Keep in mind the different definition ofQR factorization used in [Nic] (see [Nic, Def. 8.6]).

3.7 Computing eigenvalues

Until now, you’ve found eigenvalues by finding the roots of the characteristic polynomial.In practice, this is almost never done. For large matrices, finding these roots is difficult.Instead, iterative methods for estimating eigenvalues are much better. In this section, wewill explore such methods. A reference for some of this material is [Nic, §8.5].

3.7.1 The power method

Throughout this subsection, we suppose that A ∈ Mn,n(C). We will use ‖ · ‖ to denote the2-norm on Cn.

An eigenvalue λ for A is called a dominant eigenvalue if λ has algebraic multiplicity one,and

|λ| > |µ| for all eigenvalues µ 6= λ.

Any eigenvector corresponding to a dominant eigenvalue is called a dominant eigenvector ofA.

Suppose A is diagonalizable, and let λ1, . . . , λn be the eigenvalues of A, with multiplicity,and suppose that

|λ1| ≤ · · · ≤ |λn−1| < λn.

(We are implicitly saying that λn is real here.) In particular, λn is a dominant eigenvalue.Fix a basis {v1, . . . ,vn} of unit eigenvectors of A, so that Avi = λivi for each i = 1, . . . , n.

Let start with some unit vector x0 ∈ Cn and recursively define a sequence of vectorsx0,x1,x2, . . . and positive real numbers ‖y1‖, ‖y2‖, ‖y3‖, . . . by

yk = Axk−1, xk =yk‖yk‖

, k ≥ 1. (3.20)

The power method uses the fact that, under some mild assumptions, the sequence ‖y1‖, ‖y2‖, . . .converges to λn, and the sequence x1,x2, · · · converges to a corresponding eigenvector.

We should be precise about what we mean by convergence here. You learned aboutconvergence of sequences of real numbers in calculus. What about convergence of vectors?We say that a sequence of vectors x1,x2, . . . converges to a vector v if

limk→∞‖xk − v‖ = 0.

This is equivalent to the components of the vectors xk converging (in the sense of sequencesof scalars) to the components of v.


Theorem 3.7.1. With the notation and assumptions from above, suppose that the initialvector x0 is of the form

x0 =n∑i=1

aivi with an 6= 0.

Then the power method converges, that is

limk→∞‖yk‖ = λn and lim

k→∞xk =

an|an|

vn.

In particular, the xk converge to an eigenvector of eigenvalue λn.

Proof. Note that, by definition, xk is a unit vector that is a positive real multiple of

Akx0 = Ak

(n∑i=1

aivi

)=

n∑i=1

aiAkvi =

n∑i=1

aiλki vi.

Thus

xk =

∑ni=1 aiλ

ki vi

‖∑n

i=1 aiλki vi‖

=anvn +

∑n−1i=1 ai

(λiλn

)kvi∥∥∥∥anvn +

∑n−1i=1 ai

(λiλn

)kvi

∥∥∥∥ .(Note that we could divide by λn since λn 6= 0. Also, since an 6= 0, the norm in thedenominator is nonzero.) Since |λi| < λn for i 6= n, we have(

λiλn

)k→ 0 as k →∞.

Hence, as k →∞, we have

xk →anvn‖anvn‖

=an‖an‖

vn.

Similarly,

‖yk+1‖ = ‖Axk‖ =

∥∥∥∥anλnvn +∑n−1

i=1 ai

(λiλn

)kλivi

∥∥∥∥∥∥∥∥anvn +∑n−1

i=1 ai

(λiλn

)kvi

∥∥∥∥= λn

∥∥∥∥anvn +∑n−1

i=1 ai

(λiλn

)k+1

vi

∥∥∥∥∥∥∥∥anvn +∑n−1

i=1 ai

(λiλn

)kvi

∥∥∥∥ → λn.

Remarks 3.7.2. (a) It is crucial that the largest eigenvalue be real. On the other hand,the assumption that it is positive can be avoided since, if it is negative, we can applyTheorem 3.7.1 to −A.


(b) If there are several eigenvalues with maximum norm, then the sequences ‖yk‖ and xkwill not converge in general. On the other hand, if λn has multiplicity greater than one, butis the unique eigenvalue with maximum norm, then the sequence ‖yk‖ will always convergeto λn, but the sequence xk may not converge.

(c) If you choose an initial vector x0 at random, it is very unlikely that you will chooseone with an = 0. (This would be the same as choosing a random real number and ending upwith the real number 0.) Thus, the condition in Theorem 3.7.1 that an 6= 0 is not a seriousobstacle in practice.

(d) It is possible to compute the smallest eigenvalue (in norm) by applying the powermethod to A−1. This is called the inverse power method . It is computationally more involved,since one must solve a linear system at each iteration.


A =

[3 35 1

].

We leave it as an exercise to verify that the eigenvalues of A are −2 and 6. So 6 is a dominanteigenvalue. Let

x0 = (1, 0)

be our initial vector. Then we compute

y1 = Ax0 =

[3 35 1

] [11

]=

[35

], ‖y1‖ ≈ 5.830951, x1 =

y1

‖y1‖≈[0.5144960.857493

],

y2 = Ax1 ≈[4.1159673.429973

], ‖y2‖ ≈ 5.357789, x2 =

y2

‖y2‖≈[0.7682210.640184

],

y3 = Ax2 ≈[4.2252154.481289

], ‖y3‖ ≈ 6.159090, x3 =

y3

‖y3‖≈[0.6860120.727589

],

y4 = Ax3 ≈[4.2408034.157649

], ‖y4‖ ≈ 5.938893, x4 =

y4

‖y4‖≈[0.7140730.700071

],

y5 = Ax4 ≈[4.2424324.270436

], ‖y5‖ ≈ 6.019539, x5 =

y5

‖y5‖≈[0.7047770.709429

].

The ‖yk‖ are converging to 6, while the xk are converging to 1√2(1, 1) ≈ (0.707107, 0.707107),

which is an eigenvector of eigenvalue 6.

Students who took MAT 2342 learned about Markov chains . Markov chains are a partic-ular case of the power method. The matrix A is stochastic if each entry is a nonnegative realnumber and the sum of the entries in each column is equal to one. A stochastic matrix isregular if there exists a positive integer k such that all entries of Ak are strictly positive. Onecan prove that every regular stochastic matrix has 1 as a dominant eigenvalue and a uniquesteady state vector , which is an eigenvector of eigenvalue 1, whose entries are all nonnegativeand sum to 1. One can then use the Theorem 3.7.1 to find this steady state vector. See [Nic,§2.9] for further details.


Remark 3.7.4. Note that the power method only allows us to compute the dominant eigen-value (or the smallest eigenvalue in norm if we use the inverse power method). What ifwe want to find other eigenvalues? In this case, the power method has serious limitations.If A is hermitian, one can first find the dominant eigenvalue λn with eigenvector vn, andthen repeat the power method with an initial vector orthogonal to vn. At each iteration, wesubtract the projection onto vn to ensure that we remain in the subspace orthogonal to vn.However, this is quite computationally intensive, and so is not practical for computing alleigenvalues.

3.7.2 The QR method

The QR method is the most-used algorithm to compute all the eigenvalues of a matrix.Here we will restrict our attention to the case where A ∈ Mn,n(R) is a real matrix whoseeigenvalues have distinct norms:

0 < |λ1| < |λ2| < · · · < |λn−1| < |λn|. (3.21)

(The general case is beyond the scope of this course.) These conditions on A ensure that itis invertible and diagonalizable, with distinct real eigenvalues. We do not assume that A issymmetric.

The QR method consists in computing a sequence of matrices A1, A2, . . . with A1 = Aand

Ak+1 = RkQk,

where Ak = QkRk is the QR factorization of Ak, for k ≥ 1. Note that, since A is invertible, ithas a unique QR factorization by Corollary 3.6.8. Recall that the matrices Qk are orthogonal(since they are real and unitary) and the matrices Rk are upper triangular.

SinceAk+1 = RkQk = Q−1k AkQk = QT

kAkQk, (3.22)

the matrices A1, A2, . . . are all similar, and hence have the same eigenvalues.

Theorem 3.7.5. Suppose A ∈ Mn,n(R) satisfies (3.21). In addition, assume that P−1

admits an LU factorization, where P is the matrix of eigenvectors of A, that is, A =P diag(λ1, . . . , λn)P−1. Then the sequence A1, A2, . . . produced by the QR method convergesto an upper triangular matrix whose diagonal entries are the eigenvalues of A.

Proof. We will omit the proof of this theorem. The interested student can find a proof in[AK08, Th. 10.6.1].

Remark 3.7.6. If the matrix A is symmetric, then so are the Ak, by (3.22). Thus, the limitof the Ak is a diagonal matrix.

Example 3.7.7. Consider the symmetric matrix

A =

1 3 43 1 24 2 1

.


Using the QR method gives

A20 =

7.07467 0.0 0.00.0 −3.18788 0.00.0 0.0 −0.88679

.The diagonal entries here are the eigenvalues of A to within 5× 10−6.

In practice, the convergence of the QR method can be slow when the eigenvalues areclose together. The speed can be improved in certain ways.

• Shifting : If, at stage k of the algorithm, a number sk is chosen and Ak−skI is factoredin the form QkRk rather than Ak itself, then

Q−1k AkQk = Q−1k (QkRk + skI)Qk = RkQk + skI,

and so we take Ak+1 = RkQk+skI. If the shifts sk are carefully chosen, one can greatlyimprove convergence.

• One can first bring the matrix A to upper Hessenberg form, which is a matrix that isnearly upper triangular (one allows nonzero entries just below the diagonal), using atechnique based on Householder reduction. Then the convergence in the QR algorithmis faster.

See [AK08, §10.6] for further details.

3.7.3 Gershgorin circle theorem

We conclude this section with a result that can be used to bound the spectrum (i.e. set ofeigenvalues) of a square matrix. We suppose in this subsection that A = [aij] ∈Mn,n(C).

For i = 1, 2, . . . , n, let

Ri =∑j:j 6=i

|aij|

be the sum of the absolute values of the non-diagonal entries in the i-th row. For a ∈ C andr ∈ R≥0, let

D(a, r) = {z ∈ C : |z − a| ≤ r}

be the closed disc with centre a and radius r. The discs D(aii, Ri) are called the Gershgorindiscs of A.

Theorem 3.7.8 (Gershgorin circle theorem). Every eigenvalue of A lies in at least one ofthe Gershgorin discs D(aii, Ri).

Proof. Suppose λ is an eigenvalue of A. Let x = (x1, . . . , xn) be a corresponding eigenvector.Dividing x by its component with largest absolute value, we can assume that, for somei ∈ {1, 2, . . . , n}, we have

xi = 1 and |xj| ≤ 1 for all j.


Equating the i-th entries of Ax = λx, we have

n∑j=1

aijxj = λxi = λ.

Since xi = 1, this implies that ∑j:j 6=i

aijxj + aii = λ.

Then we have

|λ− aii| =

∣∣∣∣∣∑j 6=i

aijxj

∣∣∣∣∣≤∑j 6=i

|aij| |xj| (by the triangle inequality)

≤∑j 6=i

|aij| (since |xj| ≤ 1 for all j)

= Ri.

Corollary 3.7.9. The eigenvalues of A lie in the Gershgorin discs corresponding to thecolumns of A. More precisely, each eigenvalue lies in at least one of the discs D(ajj,

∑i:i 6=j |aij|).

Proof. Since A and AT have the same eigenvalues, we can apply Theorem 3.7.8 to AT .

One can interpret the Gershgorin circle theorem as saying that if the off-diagonal entriesof a square matrix have small norms, then the eigenvalues of the matrix are close to thediagonal entries of the matrix.


A =

[0 14 0

].

It has characteristic polynomial x2 − 4 = (x− 2)(x+ 2), and so its eigenvalues are ±2. TheGershgorin circle theorem tells us that the eigenvalues lie in the discs D(0, 1) and D(0, 4).

R

iR

2−2



A =

[1 −21 −1

].

It has characteristic polynomial x2 + 1, and so its eigenvalues are ±i. The Gershgorin circletheorem tells us that the eigenvalues lie in the discs D(1, 2) and D(−1, 1).

R

iR

i

−i

Note that, in Examples 3.7.10 and 3.7.11, it was not the case that each Gershgorin disccontained one eigenvalue. There was one disc that contained no eigenvalues, and one discthat contained two eigenvalues. In general, one has the following strengthened version of theGershgorin circle theorem.

Theorem 3.7.12. If the union of k Gershgorin discs is disjoint from the union of the othern− k discs, then the former union contains exactly k eigenvalues of A, and the latter unioncontains exactly n− k eigenvalues of A.

Proof. The proof of this theorem uses a continuity argument, where one starts with Gersh-gorin discs that are points, and gradually enlarges them. For details, see [Mey00, p. 498].

Example 3.7.13. Let’s use the Gershgorin circle theorem to estimate the eigenvalues of

A =

−7 0.3 0.25 0 21 −1 10

.The Gershgorin discs are

D(−7, 0.5), D(0, 7), D(10, 2).

R

iR

By Theorem 3.7.12, we know that two eigenvalues lie in the union of discs D(−7, 0.5)∪D(0, 7)and one lies in the disc D(10, 2).


Exercises.

3.7.1. Suppose A ∈ Mn,n(C) has eigenvalues λ1, . . . , λr. (We only list each eigenvalue oncehere, even if it has multiplicity greater than one.) Prove that the Gershgorin discs for A areprecisely the sets {λ1}, . . . , {λr} if and only if A is diagonal.

3.7.2. Let

A =

1.3 0.5 0.1 0.2 0.1−0.2 0.7 0 0.2 0.1

1 −2 4 0.1 −0.10 0.2 −0.1 2 1

0.05 0 0.1 0.5 1

.Use the Gershgorin circle theorem to prove that A is invertible.

3.7.3. Suppose that P is a permutation matrix. (Recall that this means that each row andcolumn of P have one entry equal to 1 and all other entries equal to 0.)

(a) Show that there are two possibilities for the Gershgorin discs of P .

(b) Show, using different methods, that the eigenvalues of a permutation all have absolutevalue 1. Compare this with your results from (a).

3.7.4. Fix z ∈ C×, and define

A =

0 0.1 0.110 0 11 1 0

, C =

z 0 00 1 00 0 1

.(a) Calculate the matrix B = CAC−1. Recall that A and B have the same eigenvalues.

(b) Give the Gershgorin discs for B, and find the values of z that give the strongestconclusion for the eigenvalues.

(c) What can we conclude about the eigenvalues of A?

Additional recommended exercises: Exercises in [Nic, §8.5].

Chapter 4

Generalized diagonalization

In this and previous courses, you’ve seen the concept of diagonalization. Diagonalizinga matrix makes it very easy to work with in many ways: you know the eigenvalues andeigenvectors, you can easily compute powers of the matrix, etc. However, you know that notall matrices are diagonalizable. So it is natural to ask if there is some slightly more generalresult concerning a nice form in which all matrices can be written. In this chapter we willconsider two such forms: singular value decomposition and Jordan canonical form. Studentswho took MAT 2342 also saw singular value decomposition a bit in that course.

4.1 Singular value decomposition

One of the most useful tools in applied linear algebra is a factorization called singular valuedecomposition (SVD). A good reference for the material in this section is [Nic, §8.6.1]. Notehowever, that [Nic, §8.6] works over R, whereas we will work over C.

Definition 4.1.1 (Singular value decomposition). A singular value decomposition (SVD) ofA ∈Mm,n(C) is a factorization

A = PΣQH ,

where P and Q are unitary and, in block form,

Σ =

[D 00 0

]m×n

, D = diag(d1, d2, . . . , dr), d1, d2, . . . , dr ∈ R>0.

Note that if P = Q in Definition 4.1.1, then A is unitary diagonalizable. So SVD is akind of generalization of unitary diagonalization. Our goal in this section is to prove thatevery matrix has a SVD, and to describe an algorithm for finding this decomposition. Later,we’ll discuss some applications of SVDs.

Recall that hermitian matrices are nice in many ways. For instance, their eigenvalues arereal and they are unitarily diagonalizable by the spectral theorem (Theorem 3.4.4). Notethat for any complex matrix A (not necessarily square), both AHA and AAH are hermitian.It turns out that, to find a SVD of A, we should study these matrices.

Lemma 4.1.2. Suppose A ∈Mm,n(C).

85

86 Chapter 4. Generalized diagonalization

(a) The eigenvalues of AHA and AAH are real and nonnegative.

(b) The matrices AHA and AAH have the same set of positive eigenvalues.

Proof. (a) Suppose λ is an eigenvalue of AHA with eigenvector v ∈ Cn, v 6= 0. Then wehave

‖Av‖2 = (Av)HAv = vHAHAv = vH(λv) = λvHv = λ‖v‖2.

Thus λ = ‖Av‖2/‖v‖2 ∈ R≥0. The proof for AAH is analogous, replacing A by AH .

(b) Suppose λ is a positive eigenvalues of AHA with eigenvector v ∈ Cn, v 6= 0. ThenAv ∈ Cm and

AAH(Av) = A((AHA)v

)= A(λv) = λ(Av).

Also, we have Av 6= 0 since AHAv = λv 6= 0 (because λ 6= 0 and v 6= 0). Thus λ is aneigenvalue of AAH . This proves that every positive eigenvalue of AHA is an eigenvalue ofAAH . For the reverse inclusion, we replace A everywhere by AH .

We now analyze the symmetric matrix AHA, called the Gram matrix of A (see Sec-tion 1.5.4), in more detail.

Step 1: Unitarily diagonalize AHA

Since the matrix AHA is hermitian, the spectral theorem (Theorem 3.4.4) states that we canchoose an orthonormal basis

{q1,q2, . . . ,qn} ⊆ Cn

consisting of eigenvectors ofAHA with corresponding eigenvalues λ1, . . . , λn. By Lemma 4.1.2(a),we have λi ∈ R≥0 for each i. We may choose the order of the q1,q2, . . . ,qn such that

λ1 ≥ λ2 ≥ · · · ≥ λr > 0 and λi = 0 for i > r. (4.1)

(We allow the possibility that r = 0, so that λi = 0 for all i, and also the possibility thatr = n.) By Proposition 3.3.9,

Q =[q1 q2 · · · qn

]is unitary and unitarily diagonalizes AHA. (4.2)

Step 2: Show that rankA = r

Recall from Section 1.3 that

rankA = dim(colA) = dim(imTA).

We wish to show that rankA = r, where r is defined in (4.1). We do this by showing that

{Aq1, Aq2, . . . , Aqr} is an orthogonal basis of imTA. (4.3)

First note that, for all i, j, we have

〈Aqi, Aqj〉 = (Aqi)HAqj = qHi (AHA)qj = qHi (λjqj) = λjq

Hi qj = λj〈qi,qj〉.

4.1. Singular value decomposition 87

Since the set {q1,q2, . . . ,qn} is orthonormal, this implies that

〈Aqi, Aqj〉 = 0 if i 6= j and ‖Aqi‖2 = λi‖qi‖2 = λi for each i. (4.4)

Thus, using (4.1), we see that

{Aq1, Aq2, . . . , Aqr} ⊆ imTA is an orthogonal set and Aqi = 0 if i > r. (4.5)

Since {Aq1, Aq2, . . . , Aqr} is orthogonal, it is linearly independent. It remains to show thatit spans imTA. So we need to show that

U := Span{Aq1, Aq2, . . . , Aqr}

is equal to imTA. Since we already know U ⊆ imTA, we just need to show that imTA ⊆ U .So we need to show

Ax ∈ U for all x ∈ Cn.

Suppose x ∈ Cn. Since {q1,q2, . . . ,qn} is a basis for Cn, we can write

x = t1q1 + t2q2 + · · ·+ tnqn for some t1, t2, . . . , tn ∈ C.

Then, by (4.5), we have

Ax = t1Aq1 + t2Aq2 + · · ·+ tnAqn = t1Aq1 + · · ·+ trAqr ∈ U,

as desired.

Step 3: Some definitions

Definition 4.1.3 (Singular values of A). The real numbers

σi =√λi

(4.4)= ‖Aqi‖, i = 1, 2, . . . , n (4.6)

are called the singular values of the matrix A.

Note that, by (4.1), we have

σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and σi = 0 if i > r.

So the number of positive singular values is equal to r = rankA.

Definition 4.1.4 (Singular matrix of A). Define

DA = diag(σ1, . . . , σr).

Then the m× n matrix

ΣA :=

[DA 00 0

](in block form) is called the singular matrix of A.

Remark 4.1.5. Don’t confuse the “singular matrix of A” with the concept of a “singularmatrix” (a square matrix that is not invertible).


Step 4: Find an orthonormal basis of Cm compatible with colA

Normalize the vectors Aq1, Aq2, . . . , Aqr by defining

pi =1

‖Aqi‖Aqi

(4.6)=

1

σiAqi ∈ Cm for i = 1, 2, . . . , r. (4.7)

It follows from (4.3) that

{p1,p2, . . . ,pr} is an orthonormal basis of imTA = colA ⊆ Cm. (4.8)

By Proposition 3.1.2, we can find pr+1, . . . ,pm so that

{p1,p2, . . . ,pm} is an orthonormal basis of Cm. (4.9)

Step 5: Find the decomposition

By Section 4.1 and (4.2) we have two unitary matrices

P =[p1 p2 · · · pm

]∈Mm,m(C) and Q =

[q1 q2 · · · qn

]∈Mn,n(C).

We also have

σipi(4.6)= ‖Aqi‖pi

(4.7)= Aqi for i = 1, 2, . . . , r.

Using this and (4.5), we have

AQ =[Aq1 Aq2 · · · Aqn

]=[σ1p1 · · · σrpr 0 · · · 0

].

Then we compute

PΣA =[p1 · · · pr pr+1 · · · pm

]

σ1 · · · 0 0 · · · 0...

. . ....

......

0 · · · σr 0 · · · 00 · · · 0 0 · · · 0...

......

...0 · · · 0 0 · · · 0

=[σ1p1 · · · σrpr 0 · · · 0

]= AQ.

Since Q−1 = QH , it follows that A = PΣAQH . Thus we have proved the following theorem.

Theorem 4.1.6. Let A ∈ Mm,n(C), and let σ1 ≥ σ2 ≥ · · · ≥ σr > 0 be the positive singularvalues of A. Then r = rankA and we have a factorization

A = PΣAQH where P and Q are unitary matrices.

In particular, every complex matrix has a SVD.

4.1. Singular value decomposition 89

Note that the SVD is not unique. For example, if r < m, then there are infinitelymany ways to extend {p1, . . . ,pr} to an orthonormal basis {p1, . . . ,pm} of Cm. Each suchextension leads to a different matrix P in the SVD. For another example illustrating non-uniqueness, consider A = In. Then ΣA = In, and A = PΣAP

H is a SVD of A for any unitaryn× n matrix P .

Remark 4.1.7 (Real SVD). If A ∈ Mm,n(R), then we can find a SVD where P and Q arereal (hence orthogonal) matrices. To see this, observe that our proof is valid if we replace Cby R everywhere.

Our proof of Theorem 4.1.6 gives an algorithm for finding SVDs.

Example 4.1.8. Let’s find a SVD of the matrix

A =

[1 −2 −11 2 −1

].

Since this matrix is real, we can write AT instead of AH in our SVD algorithm. We have

ATA =

2 0 −20 8 0−2 0 2

and AAT =

[6 −2−2 6

].

As expected, both of these matrices are symmetric. It’s easier to find the eigenvalues ofAAT . This has characteristic polynomial

(x− 6)2 − 4 = x2 − 12x+ 32 = (x− 4)(x− 8).

Thus, the eigenvalues of AAT are λ1 = 8 and λ2 = 4, both with multiplicity one. It followsfrom Lemma 4.1.2(b) that the eigenvalues of ATA are λ1 = 8, λ2 = 4, and λ3 = 0, all withmultiplicity one. So the positive singular values of A are

σ1 =√λ1 = 2

√2 and σ2 =

√λ2 = 2.

We now find an orthonormal basis of each eigenspace of ATA. We find the eigenvectors

q1 = (0, 1, 0), q2 =

(−1√

2, 0,

1√2

), q3 =

(1√2, 0,

1√2

).

We now compute

p1 =1

σ1Aq1 =

1

2√

2(−2, 2) =

(−1√

2,

1√2

),

p2 =1

σ2Aq2 =

1

2(−√

2,−√

2) =

(−1√

2,−1√

2

).

Then define

P =[p1 p2

]=

[−1/√

2 −1/√

2

1/√

2 −1/√

2

],


Q =[q1 q2 q3

]=

0 −1/√

2 1/√

21 0 0

0 1/√

2 1/√

2

,Σ =

[σ1 0 00 σ2 0

]=

[2√

2 0 00 2 0

].

We can then check that A = PΣQT .

In practice, SVDs are not computed using the above method. There are sophisticatednumerical algorithms for calculating the singular values, P , Q, and the rank of an m × nmatrix to a high degree of accuracy. Such algorithms are beyond the scope of this course.

Our algorithm gives us a way of finding one SVD. However, since SVDs are not unique,it is natural to ask how they are related.

Lemma 4.1.9. If A = PΣQH is any SVD for A as in Definition 4.1.1, then

(a) r = rankA, and

(b) the d1, . . . , dr are the singular values of A in some order.

Proof. We have

AHA =(PΣQH

)H (PΣQH

)= QΣHPHPΣQH = QΣHΣQH . (4.10)

Thus ΣHΣ and AHA are similar matrices, and so rank(AHA) = rank(ΣHΣ) = r. As we sawin Step 2 above, rankA = rankAHA. Hence rankA = r.

It also follows from the fact that ΣHΣ and AHA are similar that they have the sameeigenvalues. Thus

{d21, d22, . . . , d2r} = {λ1, λ2, . . . , λr},

where λ1, λ2, . . . , λr are the positive eigenvalues of AHA. So there is some permutation τ of{1, 2, . . . , r} such that d2i = λτ(i) for each i = 1, 2, . . . , r. Therefore di =

√λτ(i) = στ(i) for

each i by (4.6).

Exercises.

4.1.1. Show that A ∈ Mm,n(C) is the zero matrix if and only if all of its singular values arezero.

Recommended exercises: Exercises in [Nic, §8.6]: 8.6.2–8.6.12, 8.6.16.

4.2. Fundamental subspaces and principal components 91

4.2 Fundamental subspaces and principal components

A singular value decomposition contains a lot of useful information about a matrix. Weexplore this phenomenon in this section. A good reference for most of the material here is[Nic, §8.6.2].

Definition 4.2.1 (Fundamental subspaces). The fundamental subspaces of A ∈ Mm,n(F)are:

• rowA = Span{x : x is a row of A},• colA = Span{x : x is a column of A},• nullA = {x : Ax = 0},• nullAH = {x : AHx = 0}.

Recall Definition 3.1.8 of the orthogonal complement U⊥ of a subspace U of Fn. We willneed a few facts about the orthogonal complement.

Lemma 4.2.2. (a) If A ∈Mm,n(F), then

(row A)⊥ = nullA and (colA)⊥ = nullAH .

(b) If U is any subspace of Fn, then (U⊥)⊥ = U .

(c) Suppose {v1, . . . ,vm} is an orthonormal basis of Fm. If U = Span{v1, . . . ,vk} forsome 1 ≤ k ≤ m, then

U⊥ = Span{vk+1, . . . ,vm}.

Proof. (a) Let a1, . . . , am be the rows of A. Then

x ∈ nullA ⇐⇒ Ax = 0

⇐⇒

a1...

am

x = 0

⇐⇒

a1x...

amx

= 0 (block multiplication)

⇐⇒ aix = 0 for all i

⇐⇒ 〈ai,x〉 = 0 for all i

⇐⇒ x ∈ (Span{a1, . . . , am})⊥ = (row A)⊥. (by Lemma 3.1.9(c))

Thus nullA = (row A)⊥. Replacing A by AH , we get

nullAH = (rowAT )⊥ = (colA)⊥.

(b) You saw this in previous courses, so we will not repeat the proof here. See [Nic,Lem. 8.6.4].


(c) We leave this as an exercise. The proof can be found in [Nic, Lem. 8.6.4].

Now we can see that any SVD for a matrix A immediately gives orthonormal bases forthe fundamental subspaces of A.

Theorem 4.2.3. Suppose A ∈Mm,n(F). Let A = PΣQH be a SVD for A, where

P =[u1 · · · um

]∈Mm,m(F) and Q =

[v1 · · · vn

]∈Mn,n(F)

are unitary (hence orthogonal if F = R) and

Σ =

[D 00 0

]m×n

, where D = diag(d1, d2, . . . , dr), with each di ∈ R>0.

Then

(a) r = rankA, and the positive singular values of A are d1, d2, . . . , dr;

(b) the fundamental spaces are as follows:

(i) {u1, . . . ,ur} is an orthonormal basis of colA,

(ii) {ur+1, . . . ,um} is an orthonormal basis of nullAH ,

(iii) {vr+1, . . . ,vn} is an orthonormal basis of nullA,

(iv) {v1, . . . ,vr} is an orthonormal basis of rowA.

Proof. (a) This is Lemma 4.1.9.

(b) (i) Since Q is invertible, we have colA = col(AQ) = col(PΣ). Also

PΣ =[u1 · · · um

] [diag(d1, . . . , dr) 00 0

]=[d1u1 · · · drur 0 · · · 0

].

ThuscolA = Span{d1u1, . . . , drur} = Span{u1, . . . ,ur}.

Since the u1, . . . ,ur are orthonormal, they are linearly independent. So {u1, . . . ,ur} is anorthonormal basis for colA.

(ii) We have

nullAH = (colA)⊥ (by Lemma 4.2.2(a))

= (Span{u1, . . . ,ur})⊥

= Span{ur+1, . . . ,um}. (by Lemma 4.2.2(c))

(iii) We first show that the proposed basis has the correct size. By the Rank-NullityTheorem (see (1.7)), we have

dim(nullA) = n− r = dim(Span{vr+1, . . . ,vn}).

So, if we can show thatSpan{vr+1, . . . ,vn} ⊆ nullA, (4.11)

4.2. Fundamental subspaces and principal components 93

it will follow that Span{vr+1, . . . ,vn} = nullA (since the two spaces have the same dimen-sion), and hence that {vr+1, . . . ,vn} is a basis for nullA (since it is linearly independentbecause it is orthonormal).

To show the inclusion (4.11), it is enough to show that vj ∈ nullA (i.e. Avj = 0) forj > r. Define

dr+1 = · · · = dn = 0, so that ΣHΣ = diag(d21, . . . , d2n).

Then, for 1 ≤ j ≤ n, we have

(AHA)vj = (QΣHΣQH)vj (by (4.10))

= (QΣHΣQH)Qej

= Q(ΣHΣ)ej

= Q(d2jej)

= d2jQej

= d2jvj.

Thus, for 1 ≤ i ≤ n,

‖Avj‖2 = (Avj)HAvj = vHj A

HAvj = vHj (d2jvj) = d2j‖vj‖2 = d2j .

In particular, Avj = 0 for j > r, as desired.

(iv) First note that

(row A)⊥ = nullA (by Lemma 4.2.2(a))

= Span{vr+1, . . . ,vn}. (by (iii))

Then

row A =((row A)⊥

)⊥(by Lemma 4.2.2(b))

= (Span{vr+1, . . . ,vn})⊥

= Span{v1, . . . ,vr}. (by Lemma 4.2.2(c))

Taking complex conjugates, we have rowA = Span{v1, . . . ,vr}.

Example 4.2.4. Suppose we want to solve the homogeneous linear system

Ax = 0 of m equations in n variables.

The set of solutions is precisely nullA. If we compute a SVD A = UΣV T for A then, in thenotation of Theorem 4.2.3, the set {vr+1, . . . ,vn} is an orthonormal basis of the solution set.

SVDs are also closely related to principal component analysis. If A = PΣQT is a SVDwith, in the notation of Theorem 4.2.3,

σ1 = d1 > σ2 = d2 > · · · > σn = dn > 0,


then we have

A = PΣQT

=[u1 · · · um

] [diag(σ1, . . . , σr) 00 0

] [v1 · · · vn

]H=[σ1u1 · · · σrur 0 · · · 0

] vH1...

vHn

= σ1u1v

H1 + · · ·+ σrurv

Hr . (block multiplication)

Thus, we have written A as a sum of r rank one matrices (see Exercise 4.2.1), called theprinciple components of A. For 1 ≤ t ≤ r, the matrix

At := σ1u1vH1 + . . .+ σtutv

Ht

is a truncated matrix that is the closest approximation to A by a rank t matrix, in thesense that the difference between A and At has the smallest possible Frobenius norm. Thisresult is known as the Eckart–Young–Mirsky Theorem. Such approximations are particularlyuseful in data analysis and machine learning. There exist algorithms for finding At withoutcomputing a full SVD.

Exercises.

4.2.1. Suppose u ∈ Cn and v ∈ Cm. Show that if u and v are both nonzero, then the rankof the matrix uvH is 1.

4.3 Pseudoinverses

In Section 1.5 we discussed left and right inverses. In particular, we discussed the pseudoin-verse in Section 1.5.4. This was a particular left/right inverse; in general left/right inversesare not unique. We’ll now discuss this concept in a bit more detail, seeing what propertyuniquely characterizes the pseudoinverse and how we can compute it using a SVD. A goodreference for this material is [Nic, §8.6.4].

Definition 4.3.1 (Middle inverse). A middle inverse of A ∈ Mm,n(F) is a matrix B ∈Mn,m(F) such that

ABA = A and BAB = B.

Examples 4.3.2. (a) Suppose A is left-invertible, with left inverse B. Then

ABA = AI = A and BAB = IB = B,

4.3. Pseudoinverses 95

so B is a middle inverse of A. Conversely if C is any other middle inverse of A, then

ACA = A =⇒ BACA = BA =⇒ CA = I,

and so C is a left inverse of A. Thus, middle inverses of A are the same as left inverses.

(b) If A right invertible, then middle inverses of A are the same as right inverses. Theproof is analogous to the one above.

(c) It follows that if A is invertible, then middle inverse are the same as inverses.

In general, middle inverses are not unique, even for square matrices.

Example 4.3.3. If

A =

[1 0 00 0 0

],

then

B =

1 b0 00 0

is a middle inverse for any b.

While the middle inverse is not unique in general, it turns out that it is unique if werequire that AB and BA be hermitian.

Theorem 4.3.4 (Penrose’ Theorem). For any A ∈ Mm,n(C), there exists a unique B ∈Mn,m(C) such that

(P1) ABA = A and BAB = B.

(P2) Both AB and BA are hermitian.

Proof. A proof can be found in [Pen55].

Definition 4.3.5 (Pseudoinverse). The pseudoinverse (or Moore–Penrose inverse) of A ∈Mm,n(C) is the unique A+ ∈Mn,m(C) such that A and A+ satisfy (P1) and (P2), that is:

AA+A = A, A+AA+ = A+, and both AA+ and A+A are hermitian.

If A is invertible, then A+ = A−1, as follows from Example 4.3.2. Also, the symmetry inthe conditions (P1) and (P2) imply that A++ = A.

The following proposition shows that the terminology pseudoinverse, as used above, co-incides with our use of this terminology in Section 1.5.4.

Proposition 4.3.6. Suppose A ∈Mm,n(C).

(a) If rankA = m, then AAH is invertible and A+ = AH(AAH)−1.

(b) If rankA = n, then AHA is invertible and A+ = (AHA)−1AH .


Proof. We proof the first statement; the proof of the second is similar. If rankA = m,then the rows of A are linearly independent and so, by Proposition 1.5.16(a) (we workedover R there, but the same result holds over C if we replace the transpose by the conjugatetranspose), AAH is invertible. Then

A(AH(AAH)−1

)A = (AAH)(AAH)−1A = IA = A

and (AH(AAH)−1

)A(AH(AAH)−1

)= AH(AAH)−1(AAH)(AAH)−1 = AH(AAH)−1.

Furthermore, A(AH(AAH)−1

)= I is hermitian, and((

AH(AAH)−1)A)H

=(AH(AAH)−1

)A,

so(AH(AAH)−1

)A is also hermitian.

In turns out that if we have a SVD for A, then it is particularly easy to compute thepseudoinverse.

Proposition 4.3.7. Suppose A ∈ Mm,n(C) and A = PΣQH is a SVD for A as in Defini-tion 4.1.1, with

Σ =

[D 00 0

]m×n

, D = diag(d1, d2, . . . , dr), d1, d2, . . . , dr ∈ R>0.

Then A+ = QΣ′PH , where

Σ−1 =

[D−1 0

0 0

]n×m

.

Proof. It is straightforward to verify that

ΣΣ′Σ = Σ, Σ′ΣΣ′ = Σ′, ΣΣ′ =

[Ir 00 0

]m×m

, Σ′Σ =

[Ir 00 0

]n×n

.

In particular, Σ′ is the pseudoinverse of Σ. Now let B = QΣ′PH . Then

ABA = (PΣQH)(QΣ′PH)(PΣQH) = PΣΣ′ΣQH = PΣQH = A.

Similarly, BAB = B. Furthermore,

AB = U(ΣΣ′)UH and BA = Q(Σ′Σ)QH

are both hermitian. Thus B = A+.


A =

[1 0 00 0 0

].

4.3. Pseudoinverses 97

We saw in Example 4.3.3 that

B =

1 b0 00 0

is a middle inverse for B for any choice of b. We have

AB =

[1 b0 0

]and BA =

1 0 00 0 00 0 0

.So BA is always symmetric, but AB is symmetric exactly when b = 0. Hence B = A+ ifand only if b = 0.

Let’s compute A+ using a SVD of A. The matrix

ATA =

1 0 00 0 00 0 0

has eigenvalues λ1 = 1, λ2 = λ3 = 0 with orthonormal eigenvectors

q1 = (1, 0, 0), q2 = (0, 1, 0), q3 = (0, 0, 1).

Thus we take Q =[q1 q2 q3

]= I3. In addition, A has rank 1 with singular values

σ1 = 1, σ2 = 0, σ3 = 0.

Thus

ΣA =

[1 0 00 0 0

]= A and Σ′ =

1 00 00 0

= AT .

We then compute

p1 =1

σ1Aq1 =

[10

].

We extend this to an orthonormal basis of C2 by choosing

p2 =

[01

].

HenceP =

[p1 p2

]= I2.

Thus a SVD for A is A = PΣAQT = ΣA = A. Therefore the pseudoinverse of A is

A+ = QΣ′APT = Σ′A =

[1 0 00 0 0

]= AT .

We conclude this section with a list of properties of the pseudoinverse, many of whichparallel properties of the inverse.


Proposition 4.3.9. Suppose A ∈Mm,n(C).

(a) A++ = A.

(b) If A is invertible, then A+ = A−1.

(c) The pseudoinverse of a zero matrix is its transpose.

(d) (AT )+ = (A+)T .

(e) (A)+ = A+.

(f) (AH)+ = (A+)H .

(g) (zA)+ = z−1A+ for z ∈ C×.

(h) If A ∈Mm,n(R), then A+ ∈Mn,m(R).

Proof. We’ve already proved the first two. The proof of the remaining properties is left asExercise 4.3.3.

Exercises.

4.3.1. Suppose that B is a middle inverse for A. Prove that BT is a middle inverse for AT .

4.3.2. A square matrix M is idempotent if M2 = M . Show that, if B is a middle inverse forA, then AB and BA are both idempotent matrices.

4.3.3. Complete the proof of Proposition 4.3.9.

Additional recommended exercises from [Nic, §8.6]: 8.6.1, 8.6.17.

4.4 Jordan canonical form

One of most fundamental results in linear algebra is the classification of matrices up tosimilarity. It turns out that every matrix is similar to a matrix of a special form, calledJordan canonical form. In this section we discuss this form. We will omit some of thetechnical parts of the proof, referring to [Nic, §§11.1, 11.2] for details. Throughout thissection we work over the field C of complex numbers.

By Schur’s theorem (Theorem 3.4.2), every matrix is unitarily similar to an upper trian-gular matrix. The following theorem shows that, in fact, every matrix is similar to a specialtype of upper triangular matrix.

Proposition 4.4.1 (Block triangulation). Suppose A ∈ Mn,n(C) has characteristic polyno-mial

cA(x) = (x− λ1)m1(x− λ2)m2 · · · (x− λk)mk

4.4. Jordan canonical form 99

where λ1, λ2, . . . , λk are the distinct eigenvalues of A. Then there exists an invertible matrixP such that

P−1AP =

U1 0 0 · · · 00 U2 0 · · · 00 0 U3 · · · 0...

......

...0 0 0 · · · Uk

where each Ui is an mi ×mi upper triangular matrix with all entries on the main diagonalentry equal to λi.

Proof. The proof proceeds by induction on n. See [Nic, Th. 11.1.1] for details.

Recall that if λ is an eigenvalue of A, then the associated eigenspace is

Eλ = null(λI − A) = {x : (λI − A)x = 0} = {x : Ax = λx}.

Definition 4.4.2 (Generalized eigenspace). If λ is an eigenvalue of the matrix A, then theassociated generalized eigenspace is

Gλ = Gλ(A) := null(λI − A)mλ ,

where mλ is the algebraic multiplicity of λ. (We use the notation Gλ(A) if we want toemphasize the matrix.)

Note that we always have

Eλ(A) ⊆ Gλ(A).


A =

[0 10 0

]∈M2,2(C).

The only eigenvalue of A is λ = 0. The associated eigenspace is

E0(A) = Span{(1, 0)}.

However, since A2 = 0, we have

G0 = C2.

Recall that the geometric multiplicity of the eigenvalue λ of A is dimEλ and we havedimEλ ≤ mλ. This inequality can be strict, as in Example 4.4.3. (In fact, it is strict forsome eigenvalue precisely when the matrix A is not diagonalizable.)

Lemma 4.4.4. If λ is an eigenvalue of A, then dimGλ(A) = mλ, where mλ is the algebraicmultiplicity of λ.


Proof. Choose P as in Proposition 4.4.1 and choose an eigenvalue λi. To simplify notation,let B = (λiI − A)mi . We have an isomorphism

Gλ(A) = nullB → null(P−1BP ), x 7→ P−1x (4.12)

(Exercise 4.4.1). Thus, it suffices to show that dim null(P−1BP ) = mi. Using the notationof Proposition 4.4.1, we have λ = λi for some i and

P−1BP = (λiI − P−1AP )mi

=

λiI − U1 0 · · · 0

0 λiI − U2 · · · 0...

......

0 0 · · · λiI − Uk

mi

=

(λiI − U1)

mi 0 · · · 00 (λiI − U2)

mi · · · 0...

......

0 0 · · · (λiI − Uk)mi

.The matrix λiI − Uj is upper triangular with main diagonal entries equal to λi − λj; so(λiI−Uj)mi it is invertible when i 6= j and the zero matrix when i = j (see Exercise 3.4.3(b)).Therefore mi = dim null(P−1BP ), as desired.

Definition 4.4.5 (Jordan block). For n ∈ Z>0 and λ ∈ C, the Jordan block J(n, λ) is then × n matrix with λ’s on the main diagonal, 1′s on the diagonal above, and 0’s elsewhere.By convention, we set J(1, λ) =

[λ].

We have

J(1, λ) =[λ], J(2, λ) =

[λ 10 λ

], J(3, λ) =

λ 1 00 λ 10 0 λ

, J(4, λ) =

λ 1 0 00 λ 1 00 0 λ 10 0 0 λ

, etc.

Our goal is to show that Proposition 4.4.1 holds with each block Ui replaced by a Jordanblock. The key is to show the result for λ = 0. We say a linear operator (or matrix) Tis nilpotent if Tm = 0 for some m ≥ 1. Every eigenvalue of a nilpotent linear operator ormatrix is equal to zero (Exercise 4.4.2). The converse also holds by Proposition 4.4.1 andExercise 3.4.3(b).

Lemma 4.4.6. If A ∈ Mn,n(C) is nilpotent, then there exists an invertible matrix P suchthat

P−1AP = diag(J1, J2, . . . , Jk),

where each Ji is a Jordan block J(m, 0) for some m.

Proof. The proof proceeds by induction on n. See [Nic, Lem. 11.2.1] for details.

4.4. Jordan canonical form 101

Theorem 4.4.7 (Jordan canonical form). Suppose A ∈ Mn,n(C) has distinct (i.e. non-repeated) eigenvalues λ1, . . . , λk. For 1 ≤ i ≤ k, let mi be the algebraic multiplicity of λi.Then there exists an invertible matrix P such that

P−1AP = diag(J1, . . . , Jm), (4.13)

where each J` is a Jordan block corresponding to some eigenvalue λi. Furthermore, the sumof the sizes of the Jordan block corresponding to λi is equal to mi. The form (4.13) is calleda Jordan canonical form of A.

Proof. By Proposition 4.4.1, there exists an invertible matrix Q such that

Q−1AQ =

U1 0 0 · · · 00 U2 0 · · · 00 0 U3 · · · 0...

......

...0 0 0 · · · Uk

,

where each Ui is upper triangular with entries on the main diagonal equal to λi. Supposethat for each Ui we can find an invertible matrix Pi as in the statement of the theorem. Then

P1 0 0 · · · 00 P2 0 · · · 00 0 P3 · · · 0...

......

...0 0 0 · · · Pk

−1

Q−1AQ

P1 0 0 · · · 00 P2 0 · · · 00 0 P3 · · · 0...

......

...0 0 0 · · · Pk

=

P−11 U1P1 0 0 · · · 0

0 P−12 U2P1 0 · · · 00 0 P−13 U3P3 · · · 0...

......

...0 0 0 · · · P−1k UkPk

would be in Jordan canonical form.

So it suffices to prove the theorem in the case that A is an upper triangular matrixwith diagonal entries all equal to some λ ∈ C. Then A − λI is strictly upper triangular(upper triangular with zeros on the main diagonal), and hence nilpotent by Exercise 3.4.3(b).Therefore, by Lemma 4.4.6, there exists an invertible matrix P such that

P−1(A− λI)P = diag(J1, . . . , J`),

where each Ji is a Jordan block J(m, 0) for some m. Then

P−1AP = P−1((A− λI) + λI

)P = P−1(A− λI)P + λI = diag(J1 + λI, . . . , J` + λI),

and each Ji + λI is a Jordan block J(m,λ) for some m.


Remarks 4.4.8. (a) Suppose that T : V → V is a linear operator on a finite-dimensionalvector space V . If we choose a basis B of V , then we can find the matrix of T relativeto this basis. It follows from Theorem 4.4.7 that we can choose the basis B so thatthis matrix is in Jordan canonical form.

(b) The Jordan canonical form of a matrix (or linear operator) is uniquely determined upto reordering the Jordan blocks. In other words, for each eigenvalue λ, the number andsize of the Jordan blocks corresponding to λ are uniquely determined. This is mosteasily proved using more advanced techniques from the theory of modules. It followsthat two matrices are similar if and only if they have the same Jordan canonical form(up to re-ordering of the Jordan blocks).

(c) A matrix is diagonalizable if and only if, in its Jordan canonical form, all Jordan blockshave size 1.

Exercises.

4.4.1. Show that (4.12) is an isomorphism.

4.4.2. Show that if T : V → V is a nilpotent linear operator, then zero is the only eigenvalueof T .

Additional exercises from [Nic, §11.2]: 11.2.1, 11.2.2.

4.5 The matrix exponential

As discussed in Section 3.4, we can substitute a matrix A ∈ Mn,n(C) into any polynomial.In this section, we’d like to make sense of the expression eA. Recall from calculus that, forx ∈ C, we have the power series

ex = 1 + x+x

2+x3

3!+ · · · =

∞∑m=0

xm

m!.

(You probably only saw this for x ∈ R, but it holds for complex values of x as well.) So wemight naively try to define

eA =∞∑m=0

1

m!Am.

In fact, this is the correct definition. However, we need to justify that this infinite summakes sense (i.e. that in converges) and figure out how to compute it. To do this, we usethe Jordan canonical form.

4.5. The matrix exponential 103

As a warm up, let’s suppose A = diag(a1, . . . , an) is diagonal. Then

eA =∞∑m=0

1

m!Am =

∞∑m=0

1

m!diag(am1 , . . . , a

mn ) = diag(ea1 , . . . , eam). (4.14)

So exponentials of diagonal matrices exist and are easy to compute.Now suppose A is arbitrary. By Theorem 4.4.7, there exists an invertible matrix P such

that B = P−1AP is in Jordan canonical form. For each m, we have

1

m!Am =

1

m!

(PBP−1

)m= P

(1

m!Bm

)P−1.

Summing over m, it follows thateA = PeBP−1

provided that the infinite sum for eB converges. If A is diagonalizable, then its Jordancanonical form B is diagonal, and we can then compute eB as in (4.14).

The above discussion shows that we can restrict our attention to matrices in Jordancanonical form. However, since the Jordan canonical form might not be diagonal, we havesome more work to do. Note that if A is in Jordan canonical form, we have

A = diag(J1, . . . , Jk)

for some Jordan blocks Ji. But then, for all m, we have

Am = diag(Jm1 , . . . , Jmk ).

ThuseA = diag(eJ1 , . . . , eJk)

provided that the sum for each eJi converges. So it suffices to consider the case where A isa single Jordan block.

Let’s first consider the case of a Jordan block J = J(n, 0) corresponding to eigenvaluezero. Then we know from Exercise 3.4.3(b) that Jn = 0. Hence

eJ =∞∑m=0

1

m!Jm =

n−1∑m=0

1

m!Jm. (4.15)

Since this sum is finite, there are no convergence issues.Now consider an arbitrary Jordan block J(n, λ), λ ∈ C. We have

J(n, λ) = λI + J(n, 0)

Now, if A,B ∈Mn,n(C) commute (i.e. AB = BA) and then the usual argument shows that

eA+B = eAeB.

(It is crucial that A and B commute; see Exercise 4.5.2.) Thus

eJ(n,λ) = eλI+J(n,0) = eλIeJ(n,0) = eλIeJ(n,0) = eλeJ(n,0).

So we can compute the exponential of any Jordan block. Hence, by our above discussion,we can compute the exponential of any matrix. We sometimes write exp(A) for eA.


Example 4.5.1. Suppose

A =

2 1 00 2 00 0 −1

.We have

exp

([2 10 2

])= e2 exp

([0 10 0

])= e2

[1 10 1

]and

exp([−1])

=[e−1].

Hence

eA =

e2 e2 00 e2 00 0 e−1

.Proposition 4.5.2. For A ∈Mn,n(C) we have det eA = etrA.

Proof. We first prove the result for a Jordan block J = J(n, λ). We have

det eJ = det(eλeJ(n,0)

)=(eλ)n

det eJ(n,0) = enλ · 1 = etr J ,

where we have used the fact (which follows from (4.15)) that eJ(n,0) is upper triangularwith all entries on the main diagonal equal to 1 (hence det eJ(n,0) = 1). Now suppose A isarbitrary. Since similar matrices have the same determinant and trace (see Exercises 3.4.1and 3.4.2), we can use Theorem 4.4.7 to assume that A is in Jordan canonical form. So

A = diag(J1, . . . , Jk)

for some Jordan blocks Ji. Then

det eA = det(diag(eJ1 , eJ2 , . . . , eJk)

)= (det eJ1)(det eJ2) · · · (det eJk) (by (1.2))

= etr J1etr J2 · · · etr Jk (by the above)

= etr J1+tr J2+···+tr Jk (since ea+b = eaeb for a, b ∈ C)

= etr(diag(J1,J2,...,Jk))

= etrA.

The matrix exponential can be used to solve certain initial value problems.. You probablylearned in calculus that the solution to the differential equation

x′(t) = ax(t), x(t0) = x0, a, t0, x0 ∈ R

isx(t) = e(t−t0)ax0.

In fact, this continues to hold more generally. If

x(t) = (x1(t), . . . , xn(t)), x′(t) = (x′1(t), . . . , x′n(t)), x0 ∈ Rn, A = Mn,n(C),

4.5. The matrix exponential 105

then the solution to the initial value problem

x′(t) = Ax(t), x(t0) = x0,

isx(t) = e(t−t0)Ax0.

In addition to the matrix exponential, it is also possible to compute other functions ofmatrices (sinA, cosA, etc.) using Taylor series for these functions. See [Usm87, Ch. 6] forfurther details.

Exercises.

4.5.1. Prove that eA is invertible for any A ∈Mn,n(C).

4.5.2. Let

A =

[0 10 0

]and B =

[0 01 0

].

Show that eAeB 6= eA+B.

4.5.3. If

A =

3 1 0 0 00 3 1 0 00 0 3 0 00 0 0 −4 10 0 0 0 −4

,compute eA.

4.5.4. If

A =

0 2 −30 0 10 0 0

,compute eA by summing the power series.

4.5.5. Solve the initial value problem x′(t) = Ax(t), x(1) = (1,−2, 3), where A is the matrixof Example 4.5.1.

Chapter 5

Quadratic forms

In your courses in linear algebra you have spent a great deal of time studying linear functions.The next level of complexity involves quadratic functions. These have a large number ofapplications, including in the theory of conic sections, optimization, physics, and statistics.Matrix methods allow for a unified treatment of quadratic functions. Conversely, quadraticfunctions provide useful insights into the theory of eigenvectors and eigenvalues. Goodreferences for the material in this chapter are [Tre, Ch. 7] and [ND77, Ch. 10].

5.1 Definitions

A quadratic form on Rn is a homogenous polynomial of degree two in the variables x1, . . . , xn.In other words, a quadratic form is a polynomial Q(x) = Q(x1, . . . , xn) having only terms ofdegree two. So only terms that are scalar multiples of x2k and xjxk are allowed.

Every quadratic form on Rn can be written in the form Q(x) = 〈x, Ax〉 = xTAx for somematrix A ∈ Mn,n(R). In general, the matrix A is not unique. For instance, the quadraticform

Q(x) = 2x21 − x22 + 6x1x2

can be written as 〈x, Ax〉 where A is any of the matrices[2 60 −1

],

[2 06 −1

],

[2 33 −1

].

In fact, we can choose any matrix of the form[2 6− aa −1

], a ∈ F.

Note that only the choice of a = 3 yields a symmetric matrix.

Lemma 5.1.1. For any quadratic form Q(x) on Rn, there is a unique symmetric matrixA ∈Mn,n(R) such that Q(x) = 〈x, Ax〉.

Proof. Suppose

Q(x) =∑i≤j

cijxixj.

106

5.1. Definitions 107

If A = [aij] is symmetric, we have

〈x, Ax〉 =n∑

i,j=1

aijxixj =n∑i=1

aiix2i +

∑i<j

(aij + aji)xixj =n∑i=1

aiix2i +

∑i<j

2aijxixj.

Thus, we have Q(x) = 〈x, Ax〉 if and only if

aii = cii for 1 ≤ i ≤ n, aij = aji =1

2cij for 1 ≤ i < j ≤ n.

Example 5.1.2. If

Q(x) = x21 − 3x22 + 4x23 + 3x1x2 − 12x1x3 + 8x2x3,

then the corresponding symmetric matrix is 1 1.5 −61.5 −3 4−6 4 2

.We can also consider quadratic forms on Cn. Typically, we still want the quadratic form

to take real values. Before we consider such quadratic forms, we state an important identity.

Lemma 5.1.3 (Polarization identity). Suppose A ∈Mn,n(F).

(a) If F = C, then

〈x, Ay〉 =1

4

∑α=±1,±i

α〈x + αy, A(x + αy)〉.

(b) If F = R and A = AT , then

〈x, Ay〉 =1

4

(〈x + y, A(x + y)〉 − 〈x− y, A(x− y)〉

).

Proof. We leave the proof as Exercise 5.1.1.

The importance of the polarization identity is that the values 〈x, Ay〉 are completelydetermined by the values of the corresponding quadratic form (the case where x = y).

Lemma 5.1.4. Suppose A ∈Mn,n(C). Then

〈x, Ax〉 ∈ R for all x ∈ Cn

if and only if A is hermitian.

Proof. First suppose that A is hermitian. Then, for all x ∈ Cn,

〈x, Ax〉 = 〈x, AHx〉 = 〈Ax,x〉 (IP1)= 〈x, Ax〉.

Thus 〈x, Ax〉 ∈ R.

108 Chapter 5. Quadratic forms

Now suppose that 〈x, Ax〉 ∈ R for all x ∈ Cn, and A = [aij]. Using the polarizationidentity (Lemma 5.1.3) we have, for all x,y ∈ Cn,

〈x, Ay〉 =1

4

(〈x + y, A(x + y)〉 − 〈x− y, A(x− y)〉+ i〈x + iy, A(x + iy)〉 − i〈x− iy, A(x− iy)〉

)(IP2)=

(IP3)

1

4

(〈x + y, A(x + y)〉 − 〈y − x, A(y − x)〉+ i〈y − ix, A(y − ix)〉 − i〈y + ix, A(y + ix)〉

)= 〈y, Ax〉 (IP1)= 〈Ax,y〉= 〈x, AHy〉.

(In the third equality, we used the fact that all the inner products appearing in the expressionare real.) It follows that A = AH (e.g. take x = ei and y = ej).

In light of Lemma 5.1.4, we define a quadratic form on Cn to be a function of the form

Q(x) = 〈x, Ax〉 for some hermitian A ∈Mn,n(C).

Thus a quadratic form on Cn is a function Cn → R.

Exercises.

5.1.1. Prove Lemma 5.1.3.

5.1.2. Consider the quadratic form

Q(x1, x2, x3) = x21 − x22 + 4x33 + x1x2 − 3x1x3 + 5x2x3.

Find all matrices A ∈M3,3(R) such that Q(x) = 〈x, Ax〉. Which one is symmetric?

5.2 Diagonalization of quadratic forms

Suppose Q(x) = 〈x, Ax〉 is a quadratic form. If A is diagonal, then the quadratic form isparticularly easy to understand. For instance, if n = 2, then it is of the form

Q(x) = a1x21 + a2x

22.

For fixed c ∈ R, the level set consisting of all points x satisfying

Q(x) = c

is an ellipse or a hyperbola (or parallel lines, which is a kind of degenerate hyperbola, or theempty set) whose axes are parallel to the x-axis and/or y-axis. For example, if F = R, theset defined by

x21 + 4x22 = 4

5.2. Diagonalization of quadratic forms 109

is the ellipse

x1

x2

while the set defined by

x21 − 4x22 = 4

is the following hyperbola:

x1

x2

If n = 3, then the level sets are ellipsoids or hyperboloids.Given an arbitrary quadratic form, we would like to make it diagonal by changing vari-

ables. Suppose we introduce some new variables

y =

y1y2...yn

= S−1x,

where S ∈Mn,n(F) is some invertible matrix, so that x = Sy. Then we have

Q(x) = Q(Sy) = 〈Sy, ASy〉 = 〈y, SHASy〉.

Therefore, in the new variables y, the quadratic form has matrix SHAS.


So we would like to find an invertible matrix S so that SHAS is diagonal. Note that is abit different than usual diagonalization, where you want S−1AS to be diagonal. However, if Sis unitary, then SH = S−1. Fortunately, we know from the spectral theorem (Theorem 3.4.4)that we can unitarily diagonalize a hermitian matrix. So we can find a unitary matrix Usuch that

UHAU = U−1AU

is diagonal. The columns u1,u2, . . . ,un of U form an orthonormal basis B of Fn. Then Uis the change of coordinate matrix from this basis to the standard basis e1, e2, . . . , en. Wehave

y = U−1x,

so y1, y2, . . . , yn are the coordinates of the vector x in the new basis u1,u2, . . . ,un.

Example 5.2.1. Let F = R and consider the quadratic form

Q(x1, x2) = 5x21 + 5x22 − 2x1x2.

Let’s describe the level set given by Q(x1, x2) = 1. The matrix of Q is

A =

[5 −1−1 5

].

We can orthogonally diagonalizes this matrix (exercise) as

D =

[4 00 6

]= UTAU, with U =

1√2

[1 1−1 1

].

Thus, we introduce the new variables y1, y2 given by[x1x2

]= U

[y1y2

]=

1√2

[y1 + y2y2 − y1

].

In these new variables, the quadratic form is

Q(y1, y2) = 4y21 + 6y22.

So the level set given by Q(y1, y2) = 1 is the ellipse with half-axes 1/2 and 1/√

6:

y1

y2

The set {x : 〈x, Ax〉 = 1} is the same ellipse, but in the basis (1/√

2,−1/√

2), (1/√

2, 1/√

2).In other words, it is the same ellipse, rotated −π/4.

5.2. Diagonalization of quadratic forms 111

Unitary diagonalization involves computing eigenvalues and eigenvectors, and so it canbe difficult to do for large n. However, one can non-unitarily diagonalize the quadratic formassociated to A, i.e. find an invertible S (without requiring that S be unitary) such thatSHAS is diagonal. This is much easier computationally. It can be done by completion ofsquares or by using row/column operations. See [Tre, §2.2] for details.

Proposition 5.2.2 (Sylvester’s law of inertia). Suppose A ∈ Mn,n(C) is hermitian. Thereexists an invertible matrix S such that SHAS is of the form

diag(1, . . . , 1,−1, . . . ,−1, 0, . . . , 0),

where

• the number of 1’s is equal to the number of positive eigenvalues of A (with multiplicity),

• the number of −1’s is equal to the number of negative eigenvalues of A (with multiplic-ity),

• the number of 0’s is equal to the multiplicity of zero as an eigenvalue of A.

Proof. We leave the proof as Exercise 5.2.3.

The sequence (1, . . . , 1,−1, . . . ,−1, 0, . . . , 0) appearing in Proposition 5.2.2 is called thesignature of the hermitian matrix A.

Exercises.

5.2.1 ([Tre, Ex. 7.2.2]). For the matrix

A =

2 1 11 2 11 1 2

,unitarily diagonalize the corresponding quadratic form. That is, find a diagonal matrix Dand a unitary matrix U such that D = UHAU .

5.2.2 ([ND77, Ex. 10.2.4]). For each quadratic form Q(x) below, find new variables y inwhich the form is diagonal. Then graph the level curves Q(y) = t (in the coordinates y) forvarious values of t.

(a) x21 + 4x1x2 − 2x2x

(b) x21 + 12x1x2 + 4x22

(c) 2x21 + 4x1x2 + 4x22

(d) −5x21 − 8x1x2 − 5x22

(e) 11x21 + 2x1x2 + 3x22

5.2.3. Prove Proposition 5.2.2.


5.3 Rayleigh’s principle and the min-max theorem

In many applications (e.g. in certain problems in statistics), one wants to maximize aquadratic form Q(x) = 〈x, Ax〉 subject to the condition ‖x‖ = 1. (Recall that A is hermi-tian.) This is equivalent to maximizing the Rayleigh quotient

RA(x) :=〈x, Ax〉〈x,x〉

subject to the condition x 6= 0. (The proof that this equivalent is similar to the proof ofProposition 2.3.2. We leave it as Exercise 5.3.1.)

It turns out that it is fairly easy to maximize the Rayleigh quotient when A is diagonal.As we saw in Section 5.2 we can always unitarily diagonalize A, and hence the associatedquadratic form. Let U be a unitary matrix such that

UHAU = D = diag(λ1, . . . , λn)

and consider the new variables

y = U−1x = UHx so that x = Uy.

Then〈x, Ax〉 = xHAx = (Uy)HAUy = yHUHAUy = yHDy = 〈y, Dy〉

and〈x,x〉 = xHx = (Uy)HUy = yHUHUy = yHy = 〈y,y〉.

Thus

RA(x) =〈x, Ax〉〈x,x〉

=〈y, Dy〉〈y,y〉

= RD(y). (5.1)

So we can always reduce the problem of maximizing a Rayleigh quotient to one of maximizinga diagonalized Rayleigh quotient .

Theorem 5.3.1. Suppose A ∈Mn,n(C) is hermitian with eigenvalues

λ1 ≤ λ2 ≤ · · · ≤ λn

and corresponding orthonormal eigenvectors v1,v2, . . . ,vn. Then

(a) λ1 ≤ RA(x) ≤ λn for all x 6= 0;

(b) λ1 is the minimum value of RA(x) for x 6= 0, and RA(x) = λ1 if and only if x is aneigenvector of eigenvalue λ1;

(c) λn is the maximum value of RA(x) for x 6= 0, and RA(x) = λn if and only if x is aneigenvector of eigenvalue λn.

Proof. By (5.1) and the fact that

y = ei ⇐⇒ x = Uei = vi,

it suffices to prove the theorem for the diagonal matrix D = (λ1, λ2, . . . , λn).

5.3. Rayleigh’s principle and the min-max theorem 113

(a) This follows immediately from (b) and (c).

(b) We clearly have

RD(e1) =〈e1, De1〉〈e1, e1〉

= λ1〈e1, e1〉〈e1, e1〉

= λ1.

Also, for any x = x1e1 + x2e2 + · · ·+ xnen, we have

RD(x) =〈x, Ax〉〈x,x〉

=〈x, A(x1e1 + · · ·+ xnen)

〈x,x〉

=〈x1e1 + · · ·+ xnen, x1λ1e1 + · · ·xnλnen

〈x,x〉

=λ1|x1|2 + · · ·+ λn|xn|2

〈x,x〉

≥ λ1|x1|2 + · · ·+ λn|xn|2

|x1|2 + · · ·+ |xn|2= λ1,

with equality holding if and only if xi = 0 for all i with λi > λ1; that is, if and only ifDx = λ1x.

(c) This follows from applying (b) to −A.


A =

1 2 12 −1 01 0 −1

.We have

RA(1, 0, 0) =〈(1, 0, 0), (1, 2, 1)〉

1= 1 and RA(0, 1, 0) =

〈(0, 1, 0), (2,−1, 0)〉1

= −1.

Thus, by Theorem 5.3.1, we have

λ1 ≤ −1 and 1 ≤ λ3.

Additional choices of x can give us improved bounds. In fact, λ1 = −√

6, λ2 = −1, andλ3 =

√6. Note that RA(0, 1, 0) = −1 is an eigenvalue, even though (0, 1, 0) is not an

eigenvector. This does not contradict Theorem 5.3.1 since −1 is neither the minimum northe maximum eigenvalue.

We studied matrix norms in Section 2.3. In Theorem 2.3.5 we saw that there are easy waysto compute the 1-norm ‖A‖1 and the ∞-norm ‖A‖∞. But we only obtained an inequalityfor the 2-norm ‖A‖2. We can now say something more precise.


Corollary 5.3.3 (Matrix 2-norm). For any A ∈Mm,n(C), the matrix 2-norm ‖A‖2 is equalto σ1, the largest singular value of A.

Proof. We have

‖A‖22 =

(max

{‖Ax‖2‖x‖2

: x 6= 0

})2

= max

{‖Ax‖22‖x‖22

: x 6= 0

}= max

{〈Ax, Ax〉〈x,x〉

: x 6= 0

}= max

{〈x, AHAx〉〈x,x〉

: x 6= 0

}= λ,

where λ is the largest eigenvalue of the hermitian matrix AHA (by Theorem 5.3.1). Sinceσ1 =

√λ, we are done.

Theorem 5.3.1 relates the maximum (resp. minimum) eigenvalue of a hermitian matrixA to the maximum (resp. minimum) value of the Rayleigh quotient RA(x). But what if weare interested in the intermediate eigenvalues?

Theorem 5.3.4 (Rayleigh’s principle). Suppose A ∈Mn,n(C) is hermitian, with eigenvaluesλ1 ≤ λ2 ≤ · · · ≤ λn and an orthonormal set of associated eigenvectors v1, . . . ,vn. For1 ≤ j ≤ n, let

• Sj be the set of all x 6= 0 that are orthogonal to v1,v2, . . . ,vj.

• Tj be the set of all x 6= 0 that are orthogonal to vn,vn−1, · · · ,vn−j+1,

Then:

(a) RA(x) ≥ λj+1 for all x ∈ Sj, and RA(x) = λj+1 for x ∈ Sj if and only if x is aneigenvector associated to λj+1.

(b) RA(x) ≤ λn−j for all x ∈ Tj, and RA(x) = λn−j for x ∈ Tj if and only if x is aneigenvector associated to λn−j.

Proof. The proof is very similar to that of Theorem 5.3.1, so we will omit it.

Rayleigh’s principle (Theorem 5.3.4) characterizes each eigenvector and eigenvalue of ann× n hermitian matrix in terms of an extremum (i.e. maximization or minimization) prob-lem. However, this characterization of the eigenvectors/eigenvalues other than the largestand smallest requires knowledge of eigenvectors other than the one being characterized. Inparticular, to characterize or estimate λj for 1 < j < n, we need the eigenvectors v1, . . . ,vj−1or vj+1, . . . ,vn.

The following result remedies the aforementioned issue by giving a characterization ofeach eigenvalue that is independent of the other eigenvalues.


Theorem 5.3.5 (Min-max theorem). Suppose A ∈Mn,n(C) is hermitian with eigenvalues

λ1 ≤ λ2 ≤ · · · ≤ λn.

Thenλk = min

U :dimU=k{max{RA(x) : x ∈ U, x 6= 0}}

andλk = max

U :dimU=n−k+1{min{RA(x) : x ∈ U, x 6= 0}} ,

where the first min/max in each expression is over subspaces U ⊆ Cn of the given dimension.

Proof. We prove the first assertion, since the proof of the second is similar. Since A ishermitian, it is unitarily diagonalizable by the spectral theorem (Theorem 3.4.4). So we canchoose an orthonormal basis

u1,u2, . . . ,un

of eigenvectors, with ui an eigenvector of eigenvalue λi.Suppose U ⊆ Cn with dimU = k. Then

dimU + dim Span{uk, . . . ,un} = k + (n− k + 1) = n+ 1 > n.

HenceU ∩ Span{uk, . . . ,un} 6= {0}.

So we can choose a nonzero vector

v =n∑i=k

aiui ∈ U

and

RA(v) =〈v, Av〉〈v,v〉

=

∑ni=k λi|ai|2∑ni=k |ai|2

≥∑n

i=k λk|ai|2∑ni=k |ai|2

= λk.

Since this is true for all U , we have

minU :dimU=k

{max{RA(x) : x ∈ U, x 6= 0}} ≥ λk.

To prove the reverse inequality, choose the particular subspace

V = Span{u1, . . . ,uk}.

Thenmax{RA(x) : x ∈ V, x 6= 0} ≤ λk,

since λk is the largest eigenvalue in V . Thus we also have

minU :dimU=k

{max{RA(x) : x ∈ U, x 6= 0}} ≤ λk.

Note that, when k = 1 or k = n, Theorem 5.3.5 recovers Theorem 5.3.1.



A =

1 2 02 −1 50 5 −1

of Example 5.3.2. Let’s use the min-max theorem to get some bounds on λ2. So n = 3 andk = 2. Take

V = Span{(1, 0, 0), (0, 0, 1)}.

Then dimV = 2. The nonzero element of V are the vectors

(x1, 0, x3) 6= (0, 0, 0).

We have

RA(x1, 0, x3) =〈(x1, 0, x3), (x1, 2x1 + 5x3,−x3)〉

|x1|2 + |x3|2=|x1|2 − |x3|2

|x1|2 + |x3|2.

This attains its maximum value of 1 at any scalar multiple of (1, 0, 0) and its minimum valueof −1 at any scalar multiple of (0, 0, 1). Thus, by the min-max theorem (Theorem 5.3.5), wehave

λ2 = minU :dimU=2

{max{RA(x) : x ∈ U, x 6= 0}} ≤ max{RA(x) : x ∈ V, x 6= 0} = 1

and

λ2 = maxU :dimU=2

{min{RA(x) : x ∈ U, x 6= 0}} ≥ min{RA(x) : x ∈ V, x 6= 0} = −1.

So we know that −1 ≤ λ2 ≤ 1. In fact, λ2 ≈ 0.69385.

Example 5.3.7. Consider the non-hermitian matrix

A =

[0 10 0

].

The only eigenvalue of A is zero. However, for x ∈ R2, the Rayleigh quotient

RA(x) =〈x, Ax〉〈x,x〉

=〈(x1, x2), (x2, 0)〉〈(x1, x2), (x1, x2)〉

=x1x2x21 + x22

has maximum value 12, when x is a scalar multiple of (1, 1). So it is crucial that A be

hermitian in the theorems of this section.


Exercises.

5.3.1. Prove that x0 maximizes RA(x) subject to x 6= 0 and yields the maximum valueM = RA(x0) if and only if x1 = x0/‖x0‖ maximizes 〈x, Ax〉 subject to ‖x‖ = 1 and yields〈x1, Ax1〉 = M . Hint : The proof is similar to that of Proposition 2.3.2.

5.3.2 ([ND77, Ex. 10.4.1]). Use the Rayleigh quotient to find lower bounds for the largesteigenvalue and upper bounds for the smallest eigenvalue of

A =

0 −1 0−1 −1 10 1 0

.5.3.3 ([ND77, Ex. 10.4.2]). An eigenvector associated with the lowest eigenvalue of the matrixbelow has the form xa = (1, a, 1). Find the exact value of a by defining the functionf(a) = RA(xa) and using calculus to minimize f(a). What is the lowest eigenvalue of A?

A =

3 −1 0−1 2 −10 −1 3

.5.3.4 ([ND77, Ex. 10.4.5]). For each matrix A below, use RA to obtain lower bounds on thegreatest eigenvalue and upper bounds on the least eigenvalue.

(a)

3 −1 0−1 2 −10 −1 3

(b)

7 −16 −8−16 7 8−8 8 −5

(c)

2 −1 0−1 3 −10 −1 2

5.3.5 ([ND77, Ex. 10.4.6]). Using v3 = (1,−1,−1) as an eigenvector associated with thelargest eigenvalue λ3 of the matrix A of Exercise 5.3.2, use RA to obtain lower bounds onthe second largest eigenvalue λ2.

5.3.6 ([ND77, Ex. 10.5.3, 10.5.4]). Consider the matrix

A =

0.4 0.1 0.10.1 0.3 0.20.1 0.2 0.3

.(a) Use U = Span{(1, 1, 1)}⊥ in the min-max theorem (Theorem 5.3.5) to obtain an upper

bound on the second largest eigenvalue of A.

(b) Repeat with U = Span{(1, 2, 3)}⊥.

Index

1-norm, 382-norm, 38∞-norm, 39

A+, 24absolute error, 46absolute value, 37algebraic multiplicity, 53

back substitution, 26block form, 10block triangulation, 98

C, 5Cauchy–Schwarz inequality, 44characteristic polynomial, 53Cholesky factorization, 68coefficient matrix, 15column space, 12commute, 10complex conjugate, 37condition number, 44conjugate transpose, 56

diag(a1, . . . , an), 6diagonalizable, 53diagonalization, 53diagonalized Rayleigh quotient, 112Dimension Theoreom, 12dominant eigenvalue, 77dominant eigenvector, 77dot product, 8, 48

Eckart–Young–Mirsky Theorem, 94eigenspace, 53eigenvalue, 53eigenvector, 53elementary matrix, 13elementary permutation matrix, 32

elementary row operations, 13Euclidean norm, 38

F, 5F×, 5forward substitution, 26free variable, 16Frobenius norm, 42fundamental subspaces, 91

Gauss–Jordan elimination, 15gaussian algorithm, 14Gaussian elimination, 13generalized eigenspace, 99generalized inverse, 24geometric multiplicity, 53Gershgorin circle theorem, 81Gershgorin discs, 81GL(n,F), 5Gram matrix, 23, 86Gram–Schmidt othogonalization algorithm, 49

hermitian conjugate, 56hermitian matrix, 57

idempotent, 98identity matrix, 9ill-conditioned, 36, 46initial value problem, 104inner product, 48inverse, 22inverse power method, 79invertible, 22

Jordan block, 100

kernel, 12

L(V,W ), 11leading 1, 14

118

Index 119

leading variable, 16left inverse, 17left-invertible, 17level set, 108linear transformation, 11lower reduced, 27lower triangular, 26lower unitriangular, 70LU decomposition, 27LU factorization, 27

Mm,n(F), 5magnitude, 37main diagonal, 25Markov chains, 79matrix

addition, 6arithmetic, 6difference, 7multiplication, 8negative, 6scalar multiple, 7size, 5square, 7subtraction, 7symmetric, 7

matrix exponential, 102matrix norm, 40matrix product, 8matrix-vector product, 8maximum norm, 39Mi(a), 13middle inverse, 94Moore–Penrose inverse, 24, 95

N, 5negative definite, 66negative semi-definite, 66nilpotent, 100non-leading variable, 16nonsingular, 22norm, 37normal equations, 76normal matrix, 63normed vector space, 37

null space, 12nullity, 12

operator norm, 40orthogonal, 49

basis, 49orthogonal complement, 51orthogonal matrix, 58orthogonal projection, 51orthogonally diagonalizable, 62orthonormal, 49

basis, 49overdetermined, 15

p-norm, 39PA = LU factorization, 33permutation matrix, 32Pi,j, 13PLU factorization, 33polarization identity, 107positive definite, 66positive semi-definite, 66power method, 77Principal axes theorem, 62principal submatrices, 67principle axes, 62principle components, 94pseudoinverse, 24, 95

QR factorization, 71QR method, 80quadratic form, 106, 108

R, 5R>0, 65rank, 12, 15Rank-Nullity Theorem, 12Rayleigh quotient, 112reduced QR factorization, 72reduced row-echelon form, 14reduced row-echelon matrix, 14regular stochastic matrix, 79relative error, 46right inverse, 20right-invertible, 20row space, 12

120 Index

row-echelon form, 14row-echelon matrix, 14

scalar, 5scalar multiplication, 7Schur decomposition, 61Schur’s theorem, 60signature, 111similar matrices, 53singular, 22singular matrix, 87singular value decomposition, 85singular values, 87size of a matrix, 5spectral theorem, 61spectrum, 62square

linear system, 15matrix, 7

standard basis vector, 6steady state vector, 79stochastic matrix, 79SVD, 85Sylvester’s law of inertia, 111symmetic matrix, 7symmetric, 57

tall matrix, 15thin QR factorization, 72trace, 61transpose, 7triangular, 26triangule inequality, 37two-sided inverse, 22

underdetermined, 15unitarily diagonalizable, 60unitary matrix, 58unknown, 15upper triangular, 25

variable, 15vector, 6vector norm, 37vector space, 7

well-conditioned, 46

wide matrix, 15

Z, 5zero matrix, 7

Bibliography

[AK08] Gregoire Allaire and Sidi Mahmoud Kaber. Numerical linear algebra, volume 55of Texts in Applied Mathematics. Springer, New York, 2008. Translated fromthe 2002 French original by Karim Trabelsi. URL: https://doi.org/10.1007/978-0-387-68918-0.

[BV18] S. Boyd and L. Vandenberghe. Introduction to Linear Algebra – Vectors, Matrices,and Least Squares. Cambridge University Press, Cambridge, 2018. URL: http://vmls-book.stanford.edu/.

[Mey00] Carl Meyer. Matrix analysis and applied linear algebra. Society for Industrialand Applied Mathematics (SIAM), Philadelphia, PA, 2000. With 1 CD-ROM(Windows, Macintosh and UNIX) and a solutions manual (iv+171 pp.). URL:https://doi.org/10.1137/1.9780898719512.

[ND77] Ben Noble and James W. Daniel. Applied linear algebra. Prentice-Hall, Inc., En-glewood Cliffs, N.J., second edition, 1977.

[Nic] W. K. Nicholson. Linear Algebra With Applications. URL: https://lyryx.com/products/mathematics/linear-algebra-applications/.

[Pen55] R. Penrose. A generalized inverse for matrices. Proc. Cambridge Philos. Soc.,51:406–413, 1955.

[Tre] S. Treil. Linear Algebra Done Wrong. URL: http://www.math.brown.edu/

~treil/papers/LADW/LADW.html.

[Usm87] Riaz A. Usmani. Applied linear algebra, volume 105 of Monographs and Textbooksin Pure and Applied Mathematics. Marcel Dekker, Inc., New York, 1987.

121

https://doi.org/10.1007/978-0-387-68918-0

https://doi.org/10.1007/978-0-387-68918-0

http://vmls-book.stanford.edu/

http://vmls-book.stanford.edu/

https://doi.org/10.1137/1.9780898719512

https://lyryx.com/products/mathematics/linear-algebra-applications/

https://lyryx.com/products/mathematics/linear-algebra-applications/

http://www.math.brown.edu/~treil/papers/LADW/LADW.html

http://www.math.brown.edu/~treil/papers/LADW/LADW.html

applied linear algebra - alistair savage · these are notes for the course applied linear algebra...

Documents