lecture 0:linear algebra, maximization, and probability

Upload: alfred-lin

Post on 06-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    1/46

    Investments

    Lecture 0: Linear Algebra, Maximization, and Probability

    Instructor: Chyi-Mei Chen(Tel) 3366-1086(Email) [email protected](Website) http://www.fin.ntu.edu.tw/cchen/

    1. A matrix is a two-dimensional array of real numbers.1 2 A matrix Ais said to have a size = mn (read as m by n), if it contains m rowsand n columns. When m = 1, we call it a row vector. When n = 1,it is a column vector. When m = n = 1, this matrix contains onlyone real number, and is itself called a scalar.

    2. Two matrices A and B are said to be equal (written A = B), if theyhave the same size and every entry (or element) of A, say aij, is equalto the corresponding element bij. (Here aij denotes the element of Aat the intersection of the i-th row and the j-th column.)

    3. The transpose ofAmn

    is the matrix Bnm

    with, for all i = 1, 2,

    , m

    and j = 1, 2, , n,aij = bji .

    We usually denote B by AT or A.

    Example 1 Let

    A =

    1 22 3

    3 4

    , B =

    1 1 20 1 1

    .

    1We shall make no use of matrices with complex numbers. The set of real numbers willbe represented by and R interchangeably.

    2You can find free materials about linear algebra on the internet. See for examplehttp://en.wikipedia.org/wiki/Symmetric matrix.

    1

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    2/46

    Then, we have

    AT =

    1 2 32 3 4

    , BT =

    1 01 12 1

    .

    4. From now on, we use upper-case boldface letters to denote matrices,and lower-case boldface letters to denote column vectors. Row vectorswill be represented as transpose of column vectors. Scalars are stillrepresented by roman characters.

    5. Given two (column) vectors an

    1 and bn

    1, their inner product isdefined as the scalar aTb = bTa =

    ni=1 aibi.

    Example 2 Let

    a =

    112

    , b =

    01

    1

    .

    ThenaTb = 1 0 + (1) 1 + 2 (1) = 3.

    6. A matrix Amn can be multiplied by any scalar c11. The product is

    a matrix Bmn = cA = Ac with its (i, j)-th element being bij = caij,i = 1, 2, , m and j = 1, 2, , n.7. Two matrices A and B can be added, if they are of the same size

    m n. The sum Cmn is a new matrix with elements cij = aij + bij,i = 1, 2, , m and j = 1, 2, , n.

    8. A matrix A with a size mn can be pre-multiplied to another matrix Bwith a size nq, and the product C = AB is of size mq. In this case,we say that A and B are conformable and that B is post-multipliedto A. Note that C has elements

    cij =n

    k=1

    aikbkj ,

    i = 1, 2, , m and j = 1, 2, , q. In other words, the (i, j)-th ele-ment of matrix C is the inner product of the i-th row vector of A andthe j-th column vector of B.

    2

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    3/46

    9. Some facts about the transpose operation: (AT)T = A for all matrix A. If A is conformable to B, then BT is conformable to AT. More-

    over, we have (AB)T = BTAT.

    Example 3 Let

    A =

    1 22 33 4

    , B =

    1 1 20 1 1

    .

    Then, we have

    AB33 =

    1

    1 + 2 0; 1 (1) + 2 1; 1 2 + 2 (1)2 1 + 3 0; 2 (1) + 3 1; 2 2 + 3 (1)3 1 + 4 0; 3 (1) + 4 1; 3 2 + 4 (1)

    =

    1 1 02 1 1

    3 1 2

    ,

    and

    BA22 =

    1 1 + (1) 2 + 2 3; 1 2 + (1) 3 + 2 40 1 + 1 2 + (1) 3; 0 2 + 1 3 + (1) 4

    =

    5 7

    1 1

    .

    10. The multiplication of two matrices can be done by treating the twomatrices as they are or by re-interpreting them as new matrices whichhave different sizes than before, as long as the re-interpretation doesnot cause problems in conformability. For instance, the matrix A inthe preceding example can be regarded as a row vector,

    A12 =

    a1 a2

    ,

    where aj is the j-th column of A; namely,

    a1 =

    1

    23

    ,a

    2 =

    2

    34

    .

    Similarly, B can be regarded as a column vector,

    B21 =

    bT1bT2

    ,

    3

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    4/46

    where bT

    i is the i-th row of B, namely,bT1 =

    1 1 2

    , bT2 =

    0 1 1

    .

    It follows thatA12B21 = a1bT1 + a2b

    T2

    =

    123

    1 1 2 +

    234

    0 1 1

    =

    1 1 22

    2 4

    3 3 6

    +

    0 2 20 3

    3

    0 4 4

    =

    1 1 02 1 13 1 2

    ,

    which is exactly the matrix that we obtain by directly pre-multiplyingA to B, as in the preceding example. What matters here, as you cansee, is that a1 and b

    T1 are conformable, and a2 and b

    T2 are conformable.

    Alternatively, we can regard A as a column vector,

    A31 =

    1 2

    2 3 3 4

    ,

    and regard B as a row vector,

    B13 =

    1 ... 1 ... 2

    0... 1

    ... 1

    .

    Note that

    A31B13 =

    1 2

    10

    1 2 11 1 2 21

    2 3 1

    0

    2 3

    11

    2 3

    21

    3 4 1

    0

    3 4

    11

    3 4

    21

    4

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    5/46

    = 1 1 02 1 1

    3 1 2

    ,which is again the matrix that we obtain by directly pre-multiplying Ato B, as in the preceding example.

    11. Another interesting fact about multiplication of matrices is that theassociative law holds: as long as conformability is not a problem, wehave

    A[(sB + tC)D] = sABD + tACD,

    where s, t are real numbers. For instance, let

    x21 =

    x1x2

    ,

    and recall the matrix A from the preceding example. Again, let aj bethe j-th columan of A. I claim that

    Ax = x1a1 + x2a2 = x1

    123

    + x2

    234

    .

    To see this, note that

    x = x1

    10

    + x2

    01

    ,

    so that

    Ax = x1A

    10

    + x2A

    10

    = x1a1 + x2a2.

    Similarly, let

    yT =

    y1 y2

    = y1

    1 0

    + y2

    0 1

    .

    Then we haveyTB

    =y

    1bT

    1 +y

    2bT

    2.

    The lesson to be learned here is that Ax is a weighted sum of Ascolumn vectors, with the weights being the elements of the vector x;and yTB is a weighted sum of Bs row vectors, with the weights beingthe elements of the vector y. (Subsequently, such a weighted sum willbe formally defined as a linear combination.)

    5

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    6/46

    12. A matrix with all elements equal to zero is called a zero matrix (ofsize m n). A matrix with m = n is called a square matrix, andin this case n = m is called the order of the square matrix. The ele-ments a11, a22, , ann of a square matrix Ann are the latters majordiagonal elements. Any other elements are called off-diagonal el-ements. If the off-diagonal elements are symmetric about the majordiagonal, this square matrix is said to be symmetric. It is easy to seethat a matrix A is symmetric, if and only if A = AT. A matrix is adiagonal matrix if all off-diagonal elements are zero. If, further, allaii are equal, this diagonal matrix is said to be a scalar matrix. If,moreover, all aii = 1 in this scalar matrix, then this matrix is calledan identity matrix, denoted by In, where n is the order of the squarematrix. We shall denote the i-th column vector of In by ui. Hence uiis the (n 1) column vector of which the i-th element is 1 and otherelements are zero. Similarly, let vTj be the j-th row vector of Im, sothat vTj is the (1 m) row vector of which the j-th element is 1 andother elements are zero. It follows from the preceding discussions thatAmnui = ai and uTj Amn =

    Tj , where the (m 1) vector ai is the

    i-th column vector ofA and the (1n) vector Tj is the j-th row vectorof A. Now, if we treat Amn as a scalar and In as row vector with

    In =

    u1 ... u2 ... ... un ,then we have

    AmnIn =

    Amnu1... Amnu2

    ... ... Amnun

    =

    a1... a2

    ... ... an

    = Amn.

    Hence we have proven that AmnIn = Amn.

    Similarly, if we treat Amn as a scalar and Im as a column vector with

    Im =

    uT1uT2

    ...uTn

    ,

    6

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    7/46

    then we have

    ImAmn =

    uT1 AmnuT2 Amn

    uTnAmn

    =

    T1T2

    ...Tn

    = Amn.

    This is why we call In and Im the identity matrices. The role thatIn and Im assume in matrix multiplication corresponds the role that 1plays in scalar multiplication.

    13. Given a square matrix Ann, if there is another matrix Bnn such thatAB = BA = In, then A and B are said to be the inverse of eachother. In this case, we say that both A and B are non-singular (orinvertible). The notation A1 denotes the inverse of A.

    14. The inverse of a square matrix is unique whenever it exists.3

    3Now we can define elementary row operations. There are three types of elementary

    row operations: first, to multiply a row ofA by a scalar; second, to switch two rows of A;and third, to add a multiple of one row of A to another row of A. All these three typesof operations can be reproduced by pre-multiply a non-singular square matrix to A. Forexample, let

    A =

    1 22 3

    3 4

    ,

    and find a matrix C33 such that CA is the same as A except that the first and thesecond rows of CA are respectively the second and the first rows of A. Show that thesolution is

    C =

    0 1 01 0 0

    0 0 1

    .

    Then find a matrix P33 such that PA is the same as A except that the second row ofPA is the second row of A multiplied by 5. Show that the solution is

    P =

    1 0 00 5 0

    0 0 1

    .

    7

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    8/46

    Exercise 1 Prove the above statement.

    15. Some facts about the inverse operation:

    If A1 is defined, then (A1)1 = A. If both Ann and Bnn are non-singular, then so is AB. Moreover,

    (AB)1 = B1A1.

    If An

    n is non-singular, then (A

    T)1 = (A1)T.

    16. If a matrix is symmetric and non-singular, then its inverse is also sym-metric.

    Exercise 2 Prove the above statement.

    17. Consider the following system of n equations with n unknowns writtencompactly in matrix notation:

    Annxn1 = bn1.

    If A is non-singular, then this system of simultaneous equations has aunique solution:

    x = A1 A x = A1b.In this case, xn1 = 0n1 if and only ifbn1 = 0n1. When bn1 = 0n0is given, the above equations are said to be homogeneous. For ahomogeneous system of simultaneous equations, xn1 = 0n1 is alwaysone solution, which is called the trivial solution.

    Finally, find a matrix Q33 such that QA is the same as A except that the first row ofQA is the sum of the first row of A and 3 times the second row of A. Show that the

    solution is

    Q =

    1 3 00 1 0

    0 0 1

    .

    Verify that C, P, and Q are all invertible.

    8

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    9/46

    18. Given n numbers a1, a2, , an, there are n! possible ways of arrangingthem.4 Each possible way of arrangement is called a permutation ofthe n numbers. We say a permutation has one inversion, if in thearrangement, some aj precedes ai with j > i.

    5 A permutation is odd(respectively even) if it has an odd (even) number of inversions.

    19. For each square matrix Ann, define its determinant, |A|, as thefollowing scalar:

    |A| = a1a2 an,where the summation is taken over all possible permutations of col-

    umn subscripts, and the positive sign is valid when the correspondingpermutation is even, otherwise the negative sign is valid. Though thisdefinition sounds formidable, it can be easily understood by looking attwo examples:

    Example 4 LetA22 be a11 a12a21 a22

    ,

    and as a concrete example, think of A as

    1 10 4

    .

    Then according to the above definition,

    |A| = a11 a12a21 a22

    which is the sum of all possible a1a2s, with the sign depending uponthe number of inversion in the sequence of column subscripts. Since

    there are two possible column subscripts: 1 and 2, there are 2! = 24For example, consider the three numbers 1, 4 and 5. They can be arranged as 145,

    154, 415, 451, 514, and 541, and hence there are 3! = 6 ways of arranging them.5In the preceding example, 415 has exactly one inversion because 4 > 1 and 4 appears

    earlier than 1 in the sequence of the three numbers. Also, 514 and 541 have respectivelytwo and three inversions.

    9

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    10/46

    possible permutations, 12 and 21. That is, the determinant of A is thesum ofa11a22 anda12a21. Now the permutation 12 has no inversionat all (zero is treated as an even number), and 21 has one (which isodd) inversion. Therefore, |A| = a11a22 a12a21. Now we can put inthe numbers and get

    |A| = 1 (4) (1) 0 = 4.

    We see that the expansion of |A| in the first example contains two (2!)terms, because there are 2 columns there.

    Example 5 Consider the following square matrix A of order 3:

    a11 a12 a13a21 a22 a23a31 a32 a33

    .

    Correspondingly, |A| should contain 6 = 3! terms. Each term is some-thing like a1a2a3. It is easy to see that

    |A| = a11a22a33(the permutation 123 has no inversion at all)

    a11a23a32(the permutation 132 has one inversion)

    a12a21a33(the permutation 213 has one inversion)

    +a12a23a31

    (the permutation 231 has two inversions)

    +a13a21a32

    (the permutation 312 has two inversions)

    a13a22a31(the permutation 321 has three inversions).

    10

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    11/46

    20. Here are some facts regarding the determinant of a square matrix Ann: It is easy to see that |cA| = cn|A| for any scalar c. |AT| = |A|. The determinant |A| is zero if one row in A is a multiple of another

    row. To see this, consider example 2. If, say, for all j = 1, 2 and3, a1j = ka2j with k being some constant. We see immediatelythat |A| is

    k(a21a22a33 + a22a23a31 a22a21a33

    a21a23a32 + a23a21a32

    a23a22a31) = 0.

    Since |A| = |AT|, the previous fact also says that |A| = 0 if onecolumn in A is a multiple of another column.

    We can push the preceding result one step further: Indeed, if somecolumn (respectively row) in A is a linear combination of the othercolumns (respectively rows), then |A| = 0.6

    Exercise 3 Verify the last statement for the case of n = 3.

    Let A and B be two square matrices of order n. Then |AB| =|A||B| = |B||A|. Verifying this is straightforward, although quitetedious.

    A square matrix A is non-singular if and only if its determinant|A| = 0. Indeed, in the following, we shall present a way of calcu-lating the inverse of A when |A| = 0, which involves only deter-minants: As long as you know how to calculate the determinant ofa square matrix, you can (i) determine whether a square matrix isinvertible (non-singular) and (ii) (if it is invertible) calculate theinverse of that matrix by calculating several determinants. Butfirst, we need to introduce something called cofactor.

    21. Given a square matrix Ann, the cofactor of A about the elementaij is denoted by Aij , which is a scalar defined as

    Aij = (1)i+j|Mij |,6A matrix B is a linear combination of the other n matrices B1, B2, , Bn of the same

    size as B, if we can find n scalars c1, c2, , cn, such that B =n

    i=1 ciBi.

    11

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    12/46

    where |Mij| is the determinant of the (n 1) (n 1) square matrixMij, obtained from the original matrix A by deleting As i-th row andj-th column.

    Example 6

    A =

    1 2 32 3 43 4 5

    According to the above definitions, we have, for example,

    A11 = (1)1+1 3 44 5

    and

    A23 = (1)2+3 1 23 4

    22. The adjoint matrix of a square matrix A, denoted by adj(A), is the

    n n square matrix where its (i, j) element is Aji.23. The following facts about the relations between a square matrix A and

    its adjoint matrix adj(A) underlie the ideas of finding the inverse bycalculating determinants:

    Pick the h-th row ofA and the k-th column ofadj(A) and denotethe two vectors by respectively Ah and adj(A)k. Then, the innerproduct (a scalar) Ahadj(A)k = |A| if h = k and 0 otherwise.

    To summarize the previous result, indeed,

    A

    adj(A) = adj(A)

    A =

    |A

    | In.

    Example 7 Let

    A =

    1 0 12 1 1-1 2 0

    .

    12

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    13/46

    Then the adjoint matrix of A is

    adj(A) =

    -2 2 -1-1 1 1

    5 -2 1

    .

    Note that

    A adj(A) = 1 0 12 1 1

    -1 2 0

    -2 2 -1-1 1 1

    5 -2 1

    =

    3 0 00 3 0

    0 0 3

    ,

    where 3 = |A|.

    24. Formula of Inverse of an Invertible Matrix Ann is non-singular,if and only if|A| = 0. Moreover, the inverse ofA when it exists is givenby

    A1 =1

    |A|adj(A).

    The proof for this theorem follows directly from the above two factsregarding adjoint matrices.

    25. A symmetric matrix Ann is said to be positive definite (or PD), iffor all xn1 = 0n1, the quadratic form xTAx > 0. (The quadraticform is a second-degree polynomial in x. With x given, it becomes ascalar.) A symmetric matrix Ann is said to be negative definite (orND), ifA is positive definite.

    26. A symmetric matrix Ann is said to be positive semi-definite (orPSD), if for all xn1 Rn, the quadratic form xTAx 0. A sym-metric matrix Ann is said to be negative semi-definite (or NSD),ifA is positive semi-definite.

    27. Any diagonal matrix with positive major diagonal elements is positivedefinite. An example is the identity matrix In. Note that x

    TInx =ni=1 x

    2i > 0, whenever xi are not all zeros.

    13

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    14/46

    28. A symmetric matrix Ann is positive definite, if and only if

    a11 a12 a13 a1ka21 a22 a23 a2ka31 a32 a33 a3k

    ......

    ......

    ...ak1 ak2 ak3 akk

    > 0,

    for all k {1, 2, , n}.29. A symmetric matrix Ann is negative definite, if and only if

    a11 a12 a13 a1ka21 a22 a23 a2ka31 a32 a33 a3k

    ......

    ......

    ...ak1 ak2 ak3 akk

    (1)k > 0,

    for all k {1, 2, , n}.

    30. Let Ann be a square matrix of order n. Given any k

    {1, 2,

    , n

    },

    we can construct a sequence of numbers with k terms, with each term inthe sequence coming from the set {1, 2, , n}. How many possible se-quences with k terms can we construct? There are n!

    k!(nk)! possible waysof picking the k numbers. Taking account of all possible permutationsof the picked k numbers, there are in total n!

    (nk)! ways of constructinga sequence with k terms out of the set {1, 2, , n}. Given k, let the

    j-th sequence be skj = {s1j, s2j, , skj}. Given k, correspondingly, wecan now construct n!

    (nk)! square matrices of order k out of the origi-nal matrix A such that the (l, m) element of the j-th matrix of orderk is the (slj , smj) element of A. We denote this new matrix of orderk by Dkj , where recall once again that k {1, 2, , n} is given and

    j = 1, 2, , n!(nk)! .

    31. A symmetric matrix Ann is positive semi-definite, if and only if forall k {1, 2, , n}, and for all j {1, 2, , n!

    (nk)!}, |Dkj | 0. A

    14

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    15/46

    symmetric matrix A of order n is negative semi-definite if and only if forall k {1, 2, , n}, and for all j {1, 2, , n!(nk)!}, (1)k|Dkj | 0.7

    Example 8 LetA be a square matrix of order 2 with its only non-zeroelement being a22 = 1. This matrix is positive semi-definite because itis symmetric and because

    a11 0, a22 0, a11 a12a21 a22

    0, a22 a21a12 a11

    0.This matrix is not negative semi-definite: Although

    (1)1 a11 0, (1)2 a11 a12a21 a22

    0, (1)2 a22 a21a12 a11

    0,note that

    (1)1 a22 < 0.

    32. A non-empty set V is called a real vector space if (i) an element(called a vector) x Vimplies that c R, cx is also an element ofV;(ii) two elements x, y Vimplies that x + y is also in V; and (iii) a setof properties regarding vector additions and scalar multiplications hold

    for all elementsx

    Vand all real numbers (for details see a textbookon linear space; note in particular the existence of a zero vector forevery vector space). Because of the fact that x, y V implies thatall the linear combinations of x and y are also elements of V, a vectorspace is also known as a linear space.

    33. A subset ofVis a vector subspace, if its elements also form a vectorspace, i.e. the subset possesses the aforementioned three propertiesthat a vector space should have.

    34. In spite of the previous statement, most vector spaces interesting uscan be described only by a finite number of their elements, called abasis of the vector space. To introduce the notion of basis, we first goover the idea of linear independence.

    7A necessary condition for A to be negative semi-definite is then, according to thisresult, that all major diagonal elements of A are less than or equal to zero.

    15

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    16/46

    35. Let x1, x2, , xn be n elements of a vector space V and 0 the zeroelement ofV. Consider the following system of simultaneous equationswith unknown c1, c2, , cn R,

    ni=1

    cixi = 0.

    The vectors x1, x2, , xn are said to be linearly independent if theabove system of equations has a unique solution (which is obviously thetrivial solution), and are said to be linearly dependent if otherwise.8

    Example 9 Consider the set V that contains all 2 1 column vec-tors with each entry in the column vector being a real number. Checkthat this is the familiar space known as R2 space. Now, consider thefollowing three elements ofV:

    10

    ,

    01

    ,

    11

    .

    Are these linearly independent vectors? The answer is no. Note that, for instance,

    (1)

    11

    + 1

    10

    + 1

    01

    =

    00

    ,

    where note that c1c2

    c3

    =

    11

    1

    =

    00

    0

    .

    Now choose any two vectors among the three. Are the two chosen vec-tors linearly independent? The answer is positive. For instance, say

    you take 10

    8An immediate consequence of this result is that a single vector x V is linearlyindependent if and only if x = 0.

    16

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    17/46

    and

    11

    You cannot find any real numbers c1, c2 = 0 such that

    c1

    10

    + c2

    11

    =

    00

    We conclude that, among the three vectors, there are two linearly inde-pendent vectors and that the three vectors are linearly dependent.

    36. Span and Basis: Given n elements of V, { x1, x2, , xn }, the setof all their linear combinations is called their linear span (or sim-ply span). Thus, the linear span must be a subspace of V. If {x1, x2, , xn } is a set of linearly independent vectors, then it is onebasis of their span, and each xi is one basis vector. More generally,the basis for a linear space is one of the latters subset such that (i)elements in this subset are linearly independent; and (ii) all other ele-ments of this linear space can be represented as linear combinations ofthese basis vectors.

    Exercise 4 Show that in the previous example V is the span of anytwo vectors listed in the example.

    37. If{ x1, x2, , xn } and { y1, y2, , ym } are two bases of the samevector space V, then m = n. That is, the number of linearly inde-pendent vectors to span a given vector space is unique! This uniquenumber is called the dimension ofV, and is denoted by dim(V).

    Exercise 5 Prove the last statement.

    38. Some vector spaces are of finite dimensions, but others are of infinitedimensions. For example, the dimension of Rn is n < + (here Rndenotes the n-dimensional Euclidean space; I shall use R and inter-changeably for the R1 space, which as we recall, is the real line). ThusRn is a finite-dimensional vector space. Recall that u1, u2, , un con-stitute the set of standard basis vectors ofRn.

    17

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    18/46

    39. To give an example for infinite dimensional spaces, consider the setof continuous functions that map [0, 1] into R, denoted by C[0, 1]. Itcan be verified that this set is a real vector space with the naturaldefinitions for scalar multiplication and vector addition.9 Note that,however large n is, it is not possible to find f1, f2, , fn C[0, 1] suchthat all other functions in C[0, 1] are linear combinations offis. HenceC[0, 1] must be an infinite-dimensional space.

    40. A real vector space must contain an infinite number of elements exceptthe trivial space {0 }, which contains nothing but the zero element.This vector space is zero-dimensional.

    41. Let Xand Ybe two vector spaces and f : X Y be a function thatmaps from X into Y. The function is said to be concave (respectivelyconvex and affine) if for any two elements x, x X and for any tworeal numbers a and b,

    f(ax + bx) (respectively , =) af(x) + bf(x),where is a complete ordering defined on Y. The function f is saidto be linear, if it is affine and f(0x) = 0y, i.e. it maps the zero elementofX into the zero element ofY. If for some positive integers m and n,

    X=

    Rn and

    Y=

    Rm, then any linear function f can be represented

    in the following form: For some matrix Amn

    f(x) = Ax, x Rn.In this case, if further, m = 1, then f is called a (real) linear func-tional. If m = n and A is non-singular, we know that given each y,there is a unique x that satisfies

    f(x) = y.

    In this case, the inverse function f1 : Y Xexists and is also linear.Apparently, f1 is associated with the matrix A1 in exactly the same

    way that f is associated with matrix A.9Specifically, for all scalar c and f, g C[0, 1], define the new function cf by cf(x) =

    c f(x), x [0, 1] and the new function g + f by (g + f)(x) = f(x) + g(x), x [0, 1]. Itis easy to verify that these new functions are continuous, and hence contained in C[0, 1].Finally, the zero function (the constant function that maps each and every point in [0 , 1]to 0) is continuous, and is taken to be the zero vector in C[0, 1].

    18

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    19/46

    42. Consider the vector space Rn

    . Given a matrix Amn, define the setM(A) = {Ax|x Rn},10 which is a subspace ofRm. Note that M(A)is the linear span of the column vectors of A. Now define anotherset N(A) = {x|Ax = 0m1}, which is a subspace of Rn, with, say,dimension k.11 Suppose that {e1, e2, , en} is one basis of Rn, and{e1, e2, , ek} is one basis ofN(A). Then, {Aek+1, Aek+2, , Aen}is one basis ofM(A).Proof

    First observe that Aek+1, Aek+2, , Aen M(A). Next, we showthat these (n k) vectors are linearly independent. Consider the fol-lowing m equations with ck+1, ck+2,

    , cn being the (n

    k) unknowns:

    ck+1Aek+1 + ck+2Aek+2 + + cnAen = 0m1.If we can show that ck+1 = ck+2 = = cn = 0 is the unique so-lution to this system of equations, then we are done. Note that aspart of the basis of Rn, ek+1, ek+2, , en are linearly independent ofe1, e2, , ek. Note that, on the other hand, the system of equationssays that ck+1ek+1 + ck+2ek+2 + + cnen lies inN(A) and therefore issome linear combination of e1, e2, , ek. That leaves us with only onepossibility: ck+1ek+1 + ck+2ek+2 + + cnen = 0n1. By the linear in-dependence ofek+1, ek+2,

    , en, we conclude that ck+1 = ck+2 =

    =

    cn = 0.

    We still need to show that any element of M(A) can be representedby some linear combination of Aek+1, Aek+2, , Aen. Consider anyother element ofM(A), which can be denoted by Ay for some y Rn.But then for all i = 1, , n,Ay = A(y1e1+y2e2+ +ynen) = yk+1(Aek+1)+yk+2(Aek+2)+ +yn(Aen),where the second equality follows from the fact that i = 1, , k,ei N(A), and therefore Aei = 0m1. Thus the proof is complete.

    10

    Thus M(A) is the image of the following linear function f : n

    m

    :f(xn1) = Amnx.

    It can be shown that a function f : n m is linear if and only if there exists a matrixAmn such that f(xn1) = Amnx.

    11N(A) is called the null space or the kernal of the matrix A or of the linear functionf(xn1) = Ax.

    19

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    20/46

    43. Fundamental Theorem of Linear Algebra.

    Theorem 1 Given a matrix Amn, we define M(A) and N(A) asabove. Then, we have from the preceding discussions that

    dim(N(A)) + dim(M(A)) = n.

    The above theorem says that the dimension of the domain space Rnin the linear transformation Ax = y Rm, is equal to the sum of thedimension of the null space and the demension of the linear span ofAs column vectors!

    44. Consider the following system of homogeneous equations in matrix no-tation:

    Amnxn1 = 0m1,

    where x = (x1, x2, , xn)T are the n unknowns. Now, ifdim(M(A)) =n (which cannot occur if m < n),12 then dim(N(A)) = 0, which meansthat N(A) contains only one element. (A set containing only one el-ement is called a singleton.) Namely, the trivial solution x = 0n1is the unique solution. Suppose, instead, that dim(M(A)) = n 1,then

    N(A) is a one-dimensional surface (a one-dimensional manifold

    in general and a line in this special case) in the space Rn! This is onesituation where we are faced with an infinite number of solutions. Ifdim(M(A)) = n 2, then N(A) is a two-dimensional surface. Again,we have an infinite number of solutions. The difference here is that, be-tween the two cases, the infinite number of solutions can be respectivelyspanned by one and two basis vectors!

    45. To make the above theorem useful, we introduce some way of calculat-ing the dimension of a space. Recall that the dimension of a space isthe number of linearly independent vectors in any basis of that space.

    It turns out that the number of linearly independent vectors is closelyrelated to the notion of rank, which we now introduce.

    12Equivalently, if there are more than n equations of which n of them are linearlyindependent, then the system of equations has one unique solution.

    20

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    21/46

    46. Given a matrix Amn, the row rank (respectively, column rank) of A,denoted by r(A), is the maximum number of linearly independent rows(respectively, columns) in A. It turns out that the column rank of amatrix is always equal to its row rank, and hence we can simply callit the rank of the matrix. (For a proof, see any textbook on linearalgebra.)

    47. The following shows the equivalence relations among rank, determinant,non-singularity of a square matrix Ann:

    A is non-singular if and only if r(A) = n, i.e. A has n linearlyindependent column vectors (equivalently, dim(M(A)) = n).ProofConsider the necessity. Suppose that A is non-singular. We mustshow that every y Rn is an element of M(A). But, given thenon-singularity of A, we can always represent each y Rn as

    y = Ax

    where x = A1y Rn. This shows that Rn M(A). But,obviously M(A) Rn, and hence we conclude that Rn = M(A)and dim(M(A)) = n.Next, consider sufficiency. Suppose that r(A) = n = dim(M(A)),then each y Rn is an element of M(A). In particular, everycolumn vector of In is in M(A). That is, we can find n columnvectors b1, b2, , bn such that

    A [b1|b2| |bn] = In.But then by definition the (n n) matrix [b1|b2| |bn] is theinverse of A, and hence A is non-singular!

    A is non-singular, if and only if |A| = 0. (The sufficiency followsfrom the fact that if

    |A

    |is non-zero, we can find the inverse of A

    by forming the adjoint matrix ofA. For the proof of the necessity,note that for any two square matrices A and B of order n, |AB| =|A||B|. Now, let B = A1 and note that AB = In.)

    48. Some facts about the ranks of matrices (assume Amn, Bmm andCnn):

    21

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    22/46

    r(A) = r(AT

    ). r(AAT) = r(ATA) = r(A). r(BAC) = r(A) if both B and C are non-singular. r(A) is the maximum order of submatrices of A with non-zero

    determinants.

    r(A) min(m, n). r(0mn) = 0. r(AC) = min(r(A), r(C)).

    49. The trace of a square matrix Ann is the scalar tr(A) =

    ni=1 aii. Onecan check that tr(A) = tr(AT) and that tr(AB) = tr(BA) if both ABand BA are well-defined square matrices.

    50. Given a square matrix Ann, an eigenvector ofA is a non-zero vectorxn1 that, after being linearly transformed by A, becomes a multiple ofitself. In other words, x is an eigenvector of A, if there exists a scalar such that

    Ax = x.

    The scalar is called the eigenvalue corresponding to the eigen-

    vector x.

    51. The above relation can be rearranged to get

    (A In)x = 0n1.

    Note that x = 0n1 is always one solution to the above equation.But, remember we are looking for non-zero vectors! Suppose theredo exist non-zero solutions to the above equation. What does thismean? It means that dim(N(AIn)) > 0, and therefore dim(M(AIn)) < n! Our earlier discussion on the equivalent relations among

    rank, determinant and non-singularity of a square matrix then impliesthat |A In| = 0! Denote |A In| by p(). Then, by expandingwe can show that p() is a polynomial of of degree n. The equation

    p() = 0 is known as the characteristic equation for the matrix

    22

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    23/46

    A. Note that in p(), the coefficient of n

    is (1)n

    , and therefore theequation p() = 0 can be written as

    |A In| = p() = (1 ) (2 ) (n ) = 0,where i is the i-th root of the polynomial equation according to someway of labelling (e.g. such that 1 2 n). Apparently, weare not guaranteed with real roots. Note that by setting to zero, theabove equation yields

    |A| = 12 n.By inspection, the coefficient ofn1 in the polynomial p() is (

    1)n1tr(A),

    while the coefficient ofn1 in (1)(2) (n) is (1)n1 ni=1 i.We conclude that

    tr(A) =n

    i=1

    i.

    52. The following properties regarding the eigenvalues and eigenvectors canbe easily verified:

    If both A and P are square matrices of order n and P is non-singular, then PAP1 and A have identical characteristic equa-tion, eigenvalues, trace, and rank.

    If is an eigenvalue of A then for any positive integer p, p is aneigenvalue of the matrix Ap (the product ofA multiplied by itselffor p times).

    If A = AT, then all the eigenvalues are real. Moreover, if twoeigenvalues are distinct: 1 = 2, then the two correspondingeigenvectors are orthogonal to each other: xT1 x2 = 0.

    The way to compute the eigenvector that corresponds to the largesteigenvalue, say 1, is to solve

    maxxRn{0}xTAx

    xTx .

    To see the idea, note that if is one eigenvalue of A with thecorresponding eigenvector x, then

    xT[(A In)x] = 0,

    23

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    24/46

    or equivalently, xTAx

    xTx= .

    Note that the above left-hand side is well-defined, because x is anon-zero vector and In is positive definite. Similarly, the eigen-vector that corresponds to the j-th largest eigenvalue solves

    maxxRn{0}

    xTAx

    xTx,

    subject to the additional constraint that x be orthogonal to x1, x2,

    , xj1.

    53. Exercise 6 Let Ann be a positive definite matrix. Show that A isnon-singular; i.e. its inverse exists.

    Proof. Suppose not. Then for some i {1, 2, , n}, the i-th columnof A, denoted ai, can be represented as a linear combination of theother column vectors of A. Let Ai be the n (n 1) matrix that weobtain by deleting ai from A. Then for some non-zero (n 1)-vectorb, we have

    ai = Aib.

    Define the n-vector x as such that its i-th element is

    1, and if we

    delete its i-th element we obtain b. Apparently x = 0. Observe thatxTAx = xT[ai (1) + Aib] = xT0 = 0,

    which is a contradiction to the fact that A is positive definite. 54. An orthogonal matrix Cnn is one whose inverse exists and equal to its

    transpose: CT = C1. It follows that |C| = 1. The following is animportant theorem:

    55. If A = ATnn, then there is a orthogonal matrix Cnn and a diagonal

    matrix Dnn such thatA = CDCT.

    Moreover, r(A) is equal to the number of non-zero main diagonal ele-ments of D.13

    13Those elements are exactly the eigenvalues of matrix A.

    24

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    25/46

    56. Exercise 7 Suppose that Ann is positive semi-definite. Show that ifA is invertible, then A is positive definite.

    Proof. The preceding section tells us that there exist orthogonal matrixC and diagonal matrix D such that A = CDCT. Note that for eachxn1, there exists yn1 such that x = Cy; in fact, x = 0 if and only ify = 0. Now we prove the assertion by contraposition. Suppose insteadthat A is not positive definite, and hence there exists x = 0 such that

    0 = xTAx = yTDy =n

    i=1

    diiy2i ,

    where dii is the (i, i)-th element ofD (and hence is the i-th eigenvalue ofA). Let zi be the (non-zero) eigenvector associated with the eigenvaluedii; that is,

    i = 1, 2, , n , diizi = Azi 0 zTi Azi = diizTi zi dii 0.Since x = 0, we have y = 0, and if dii > 0 for all i = 1, 2, , n, then

    ni=1

    diiy2i > 0,

    and we shall have a contradiction. Hence it must be that dii = 0 forsome i {1, 2, , n}. For that zero dii, the associated eigenvector zimust be such that

    Azi = diizi = 0,

    and since A is invertible, we have

    zi = A10 = 0,

    which contradicts the fact that zi is a non-zero vector. 57. Consider a twice differentiable function f : Rn R. Let the Df :

    Rn

    Rn

    be the vector function

    Df =

    f

    x1f

    x2...f

    xn

    ,

    25

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    26/46

    which will be referred to as the gradient of f. Let D2

    f : Rn

    Rn2

    bethe matrix function

    D2f =

    2f

    x1x1

    2f

    x1x2 2f

    x1xn2f

    x2x1

    2f

    x2x2 2f

    x2xn...

    ......

    ...2f

    xnx1

    2f

    xnx2 2f

    xnxn

    ,

    which will be referred to as the Hessian of f.

    58. Recall the following definitions from Section 32. A function f : Rn R is concave, if for all x, y Rn and all [0, 1], f(x + (1 )y) f(x)+(1)f(y). A concave function is said to be strictly concave ifthe above defining inequality is always strict. A function f is (strictly)convex iff is (strictly) concave. A function is affine if it is both con-cave and convex. An affine function is linear if it passes through theorigin.

    Theorem 2 A twice differentiable function f : Rn R is concave(respectively, strictly concave) if and only if D

    2

    f is a negative semi-definite (respectively, definite) matrix at each and every x = (x1, x2, , xn)T Rn.

    59. Functions to appear in this and the next sections will be twice dif-ferentiable. A necessary condition for x Rn to solve the followingmaximization program (P1)

    maxxRn

    f(x)

    is that Df(x) = 0, which will be referred to as the first-order condi-

    tions. This necessary condition is also sufficient, if f is concave.14

    14The existence of an optimal solution has not been warranted. For example, supposethat n = 1 and f(x) = 1 ex. Since f = ex > 0 for all x , there does not existx satisfying Df(x) = f(x) = 0 in this case, and since the latter is the equivalencecondition for the optimal solution, there does not exist a solution to the optimizationproblem.

    26

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    27/46

    60. Consider the following maximization program (P2):maxxRn

    f(x)

    subject toi = 1, 2, , m, gi(x) = 0,

    where m < n.

    Theorem 3 (Lagrange Theorem) Given any feasible x, define

    J(x) {j {1, 2, , m} : gj(x) = 0};that is, J(x) is the set of contraints binding at x. (In the currentcase we have only equality contraints, it is trivially true that J(x) ={1, 2, , m}.) Define

    DJ(x) = {Dgj(x) : j J(x)};that is, DJ(x) contains the gradient vectors atx for those constraintsbinding at x. Now, suppose that x solves (P2) and DJ(x) containsa set of linearly independent gradient vectors (this is known as a con-straint qualifications condition). Then, there must existm constants

    (called Lagrange multipliers) 1, 2, , m such that(i) i = 1, 2, , m, gi(x) = 0; and(ii)

    mi=1

    iDgi(x) = Df(x).

    Conversely, if f is concave and all gis are affine,15 and if x satisfies

    (i) and (ii), then x solves (P2).16

    15We have shown that every linear function g : n takes the form g(x) = aTxfor some n-vector a. It follows that every affine function g : n takes the form

    g(x

    ) =aTx

    + b for some n-vectora

    and some scalar b. It can be easily verified thata

    is exactly the gradient of g, which is a constant function. Note that for x, x such thatg(x) = g(x), we have

    0 = g(x) g(x) = aT(x x) = DgT(x x).

    16Let me prove the sufficiency. Recall the following Taylors Theorem with Rimainder,

    27

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    28/46

    61. To given an intuitive explanation for Lagrange Theorem, consider thewhich can be found in, for example, Chapter 7 of M. H. Protter and C. B. Morrey, 1977,A First Course in Real Analysis, New York: Springer-Verlag. Suppose that f : n istwice continuously differentiable on an open ball

    B(a, r) {x n :

    (a1 x1)2 + (a2 x2)2 + + (an xn)2 < r},

    where recall that the open ball contains every x n whose distance from the center aof the ball is less than r > 0. Then for each x B(a, r), there exists some y lying on theline segment ax, such that

    f(x) = f(a) + Df(a)T(x a) +1

    2(x a)TD2f(y)(x a).

    Now suppose that f is concave, so that D2f(z) is negative semi-definite for all z n.This implies that

    1

    2(x a)TD2f(y)(x a) 0,

    and thatf(x) f(a) Df(a)T(x a).

    Similarly, we haveg(x) g(a) Dg(a)T(x a)

    ifg : n is convex. Now, let us go back to the proof for the sufficiency of the LagrangeTheorem. Suppose that there exists x satisfying

    (i) i = 1, 2, , m, gi(x

    ) = 0; and(ii)mi=1

    iDgi(x) = Df(x),

    for some 1, 2, , m . Consider any other x satisfying

    i = 1, 2, , m, gi(x) = 0.

    We must show thatf(x) f(x).

    To see that this is true, note that

    f(x) f(x

    ) Df(x

    )T(x x

    )

    =

    mi=1

    iDgi(x)T(x x) = 0,

    where the last equality follows from the fact that gi() is affine for all i = 1, 2, , m; seethe preceding footnote.

    28

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    29/46

    case where m = 1, n = 2 in (P2). In this case, (P2) can be rewritten as

    max(x,y)R2

    f(x, y) s.t. g(x, y) = 0.

    Suppose that given a solution (x, y), either gx

    or gy

    is non-zero (which

    is the constraint qualifications condition). Suppose the latter istrue. Then by the implicit function theorem, we can represent y = h(x)locally, with

    h(x) = g

    xg

    y

    .

    That is, in a neighborhood of (x, y), we can reproduce the solutionby

    maxx

    f(x, h(x)),

    which is an unconstrained problem with a single control variable. It isnecessary that

    df(x, h(x))

    dx|[x=x] = 0,

    or, at (x, h(x)),f

    x

    +f

    y

    h = 0,

    which meansf

    xg

    x

    =

    f

    y

    g

    y

    .

    These are exactly the content of Lagrange Theorem. Note that we canreproduce these necessary conditions by first forming a new functioncalled Lagrangian, which is defined as

    L(x , y , ) = f(x, y) g(x, y),

    and then set the three partial derivatives of the Lagrangian to zero.

    62. Exercise 8 Letf(x, y) = xay1a, 0 < a < 1, andg(x, y) = px+qy Iin the preceding section, but require that x, y +. Solves the optimal

    29

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    30/46

    solutions (x, y) using the Lagrangian Theorem.17

    63. Consider the following maximization program (P3):

    maxxRn

    f(x)

    subject toi = 1, 2, , m, gi(x) 0.

    Theorem 4 (Kuhn-Tucker Theorem) Suppose that there exists somex such that gi(x) < 0 for all i = 1, 2, , m. (This is called the SlaterCondition.) Then if x is a solution to (P3), there must exist mnon-negative constants (called the Lagrange multipliers for the mconstraints) 1, 2, , m such that (i)

    mi=1

    iDgi(x) = Df(x);

    and (ii) (complementary slackness) for all i = 1, 2, , m, igi(x) =0.18 19 Conversely, if f is concave and for all i = 1, 2, , m, gi :

    Rn

    R is convex, and ifx satisfies the above (i) and (ii), then

    x isa solution to the above program (P3).

    17Hint: Show that

    D2f = a(1 a)xa2y1a

    y2 xy

    xy x2

    .

    18Another version of the Theorem requires that the constraint qualifications conditionhold: If x solves (P3) with DJ(x) containing a set of linearly independent gradientvectors, then (i) and (ii) must hold; and if f and all gis are concave, then x solves (P3)if x satisfies (i) and (ii).

    19Let me prove the sufficiency. Suppose that f and all gi are concave. Suppose thatthere exist non-negative 1, 2, , m such that at x, (i)

    mi=1

    iDgi(x) = Df(x);

    and (ii) (complementary slackness) for all i = 1, 2, , m, igi(x) = 0. We must show

    30

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    31/46

    64. Exercise 9 Consider the differentiable function f : n

    definedbyf(xn1) = a

    x, x n,where an1 is a given vector. Show that Df(x) = a for all x n.20

    65. Exercise 10 Consider the differentiable function f : n definedby

    f(xn1) = xAx, x n,

    where Ann is a given square matrix. Show that21 for allx n,

    Df(x) = Ax + Axthat given any x such that gi(x) 0 for all i = 1, 2, , m, we have

    f(x) f(x) 0.

    To this end, note thatf(x) f(x) Df(x)T(x x)

    =

    mi=1

    iDgi(x)T(x x)

    mi=1

    i[gi(x) gi(x)]

    =m

    i=1

    igi(x) 0,

    where the first inequality follows from the preceding footnote and the fact that f is concave,the first equality follows from the assumption that the Kuhn-Tucker condition holds at x,the second inequality follows from the preceding footnote and the fact that i 0 and giis convex for all i = 1, 2, , m, and the last equality follows from the assumption that thecomplementary slackness holds at x, and the fact that gi(x) 0 for all i = 1, 2, , m.

    20By definition, f(xn1) = ax =

    nj=1 ajxj , so that

    fxj

    = aj , which by definition is

    the j-th element of Df.21(Step 1.) First recall that the j-th element ofAmnxn1 is simply the inner product

    of x and the j-th row of A:

    Ax =

    T1

    T2...

    Tn

    x =

    T1

    x

    T2 x...

    Tnx

    ,

    where Tj is the j-th row of A, and the second equality follows when we treat A as acolumn vector and x as a scalar (check conformability!).

    31

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    32/46

    and D2f(x) = A + A.

    66. Exercise 11 Let1 be the n-vector of ones. Lete andV be respectivelysome givenn-vector andnn symmetric positive definite matrix. Solvethe following maximization problem: given constant k > 0,

    maxxn

    xe k2

    xVx,

    subject to the constraintx1 = 1.

    (Hint: First check if the objective function f(x) xe k2

    xVx isconcave and if the contraint function g(x) x1 1 is affine. If youranswer is positive, then apply Lagrange theorem.)

    Solution. The gradient of f is

    Df = e k2

    (V + V)x,

    (Step 2.) Now, by definition, we have

    f(xn1) = xAx =

    ni=1

    nk=1

    xiaikxk,

    so that

    f

    xj=

    n

    i=1

    nk=1 xiaikxk

    xj=

    n

    i=1 xin

    k=1 aikxk

    xi|i=j +

    n

    k=1 xkn

    i=1 xiaik

    xk|k=j

    =

    nk=1

    ajkxk +

    ni=1

    aijxi,

    which is the inner product of x and the j-th row of A + AT! Hence we know from Step 1that Df(x) = (A+A)x. Now, observe that the j-th row ofD2f is simply the transpose of

    the gradient of fxj , or equivalently, it is the transpose of the gradient of the j-th elementof Df. Since the j-th element of Df is simply the inner product of x and the j-th row ofA + AT, its gradient, by the preceding exercise, is simply the j-th row of A + AT! Now,since A + AT is a symmetric matrix, the transpose of the j-th row of A + AT is exactlythe j-th row of A + AT. Hence we conclude that the j-th row of D2f is the j-th row ofA + AT, or equivalently, D2f = A + AT.

    32

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    33/46

    so that the Hessian of f is

    D2f = k2

    (V + V),

    which is negative semi-definite: for any hn1, we have

    hD2fh = k2

    (hVh + hVh)

    = k2

    (hVh + (hVh))

    = k

    2 (hVh + hVh) = khVh 0,

    where the third equality follows from the fact that the transpose of thescalar hVh is the scalar itself, and the last inequality follows from thefact that V is positive semi-definite and k < 0. Thus D2f is negativesemi-definite for all x, and we conclude that f is concave in x.

    Next following the hint we show that g is affine in x. The gradient andHessian of g are respectively

    Dg = 1 D2g = 0nn.

    For any hn1, we have

    hD2gh = h0h = 0 0,and

    hD2gh = h0h = 0 0,and hence g is both convex and concave in x, proving that g is affinein x.

    Now since f and g are respectively concave and affine in x, we canapply the Lagrangian Theorem to obtain the following necessary and

    sufficient first order conditions: for some Lagrange multiplier ,Df = Dg, g = 0.

    We have

    e k2

    (V + V)x = 1.

    33

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    34/46

    Because V = V is invertible, we have

    x =1

    kV1(e 1).

    Using the fact that g(x) = 0, we have

    1 = 1x =1

    k1V1(e 1)

    = 1V1e k1V11

    .

    It follows that

    x =1

    kV1(e 1

    V1e k1V11

    1).

    Since at all x, Dg = 1 = 0, the constraint qualifications conditionholds easily.

    67. Exercise 12 Consider the following constrained maximization prob-lem:

    maxx,y

    f(x, y) = x1

    3 y2

    3 ,

    subject to

    x, y 0; Pxx + Pyy I,where Px, Py, I > 0 are given constants.(i) Is f : 2+ a concave function? (Hint: Show that D2f is anegative semi-definite matrix at all (x, y) with x, y > 0.)(ii) Define g1(x, y) = x, g2(x, y) = y, andg3(x, y) = Pxx + Pyy I.Are g1, g2 and g3 convex functions? (Hint: Show that D

    2g3 is a posi-tive semi-definite matrix at all (x, y) with x, y > 0.)(iii) Does there exist some (x, y) 2 such that gi(x, y) < 0 for alli = 1, 2, 3? (Hint: Try for example x = I

    3Pxand y = I

    3Py.)

    (iv) Solve the above maximization problem using Kuhn-Tucker condi-tion. Prove that the optimal solution is

    (x, y) = (I

    3Px,

    2I

    3Py).

    34

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    35/46

    68. Exercise 13 Consider the minimization problem

    minwN1N

    wVw

    subject tow1 = 1,

    where V is a positive definite symmetric matrix. (Hint: The problemminx f(x) and the problem maxx f(x) have the same solution. Showthat the objective function f(w) = wVw is convex and the constraintfunctiong(w) w11 is affine. Then, apply the Lagrange Theorem.)

    69. Exercise 14 Solve the following maximization problem:

    max(x,y)2

    +

    f(x, y) = log(1 + x) +

    y x2

    y.

    (Hint: Check iff is concave. Then show that the optimal solution tothe following unconstrained problem

    max(x,y)2

    f(x, y) = log(1 + x) +

    y x2

    y.

    lies in 2+.)Solution. The gradient and the Hessian off are respectively

    Df =

    11+x

    12

    12y 1

    , D2f =

    1(1+x)2

    0

    0 14

    y3

    2

    .

    For any a

    b

    2,

    we have

    a b

    1(1+x)2

    0

    0 14

    y3

    2

    a

    b

    =

    a2(1 + x)2

    b2

    4y

    3

    2 0,

    proving that D2f(x, y) is a negative semi-definite matrix for all (x, y) 2+. Thus f(x, y) is concave.

    35

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    36/46

    Following the hint, we first consider the unconstrained problem. Sincef is concave, for (x, y) to attain the maximum of f on 2, it isnecessary and sufficient that

    Df(x, y) =

    00

    x = 1, y = 1

    4.

    Thus if we confine ourselves to 2+ in searching for the maximum of f,we would still get the optimal solution (x, y) = (1, 1

    4). (If the highest

    mountain is located somewhere in Asia, then it must also be the highestmountain in Asia.)

    70. Exercise 15 Solve the following maximization problem:

    max(x,y)2

    +

    f(x, y) = log(1 + x) +

    y x2

    y,

    subject tox + y 1.

    (Hint: First check if f is concave and g x + y 1 is convex. In casef is concave and g x + y 1 is convex, then show that there existssome (x, y) satisfying x + y < 1, x > 0, and y > 0. This will verify theSlater Condition. Now use the Kuhn-Tucker theorem: For some

    0,

    we must have(x + y 1) = 0,

    and

    11

    =

    1

    1+x 1

    21

    2y 1

    .

    Either = 0 or > 0. Show that if = 0, then x + y > 1, acontradiction. Thus we have > 0, which implies that

    x + y 1 = 0;1

    1+x

    12

    = 12y

    1.

    From here, one can obtain the optimal (x, y).)

    Solution. Following the hint, we first verify that g is convex (that f isconcave was verified above). It is easy to show that

    D2g = 022,

    36

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    37/46

    and hence for all z21, we havez0z = 0 0,

    proving that g is convex. It remains to verify that the Slater conditionis also satisfied. Apparently, (x, y) = (0.1, 0.1) will serve the purpose.Thus we have a concave program at hand, and we can apply the Kuhn-Tucker theorem to solve this problem. For some 0, we must have

    (x + y 1) = 0,

    and

    11

    =

    1

    1+x 1

    21

    2y 1

    .

    Suppose that = 0 and we shall demonstrate a contradiction. If = 0,then we have

    11

    =

    00

    =

    11+x

    12

    12y 1

    ,

    x = 1, y = 14

    x + y = 54

    > 1,

    a contradiction. Thus > 0. It follows from

    (x + y 1) = 0

    that x = 1 y (and we say the constraint g(x, y) 0 is binding at(x, y)). It follows that

    1

    1 + (1 y) 1

    2=

    1

    2

    y 1 h(y) [y]3 9[y]2 +20[y] 4 = 0.

    Since h(0), h(1) > 0, and h(y) < 0, y (0, 1), we see that thereexists a unique y

    (0, 1) that satisfies h(y) = 0. From here, the

    optimal solution is then (1 y, y).71. A (real-valued) random variable x is defined by its distribution

    function Fx(r), which is the probability for the event that the realiza-tion of x is less than or equal to the real number r. Note that Fx()is defined on the set of real numbers (which is denoted by ), and its

    37

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    38/46

    value is between 0 and 1. Note that Fx(r2) Fx(r1) whenever two realnumbers r2 > r1, meaning that Fx() is a weakly increasing function.Moreover, it is right-continuous, in the sense that

    limrn>r;rnr

    Fx(rn) = Fx(r), r .If Fx(r) exists for all r , we call it the density function of ran-dom variable x. If Fx() is a continuous function, then we say thatx is a continuous random variable; otherwise, we say that x is adiscrete random variable. A discrete random variable may haveeither a finite number of possible outcomes, or an infinite number of

    outcomes, which can be exhausted by counting (just like the set ofnatural numbers). A continuous random variable has a set of possibleoutcomes which cannot be exhausted by counting (just like the set ofreal numbers).

    72. When a discrete random variable x has only a finite number of (sayn) possible outcomes x1 < x2 < < xn, which may occur withrespectively probabilities p1, p2, , pn, we define the expected valueor mathematical expectation of x by

    E[x] =n

    j=1

    pjxj .

    (Verify that p1 = Fx(x1), and pj = Fx(xj) Fx(xj1) for all j =2, 3, , n.) We also call the expecte value of x the first central momentof x, which gives an average level of xs random outcome.22 We mayneed to measure the risk or uncertainty of x, which is defined as thevariance of x,

    var(x) =n

    j=1

    pj(xi E[x])2.

    The variance of x is also referred to as the second central moment ofx.23 The square root of variance of x is called the standard deviation

    of x.24

    22Note that E[ax + b] = aE[x] + b for all a, b . That is, mathematical expectation isa linear operator.

    23Verify that var(ax) = a2var(x), for all a .24Verify that the standard deviation of ax is equal to |a| times the standard deviation

    of x, for all a .

    38

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    39/46

    When we have two discrete random variables x and y, having respec-tively n and m possible outcomes, we usually measure their statisticrelationship by their covariance, which is defined by

    cov(x, y) =mi=1

    nj=1

    qij(yi E[y])(xj E[x]),

    where qij is the probability for the event that xs outcome is xj and ysoutcome is yi.

    25 Apparently, if the two random variables tend to movein the opposite direction, then this measure is negative, and we saythat they are negatively correlated. Negative correlation is the statistic

    notion that underlies the concept of hedging in finance.73. What if the above x and y are continuous random variables? We simply

    replace summations by integrals. That is, we have

    E[x] =+

    zdFx(z),

    var(x) =+

    (z E[x])2dFx(z),and

    cov(x, y) =+s=

    +t=(s E[x])(t E[y])dFxy(s, t),

    where given any real numbers s and t, Fxy(s, t) is the probability forthe event that x s and y t. We call Fxy(, ) the joint distributionfunctoin of random variables x and y. In most cases Fxy(, ) will becontinuously differentiable, so that we can write

    dFxy(s, t) = fxy(s, t)dsdt,

    where fxy(, ) is the joint density function of random variables x andy.

    Example 10 Suppose that x has two possible outcomes, 0 and 30, andthat y has two possible outcomes, 20 and -20. The joint probabilities for the 4 possible pairs of outcomes of x and y are summarized in thefollowing table.

    25Verify that cov(ax + bz, y) = acov(x, y) + bcov(z, y), for all a, b .

    39

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    40/46

    x/y 20 200 0.1 0.2 30 0.3 0.4

    The probabilities of the following 4 events define the joint distributionof x and y:

    prob.(x = 0, y = 20) = 0.1; prob.(x = 0, y = 20) = 0.2;

    prob.(x = 30, y = 20) = 0.3; prob.(x = 30, y = 20) = 0.4.From the joint distribution of the two random variables, we can derivethe marginal distribution of each random variable. More precisely, weimmediately have the marginal distribution of x, defined by the proba-bilities for the following two events:

    prob.(x = 0) = prob.(x = 0, y = 20)+prob.(x = 0, y = 20) = 0.1+0.2 = 0.3;

    prob.(x = 30) = prob.(x = 30, y = 20)+prob.(x = 30, y = 20) = 0.3+0.4 = 0.7.We can similarly obtain the marginal distribution of y, defined by theprobabilities for the following two events:

    prob.(y = 20) = prob.(x = 0, y = 20)+prob.(x = 30, y = 20) = 0.1+0.3 = 0.4;prob.(y = 20) = prob.(x = 0, y = 20)+prob.(x = 30, y = 20) = 0.2+0.4 = 0.6.

    Now we can obtain the expected value and variance for each randomvariable, using that random variables marginal distribution. Verifythat

    E[x] = 0 0.3 + 30 0.7 = 21, E[y] = 20 0.4 + 20 0.6 = 4,

    var(x) = (0 21)2 0.3 + (30 21)2 0.7 = 189,var(y) = (20 4)2 0.4 + (20 4)2 0.6 = 384.

    Finally, we can use the joint distribution of the two random variablesto compute their covariance:

    cov(x, y) = (0 21) (20 4) 0.1 + (0 21) (20 4) 0.2

    40

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    41/46

    +(30 21) (20 4) 0.3 + (30 21) (20 4) 0.4 = 24.We shall need the concepts of conditional distribution and condi-tional expectation of a random variable. The former is the proba-bility distribution of one random variable conditional on knowing theother random variables outcome. For example, the conditional distri-bution of x, after we know that the outcome of y is 20, is defined bythe probabilities of the following two events:

    prob.(x = 0|y = 20) = prob.(x = 0, y = 20)prob.(y = 20)

    =prob.(x = 0, y = 20)

    prob.(x = 0, y = 20) + prob.(x = 30, y = 20) =0.1

    0.1 + 0.3=

    1

    4;

    prob.(x = 30|y = 20) = prob.(x = 30, y = 20)prob.(y = 20)

    =prob.(x = 30, y = 20)

    prob.(x = 0, y = 20) + prob.(x = 30, y = 20) =0.3

    0.1 + 0.3=

    3

    4.

    Now, after we know that the outcome of y is 20, with the conditionaldistribution of x, we can compute the conditional expectation for x:

    E[x|y = 20] = 0 14

    + 30 34

    = 904

    .

    Similarly, we can find the conditional distribution of x after we knowthat the outcome of y is 20. Again, this is defined by the probabilitiesof two events:

    prob.(x = 0|y = 20) = prob.(x = 0, y = 20)prob.(y = 20)

    =prob.(x = 0, y = 20)

    prob.(x = 0, y = 20) + prob.(x = 30, y = 20)=

    0.2

    0.2 + 0.4=

    1

    3;

    prob.(x = 30|y = 20) = prob.(x = 30, y = 20)prob.(y = 20)

    =prob.(x = 30, y = 20)

    prob.(x = 0, y = 20) + prob.(x = 30, y = 20)=

    0.4

    0.2 + 0.4=

    2

    3.

    41

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    42/46

    Hence we have

    E[x|y = 20] = 0 13

    + 30 23

    = 20.

    These computations tell us that, if a high outcome of x would make usbetter off, then a lower realization of y is good news.

    The following law of iterated expectations (LIE) says that the ex-pected value of the conditional expectations of x is equal to the originalexpected value of x. In the above example, we know from the marginaldistribution of y that y may equal20 with probability 0.4 and equal 20with probability 0.6, and the conditional expected values of x in thosetwo cases are respectively 90

    4and 20. Thus the expected value of these

    conditional expectations is

    0.4 904

    + 0.6 20 = 9 + 12 = 21 = E[x]!

    In other words, we can write

    E[E[x|y]] = E[x],where E[x

    |y] is the random variable that equals 90

    4when y =

    20 and

    equals 20 when y = 20. Note that E[x|y] is the expected value of xafter seeing the realization of y, and hence it naturally varies with therealization of y.

    74. The following measure is called the coefficient of correlation for xand y, which always lies between 1 and 1:

    x,y cov(x, y)var(x)

    var(y)

    .

    We say that the two random variables are perfectly correlated if theircoefficient of correlation is equal to 1 or -1.

    Show that the coefficient of correlation between x and y in the aboveexample is

    24384

    189

    .

    42

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    43/46

    75. Example 11 There are several classes of important continuous ran-dom variables, and we give two of them here. The first class is thenormal or Gaussian random variables. A random variable x is Gaus-sian or normal if and only if it has the following density function

    f(x) =1

    2exp((x )

    2

    22), x R,

    where = E[x] and 2 =var(x). One can verify that and 2 areindeed the expected value and variance of the normal random variablex by integration:

    + xf(x)dx = ,

    + (x )2f(x)dx = 2.

    The second class is the uniformly distributed random variables. Arandom variable x is uniformly distributed if and only if its has a densityfunction which is a constant function on an interval [a, b]:

    f(x) =1

    b a , x (a, b).We can obtain the expected value and variance of x by integration:

    + xf(x)dx =a + b

    2, + (x

    a + b

    2 )

    2f(

    x)

    dx=

    (b

    a)2

    12.

    76. If Amn is a two-dimensional array of mn random variables (i.e. eachelement aij is a random variable), then A is called a random matrixand E(A) is a non-random matrix with the same size as A, with its(i, j)-th element being

    E(aij), i = 1, 2, , m; j = 1, 2, , n.We shall call E(A) the expected value of the random matrix A.

    If n = 1, Amn = rm1 is called a random vector. The variance-

    covariance matrix of the random vector r is defined by the m m non-random matrix V, with vij = cov(ri, rj). Since for any tworandom variables x and y, cov(x, y) = cov(y, x) (verify!), the matrix Vis symmetric. Moreover, since for any random variable x, cov(x, x) =var(x) (verify!), the major diagonal elements ofV are, respectively, thevariances of the random variables in the random vector r!

    43

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    44/46

    77. Theorem 5 The following statements are true.(i) Denote E[r] by e. Then V = E[(r e)(r e)].(ii) V is positive semi-definite. V is positive definite if and only if thereexists no non-zero m-vector x such that the variance ofxTr equals zero.In case m = 2, such a non-zero x exists if and only if r1 and r2 areperfectly correlated.(iii) Given any m-vectors w, w1 and w2, the expected value of w

    Tr iswe. The variance of wTr is wVw. The covariance of wT1 r and w

    T2 r

    is w1Vw2.

    Proof. Consider part (i). Note that the (k, j)-th element of the matrix

    E[(r e)(r e)] isE[(rk E[rk])(rj E[rj])],

    which is exactly the definition of cov(rk, rj).

    Next, consider part (ii). We have mentioned that V = VT. Pick anyxm1, and observe that (I use and T interchangeably)

    xVx = xE[(r e)(r e)]x= E[x(r e)(r e)x]

    = E[(xr xe))((rx ex)]= E[(xr E[xr])2] = var[xr] 0.

    Thus by definition, V is positive semi-definite. Note also that an equiv-alence condition for V not to be positive definite is the existence ofxm1 = 0 such that

    var[xr] = xVx = 0.

    Now, if m = 2, we have

    V22 = var[r1] cov(r1, r2)

    cov(r1, r2) var[r2]

    ,

    and if V is singular, then its determinant must equal zero, and so

    cov(r1, r2)2

    var[r1]var[r2]= 1,

    44

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    45/46

    and so the coefficient of correlation for (r1, r2) equals either 1 or 1.Finally, consider part (iii). By definition we have

    E[wTr] = E[m

    j=1

    wj rj ] =m

    j=1

    wjE[rj] =m

    j=1

    wjej = wTe.

    Similarly, mimicking the proof for part (ii), one can easily show that

    var[rw] = wVw.

    Finally, note that

    w1Vw2 = E[(w1r w1e)(w2r w2e)]

    = E[(w1r w1e)(w2r w2e)] = cov(w1r, w2r).This finishes the proof.

    78. Given a random vector xn1, let its mean vector be

    en

    1 =

    E[x1]E[x2]

    .

    ..E[xn]

    and its covariance matrix be

    nn =

    cov(x1, x1) cov(x1, x2) cov(x1, xn)cov(x2, x1) cov(x2, x2) cov(x2, xn)

    ......

    ......

    cov(xn, x1) cov(xn, x2) cov(xn, xn)

    .

    Assume that is positive definite. We say that the n random vari-

    ables in x are multi-variate normal, if the joint density function of(x1, x2, , xn) is

    f(x) = (2)n2 || 12 e 12 (xe)1(xe), x n.

    45

  • 8/2/2019 Lecture 0 Linear Algebra, Maximization, and Probability

    46/46

    In this case, we simply write26

    x N(e, ).

    If x is multivariate normal, and n = 2, we say that x is bi-variate nor-mal. For bivariate normal random variables (x, y) N(x, y, 2x, 2y, ),where is the cofficient of correlation of x and y, it can be shown thatthe conditional distribution of x upon knowing the realization of y is

    x|[y=y] N(E[x|y = y],var(x|y = y)),where

    E[x|y = y] = x + cov(x, y)2y

    [y y],

    andvar(x|y = y) = 2x(1 2).

    More generally, given a vector

    z(m+n)1 =

    xn1ym1

    of multi-variate normal random variables, where

    xn1ym1

    N(

    an1bm1

    ,

    Vnn CnmCmn Umm

    ),

    the conditional expectation of x given y is

    E[x|y] = a + CU1(y b),and the conditional covariance matrix of x given y is

    var(x|y) = V CU1C.

    26

    Some related results are as follows. Given any non-random matrix Amn, the randomvector Ax is multivariate normal if x is multivariate normal. In fact, an equivalencecondition for x to be multivariate normal is that bTx is a normal random variable for allbn1 n. Note that if x is multivariate normal, then each random variable xi is onenormal random variable defined in the preceding section. Conversely, suppose that for alli {1, 2, , n}, xi is one normal random variable. Then, x is multivariate normal if then random variables (x1, x2, , xn) are totally independent.

    46