1 Linear Algebra and Optimization: An Introduction
s s s s s s 1
2 Linear Transformations and Linear Systems
s s s s 17
3 Diagonalizable Matrices and Eigenvectors
s s s 35
4 Optimization Basics: A Machine Learning View
s s s s s 47
5 Optimization Challenges and Advanced Solutions
s s s s 57
6 Lagrangian Relaxation and Duality
s s s 63
7 Singular Value Decomposition
s s 71
8 Matrix Factorization 81
9 The Linear Algebra of Similarity
s s s s s s s s 89
10 The Linear Algebra of Graphs
s s s s s s s s 95
11 Optimization in Computational Graphs
s s s s s s 101
,Chapter 1
Linear Algebra and Optimization: An Introduction
s s s s s
1. For any two vectors x and y, which are each of length a,
s s s s s s s s s s s s
sshow that (i) x − y is orthogonal to x + y, and (ii) the dot product
s s s s s s s s s s s s s s s s
sof x − 3y and x + 3y is negative.
s s s s s s s s
(i) The first is s i mp· ly− x x y y using the distributive property of m atrix
s s s s s s s s s s s s s s s s s
multiplication.· The dot product of a vector with itself is its squ ared length.
s
s
s s s s s s s s s s s s
sSince both vectors are of the same length, it follows tha s s s s s s s s s s
t the result is 0. (ii) In the second case, one can use a similar argume
s s s s s s s s s s s s s s s
nt to show that the result is a2 − 9a2, which is negative.
s s s s s s s
s
s s s s
2. Consider a situation in which you have three matrices A, B, and C, of s s s s s s s s s s s s s
sizes 10 × 2, 2 × 10, and 10 × 10, respectively.
s s s s s s s s s s s s
(a) Suppose you had to compute the matrix product ABC. From an effic s s s VG s s s s s s s
iency per- s s
spective, would itcomputationally makemore senseto compute(AB)C or s s s s s s s s s s
would it make more sense to compute A(BC)?
s s s s s s VG s
(b) If you had to compute the matrix product CAB, would it make more
s s s VG s s s s s s s s
sense to compute (CA)B or C(AB)?
s s s s s s
The main point is to keep the size of the intermediate matrix as s
s s s s s s s s s s s s s
mall as possible in order to reduce both computational and space
s s s s s s s s s s s
requirements. In the case of ABC, it makes sense to compute BC fi rst.
s s s s s s s s s s s s s s
In the case of CAB it makes sense to compute CA first. This t ype
s s s s s s s s s s s s s s s
of associativity property is used frequently in machine learning in
s s s s s s s s s s
order to reduce computational requirements.
s s s s s
3. Show that if a matrix A s at i s—fies A = s s s s s s s s
AT , then all the diagonal elements s
s s s s s
of the matrix are 0.
s s s s
Note that A + AT = 0. However, this matrix also contains twice the
s s s s
s
s s s s s s s s
diagonal elements of A on its diagonal. Therefore, the diagonal el
s s s s s s s s s s s
ements of A must be 0.
s s s s s s
4. Show that if we have a matrix satisfy—ing A= s s s s s s s s s
1
, AT , then for any column vector
s
s s s s s
x, we have x Ax = 0.
s s s s
T s
s s
Note that the transpose of the scalar xT Ax remains unchanged. Ther
s s s s s s s s s s s s s s
s
s s s s s
efore, we have
s s s
xT Ax = (xT Ax)T = xT AT x = −xT Ax. Therefore, we have 2xT Ax
s
s s
s s s
s
s s
s s
s
s s s s
s
= 0. s
2