Linear Algebra and Optimization for Machine
Learning
1st Edition by Charu Aggarwal. Chapters 1 – 11
vii
,Contents
1 Linear Algebra and Optimization: An Introduction
N E N E N E N E N E 1
2 Linear Transformations and Linear Systems
NE NE NE NE 17
3 Diagonalizable Matrices and Eigenvectors
NE NE NE 35
4 Optimization Basics: A Machine Learning View
NE NE NE NE NE 47
5 Optimization Challenges and Advanced Solutions
NE NE NE NE 57
6 Lagrangian Relaxation and Duality
N E N E N E 63
7 Singular Value Decomposition
N E N E 71
8 Matrix Factorization
N E 81
9 The Linear Algebra of Similarity
N E N E N E N E 89
10 The Linear Algebra of Graphs
N E N E N E N E 95
11 Optimization in Computational Graphs
N E N E N E 101
viii
,Chapter N E 1
Linear Algebra and Optimization: An Introduction
NE NE NE NE NE
1. For any two vectors x and y, which are each of length a, show
N E N E N E N E N E N E N E N E N E N E N E N E N E N
Ethat (i) x − y is orthogonal to x + y, and (ii) the dot product of x
N E N E NE NE N E NE NE NE NE NE NE NE NE NE NE NE NE NE
− 3y and x + 3y is negative.
NE N E NE NE NE N E NE
(i) The first is simply
NE · − x x y y using the distributive property of m
NE NE NE N E NEN N E N E N E NE NE NE NE NE NE
atrix multiplication.· The dot product of a vector with itself is its squ
NE
E
NE NE NE NE NE NE NE NE NE NE NE
ared length. Since both vectors are of the same length, it follows that
NE NE NE NE NE NE NE NE NE NE NE NE
the result is 0. (ii) In the second case, one can use a similar argumen
NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE
t to show that the result is a2 − 9a2, which is negative.
NE NE NE NE NE NE NE NE NE NE NE NE
2. Consider a situation in which you have three matrices A, B, and C,
NE NE NE NE NE NE NE NE NE NE NE NE NE
of sizes 10 × 2, 2 × 10, and 10 × 10, respectively.
NE NE NE NE NE NE NE NE NE NE NE NE
(a) Suppose you had to compute the matrix product ABC. From an effi
NE NE NE NE NE NE NE NE NE NE NE
ciency per- NE
spective, would it computationally make more sense to compute (AB)C
NE NE NE NE NE NE NE NE NE NE N
or would it make more sense to compute A(BC)?
E NE NE NE NE NE NE NE NE
(b) If you had to compute the matrix product CAB, would it make mor
NE NE NE NE NE NE NE NE NE NE NE NE
e sense to compute (CA)B or C(AB)?
NE NE NE N E N E N E
The main point is to keep the size of the intermediate matrix as s
NE NE NE NE NE NE NE NE NE NE NE NE NE
mall as possible in order to reduce both computational and space r
NE NE N E NE NE NE NE NE NE NE NE
equirements. In the case of ABC, it makes sense to compute BC firNE NE NE NE NE NE NE NE NE NE NE NE
st. In the case of CAB it makes sense to compute CA first. This ty
NE NE NE NE NE NE NE NE NE NE NE NE NE NE
pe of associativity property is used frequently in machine learning i
NE NE NE NE NE NE NE NE NE NE
n order to reduce computational requirements.
NE NE NE NE NE
3. N E —
Show that if a matrix A satisfies A = N E N E N E N E N E N E N E
AT , then all the diagonal elements N
E NE NE NE NE NE NE
of the matrix are 0.
NE NE NE NE
Note that A + AT = 0. However, this matrix also contains twice the
NE NE NE NE NE NE NE NE NE NE NE NE NE
diagonal elements of A on its diagonal. Therefore, the diagonal el
NE NE NE NE NE NE NE NE NE NE NE
ements of A must be 0. NE NE NE NE NE
4. Show that if we have a matrix satisfying
NE — A= NE NE NE NE NE NE NE NE
1
, AT , then for any column vector
NE NE NE NE NE NE NE
x, we have x Ax = 0.
NE NE N E
T
NE NE NE
Note that the transpose of the scalar xT Ax remains unchanged. Ther
N E N E N E N E N E N E N E NE N E N E N E
efore, we have
N E N E
xT Ax = (xT Ax)T = xT AT x = −xT Ax. Therefore, we have 2xT Ax
NE NE NE NE N E NE NE NE NE NE NE NE NE NE NE NE NE
= 0.
NE
2