Linear Algebra and Optimization for Machine
Learning
1st Edition by Charu Aggarwal. Chapters 1 – 11
vii
,Contents
1 Linear Algebra and Optimization: An Introduction
U J M U J M U J M U J M U J M 1
2 Linear Transformations and Linear Systems
U J M U J M U J M U J M 17
3 Diagonalizable Matrices and Eigenvectors U J M U J M U J M 35
4 Optimization Basics: A Machine Learning View UJM UJM U JM UJM UJM 47
5 Optimization Challenges and Advanced Solutions U J M U J M U J M U J M 57
6 Lagrangian Relaxation and Duality U J M U J M U J M 63
7 Singular Value Decomposition U J M U J M 71
8 Matrix Factorization U J M 81
9 The Linear Algebra of Similarity
U J M U J M U J M U J M 89
10 The Linear Algebra of Graphs
U J M U J M U J M U J M 95
11 Optimization in Computational Graphs U J M U J M U J M 101
viii
,Chapter U J M 1
Linear Algebra and Optimization: An Introduction
UJ M UJ M UJ M UJ M UJ M
1. For any two vectors x and y, which are each of length a, show that
U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M
U (i) x − y is orthogonal to x+y, and (ii) the dot product of x − 3y and x+3y
J M U J M UJM UJM U J M UJ M U JM UJ M UJM UJM U J M U JM U J M U JM UJM UJ M UJM UJM UJM U J M U JM UJM UJM U J M
is negative. U J M
(i) The first is simply· x −x y y using the distributive property of matrix m
UJM UJM UJM UJM
U
UJM
J M
U J M
U
U J M U J M UJM UJM UJM UJM UJM UJM UJM
ultiplication. The dot·product of a vector with itself is its squared length. Sin UJM UJM
J M
UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM
ce both vectors are of the same length, it follows that the result is 0. (ii) In the
UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJ
second case, one can use a similar argument to show that the result is a2 − 9a
M UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM U JM UJM
2, which is negative.
UJM UJM UJM
2. Consider a situation in which you have three matrices A, B, and C, of sizes U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M U J M
10×2, 2×10, and 10×10, respectively.
UJM UJM U JM UJM UJM UJM UJM UJM UJM UJM
(a) Suppose you had to compute the matrix product ABC. From an efficiency pe UJ M U JM UJ M UJ M UJM U JM U JM UJM UJ M UJ M UJ M UJM
r-
spective, would it computationally make more sense to compute (AB)C or woul
UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM
d it make more sense to compute A(BC)?
UJM U JM U JM UJ M UJ M UJ M U JM
(b) If you had to compute the matrix product CAB, would it make more sense to
UJM UJM U JM UJ M UJM UJ M U JM UJM U JM UJ M UJ M UJ M U JM UJM U JM
compute (CA)B or C(AB)? U J M U J M U J M
The main point is to keep the size of the intermediate matrix as small as p
UJM UJM UJM UJM UJM UJM UJM UJM U JM UJM UJM UJM UJM UJM UJ M
ossible inorder to reduce both computational and space requirements. I U J M UJM UJM UJM UJM UJM UJM UJM UJM UJM
n the case of ABC, it makes sense to compute BC first. In the case of CAB it
UJM UJM UJM UJM UJM UJ M UJM UJM UJM UJM UJM UJM UJM UJM UJ M UJM UJ M UJM
makes sense to compute CA first. Thistype of associativity property is use UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM
d frequently in machine learningin order to reduce computational require
UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM
ments.
3. — A =
Show that if a matrix A satisfies U J M U J M U J M U J M U J M U J M U J M U J M
AT, then all the diagonal elements of t UJM U J M U J M U J M U J M U J M U J M U J M
he matrix are 0. UJM UJ M U JM
NotethatA+AT=0.However,thismatrix also contains twice the diagonal UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM
elements of A on its diagonal. Therefore, the diagonal elements of A mus UJM U JM UJM UJM UJM UJM UJM UJM UJM UJM UJM UJM
t be 0. UJM UJ M
4. Show that if we have a matrix satisfying A
—= UJ M UJ M U JM U JM U JM U JM U JM U JM UJM
1
, AT, then for any column vector x, we
UJM UJM UJM UJ M U JM U JM UJ M U JM UJ
have x Ax = 0.
M U J M
T
U JM U JM U JM
Note that the transpose of the scalar xTAx remains unchanged. Theref
U J M U J M U J M U J M U J M U J M U J M UJM U J M U J M U J M
ore, we have
U J M U J M
xTAx = (xTAx)T = xTATx = −xTAx. Therefore, we have 2xTAx = 0.
UJM U JM U JM UJM U J M U JM UJM UJM U JM U JM UJM U J M U J M U J M U J M UJM U JM U JM
2