Linear Algebra and Optimization for Machine
Learning
1st Edition by Charu Aggarwal. Chapters 1 – 11
vii
,Contents
1 Linear Algebra and Optimization: An Introduction
A S D A S D A S D A S D A S D 1
2 Linear Transformations and Linear Systems
A S D A S D A S D A S D 17
3 Diagonalizable Matrices and Eigenvectors A S D A S D A S D 35
4 Optimization Basics: A Machine Learning View ASD ASD ASD ASD ASD 47
5 Optimization Challenges and Advanced Solutions A S D A S D A S D A S D 57
6 Lagrangian Relaxation and Duality A S D A S D A S D 63
7 Singular Value Decomposition A S D A S D 71
8 Matrix Factorization A S D 81
9 The Linear Algebra of Similarity
A S D A S D A S D A S D 89
10 The Linear Algebra of Graphs
A S D A S D A S D A S D 95
11 Optimization in Computational Graphs A S D A S D A S D 101
viii
,Chapter A S D 1
Linear Algebra and Optimization: An Introduction
ASD AS D AS D AS D AS D
1. For any two vectors x and y, which are each of length a, show that
A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D
A (i) x − y is orthogonal to x+y, and (ii) the dot product of x− 3y and x+3y i
S D A S D A SD ASD A S D A SD ASD ASD ASD ASD A S D A SD A S D ASD ASD ASD A SD ASD ASD A S D ASD ASD ASD A S D
s negative.
A S D
(i) The first is simply· x −x y y using the distributive property of matrix m
ASD ASD ASD ASD
A
ASD
S D
A S D
A
A S D A S D ASD ASD ASD ASD ASD ASD ASD
ultiplication. The dot·product of a vector with itself is its squared length. Sin ASD ASD
S D
ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
ce both vectors are of the same length, it follows that the result is 0. (ii) In the
ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
second case, one can use a similar argument to show that the result is a2 − 9a ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
2, which is negative.
ASD ASD ASD
2. Consider a situation in which you have three matrices A, B, and C, of sizes A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D A S D
10×2, 2×10, and 10×10, respectively.
ASD ASD A SD ASD ASD ASD ASD ASD ASD ASD
(a) Suppose you had to compute the matrix product ABC. From an efficiency pe ASD A SD A SD ASD ASD ASD ASD ASD ASD A SD A SD ASD
r-
spective,wouldit computationallymake more sensetocompute(AB)C or woul
ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
d it make more sense to compute A(BC)?
ASD A SD ASD A SD ASD A SD A SD
(b) If you had to compute the matrix product CAB, would it make more sense to
ASD A SD ASD A SD ASD A SD A SD A SD ASD A SD ASD A SD ASD A SD ASD
compute (CA)B or C(AB)? A S D A S D A S D
The main point is to keep the size of the intermediate matrix as small as p
ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
ossible inorder to reduce both computational and space requirements. I A S D ASD ASD ASD ASD ASD ASD ASD ASD ASD
n the case of ABC, it makes sense to compute BC first. In the case of CAB it
ASD ASD ASD ASD ASD A SD ASD ASD ASD ASD A SD ASD ASD ASD ASD ASD A SD ASD
makes sense to compute CA first.Thistypeof associativity propertyisuse ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
dfrequentlyinmachinelearningin ordertoreducecomputationalrequire
ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
ments.
3. — A =
Show that if a matrix A satisfies A S D A S D A S D A S D A S D A S D A S D A S D
AT, then all the diagonal elements of t ASD A S D A S D A S D A S D A S D A S D A S D
he matrix are 0. ASD A SD ASD
NotethatA+AT=0.However,thismatrix also contains twice the diagonal ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
elements of A on its diagonal. Therefore, the diagonal elements of A mus ASD A SD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
t be 0. ASD ASD
4. Show that if we have a matrix satisfying A
—= ASD AS D ASD A SD A SD A SD A SD A SD ASD
1
, AT, then for any column vector x, we
ASD ASD ASD A SD A SD A SD ASD A SD AS
have x Ax = 0.
D A S D
T
ASD A SD A SD
Note that the transpose of the scalar xTAx remains unchanged. Theref
A S D A S D A S D A S D A S D A S D A S D ASD A S D A S D A S D
ore, we have
A S D A S D
xTAx = (xTAx)T = xTATx = −xTAx. Therefore, we have 2xTAx = 0.
ASD A SD A SD ASD A S D A SD ASD ASD A SD A SD ASD A S D A S D A S D A S D ASD A SD A SD
2