Linear Algebra and Optimization for MachineLearning
b b b b b b
1st Edition by Charu Aggarwal. Chapters 1 – 11
b b b b bb b b b
,Contents
1 Linear Algebra and Optimization: An Introduction
b b b b b 1
2 Linear Transformations and Linear Systems
b b b b 17
3 Diagonalizable Matrices and Eigenvectors b b b 35
4 Optimization Basics: A Machine Learning View
b b b b b 47
5 Optimization Challenges and Advanced Solutions
b b b b 57
6 Lagrangian Relaxation and Duality
b b b 63
7 Singular Value Decomposition
b b 71
8 Matrix Factorization
b 81
9 The Linear Algebra of Similarity
b b b b 89
10 The Linear Algebra of Graphs
b b b b 95
11 Optimization in Computational Graphs
b b b 101
,Chapter 1 b
Linear Algebra and Optimization: An Introduction
b b b b b
1. For any two vectors x and y, which are each of length a, show that (i) x
b b b b b b b b b b b b b b b b
− y is orthogonal to x+y, and (ii) the dot product of x−3y and x+3y is
b b b b b b b b b b b b b b b b b b b b b b
negative.
b
(i) The first is simply
b · −x · x y y using the distributive property of matrix b b b b b b b b b b b b b b b
multiplication. The dot product of a vector with itself is its squared length.
b b b b b b b b b b b b b
Since both vectors are of the same length, it follows that the result is 0. (ii) In
b b b b b b b b b b b b b b b b b
the second case, one can use a similar argument to show that the result is a2 −
b b b b b b b b b b b b b b b b b
9a2, which is negative.
b b b b
2. Consider a situation in which you have three matrices A, B, and C, of sizes b b b b b b b b b b b b b b
10×2, 2×10,and 10×10, respectively.
b b b b b b b b b b b
(a) Suppose you had to compute the matrix product ABC. From an efficiency b b b b b b b b b b b
per- spective, would it computationally make more sense to compute (AB)C or
b b b b b b b b b b b b
would it make more sense to compute A(BC)?
b b b b b b b b
(b) If you had to compute the matrix product CAB, would it make more sense to
b b b b b b b b b b b b b b
compute (CA)B or C(AB)?
b b b b
The main point is to keep the size of the intermediate matrix as small as
b b b b b b b b b b b b b b
possible in order to reduce both computational and space requirements.
b b b b b b b b b b
In the case of ABC, it makes sense to compute BC first. In the case of CAB it
b b b b b b b b b b b b b b b b b b
makes sense to compute CA first. This type of associativity property is
b b b b b b b b b b b b
used frequently in machine learning in order to reduce computational
b b b b b b b b b b
requirements.
b
3. Show that if a matrix A satisfies A = b b b b b b b b AT, then all the diagonal elements
b b b b b b
of the matrix are 0.
b b b b b
Notethat A + AT = 0.However,this matrix also contains twice the diagonal
b b b b b b b b b b b b b b
elements of A on its diagonal. Therefore, the diagonal elements of A must
b b b b b b b b b b b b b
be 0.
b b
4. Show that if we have a matrix satisfying A= AT,then for any column vector x,
b b b b b b b b b b b b b b b b
we have xT Ax = 0.
b b b b b b
Note that the transpose of the scalar xT Ax remains unchanged. Therefore,
b b b b b b b b b b b
1
, b we have
b
xTAx = (xTAx)T = xTATx = −xTAx. Therefore, we have 2xTAx = 0.
b b b b b b b b b b b b b b b b b b
2