Linear Algebra and Optimization for Machine Learning
n n n n n n
1st Edition by Charu Aggarwal. Chapters 1 – 11
n n n n nn n n n
,Contents
1 Linear Algebra and Optimization: An Introduction
n n n n n 1
2 Linear Transformations and Linear Systems
n n n n 17
3 Diagonalizable Matrices and Eigenvectors n n n 35
4 Optimization Basics: A Machine Learning View
n n n n n 47
5 Optimization Challenges and Advanced Solutions
n n n n 57
6 Lagrangian Relaxation and Duality
n n n 63
7 Singular Value Decomposition
n n 71
8 Matrix Factorization
n 81
9 The Linear Algebra of Similarity
n n n n 89
10 The Linear Algebra of Graphs
n n n n 95
11 Optimization in Computational Graphs
n n n 101
,Chapter 1 n
Linear Algebra and Optimization: An Introduction
n n n n n
1. For any two vectors x and y, which are each of length a, show that (i) x
n n n n n n n n n n n n n n n n
− y is orthogonal to x+y, and (ii) the dot product of x−3y and x+3y is
n n n n n n n n n n n n n n n n n n n n n n
negative.
n
(i) The first is simply
n· −x · x y y using the distributive property of matrix n n n n n n n n n n n n n n n
multiplication. The dot product of a vector with itself is its squared length.
n n n n n n n n n n n n n
Since both vectors are of the same length, it follows that the result is 0. (ii) In
n n n n n n n n n n n n n n n n n
the second case, one can use a similar argument to show that the result is a2 −
n n n n n n n n n n n n n n n n n
9a2, which is negative.
n n n n
2. Consider a situation in which you have three matrices A, B, and C, of sizes 10 n n n n n n n n n n n n n n n
×2, 2×10,and 10×10, respectively.
n n n n n n n n n n
(a) Suppose you had to compute the matrix product ABC. From an efficiency n n n n n n n n n n n
per- spective, would it computationally make more sense to compute (AB)C or
n n n n n n n n n n n n
would it make more sense to compute A(BC)?
n n n n n n n n
(b) If you had to compute the matrix product CAB, would it make more sense to
n n n n n n n n n n n n n n
compute (CA)B or C(AB)?
n n n n
The main point is to keep the size of the intermediate matrix as small as
n n n n n n n n n n n n n n
possible in order to reduce both computational and space requirements.
n n n n n n n n n n
In the case of ABC, it makes sense to compute BC first. In the case of CAB it
n n n n n n n n n n n n n n n n n n
makes sense to compute CA first. This type of associativity property is
n n n n n n n n n n n n
used frequently in machine learning in order to reduce computational
n n n n n n n n n n
requirements.
n
3. Show that if a matrix A satisfies A = n n n n n n n n AT, then all the diagonal elements
n n n n n n
of the matrix are 0.
n n n n n
NotethatA +AT= 0.However,this matrix also contains twice the diagonal
n n n n n n n n n n n n n n
elements of A on its diagonal. Therefore, the diagonal elements of A must
n n n n n n n n n n n n n
be 0.
n n
4. Show that if we have a matrix satisfying A= AT,then for any column vector x, n n n n n n n n n n n n n n n n
we have xT Ax = 0.
n n n n n n
Note that the transpose of the scalar xT Ax remains unchanged. Therefore,
n n n n n n n n n n n
1
, n we have
n
xTAx = (xTAx)T = xTATx = −xTAx. Therefore, we have 2xTAx = 0.
n n n n n n n n n n n n n n n n n n
2