Linear Algebra and Optimization for Machine Learning
t t t t t t
1st Edition by Charu Aggarwal. Chapters 1 – 11
t t t t tt t t t
,Contents
1 Linear Algebra and Optimization: An Introduction
t t t t t 1
2 Linear Transformations and Linear Systems
t t t t 17
3 Diagonalizable Matrices and Eigenvectors
t t t 35
4 Optimization Basics: A Machine Learning View
t t t t t 47
5 Optimization Challenges and Advanced Solutions
t t t t 57
6 Lagrangian Relaxation and Duality
t t t 63
7 Singular Value Decomposition
t t 71
8 Matrix Factorization
t 81
9 The Linear Algebra of Similarity
t t t t 89
10 The Linear Algebra of Graphs
t t t t 95
11 Optimization in Computational Graphs
t t t 101
,Chapter 1 t
Linear Algebra and Optimization: An Introduction
t t t t t
1. For any two vectors x and y, which are each of length a, show that (i)
t t t t t t t t t t t t t t t
tx − y is orthogonal to x + y, and (ii) the dot product of x − 3y and x + 3y is
t t t t t t t t t t t t t t t t t t t t t t
negative.
t
(i) The first is simply
t · −x x y y using the distributive property of matrix
t t t t t t t t t t t t t t
multiplication. The dot
t
· product of a vector with itself is its squared length. t
t
t t t t t t t t t t t
Since both vectors are of the same length, it follows that the result is 0. (ii)
t t t t t t t t t t t t t t t t
In the second case, one can use a similar argument to show that the result
t t t t t t t t t t t t t t t
is a2 − 9a2, which is negative.
t t t t t t t
2. Consider a situation in which you have three matrices A, B, and C, of sizes
t t t t t t t t t t t t t t
10 × 2, 2 × 10, and 10 × 10, respectively.
t t t t t t t t t t t
(a) Suppose you had to compute the matrix product ABC. From an efficiency t t t t t t t t t t t
per- spective, would it computationally make more sense to compute (AB)C or
t t t t t t t t t t t t
would it make more sense to compute A(BC)?
t t t t t t t t
(b) If you had to compute the matrix product CAB, would it make more sense
t t t t t t t t t t t t t
to compute (CA)B or C(AB)?
t t t t t
The main point is to keep the size of the intermediate matrix as small as
t t t t t t t t t t t t t t
possible in order to reduce both computational and space
t t t t t t t t t
requirements. In the case of ABC, it makes sense to compute BC first. In
t t t t t t t t t t t t t t
the case of CAB it makes sense to compute CA first. This type of
t t t t t t t t t t t t t t
associativity property is used frequently in machine learning in order to
t t t t t t t t t t t
reduce computational requirements.
t t t
3. Show that if a matrix A satisfies A = AT, then all the diagonal
t t t t t t t t t t t t t
elements of the matrix are 0.
t t t t t t
Note that A + AT = 0. However, this matrix also contains twice the
t t t t t t t t t t t t t
diagonal elements of A on its diagonal. Therefore, the diagonal
t t t t t t t t t t
elements of A must be 0.
t t t t t t
4. Show that if we have a matrix satisfying A =
t AT, then for any column t t t t t t t t t t t t t
vector x, we have xT Ax = 0.
t t t t t t t t
Note that the transpose of the scalar xT Ax remains unchanged. Therefore,
t t t t t t t t t t t
1
, t we have
t
xTAx = (xTAx)T = xTATx = −xTAx. Therefore, we have 2xTAx = 0.
t t t t t t t t t t t t t t t t t t
2