Optimization for Machine Learning”
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
March 21, 2021
,Contents
1 Linear Algebra and Optimization: An Introduction 1
2 Linear Transformations and Linear Systems 17
3 Diagonalizable Matrices and Eigenvectors 35
4 Optimization Basics: A Machine Learning View 47
5 Optimization Challenges and Advanced Solutions 57
6 Lagrangian Relaxation and Duality 63
7 Singular Value Decomposition 71
8 Matrix Factorization 81
9 The Linear Algebra of Similarity 89
10 The Linear Algebra of Graphs 95
11 Optimization in Computational Graphs 101
vii
, Chapter 1
Linear Algebra and
Optimization: An Introduction
1. For any two vectors x and y, which are each of length a, show that (i) x − y is
orthogonal to x + y, and (ii) the dot product of x − 3y and x + 3y is negative.
(i) The first is simply x·x−y·y using the distributive property of matrix multiplication.
The dot product of a vector with itself is its squared length. Since both vectors are of
the same length, it follows that the result is 0. (ii) In the second case, one can use a
similar argument to show that the result is a2 − 9a2 , which is negative.
2. Consider a situation in which you have three matrices A, B, and C, of sizes 10 × 2,
2 × 10, and 10 × 10, respectively.
(a) Suppose you had to compute the matrix product ABC. From an efficiency per-
spective, would it computationally make more sense to compute (AB)C or would
it make more sense to compute A(BC)?
(b) If you had to compute the matrix product CAB, would it make more sense to
compute (CA)B or C(AB)?
The main point is to keep the size of the intermediate matrix as small as possible
in order to reduce both computational and space requirements. In the case of ABC,
it makes sense to compute BC first. In the case of CAB it makes sense to compute
CA first. This type of associativity property is used frequently in machine learning in
order to reduce computational requirements.
3. Show that if a matrix A satisfies A = −AT , then all the diagonal elements of the
matrix are 0.
Note that A + AT = 0. However, this matrix also contains twice the diagonal elements
of A on its diagonal. Therefore, the diagonal elements of A must be 0.
4. Show that if we have a matrix satisfying A = −AT , then for any column vector x, we
have xT Ax = 0.
Note that the transpose of the scalar xT Ax remains unchanged. Therefore, we have
xT Ax = (xT Ax)T = xT AT x = −xT Ax. Therefore, we have 2xT Ax = 0.
1