Linear Algebra and Optimization for Machine
Learning
1st Edition by Charu Aggarwal. Chapters 1 – 11
vii
,Contents
1 Linear Algebra and Optimization: An Introduction
u z u z u z u z u z 1
2 Linear Transformations and Linear Systems
u z u z u z u z 17
3 Diagonalizable Matrices and Eigenvectors u z u z u z 35
4 Optimization Basics: A Machine Learning View uz uz uz uz uz 47
5 Optimization Challenges and Advanced Solutions u z u z u z u z 57
6 Lagrangian Relaxation and Duality u z u z u z 63
7 Singular Value Decomposition u z u z 71
8 Matrix Factorization u z 81
9 The Linear Algebra of Similarity
u z u z u z u z 89
10 The Linear Algebra of Graphs
u z u z u z u z 95
11 Optimization in Computational Graphs u z u z u z 101
viii
,Chapter 1 u z
Linear Algebra and Optimization: An Introduction
uz uz uz uz uz
1. For any two vectors x and y, which are each of length a, show that (i) x
u z u z u z u z u z u z u z u z u z u z u z u z u z u z u z u z uz
− y is orthogonal to x+y, and (ii) the dot product of x− 3y and x+3y is negativ
uz u z uz uz uz uz zu u z uz u z uz uz uz uz uz uz u z uz uz zu u z u z
e.
(i) The first is simply·x −x · y y using the distributive property of matrix multi
uz uz uz uz uz u z u z u z
u z u z uz uz uz uz uz uz uz
plication. The dot product of a vector with itself is its squared length. Since b uz uz uz uz uz uz uz uz uz uz uz uz uz uz
oth vectors are of the same length, it follows that the result is 0. (ii) In the seco
uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz
nd case, one can use a similar argument to show that the result is a2 − 9a2, whi
uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz
ch is negative. uz uz
2. Consider a situation in which you have three matrices A, B, and C, of sizes 10 u z u z u z u z u z u z u z u z u z u z u z u z u z u z u z zu
×2, 2×10, and 10×10, respectively.
zu uz zu zu uz u z zu zu u z
(a) Suppose you had to compute the matrix product ABC. From an efficiency pe uz uz uz uz uz uz uz uz uz uz uz uz
r-
spective,would itcomputationallymake moresensetocompute(AB)C orwoul
uz uz uz uz uz uz uz uz uz uz uz uz
d it make more sense to compute A(BC)?
uz uz uz uz uz uz uz
(b) If you had to compute the matrix product CAB, would it make more sense to c
uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz
ompute (CA)B or C(AB)? u z u z u z
The main point is to keep the size of the intermediate matrix as small as po
uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz
ssible inorder to reduce both computational and spacerequirements. In t u z uz uz uz uz uz uz uz uz uz uz
he case of ABC, it makes sense to compute BC first. In the case of CAB it ma
uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz uz
kes sense to compute CA first.This type of associativity property isused fr
uz uz uz uz uz uz uz uz uz uz uz uz uz
equently inmachine learningin order to reduce computational requireme uz uz uz uz uz uz uz uz uz
nts.
3. Show that if a matrix A satisfies —
A = u z u z u z u z u z u z u z u z
AT, then all the diagonal elements of th uz u z u z u z u z u z u z u z
e matrix are 0.
uz uz uz
NotethatA+AT=0.However,thismatrix alsocontains twice the diagonal
zu zu uz zu uz zu zu zu uz uz uz uz uz uz uz
elements of A on its diagonal. Therefore, the diagonal elements of A must uz uz uz uz uz uz uz uz uz uz uz uz uz
be 0. uz
4. Show that if we have a matrix satisfying A—= uz uz uz uz uz uz uz uz uz
1
, AT, then for any column vector x, we
uz uz uz uz uz uz uz uz uz
have x Ax = 0. u z
T
uz uz uz
Note that the transpose of the scalar xTAx remains unchanged. Therefore,
u z u z u z u z u z u z u z uz u z u z u z u z
we haveu z
xTAx = (xTAx)T = xTATx = −xTAx. Therefore, we have 2xTAx = 0.
uz uz uz uz u z uz uz uz uz uz uz u z u z u z u z uz uz uz
2