Study Program: Master Data Science and Society
Academic Year 2022/2023, Semester 1, Block 1 (September to October 2022)
Course: Deep Learning (880008-M-6)
Lecturer: G. Saygili
,Lecture 1: Introduction and the
Perceptron
Introduction
• Artificial Intelligence: hardcode
knowledge about the world in for-
mal languages, people struggle to
devise formal rules with enough
complexity to accurately describe
world knowledge
• Machine Learning: acquire their
own knowledge by extracting pat-
terns from raw data, performance
of simple ML algorithms depends
heavily on the representation of the
data (features)
• Representation Learning: use ML
not only to discover mappings from
representation to output but to learn the representation itself (example: autoencoder which
combined an encoder function with a decoder function)
• Deep Learning: Solves the problem representation learning because it introduces represen-
tations that are expressed in terms of other, simpler representations. This enables the com-
puter to build complex concepts out of simple ones. Depths enables the computer to learn a
multistep computer program.
History of Deep Learning
• 1940-1960: Cybernetics (early model of brain function, Perceptron by Rosenblatt)
• 1980-1990: Connectionism (distributed representation, backpropagation)
• Since 2006: Deep Learning
Artificial Neural Networks (ANNs)
• Proof by example that intelligent be-
havior is possible: reverse engineer
the computational principles behind
it and duplicate its functionality
• ML models hath help us understand
the principles that underlie human
intelligence
• In deep learning: more general principle of learning by multiple layers that create depth
o The next hidden layers identify are more complex structure based on the learnt fea-
tures from the previous layer(s). Each layer adds information and complexity.
CPU vs GPU
• CPU has multiple cores while GPU has hundreds of cores
,Deep Learning Frameworks
• TensorFlow: created by the Google Brain Team, first release in 2015
• Keras: runs on top of TensorFlow
• PyTorch: released in 2017, merged with Caffe2 and torch
The Perceptron
• Most basic single-layer
neural network
• typically used for bi-
nary classification
problems
• Data needs to be line-
arly separable (linear
decision boundary)
• input vector: 𝑋 =
𝑥1 , 𝑥2 , … , 𝑥𝑚
• weights vector: 𝑊 = 𝑤1 , 𝑤2 , … , 𝑤𝑚
• Summed input: ∑𝑖 𝑤𝑖 𝑥𝑖
• Activation function (step activation function):
∑𝑖 𝑤𝑖 𝑥𝑖 ≥ 𝑡 → 𝑦 ′ = 1 here the node fires
∑𝑖 𝑤𝑖 𝑥𝑖 < 𝑡 → 𝑦 ′ = 0 here the node doesn’t fire
• If there is no bias, the intersection with the y axis is always zero, the slope depends on the
weights. If a bias term is added, the line is shifted. The bias is a measure how easy it is to get
the perceptron to output a “1”.
• If the expected output is not equal the observed output, the weights (and bias) need to be
updated according to an update rule (𝛼 is the learning rate): 𝑤′𝑖 = 𝑤𝑖 + 𝛼𝑥𝑖 (𝑦 − 𝑦 ′ )
Matrix Multiplication
, Lecture 2: MLP and Back-Propagation Algorithm
Multilayer Perceptron
• MLP = Feedforward Network
• Dense or fully connected layers
• Goal: approximate some function and learn the
parameter values that lead to the best result
• The functions of the hidden layers and the output
layers are chained together: 𝑦 = 𝑓2 (𝑓1 (𝑥))
• The output of the network (y) is the output of the
last layer which is based on the previous outputs: At
each layer a weighted combination of the inputs plus
the bias term is calculated and an activation is ap-
plied. The result is forwarded to the next node.
• Activation functions are e.g., sigmoid, ReLu, leaky ReLu
• They can be the same per layer but also different within the network (hyperparameter)
Back-Propagation
• How the network optimizes its parameters
• Loss Function/Error Function/Cost Function
o Calculates the “cost”, or distance between the net-
work’s output and the expected one.
o Loss / cost: sum of errors over all training samples
o Error: sample wise
• Forward Propagation: Loss/error is calculated
• Backpropagation: Parameters (weights and biases) are updated
while minimizing the loss (with (stochastic) gradient descent)
• Backpropagate the prediction error from the loss function to
update the parameters.
• We often have thousands of parameters to update at once. We want to know how each indi-
vidual parameter contributes to the error so we can update them appropriately.
• This is done by taking the derivative of the error (cost function) with respect to each para-
meter (partial derivatives). Using the chain rule:
𝜕𝐿 𝜕𝐿 𝜕𝑎
if 𝐿 = 𝑓(𝑎) and 𝑎 = 𝑔(𝑧) then 𝜕𝑧 = 𝜕𝑎 ∗ 𝜕𝑧 which is the derivative of L with respect to z
𝜕𝑙𝑜𝑠𝑠 𝜕𝑙𝑜𝑠𝑠 𝜕𝑎 𝜕𝑢
• Example: We need to take the derivative to update our parameters 𝜕𝑤1
= 𝜕𝑎1
∗ 𝜕𝑢1 ∗ 𝜕𝑤1
1 1
• Exercise using the chain rule:
𝜕𝐿
o Expand 𝜕𝑊
o Use 𝑎 = 𝑓(𝑧) and 𝑧 = 𝑊 ∗ 𝑎 + 𝑏
o And assume that the loss L is a function of a
𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝑧
o Solution: 𝜕𝑊 = 𝜕𝑎 ∗ 𝜕𝑧 ∗ 𝜕𝑊