CS 7641 Problem Set Solutions
Question 1
We are asked to consider a supervised learning problem with a non-deterministic target, where
an instance x can map to a binary outcome y ∈ {0, 1}. The goal is to derive the appropriate
error function for finding the Maximum Likelihood (ML) hypothesis and compare it to the sum
of squared errors.
(a) Derivation of the Maximum Likelihood Error Function
To find the Maximum Likelihood Hypothesis, hML, we start with Bayes’ Rule. We want to find
the hypothesis h that maximizes the probability of observing the training data D, given the
hypothesis. This is the likelihood, P(D|h).
Assuming the training data points D = {(x1, y1), . . . , (xm, ym)} are independent and iden-
tically distributed (i.i.d.), the likelihood of the entire dataset is the product of the probabilities
of each data point:
m
Y
L(h) = P(D|h) = P(yi |xi , h)
i=1
Our hypothesis h(xi) models the probability that the true value yi is 1. Therefore, we can
express the probability of observing a specific yi as:
• P(yi = 1|xi, h) = h(xi)
• P(yi = 0|xi, h) = 1 − h(xi)
This can be written more compactly as a single expression:
P(yi|xi, h) = h(xi)yi (1 − h(xi))1−yi
Substituting this back into our likelihood function gives:
m
Y
L(h) = h(xi )yi (1 − h(xi))1−yi
i=1
For computational convenience, we maximize the log-likelihood, l(h) = ln(L(h)), as this converts
the product into a sum and doesn’t change the location of the maximum.
Y
m ! Σ
m
l(h) = ln [yi ln(h(xi )) + (1 − yi) ln(1 − h(xi))]
h(xi)yi (1 − h(xi))1−yi =
i=1 i=1
Standard learning algorithms are formulated as minimizing an error function. Maximizing the
log-likelihood is equivalent to minimizing its negative. Therefore, the proper error function to
minimize is the negative log-likelihood, also known as the cross-entropy error:
m
Σ
E(h) = −l(h) = − [yi ln(h(xi)) + (1 − yi) ln(1 − h(xi))]
i=1
1
,(b) Comparison to Sum of Squared Errors
This derived cross-entropy error function can be compared to the sum of squared errors (SSE),
which is used for deterministic functions perturbed by zero-mean Gaussian noise.
• Underlying Assumption: The core difference lies in the assumed data distribution.
SSE assumes the target variable is continuous and any deviation from the true function
is due to Gaussian noise. Cross-entropy assumes the target variable is binary and follows
a Bernoulli distribution, which is exactly the setup described in the problem.
• Behavior with 0/1 Data: If a standard neural network used SSE with 0/1 data, it
would treat ’0’ and ’1’ as numerical targets and try to minimize the squared difference.
While this can work, it’s not ideal. For example, if the model predicts 0.9 for a target
of 1, the squared error is small ((1 − 0.9)2 = 0.01). If it confidently predicts 0.1 for a
target of 1, the error is much larger ((1 − 0.1)2 = 0.81). Cross-entropy provides a much
steeper penalty for confident but incorrect predictions (e.g., predicting a probability near
0 when the true label is 1), which often leads to better training dynamics for classification
problems.
• Behavior with Probability Estimates: If the training data consisted of (x, y) pairs
where y was an estimate of the probability (e.g., y = 0.75) instead of 0s and 1s, using
SSE would be more justifiable, as the target is now a continuous value. However, even in
this case, cross-entropy is often preferred as it is fundamentally designed for comparing
probability distributions.
Question 2: Perceptron and Network Design
(a) Perceptron for the function A ∧ ¬B
To implement the boolean function A ∧ ¬B, we need a two-input perceptron that outputs 1
only when the input is (A = 1, B = 0). Let A be input x1 and B be input x2. The perceptron’s
output is 1 if w1 x1 + w2x2 + w0 > 0, and 0 otherwise. Here, w0 represents the negative of the
threshold (w0 = −θ).
A valid set of parameters is:
• Weight for A (w1): 1
• Weight for B (w2): -1
• Bias (w0): -0.5 (which corresponds to a threshold θ = 0.5)
Let’s verify this solution for all four possible inputs:
• A=0, B=0: 1(0) + (−1)(0) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)
• A=0, B=1: 1(0) + (−1)(1) − 0.5 = −1.5 ≤ 0 → Output is 0. (Correct)
• A=1, B=1: 1(1) + (−1)(1) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)
• A=1, B=0: 1(1) + (−1)(0) − 0.5 = +0.5 > 0 → Output is 1. (Correct)
2
,(b) Two-Layer Network for A ⊕ B (XOR)
The XOR function is not linearly separable, so it cannot be implemented with a single percep-
tron. It requires a multi-layer network. We can construct XOR from other logical functions
that are linearly separable, such as:
A ⊕ B = (A ∨ B) ∧ ¬(A ∧ B)
We can design a two-layer network where the hidden layer computes (A ∨ B) and ¬(A ∧ B),
and the output layer combines them with an AND operation.
Network Structure and Parameters:
• Hidden Layer Neuron 1 (H1 - OR gate): Computes A ∨ B.
– wA1 = 1, wB1 = 1, Bias w01 = −0.5 (θ = 0.5)
• Hidden Layer Neuron 2 (H2 - NAND gate): Computes ¬(A ∧ B).
– wA2 = −1, wB2 = −1, Bias w02 = 1.5 (θ = −1.5)
• Output Layer Neuron (O - AND gate): Takes inputs from H1 and H2.
– wH1 = 1, wH2 = 1, Bias w0O = −1.5 (θ = 1.5)
Network Diagram:
A A∨B
A⊕B
B ¬(A ∧ B)
Question 3: Training Rule Derivations and Comparison
The problem describes a single unit where the output ‘o‘ is a linear combination of weights and
a feature set that includes both linear and quadratic terms of the inputs (xj and xj2). For clarity
in the derivation, let’s define the output o with unique weights for each feature term:
Σn
o = w0 + (wj,1xj + wj,2x 2j)
j=1
This is still a linear unit because the output ‘o‘ is linear with respect to the weights, which is
the key property for these derivations.
(a) Perceptron Training Rule Derivation
The Perceptron Training Rule updates weights only when the model misclassifies an input. The
output of the perceptron, op, is the thresholded value of the linear combination.
,
Σ
n
op = sgn w0 + (w j, 1 x j + wj,2x2)j
j =1
3
, where the target output t is typically +1 or −1.
The update rule is designed to move the decision boundary to correctly classify the misla-
beled instance. The general form is ∆wi = η(t − op) · featurei. Applying this to our specific
features (1, xj, xj 2):
• For the bias weight w0: The feature is 1.
∆w0 = η(t − op)
• For the linear term weight wj,1: The feature is xj.
∆wj,1 = η(t − op)xj
• For the quadratic term weight wj,2: The feature is x2j.
∆wj,2 = η(t − op)xj2
These updates are applied for each misclassified training example.
(b) Gradient Descent Training Rule Derivation
The Gradient Descent (or Delta Rule) method works by minimizing an error function, typically
the sum of squared errors (SSE). For a single training example d, the error is:
1 2
Ed = (td − od)
2
Here, the output od is the unthresholded, continuous linear output of the unit:
n
Σ
2
od = w0 + (wj,1xd,j + wj,2xd,j )
j=1
The weight update rule is ∆w = −η∇Ed, which requires calculating the partial derivative of
the error with respect to each weight. We use the chain rule:∂w∂Ed = ∂E d ∂od .
∂od ∂w
First, the derivative of the error with respect to the output od is:
∂Ed ∂ 1
= (t d − od )2 = −(t d − od)
∂od ∂od 2
Next, we find the derivatives of the output od with respect to each weight:
∂od
∂od = 1 =x ∂od
d,j = x2d,j
∂w0 ∂wj,1 ∂wj,2
Putting it all together, the gradients are:
∂Ed ∂Ed ∂Ed
= −(t — o ) = −(td − od)xd,j = −(td − od)x2d,j
∂w0 d d ∂w j,1 ∂wj,2
The final update rules (∆w = −η∇Ed) are:
•
∆w0 = η(td − od)
•
∆wj,1 = η(td − od)xd,j
•
∆wj,2 = η(td − od)x2d,j
Key Distinction: While the update rules look formally identical, the output term ‘o‘ is
critically different. In the Perceptron rule, op is the discrete, thresholded output (±1). In
Gradient Descent, od is the continuous, unthresholded linear output.
4
Question 1
We are asked to consider a supervised learning problem with a non-deterministic target, where
an instance x can map to a binary outcome y ∈ {0, 1}. The goal is to derive the appropriate
error function for finding the Maximum Likelihood (ML) hypothesis and compare it to the sum
of squared errors.
(a) Derivation of the Maximum Likelihood Error Function
To find the Maximum Likelihood Hypothesis, hML, we start with Bayes’ Rule. We want to find
the hypothesis h that maximizes the probability of observing the training data D, given the
hypothesis. This is the likelihood, P(D|h).
Assuming the training data points D = {(x1, y1), . . . , (xm, ym)} are independent and iden-
tically distributed (i.i.d.), the likelihood of the entire dataset is the product of the probabilities
of each data point:
m
Y
L(h) = P(D|h) = P(yi |xi , h)
i=1
Our hypothesis h(xi) models the probability that the true value yi is 1. Therefore, we can
express the probability of observing a specific yi as:
• P(yi = 1|xi, h) = h(xi)
• P(yi = 0|xi, h) = 1 − h(xi)
This can be written more compactly as a single expression:
P(yi|xi, h) = h(xi)yi (1 − h(xi))1−yi
Substituting this back into our likelihood function gives:
m
Y
L(h) = h(xi )yi (1 − h(xi))1−yi
i=1
For computational convenience, we maximize the log-likelihood, l(h) = ln(L(h)), as this converts
the product into a sum and doesn’t change the location of the maximum.
Y
m ! Σ
m
l(h) = ln [yi ln(h(xi )) + (1 − yi) ln(1 − h(xi))]
h(xi)yi (1 − h(xi))1−yi =
i=1 i=1
Standard learning algorithms are formulated as minimizing an error function. Maximizing the
log-likelihood is equivalent to minimizing its negative. Therefore, the proper error function to
minimize is the negative log-likelihood, also known as the cross-entropy error:
m
Σ
E(h) = −l(h) = − [yi ln(h(xi)) + (1 − yi) ln(1 − h(xi))]
i=1
1
,(b) Comparison to Sum of Squared Errors
This derived cross-entropy error function can be compared to the sum of squared errors (SSE),
which is used for deterministic functions perturbed by zero-mean Gaussian noise.
• Underlying Assumption: The core difference lies in the assumed data distribution.
SSE assumes the target variable is continuous and any deviation from the true function
is due to Gaussian noise. Cross-entropy assumes the target variable is binary and follows
a Bernoulli distribution, which is exactly the setup described in the problem.
• Behavior with 0/1 Data: If a standard neural network used SSE with 0/1 data, it
would treat ’0’ and ’1’ as numerical targets and try to minimize the squared difference.
While this can work, it’s not ideal. For example, if the model predicts 0.9 for a target
of 1, the squared error is small ((1 − 0.9)2 = 0.01). If it confidently predicts 0.1 for a
target of 1, the error is much larger ((1 − 0.1)2 = 0.81). Cross-entropy provides a much
steeper penalty for confident but incorrect predictions (e.g., predicting a probability near
0 when the true label is 1), which often leads to better training dynamics for classification
problems.
• Behavior with Probability Estimates: If the training data consisted of (x, y) pairs
where y was an estimate of the probability (e.g., y = 0.75) instead of 0s and 1s, using
SSE would be more justifiable, as the target is now a continuous value. However, even in
this case, cross-entropy is often preferred as it is fundamentally designed for comparing
probability distributions.
Question 2: Perceptron and Network Design
(a) Perceptron for the function A ∧ ¬B
To implement the boolean function A ∧ ¬B, we need a two-input perceptron that outputs 1
only when the input is (A = 1, B = 0). Let A be input x1 and B be input x2. The perceptron’s
output is 1 if w1 x1 + w2x2 + w0 > 0, and 0 otherwise. Here, w0 represents the negative of the
threshold (w0 = −θ).
A valid set of parameters is:
• Weight for A (w1): 1
• Weight for B (w2): -1
• Bias (w0): -0.5 (which corresponds to a threshold θ = 0.5)
Let’s verify this solution for all four possible inputs:
• A=0, B=0: 1(0) + (−1)(0) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)
• A=0, B=1: 1(0) + (−1)(1) − 0.5 = −1.5 ≤ 0 → Output is 0. (Correct)
• A=1, B=1: 1(1) + (−1)(1) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)
• A=1, B=0: 1(1) + (−1)(0) − 0.5 = +0.5 > 0 → Output is 1. (Correct)
2
,(b) Two-Layer Network for A ⊕ B (XOR)
The XOR function is not linearly separable, so it cannot be implemented with a single percep-
tron. It requires a multi-layer network. We can construct XOR from other logical functions
that are linearly separable, such as:
A ⊕ B = (A ∨ B) ∧ ¬(A ∧ B)
We can design a two-layer network where the hidden layer computes (A ∨ B) and ¬(A ∧ B),
and the output layer combines them with an AND operation.
Network Structure and Parameters:
• Hidden Layer Neuron 1 (H1 - OR gate): Computes A ∨ B.
– wA1 = 1, wB1 = 1, Bias w01 = −0.5 (θ = 0.5)
• Hidden Layer Neuron 2 (H2 - NAND gate): Computes ¬(A ∧ B).
– wA2 = −1, wB2 = −1, Bias w02 = 1.5 (θ = −1.5)
• Output Layer Neuron (O - AND gate): Takes inputs from H1 and H2.
– wH1 = 1, wH2 = 1, Bias w0O = −1.5 (θ = 1.5)
Network Diagram:
A A∨B
A⊕B
B ¬(A ∧ B)
Question 3: Training Rule Derivations and Comparison
The problem describes a single unit where the output ‘o‘ is a linear combination of weights and
a feature set that includes both linear and quadratic terms of the inputs (xj and xj2). For clarity
in the derivation, let’s define the output o with unique weights for each feature term:
Σn
o = w0 + (wj,1xj + wj,2x 2j)
j=1
This is still a linear unit because the output ‘o‘ is linear with respect to the weights, which is
the key property for these derivations.
(a) Perceptron Training Rule Derivation
The Perceptron Training Rule updates weights only when the model misclassifies an input. The
output of the perceptron, op, is the thresholded value of the linear combination.
,
Σ
n
op = sgn w0 + (w j, 1 x j + wj,2x2)j
j =1
3
, where the target output t is typically +1 or −1.
The update rule is designed to move the decision boundary to correctly classify the misla-
beled instance. The general form is ∆wi = η(t − op) · featurei. Applying this to our specific
features (1, xj, xj 2):
• For the bias weight w0: The feature is 1.
∆w0 = η(t − op)
• For the linear term weight wj,1: The feature is xj.
∆wj,1 = η(t − op)xj
• For the quadratic term weight wj,2: The feature is x2j.
∆wj,2 = η(t − op)xj2
These updates are applied for each misclassified training example.
(b) Gradient Descent Training Rule Derivation
The Gradient Descent (or Delta Rule) method works by minimizing an error function, typically
the sum of squared errors (SSE). For a single training example d, the error is:
1 2
Ed = (td − od)
2
Here, the output od is the unthresholded, continuous linear output of the unit:
n
Σ
2
od = w0 + (wj,1xd,j + wj,2xd,j )
j=1
The weight update rule is ∆w = −η∇Ed, which requires calculating the partial derivative of
the error with respect to each weight. We use the chain rule:∂w∂Ed = ∂E d ∂od .
∂od ∂w
First, the derivative of the error with respect to the output od is:
∂Ed ∂ 1
= (t d − od )2 = −(t d − od)
∂od ∂od 2
Next, we find the derivatives of the output od with respect to each weight:
∂od
∂od = 1 =x ∂od
d,j = x2d,j
∂w0 ∂wj,1 ∂wj,2
Putting it all together, the gradients are:
∂Ed ∂Ed ∂Ed
= −(t — o ) = −(td − od)xd,j = −(td − od)x2d,j
∂w0 d d ∂w j,1 ∂wj,2
The final update rules (∆w = −η∇Ed) are:
•
∆w0 = η(td − od)
•
∆wj,1 = η(td − od)xd,j
•
∆wj,2 = η(td − od)x2d,j
Key Distinction: While the update rules look formally identical, the output term ‘o‘ is
critically different. In the Perceptron rule, op is the discrete, thresholded output (±1). In
Gradient Descent, od is the continuous, unthresholded linear output.
4