Tentamen (uitwerkingen)

CS 7641 Problem Set Solutions | Georgia Institute of Technology

Beoordeling

Verkocht

Pagina's

Cijfer

Geüpload op

22-07-2025

Geschreven in

2024/2025

CS 7641 Problem Set Solutions | Georgia Institute of Technology

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Georgia Institute Of Technology
Vak: CS 7641

Alle documenten voor dit vak (3)

Documentinformatie

Geüpload op: 22 juli 2025
Aantal pagina's: 34
Geschreven in: 2024/2025
Type: Tentamen (uitwerkingen)
Bevat: Vragen en antwoorden

Onderwerpen

cs 7641
cs 7641 problem set solutions

Voorbeeld van de inhoud

CS 7641 Problem Set Solutions

Question 1
We are asked to consider a supervised learning problem with a non-deterministic target, where
an instance x can map to a binary outcome y ∈ {0, 1}. The goal is to derive the appropriate
error function for finding the Maximum Likelihood (ML) hypothesis and compare it to the sum
of squared errors.

(a) Derivation of the Maximum Likelihood Error Function
To find the Maximum Likelihood Hypothesis, hML, we start with Bayes’ Rule. We want to find
the hypothesis h that maximizes the probability of observing the training data D, given the
hypothesis. This is the likelihood, P(D|h).
Assuming the training data points D = {(x1, y1), . . . , (xm, ym)} are independent and iden-
tically distributed (i.i.d.), the likelihood of the entire dataset is the product of the probabilities
of each data point:
m
Y
L(h) = P(D|h) = P(yi |xi , h)
i=1
Our hypothesis h(xi) models the probability that the true value yi is 1. Therefore, we can
express the probability of observing a specific yi as:
• P(yi = 1|xi, h) = h(xi)
• P(yi = 0|xi, h) = 1 − h(xi)
This can be written more compactly as a single expression:

P(yi|xi, h) = h(xi)yi (1 − h(xi))1−yi
Substituting this back into our likelihood function gives:
m
Y
L(h) = h(xi )yi (1 − h(xi))1−yi
i=1

For computational convenience, we maximize the log-likelihood, l(h) = ln(L(h)), as this converts
the product into a sum and doesn’t change the location of the maximum.
Y
m ! Σ
m
l(h) = ln [yi ln(h(xi )) + (1 − yi) ln(1 − h(xi))]
h(xi)yi (1 − h(xi))1−yi =
i=1 i=1

Standard learning algorithms are formulated as minimizing an error function. Maximizing the
log-likelihood is equivalent to minimizing its negative. Therefore, the proper error function to
minimize is the negative log-likelihood, also known as the cross-entropy error:
m
Σ
E(h) = −l(h) = − [yi ln(h(xi)) + (1 − yi) ln(1 − h(xi))]
i=1

1

,(b) Comparison to Sum of Squared Errors
This derived cross-entropy error function can be compared to the sum of squared errors (SSE),
which is used for deterministic functions perturbed by zero-mean Gaussian noise.

• Underlying Assumption: The core difference lies in the assumed data distribution.
SSE assumes the target variable is continuous and any deviation from the true function
is due to Gaussian noise. Cross-entropy assumes the target variable is binary and follows
a Bernoulli distribution, which is exactly the setup described in the problem.

• Behavior with 0/1 Data: If a standard neural network used SSE with 0/1 data, it
would treat ’0’ and ’1’ as numerical targets and try to minimize the squared difference.
While this can work, it’s not ideal. For example, if the model predicts 0.9 for a target
of 1, the squared error is small ((1 − 0.9)2 = 0.01). If it confidently predicts 0.1 for a
target of 1, the error is much larger ((1 − 0.1)2 = 0.81). Cross-entropy provides a much
steeper penalty for confident but incorrect predictions (e.g., predicting a probability near
0 when the true label is 1), which often leads to better training dynamics for classification
problems.

• Behavior with Probability Estimates: If the training data consisted of (x, y) pairs
where y was an estimate of the probability (e.g., y = 0.75) instead of 0s and 1s, using
SSE would be more justifiable, as the target is now a continuous value. However, even in
this case, cross-entropy is often preferred as it is fundamentally designed for comparing
probability distributions.

Question 2: Perceptron and Network Design
(a) Perceptron for the function A ∧ ¬B
To implement the boolean function A ∧ ¬B, we need a two-input perceptron that outputs 1
only when the input is (A = 1, B = 0). Let A be input x1 and B be input x2. The perceptron’s
output is 1 if w1 x1 + w2x2 + w0 > 0, and 0 otherwise. Here, w0 represents the negative of the
threshold (w0 = −θ).
A valid set of parameters is:
• Weight for A (w1): 1

• Weight for B (w2): -1

• Bias (w0): -0.5 (which corresponds to a threshold θ = 0.5)

Let’s verify this solution for all four possible inputs:

• A=0, B=0: 1(0) + (−1)(0) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)

• A=0, B=1: 1(0) + (−1)(1) − 0.5 = −1.5 ≤ 0 → Output is 0. (Correct)

• A=1, B=1: 1(1) + (−1)(1) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)

• A=1, B=0: 1(1) + (−1)(0) − 0.5 = +0.5 > 0 → Output is 1. (Correct)

2

,(b) Two-Layer Network for A ⊕ B (XOR)
The XOR function is not linearly separable, so it cannot be implemented with a single percep-
tron. It requires a multi-layer network. We can construct XOR from other logical functions
that are linearly separable, such as:

A ⊕ B = (A ∨ B) ∧ ¬(A ∧ B)

We can design a two-layer network where the hidden layer computes (A ∨ B) and ¬(A ∧ B),
and the output layer combines them with an AND operation.
Network Structure and Parameters:

• Hidden Layer Neuron 1 (H1 - OR gate): Computes A ∨ B.

– wA1 = 1, wB1 = 1, Bias w01 = −0.5 (θ = 0.5)

• Hidden Layer Neuron 2 (H2 - NAND gate): Computes ¬(A ∧ B).

– wA2 = −1, wB2 = −1, Bias w02 = 1.5 (θ = −1.5)
• Output Layer Neuron (O - AND gate): Takes inputs from H1 and H2.

– wH1 = 1, wH2 = 1, Bias w0O = −1.5 (θ = 1.5)
Network Diagram:

A A∨B

A⊕B

B ¬(A ∧ B)

Question 3: Training Rule Derivations and Comparison
The problem describes a single unit where the output ‘o‘ is a linear combination of weights and
a feature set that includes both linear and quadratic terms of the inputs (xj and xj2). For clarity
in the derivation, let’s define the output o with unique weights for each feature term:
Σn
o = w0 + (wj,1xj + wj,2x 2j)
j=1

This is still a linear unit because the output ‘o‘ is linear with respect to the weights, which is
the key property for these derivations.

(a) Perceptron Training Rule Derivation
The Perceptron Training Rule updates weights only when the model misclassifies an input. The
output of the perceptron, op, is the thresholded value of the linear combination.
,
Σ
n
op = sgn w0 + (w j, 1 x j + wj,2x2)j
j =1

3

, where the target output t is typically +1 or −1.
The update rule is designed to move the decision boundary to correctly classify the misla-
beled instance. The general form is ∆wi = η(t − op) · featurei. Applying this to our specific
features (1, xj, xj 2):
• For the bias weight w0: The feature is 1.
∆w0 = η(t − op)
• For the linear term weight wj,1: The feature is xj.
∆wj,1 = η(t − op)xj
• For the quadratic term weight wj,2: The feature is x2j.

∆wj,2 = η(t − op)xj2
These updates are applied for each misclassified training example.

(b) Gradient Descent Training Rule Derivation
The Gradient Descent (or Delta Rule) method works by minimizing an error function, typically
the sum of squared errors (SSE). For a single training example d, the error is:
1 2
Ed = (td − od)
2
Here, the output od is the unthresholded, continuous linear output of the unit:
n
Σ
2
od = w0 + (wj,1xd,j + wj,2xd,j )
j=1
The weight update rule is ∆w = −η∇Ed, which requires calculating the partial derivative of
the error with respect to each weight. We use the chain rule:∂w∂Ed = ∂E d ∂od .
∂od ∂w
First, the derivative of the error with respect to the output od is:
∂Ed ∂ 1
= (t d − od )2 = −(t d − od)
∂od ∂od 2
Next, we find the derivatives of the output od with respect to each weight:
∂od
∂od = 1 =x ∂od
d,j = x2d,j
∂w0 ∂wj,1 ∂wj,2
Putting it all together, the gradients are:
∂Ed ∂Ed ∂Ed
= −(t — o ) = −(td − od)xd,j = −(td − od)x2d,j
∂w0 d d ∂w j,1 ∂wj,2
The final update rules (∆w = −η∇Ed) are:
•
∆w0 = η(td − od)
•
∆wj,1 = η(td − od)xd,j
•
∆wj,2 = η(td − od)x2d,j
Key Distinction: While the update rules look formally identical, the output term ‘o‘ is
critically different. In the Perceptron rule, op is the discrete, thresholded output (±1). In
Gradient Descent, od is the continuous, unthresholded linear output.

4

$21.99

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

NurseHenny

4.3

(27)

Maak kennis met de verkoper

NurseHenny EXAMS

Bekijk profiel

Volgen

Verkocht

149

Lid sinds

2 jaar

Aantal volgers

Documenten

1887

Laatst verkocht

2 weken geleden

AFFORDABLE EXAMS AND STUDY GUIDES

On this page you will find verified, well elaborated exams and packages, offered by seller NURSE HENNY.

4.3

27 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper NurseHenny. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $21.99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 41730 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen