Examen

CS 7641 Problem Set Solutions | Georgia Institute of Technology

Puntuación

Vendido

Páginas

Grado

Subido en

22-07-2025

Escrito en

2024/2025

CS 7641 Problem Set Solutions | Georgia Institute of Technology

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Georgia Institute Of Technology
Grado: CS 7641

Todos documentos para esta materia (3)

Información del documento

Subido en: 22 de julio de 2025
Número de páginas: 34
Escrito en: 2024/2025
Tipo: Examen
Contiene: Preguntas y respuestas

Temas

cs 7641
cs 7641 problem set solutions

Vista previa del contenido

CS 7641 Problem Set Solutions

Question 1
We are asked to consider a supervised learning problem with a non-deterministic target, where
an instance x can map to a binary outcome y ∈ {0, 1}. The goal is to derive the appropriate
error function for finding the Maximum Likelihood (ML) hypothesis and compare it to the sum
of squared errors.

(a) Derivation of the Maximum Likelihood Error Function
To find the Maximum Likelihood Hypothesis, hML, we start with Bayes’ Rule. We want to find
the hypothesis h that maximizes the probability of observing the training data D, given the
hypothesis. This is the likelihood, P(D|h).
Assuming the training data points D = {(x1, y1), . . . , (xm, ym)} are independent and iden-
tically distributed (i.i.d.), the likelihood of the entire dataset is the product of the probabilities
of each data point:
m
Y
L(h) = P(D|h) = P(yi |xi , h)
i=1
Our hypothesis h(xi) models the probability that the true value yi is 1. Therefore, we can
express the probability of observing a specific yi as:
• P(yi = 1|xi, h) = h(xi)
• P(yi = 0|xi, h) = 1 − h(xi)
This can be written more compactly as a single expression:

P(yi|xi, h) = h(xi)yi (1 − h(xi))1−yi
Substituting this back into our likelihood function gives:
m
Y
L(h) = h(xi )yi (1 − h(xi))1−yi
i=1

For computational convenience, we maximize the log-likelihood, l(h) = ln(L(h)), as this converts
the product into a sum and doesn’t change the location of the maximum.
Y
m ! Σ
m
l(h) = ln [yi ln(h(xi )) + (1 − yi) ln(1 − h(xi))]
h(xi)yi (1 − h(xi))1−yi =
i=1 i=1

Standard learning algorithms are formulated as minimizing an error function. Maximizing the
log-likelihood is equivalent to minimizing its negative. Therefore, the proper error function to
minimize is the negative log-likelihood, also known as the cross-entropy error:
m
Σ
E(h) = −l(h) = − [yi ln(h(xi)) + (1 − yi) ln(1 − h(xi))]
i=1

1

,(b) Comparison to Sum of Squared Errors
This derived cross-entropy error function can be compared to the sum of squared errors (SSE),
which is used for deterministic functions perturbed by zero-mean Gaussian noise.

• Underlying Assumption: The core difference lies in the assumed data distribution.
SSE assumes the target variable is continuous and any deviation from the true function
is due to Gaussian noise. Cross-entropy assumes the target variable is binary and follows
a Bernoulli distribution, which is exactly the setup described in the problem.

• Behavior with 0/1 Data: If a standard neural network used SSE with 0/1 data, it
would treat ’0’ and ’1’ as numerical targets and try to minimize the squared difference.
While this can work, it’s not ideal. For example, if the model predicts 0.9 for a target
of 1, the squared error is small ((1 − 0.9)2 = 0.01). If it confidently predicts 0.1 for a
target of 1, the error is much larger ((1 − 0.1)2 = 0.81). Cross-entropy provides a much
steeper penalty for confident but incorrect predictions (e.g., predicting a probability near
0 when the true label is 1), which often leads to better training dynamics for classification
problems.

• Behavior with Probability Estimates: If the training data consisted of (x, y) pairs
where y was an estimate of the probability (e.g., y = 0.75) instead of 0s and 1s, using
SSE would be more justifiable, as the target is now a continuous value. However, even in
this case, cross-entropy is often preferred as it is fundamentally designed for comparing
probability distributions.

Question 2: Perceptron and Network Design
(a) Perceptron for the function A ∧ ¬B
To implement the boolean function A ∧ ¬B, we need a two-input perceptron that outputs 1
only when the input is (A = 1, B = 0). Let A be input x1 and B be input x2. The perceptron’s
output is 1 if w1 x1 + w2x2 + w0 > 0, and 0 otherwise. Here, w0 represents the negative of the
threshold (w0 = −θ).
A valid set of parameters is:
• Weight for A (w1): 1

• Weight for B (w2): -1

• Bias (w0): -0.5 (which corresponds to a threshold θ = 0.5)

Let’s verify this solution for all four possible inputs:

• A=0, B=0: 1(0) + (−1)(0) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)

• A=0, B=1: 1(0) + (−1)(1) − 0.5 = −1.5 ≤ 0 → Output is 0. (Correct)

• A=1, B=1: 1(1) + (−1)(1) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)

• A=1, B=0: 1(1) + (−1)(0) − 0.5 = +0.5 > 0 → Output is 1. (Correct)

2

,(b) Two-Layer Network for A ⊕ B (XOR)
The XOR function is not linearly separable, so it cannot be implemented with a single percep-
tron. It requires a multi-layer network. We can construct XOR from other logical functions
that are linearly separable, such as:

A ⊕ B = (A ∨ B) ∧ ¬(A ∧ B)

We can design a two-layer network where the hidden layer computes (A ∨ B) and ¬(A ∧ B),
and the output layer combines them with an AND operation.
Network Structure and Parameters:

• Hidden Layer Neuron 1 (H1 - OR gate): Computes A ∨ B.

– wA1 = 1, wB1 = 1, Bias w01 = −0.5 (θ = 0.5)

• Hidden Layer Neuron 2 (H2 - NAND gate): Computes ¬(A ∧ B).

– wA2 = −1, wB2 = −1, Bias w02 = 1.5 (θ = −1.5)
• Output Layer Neuron (O - AND gate): Takes inputs from H1 and H2.

– wH1 = 1, wH2 = 1, Bias w0O = −1.5 (θ = 1.5)
Network Diagram:

A A∨B

A⊕B

B ¬(A ∧ B)

Question 3: Training Rule Derivations and Comparison
The problem describes a single unit where the output ‘o‘ is a linear combination of weights and
a feature set that includes both linear and quadratic terms of the inputs (xj and xj2). For clarity
in the derivation, let’s define the output o with unique weights for each feature term:
Σn
o = w0 + (wj,1xj + wj,2x 2j)
j=1

This is still a linear unit because the output ‘o‘ is linear with respect to the weights, which is
the key property for these derivations.

(a) Perceptron Training Rule Derivation
The Perceptron Training Rule updates weights only when the model misclassifies an input. The
output of the perceptron, op, is the thresholded value of the linear combination.
,
Σ
n
op = sgn w0 + (w j, 1 x j + wj,2x2)j
j =1

3

, where the target output t is typically +1 or −1.
The update rule is designed to move the decision boundary to correctly classify the misla-
beled instance. The general form is ∆wi = η(t − op) · featurei. Applying this to our specific
features (1, xj, xj 2):
• For the bias weight w0: The feature is 1.
∆w0 = η(t − op)
• For the linear term weight wj,1: The feature is xj.
∆wj,1 = η(t − op)xj
• For the quadratic term weight wj,2: The feature is x2j.

∆wj,2 = η(t − op)xj2
These updates are applied for each misclassified training example.

(b) Gradient Descent Training Rule Derivation
The Gradient Descent (or Delta Rule) method works by minimizing an error function, typically
the sum of squared errors (SSE). For a single training example d, the error is:
1 2
Ed = (td − od)
2
Here, the output od is the unthresholded, continuous linear output of the unit:
n
Σ
2
od = w0 + (wj,1xd,j + wj,2xd,j )
j=1
The weight update rule is ∆w = −η∇Ed, which requires calculating the partial derivative of
the error with respect to each weight. We use the chain rule:∂w∂Ed = ∂E d ∂od .
∂od ∂w
First, the derivative of the error with respect to the output od is:
∂Ed ∂ 1
= (t d − od )2 = −(t d − od)
∂od ∂od 2
Next, we find the derivatives of the output od with respect to each weight:
∂od
∂od = 1 =x ∂od
d,j = x2d,j
∂w0 ∂wj,1 ∂wj,2
Putting it all together, the gradients are:
∂Ed ∂Ed ∂Ed
= −(t — o ) = −(td − od)xd,j = −(td − od)x2d,j
∂w0 d d ∂w j,1 ∂wj,2
The final update rules (∆w = −η∇Ed) are:
•
∆w0 = η(td − od)
•
∆wj,1 = η(td − od)xd,j
•
∆wj,2 = η(td − od)x2d,j
Key Distinction: While the update rules look formally identical, the output term ‘o‘ is
critically different. In the Perceptron rule, op is the discrete, thresholded output (±1). In
Gradient Descent, od is the continuous, unthresholded linear output.

4

$21.99

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

NurseHenny

4.3

(27)

Conoce al vendedor

NurseHenny EXAMS

Ver perfil

Seguir

Vendido

149

Miembro desde

2 año

Número de seguidores

Documentos

1887

Última venta

2 semanas hace

AFFORDABLE EXAMS AND STUDY GUIDES

On this page you will find verified, well elaborated exams and packages, offered by seller NURSE HENNY.

4.3

27 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller NurseHenny. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $21.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now