100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4.2 TrustPilot
logo-home
Examen

CS 7641 Problem Set Solutions | Georgia Institute of Technology

Puntuación
-
Vendido
-
Páginas
34
Grado
A
Subido en
22-07-2025
Escrito en
2024/2025

CS 7641 Problem Set Solutions | Georgia Institute of Technology

Institución
Grado











Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Grado

Información del documento

Subido en
22 de julio de 2025
Número de páginas
34
Escrito en
2024/2025
Tipo
Examen
Contiene
Preguntas y respuestas

Temas

Vista previa del contenido

CS 7641 Problem Set Solutions



Question 1
We are asked to consider a supervised learning problem with a non-deterministic target, where
an instance x can map to a binary outcome y ∈ {0, 1}. The goal is to derive the appropriate
error function for finding the Maximum Likelihood (ML) hypothesis and compare it to the sum
of squared errors.

(a) Derivation of the Maximum Likelihood Error Function
To find the Maximum Likelihood Hypothesis, hML, we start with Bayes’ Rule. We want to find
the hypothesis h that maximizes the probability of observing the training data D, given the
hypothesis. This is the likelihood, P(D|h).
Assuming the training data points D = {(x1, y1), . . . , (xm, ym)} are independent and iden-
tically distributed (i.i.d.), the likelihood of the entire dataset is the product of the probabilities
of each data point:
m
Y
L(h) = P(D|h) = P(yi |xi , h)
i=1
Our hypothesis h(xi) models the probability that the true value yi is 1. Therefore, we can
express the probability of observing a specific yi as:
• P(yi = 1|xi, h) = h(xi)
• P(yi = 0|xi, h) = 1 − h(xi)
This can be written more compactly as a single expression:

P(yi|xi, h) = h(xi)yi (1 − h(xi))1−yi
Substituting this back into our likelihood function gives:
m
Y
L(h) = h(xi )yi (1 − h(xi))1−yi
i=1

For computational convenience, we maximize the log-likelihood, l(h) = ln(L(h)), as this converts
the product into a sum and doesn’t change the location of the maximum.
Y
m ! Σ
m
l(h) = ln [yi ln(h(xi )) + (1 − yi) ln(1 − h(xi))]
h(xi)yi (1 − h(xi))1−yi =
i=1 i=1

Standard learning algorithms are formulated as minimizing an error function. Maximizing the
log-likelihood is equivalent to minimizing its negative. Therefore, the proper error function to
minimize is the negative log-likelihood, also known as the cross-entropy error:
m
Σ
E(h) = −l(h) = − [yi ln(h(xi)) + (1 − yi) ln(1 − h(xi))]
i=1




1

,(b) Comparison to Sum of Squared Errors
This derived cross-entropy error function can be compared to the sum of squared errors (SSE),
which is used for deterministic functions perturbed by zero-mean Gaussian noise.

• Underlying Assumption: The core difference lies in the assumed data distribution.
SSE assumes the target variable is continuous and any deviation from the true function
is due to Gaussian noise. Cross-entropy assumes the target variable is binary and follows
a Bernoulli distribution, which is exactly the setup described in the problem.

• Behavior with 0/1 Data: If a standard neural network used SSE with 0/1 data, it
would treat ’0’ and ’1’ as numerical targets and try to minimize the squared difference.
While this can work, it’s not ideal. For example, if the model predicts 0.9 for a target
of 1, the squared error is small ((1 − 0.9)2 = 0.01). If it confidently predicts 0.1 for a
target of 1, the error is much larger ((1 − 0.1)2 = 0.81). Cross-entropy provides a much
steeper penalty for confident but incorrect predictions (e.g., predicting a probability near
0 when the true label is 1), which often leads to better training dynamics for classification
problems.

• Behavior with Probability Estimates: If the training data consisted of (x, y) pairs
where y was an estimate of the probability (e.g., y = 0.75) instead of 0s and 1s, using
SSE would be more justifiable, as the target is now a continuous value. However, even in
this case, cross-entropy is often preferred as it is fundamentally designed for comparing
probability distributions.


Question 2: Perceptron and Network Design
(a) Perceptron for the function A ∧ ¬B
To implement the boolean function A ∧ ¬B, we need a two-input perceptron that outputs 1
only when the input is (A = 1, B = 0). Let A be input x1 and B be input x2. The perceptron’s
output is 1 if w1 x1 + w2x2 + w0 > 0, and 0 otherwise. Here, w0 represents the negative of the
threshold (w0 = −θ).
A valid set of parameters is:
• Weight for A (w1): 1

• Weight for B (w2): -1

• Bias (w0): -0.5 (which corresponds to a threshold θ = 0.5)

Let’s verify this solution for all four possible inputs:

• A=0, B=0: 1(0) + (−1)(0) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)

• A=0, B=1: 1(0) + (−1)(1) − 0.5 = −1.5 ≤ 0 → Output is 0. (Correct)

• A=1, B=1: 1(1) + (−1)(1) − 0.5 = −0.5 ≤ 0 → Output is 0. (Correct)

• A=1, B=0: 1(1) + (−1)(0) − 0.5 = +0.5 > 0 → Output is 1. (Correct)




2

,(b) Two-Layer Network for A ⊕ B (XOR)
The XOR function is not linearly separable, so it cannot be implemented with a single percep-
tron. It requires a multi-layer network. We can construct XOR from other logical functions
that are linearly separable, such as:

A ⊕ B = (A ∨ B) ∧ ¬(A ∧ B)

We can design a two-layer network where the hidden layer computes (A ∨ B) and ¬(A ∧ B),
and the output layer combines them with an AND operation.
Network Structure and Parameters:

• Hidden Layer Neuron 1 (H1 - OR gate): Computes A ∨ B.

– wA1 = 1, wB1 = 1, Bias w01 = −0.5 (θ = 0.5)

• Hidden Layer Neuron 2 (H2 - NAND gate): Computes ¬(A ∧ B).

– wA2 = −1, wB2 = −1, Bias w02 = 1.5 (θ = −1.5)
• Output Layer Neuron (O - AND gate): Takes inputs from H1 and H2.

– wH1 = 1, wH2 = 1, Bias w0O = −1.5 (θ = 1.5)
Network Diagram:


A A∨B


A⊕B


B ¬(A ∧ B)




Question 3: Training Rule Derivations and Comparison
The problem describes a single unit where the output ‘o‘ is a linear combination of weights and
a feature set that includes both linear and quadratic terms of the inputs (xj and xj2). For clarity
in the derivation, let’s define the output o with unique weights for each feature term:
Σn
o = w0 + (wj,1xj + wj,2x 2j)
j=1

This is still a linear unit because the output ‘o‘ is linear with respect to the weights, which is
the key property for these derivations.

(a) Perceptron Training Rule Derivation
The Perceptron Training Rule updates weights only when the model misclassifies an input. The
output of the perceptron, op, is the thresholded value of the linear combination.
,
Σ
n
op = sgn w0 + (w j, 1 x j + wj,2x2)j
j =1


3

, where the target output t is typically +1 or −1.
The update rule is designed to move the decision boundary to correctly classify the misla-
beled instance. The general form is ∆wi = η(t − op) · featurei. Applying this to our specific
features (1, xj, xj 2):
• For the bias weight w0: The feature is 1.
∆w0 = η(t − op)
• For the linear term weight wj,1: The feature is xj.
∆wj,1 = η(t − op)xj
• For the quadratic term weight wj,2: The feature is x2j.

∆wj,2 = η(t − op)xj2
These updates are applied for each misclassified training example.

(b) Gradient Descent Training Rule Derivation
The Gradient Descent (or Delta Rule) method works by minimizing an error function, typically
the sum of squared errors (SSE). For a single training example d, the error is:
1 2
Ed = (td − od)
2
Here, the output od is the unthresholded, continuous linear output of the unit:
n
Σ
2
od = w0 + (wj,1xd,j + wj,2xd,j )
j=1
The weight update rule is ∆w = −η∇Ed, which requires calculating the partial derivative of
the error with respect to each weight. We use the chain rule:∂w∂Ed = ∂E d ∂od .
∂od ∂w
First, the derivative of the error with respect to the output od is:
∂Ed ∂ 1
= (t d − od )2 = −(t d − od)
∂od ∂od 2
Next, we find the derivatives of the output od with respect to each weight:
∂od
∂od = 1 =x ∂od
d,j = x2d,j
∂w0 ∂wj,1 ∂wj,2
Putting it all together, the gradients are:
∂Ed ∂Ed ∂Ed
= −(t — o ) = −(td − od)xd,j = −(td − od)x2d,j
∂w0 d d ∂w j,1 ∂wj,2
The final update rules (∆w = −η∇Ed) are:

∆w0 = η(td − od)

∆wj,1 = η(td − od)xd,j

∆wj,2 = η(td − od)x2d,j
Key Distinction: While the update rules look formally identical, the output term ‘o‘ is
critically different. In the Perceptron rule, op is the discrete, thresholded output (±1). In
Gradient Descent, od is the continuous, unthresholded linear output.

4
$21.99
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
NurseHenny EXAMS
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
149
Miembro desde
2 año
Número de seguidores
71
Documentos
1887
Última venta
2 semanas hace
AFFORDABLE EXAMS AND STUDY GUIDES

On this page you will find verified, well elaborated exams and packages, offered by seller NURSE HENNY.

4.3

27 reseñas

5
19
4
4
3
0
2
1
1
3

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes