Exam (elaborations)

Coursera: Machine Learning - All weeks solutions [Assignment + Quiz] - Andrew NG

Rating

Sold

Pages

169

Grade

A+

Uploaded on

08-06-2021

Written in

2020/2021

Coursera: Machine Learning - All weeks solutions [Assignment + Quiz] PDF - Andrew NG. Coursera: Machine Learning - All Weeks solutions [Assignment + Quiz] - Andrew NG === Week 1 === Assignments: • No Assignment for Week 1 Introduction 1. A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. What would be a reasonable choice for P? o The probability of it correctly predicting a future date’s weather. o The weather prediction task. o The process of the algorithm examining a large amount of historical weather data. o None of these.1. A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. In this setting, what is T? o The weather prediction task. o None of these. o The probability of it correctly predicting a future date’s weather. o The process of the algorithm examining a large amount of historical weather data. 2. Suppose you are working on weather prediction, and use a learning algorithm to predict tomorrow’s temperature (in degrees Centigrade/Fahrenheit). Would you treat this as a classification or a regression problem? o Regression o Classification 2. Suppose you are working on weather prediction, and your weather station makes one of three predictions for each day’s weather: Sunny, Cloudy or Rainy. You’d like to use a learning algorithm to predict tomorrow’s weather. Would you treat this as a classification or a regression problem? o Regression o Classification 3. Suppose you are working on stock market prediction, and you would like to predict the price of a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for this. Would you treat this as a classification or a regression problem?o Regression o Classification 3. Suppose you are working on stock market prediction. You would like to predict whether or not a certain company will declare bankruptcy within the next 7 days (by training on data of similar companies that had previously been at risk of bankruptcy). Would you treat this as a classification or a regression problem? o Regression o Classification 3. Suppose you are working on stock market prediction, Typically tens of millions of shares of Microsoft stock are traded (i.e., bought/sold) each day. You would like to predict the number of Microsoft shares that will be traded tomorrow. Would you treat this as a classification or a regression problem? o Regression o Classification 4. Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from. o Given historical data of children’s ages and heights, predict children’s height as a function of their age. o Given 50 articles written by male authors, and 50 articles written by female authors, learn to predict the gender of a new manuscript’s author (when the identity of this author is unknown).o Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow “similar” or “related”. o Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail. 4. Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from. o Given data on how 1000 medical patients respond to an experimental drug (such as effectiveness of the treatment, side effects, etc.), discover whether there are different categories or “types” of patients in terms of how they respond to the drug, and if so what these categories are. o Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments. o Have a computer examine an audio clip of a piece of music, and classify whether or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical instruments (and no vocals). o Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years. Linear Regression with One Variable : 1. Consider the problem of predicting how well a student does in her second year of college/university, given how well she did in her first year. Specifically, let x be equal to the number of “A” grades (including A-. A and A+ grades) that a student receives in their first year of college (freshmen year). We would like to predict the value of y, which we define as the number of “A” grades they get in their second year (sophomore year). Here each row is one training example. Recall that in linear regression, our hypothesis is to denote the number of training examples.For the training set given above (note that this training set may also be referenced in other questions in this quiz), what is the value of ? In the box below, please enter your answer (which should be a number between 0 and 10). 4 2. Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemist obtains the dataset below. In the column on the right, “kJ/mol” is the unit measuring the amount of energy released.You would like to use linear regression ( ) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for and ? You should be able to select the right answer without actually implementing linear regression. o = −569.6, = 530.9 o = −1780.0, = −530.9 o = −569.6, = −530.9 o = −1780.0, = 530.92. For this question, assume that we are using the training set from Q1. Recall our definition of the cost function was What is ? In the box below, please enter your answer (Simplify fractions to decimals when entering answer, and ‘.’ as the decimal delimiter e.g., 1.5). 0.5 3. Suppose we set in the linear regression hypothesis from Q1. What is ? 3 3. Suppose we set = −2, = 0.5 in the linear regression hypothesis from Q1. What is ? 1 4. Let be some function so that outputs a number. For this problem, is some arbitrary/unknown smooth function (not necessarily the cost function of linear regression, so may have local optima). Suppose we use gradient descent to try to minimize as a function of and . Which of the following statements are true? (Check all that apply.) o If and are initialized at the global minimum, then one iteration will not change their values. o Setting the learning rate to be very small is not harmful, and can only speed up the convergence of gradient descent. o No matter how and are initialized, so long as is sufficiently small, we can safely expect gradient descent to converge to the same solution.o If the first few iterations of gradient descent cause to increase rather than decrease, then the most likely cause is that we have set the learning rate to too large a value. 4. In the given figure, the cost function has been plotted against and , as shown in ‘Plot 2’. The contour plot for the same cost function is given in ‘Plot 1’. Based on the figure, choose the correct options (check all that apply). o If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function is maximum at point A. o If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point C, as the value of cost function is minimum at point C. o Point P (the global minimum of plot 2) corresponds to point A of Plot 1.o If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function is minimum at A. o Point P (The global minimum of plot 2) corresponds to point C of Plot 1. Linear Algebra : 1. Let two matrices be , What is A - B ? o o o o 1. Let two matrices be , What is A + B ?o o o o 2. Let What is 2∗x ? o Correct To multiply the vector x by 2, take each element of x and multiply that element by 2. o o o2. Let What is 2∗x ? o o Correct To multiply the vector x by 2, take each element of x and multiply that element by 2. o o3. Let u be a 3-dimensional vector, where specifically What is ? o o o o 4. Let u and v be 3-dimensional vectors, where specifically and what is ? (Hint: is a 1x3 dimensional matrix, and v can also be seen as a 3x1 matrix. The answer you want can be obtained by taking the matrix product of and .) Do not add brackets to your answer. -4 === Week 2 === Assignments: It consist of the following files: • ex1.m - Octave/MATLAB script that steps you through the exercise • ex1 multi.m - Octave/MATLAB script for the later parts of the exercise • - Dataset for linear regression with one variable • - Dataset for linear regression with multiple variables • submit.m - Submission script that sends your solutions to our servers • [*] warmUpExercise.m - Simple example function in Octave/MATLAB • [*] plotData.m - Function to display the dataset • [*] computeCost.m - Function to compute the cost of linear regression • [*] gradientDescent.m - Function to run gradient descent • [#] computeCostMulti.m - Cost function for multiple variables • [#] gradientDescentMulti.m - Gradient descent for multiple variables • [#] featureNormalize.m - Function to normalize features • [#] normalEqn.m - Function to compute the normal equations • Video - YouTube videos featuring Free IOT/ML tutorials * indicates files you will need to complete # indicates optional exercises warmUpExercise.m : function A = warmUpExercise() %WARMUPEXERCISE Example function in octave % A = WARMUPEXERCISE() is an example function that returns the 5x5 identity matrix A = []; % ============= YOUR CODE HERE ============== % Instructions: Return the 5x5 identity matrix % In octave, we return values by defining which variables % represent the return values (at the top of the file) % and then set them accordingly. A = eye(5); %It's a built-in function to create identity matrix % =========================================== endplotData.m : function plotData(x, y) %PLOTDATA Plots the data points x and y into a new figure % PLOTDATA(x,y) plots the data points and gives the figure axes labels of % population and profit. figure; % open a new figure window % ====================== YOUR CODE HERE ====================== % Instructions: Plot the training data into a figure using the % "figure" and "plot" commands. Set the axes labels using % the "xlabel" and "ylabel" commands. Assume the % population and revenue data have been passed in % as the x and y arguments of this function. % % Hint: You can use the 'rx' option with plot to have the markers % appear as red crosses. Furthermore, you can make the % markers larger by using plot(..., 'rx', 'MarkerSize', 10); plot(x, y, 'rx', 'MarkerSize', 10); % Plot the data ylabel('Profit in $10,000s'); % Set the y-axis label xlabel('Population of City in 10,000s'); % Set the x-axis label % ============================================================ end computeCost.m : function J = computeCost(X, y, theta) %COMPUTECOST Compute cost for linear regression % J = COMPUTECOST(X, y, theta) computes the cost of using theta as the % parameter for linear regression to fit the data points in X and y % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta % You should set J to the cost. %%%%%%%%%%%%% CORRECT %%%%%%%%% % h = X*theta; % temp = 0; % for i=1:m % temp = temp + (h(i) - y(i))^2; % end % J = (1/(2*m)) * temp; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%% CORRECT: Vectorized Implementation %%%%%%%%% J = (1/(2*m))*sum(((X*theta)-y).^2); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ========================================================================= end gradientDescent.m : function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha % Initialize some useful values m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters % ====================== YOUR CODE HERE ====================== % Instructions: Perform a single gradient step on the parameter vector % theta. % % Hint: While debugging, it can be useful to print out the values % of the cost function (computeCost) and gradient here. % %%%%%%%%% CORRECT %%%%%%% %error = (X * theta) - y; %temp0 = theta(1) - ((alpha/m) * sum(error .* X(:,1))); %temp1 = theta(2) - ((alpha/m) * sum(error .* X(:,2))); %theta = [temp0; temp1]; %%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%% CORRECT %%%%%%% %error = (X * theta) - y; %temp0 = theta(1) - ((alpha/m) * X(:,1)'*error); %temp1 = theta(2) - ((alpha/m) * X(:,2)'*error); %theta = [temp0; temp1]; %%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%% CORRECT %%%%%%% error = (X * theta) - y; theta = theta - ((alpha/m) * X'*error); %%%%%%%%%%%%%%%%%%%%%%%%% % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCost(X, y, theta); end end computeCostMulti.m : function J = computeCostMulti(X, y, theta) %COMPUTECOSTMULTI Compute cost for linear regression with multiple variables % J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the % parameter for linear regression to fit the data points in X and y % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta % You should set J to the cost. J = (1/(2*m))*(sum(((X*theta)-y).^2)); % ========================================================================= end gradientDescentMulti.m : function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters) %GRADIENTDESCENTMULTI Performs gradient descent to learn theta % theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha % Initialize some useful values m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters % ====================== YOUR CODE HERE ====================== % Instructions: Perform a single gradient step on the parameter vector % theta. % % Hint: While debugging, it can be useful to print out the values % of the cost function (computeCostMulti) and gradient here. % %%%%%%%% CORRECT %%%%%%%%%% error = (X * theta) - y; theta = theta - ((alpha/m) * X'*error); %%%%%%%%%%%%%%%%%%%%%%%%%%% % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCostMulti(X, y, theta); end end featureNormalize.m : function [X_norm, mu, sigma] = featureNormalize(X) %FEATURENORMALIZE Normalizes the features in X % FEATURENORMALIZE(X) returns a normalized version of X where % the mean value of each feature is 0 and the standard deviation % is 1. This is often a good preprocessing step to do when % working with learning algorithms. % You need to set these values correctly X_norm = X; mu = zeros(1, size(X, 2)); sigma = zeros(1, size(X, 2)); % ====================== YOUR CODE HERE ====================== % Instructions: First, for each feature dimension, compute the mean % of the feature and subtract it from the dataset, % storing the mean value in mu. Next, compute the % standard deviation of each feature and divide % each feature by it's standard deviation, storing % the standard deviation in sigma. % % Note that X is a matrix where each column is a % feature and each row is an example. You need % to perform the normalization separately for % each feature. % % Hint: You might find the 'mean' and 'std' functions useful. % mu = mean(X); sigma = std(X); X_norm = (X - mu)./sigma; % ============================================================ endnormalEqn.m : function [theta] = normalEqn(X, y) %NORMALEQN Computes the closed-form solution to linear regression % NORMALEQN(X,y) computes the closed-form solution to linear % regression using the normal equations. theta = zeros(size(X, 2), 1); % ====================== YOUR CODE HERE ====================== % Instructions: Complete the code to compute the closed form solution % to linear regression and put the result in theta. % % ---------------------- Sample Solution ---------------------- theta = pinv(X'*X)*X'*y; % ------------------------------------------------------------- % ============================================================ end Linear Regression with Multiple Variables : 1. Suppose m=4 students have taken some classes, and the class had a midterm exam and a final exam. You have collected a dataset of their scores on the two exams, which is as follows: You’d like to use polynomial regression to predict a student’s final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form , where is the midterm score and x_2 is (midterm score)^2. Further, you plan to use both feature scaling (dividing by the “max-min”, or range, of a feature) and mean normalization. What is the normalized feature ? (Hint: midterm = 69, final = 78 is training example 4.) Please round off your answer to two decimal places and enter in the text box below. -0.472. You run gradient descent for 15 iterations with and compute after each iteration. You find that the value of decreases slowly and is still decreasing after 15 iterations. Based on this, which of the following conclusions seems most plausible? o Rather than use the current value of α, it’d be more promising to try a larger value of α (say = 1.0). o Rather than use the current value of α, it’d be more promising to try a smaller value of α (say = 0.1). o = 0.3 is an effective choice of learning rate. 2. You run gradient descent for 15 iterations with and compute after each iteration. You find that the value of decreases quickly then levels off. Based on this, which of the following conclusions seems most plausible? o Rather than use the current value of α, it’d be more promising to try a larger value of α (say = 1.0). o Rather than use the current value of α, it’d be more promising to try a smaller value of α (say = 0.1). o = 0.3 is an effective choice of learning rate. 3. Suppose you have m = 23 training examples with n = 5 features (excluding the additional allones feature for the intercept term, which you should add). The normal equation is . For the given values of m and n, what are the dimensions of , X, and y in this equation?o X is 23 × 5, y is 23 × 1, θ is 5 × 5 o X is 23 × 6, y is 23 × 6, θ is 6 × 6 o X is 23 × 6, y is 23 × 1, θ is 6 × 1 X has m rows and n+1 columns (+1 because of the term). y is m-vector. is an (n+1)-vector o X is 23 × 5, y is 23 × 1, θ is 5 × 1 4. Suppose you have a dataset with m = examples and n = features for each example. You want to use multivariate linear regression to fit the parameters to our data. Should you prefer gradient descent or the normal equation? o Gradient descent, since it will always converge to the optimal θ. o Gradient descent, since will be very slow to compute in the normal equation. With n = features, you will have to invert a x matrix to compute the normal equation. Inverting such a large matrix is computationally expensive, so gradient descent is a good choice. o The normal equation, since it provides an efficient way to directly find the solution. o The normal equation, since gradient descent might be unable to find the optimal θ.Octave / Matlab Tutorial : 1. Suppose I first execute the following Octave/Matlab commands: 2. A = [1 2; 3 4; 5 6]; B = [1 2 3; 4 5 6]; Which of the following are then valid commands? Check all that apply. (Hint: A’ denotes the transpose of A.) o C = A * B; o C = B’ + A; o C = A’ * B; o C = B + A; 2. Let Which of the following indexing expressions gives Check all that apply. o B = A(:, 1:2); o B = A(1:4, 1:2); o B = A(:, 0:2); o B = A(0:4, 0:2); 3. Let A be a 10x10 matrix and x be a 10-element vector. Your friend wants to compute the product Ax and writes the following code: 4. v = zeros(10, 1); 5. for i = 1:10 6. for j = 1:107. v(i) = v(i) + A(i, j) * x(j); 8. end end How would you vectorize this code to run without any for loops? Check all that apply. o v = A * x; o v = Ax; o v = x’ * A; o v = sum (A * x); 4. Say you have two column vectors v and w, each with 7 elements (i.e., they have dimensions 7x1). Consider the following code: 5. z = 0; 6. for i = 1:7 7. z = z + v(i) * w(i) end Which of the following vectorizations correctly compute z? Check all that apply. o z = sum (v .* w); o z = w’ * v; o z = v * w’; o z = w * v’; === Week 3 === Assignment It consist of the following files: • ex2.m - Octave/MATLAB script that steps you through the exercise • ex2 reg.m - Octave/MATLAB script for the later parts of the exercise • - Training set for the first half of the exercise • - Training set for the second half of the exercise • submit.m - Submission script that sends your solutions to our servers • mapFeature.m - Function to generate polynomial features• plotDecisionBoundary.m - Function to plot classifier's decision boundary • [*] plotData.m - Function to plot 2D classification data • [*] sigmoid.m - Sigmoid Function • [*] costFunction.m - Logistic Regression Cost Function • [*] predict.m - Logistic Regression Prediction Function • [*] costFunctionReg.m - Regularized Logistic Regression Cost • Video - YouTube videos featuring Free IOT/ML tutorials * indicates files you will need to complete plotData.m : function plotData(X, y) %PLOTDATA Plots the data points X and y into a new figure % PLOTDATA(x,y) plots the data points with + for the positive examples % and o for the negative examples. X is assumed to be a Mx2 matrix. % ====================== YOUR CODE HERE ====================== % Instructions: Plot the positive and negative examples on a % 2D plot, using the option 'k+' for the positive % examples and 'ko' for the negative examples. % %Seperating positive and negative results pos = find(y==1); %index of positive results neg = find(y==0); %index of negative results % Create New Figure figure; %Plotting Positive Results on % X_axis: Exam1 Score = X(pos,1) % Y_axis: Exam2 Score = X(pos,2) plot(X(pos,1),X(pos,2),'g+'); %To keep above plotted graph as it is. hold on; %Plotting Negative Results on % X_axis: Exam1 Score = X(neg,1) % Y_axis: Exam2 Score = X(neg,2) plot(X(neg,1),X(neg,2),'ro'); % ========================================================================= hold off; endsigmoid.m : function g = sigmoid(z) %SIGMOID Compute sigmoid function % g = SIGMOID(z) computes the sigmoid of z. % You need to return the following variables correctly g = zeros(size(z)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the sigmoid of each value of z (z can be a matrix, % vector or scalar). g = 1./(1+exp(-z)); % ============================================================= end costFunction.m : function [J, grad] = costFunction(theta, X, y) %COSTFUNCTION Compute cost and gradient for logistic regression % J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the % parameter for logistic regression and the gradient of the cost % w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; grad = zeros(size(theta)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta. % You should set J to the cost. % Compute the partial derivatives and set grad to the partial % derivatives of the cost w.r.t. each parameter in theta % % Note: grad should have the same dimensions as theta % %DIMENSIONS: % theta = (n+1) x 1 % X = m x (n+1) % y = m x 1 % grad = (n+1) x 1 % J = Scalar z = X * theta; % m x 1 h_x = sigmoid(z); % m x 1 J = (1/m)*sum((-y.*log(h_x))-((1-y).*log(1-h_x))); % scalar grad = (1/m)* (X'*(h_x-y)); % (n+1) x 1 % ============================================================= end predict.m : function p = predict(theta, X) %PREDICT Predict whether the label is 0 or 1 using learned logistic %regression parameters theta % p = PREDICT(theta, X) computes the predictions for X using a % threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1) m = size(X, 1); % Number of training examples % You need to return the following variables correctly p = zeros(m, 1); % ====================== YOUR CODE HERE ====================== % Instructions: Complete the following code to make predictions using % your learned logistic regression parameters. % You should set p to a vector of 0's and 1's % % Dimentions: % X = m x (n+1) % theta = (n+1) x 1 h_x = sigmoid(X*theta); p=(h_x>=0.5); %p = double(sigmoid(X * theta)>=0.5); % ========================================================================= end costFunctionReg.m : function [J, grad] = costFunctionReg(theta, X, y, lambda) %COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization % J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using % theta as the parameter for regularized logistic regression and the % gradient of the cost w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; grad = zeros(size(theta)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta. % You should set J to the cost. % Compute the partial derivatives and set grad to the partial % derivatives of the cost w.r.t. each parameter in theta %DIMENSIONS: % theta = (n+1) x 1 % X = m x (n+1) % y = m x 1 % grad = (n+1) x 1 % J = Scalar z = X * theta; % m x 1 h_x = sigmoid(z); % m x 1 reg_term = (lambda/(2*m)) * sum(theta(2:end).^2); J = (1/m)*sum((-y.*log(h_x))-((1-y).*log(1-h_x))) + reg_term; % scalar grad(1) = (1/m)* (X(:,1)'*(h_x-y)); % 1 x 1 grad(2:end) = (1/m)* (X(:,2:end)'*(h_x-y))+(lambda/m)*theta(2:end); % n x 1 % ============================================================= end Logistic Regression : 1. Suppose that you have trained a logistic regression classifier, and it outputs on a new example a prediction = 0.2. This means (check all that apply): o Our estimate for P(y = 1|x; θ) is 0.8. h(x) gives P(y=1|x; θ), not 1 - P(y=1|x; θ) o Our estimate for P(y = 0|x; θ) is 0.8. Since we must have P(y=0|x;θ) = 1 - P(y=1|x; θ), the former is 1 - 0.2 = 0.8.o Our estimate for P(y = 1|x; θ) is 0.2. h(x) is precisely P(y=1|x; θ), so each is 0.2. o Our estimate for P(y = 0|x; θ) is 0.2. h(x) is P(y=1|x; θ), not P(y=0|x; θ) 2. Suppose you have the following training set, and fit a logistic regression classifier . Which of the following are true? Check all that apply. o Adding polynomial features (e.g., instead using ) could increase how well we can fit the training data. o At the optimal value of θ (e.g., found by fminunc), we will have J(θ) ≥ 0. o Adding polynomial features (e.g., instead using ) would increase J(θ) because we are now summing over more terms.o If we train gradient descent for enough iterations, for some examples in the training set it is possible to obtain . 3. For logistic regression, the gradient is given by . Which of these is a correct gradient descent update for logistic regression with a learning rate of ? Check all that apply. o (simultaneously update for all j). o . o (simultaneously update for all j). o (simultaneously update for all j). 4. Which of the following statements are true? Check all that apply. o The one-vs-all technique allows you to use logistic regression for problems in which each comes from a fixed, discrete set of values. If each is one of k different values, we can give a label to each and use one-vs-all as described in the lecture. o For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc).The cost function for logistic regression is convex, so gradient descent will always converge to the global minimum. We still might use a more advanced optimisation algorithm since they can be faster and don’t require you to select a learning rate. o The cost function for logistic regression trained with examples is always greater than or equal to zero. The cost for any example is always since it is the negative log of a quantity less than one. The cost function is a summation over the cost for each sample, so the cost function itself must be greater than or equal to zero. o Since we train one classifier when there are two classes, we train two classifiers when there are three classes (and we do one-vs-all classification). We will need 3 classfiers. One-for-each class. Suppose you train a logistic classifier . Suppose , , . Which of the following figures represents the decision boundary found by your classifier? • Figure: In this figure, we transition from negative to positive when x1 goes from left of 6 to right of 6 which is true for the given values of θ. • Figure:• Figure: • Figure:Regularization : 1. You are training a classification model with logistic regression. Which of the following statements are true? Check all that apply. o Introducing regularization to the model always results in equal or better performance on the training set. o Introducing regularization to the model always results in equal or better performance on examples not in the training set. o Adding a new feature to the model always results in equal or better performance on the training set. o Adding many new features to the model helps prevent overfitting on the training set. 2. Suppose you ran logistic regression twice, once with , and once with . One of the times, you got parameters , and the other time you got . However, you forgot which value of corresponds to which value of . Which one do you think corresponds to ? o When is set to 1, We use regularization to penalize large value of . Thus, the parameter, , obtained will in general have smaller values. o 2. Suppose you ran logistic regression twice, once with , and once with . One of the times, you got parameters , and the other time you got . However, you forgot which value of corresponds to which value of . Which one do you think corresponds to ? o o When is set to 1, We use regularization to penalize large value of . Thus, the parameter, , obtained will in general have smaller values. 3. Which of the following statements about regularization are true? Check all that apply. o Using a very large value of hurt the performance of your hypothesis; the only reason we do not set to be too large is to avoid numerical problems. o Because logistic regression outputs values , its range of output values can only be “shrunk” slightly by regularization anyway, so regularization is generally not helpful for it. o Consider a classification problem. Adding regularization may cause your classifier to incorrectly classify some training examples (which it had correctly classified when not using regularization, i.e. when λ = 0). o Using too large a value of λ can cause your hypothesis to overfit the data; this can be avoided by reducing λ. 3. Which of the following statements about regularization are true? Check all that apply. o Using a very large value of hurt the performance of your hypothesis; the only reason we do not set to be too large is to avoid numerical problems.o Because logistic regression outputs values , its range of output values can only be “shrunk” slightly by regularization anyway, so regularization is generally not helpful for it. o Because regularization causes J(θ) to no longer be convex, gradient descent may not always converge to the global minimum (when λ > 0, and when using an appropriate learning rate α). o Using too large a value of λ can cause your hypothesis to underfit the data; this can be avoided by reducing λ. 4. In which one of the following figures do you think the hypothesis has overfit the training set? o Figure: o Figure:o Figure: o Figure: 5. In which one of the following figures do you think the hypothesis has underfit the training set? • Figure:=== Week 4 === Assignments: • It consist of the following files: • ex3.m - Octave/MATLAB script that steps you through part 1 • ex3 nn.m - Octave/MATLAB script that steps you through part 2 • - Training set of hand-written digits • - Initial weights for the neural network exercise • submit.m - Submission script that sends your solutions to our servers • displayData.m - Function to help visualize the dataset • fmincg.m - Function minimization routine (similar to fminunc) • sigmoid.m - Sigmoid function • [*] lrCostFunction.m - Logistic regression cost function • [*] oneVsAll.m - Train a one-vs-all multi-class classifier • [*] predictOneVsAll.m - Predict using a one-vs-all multi-class classifier • [*] predict.m - Neural network prediction function • Video - YouTube videos featuring Free IOT/ML tutorials * indicates files you will need to complete lrCostFunction.m : function [J, grad] = lrCostFunction(theta, X, y, lambda) %LRCOSTFUNCTION Compute cost and gradient for logistic regression with %regularization % J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using % theta as the parameter for regularized logistic regression and the % gradient of the cost w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; grad = zeros(size(theta)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta. % You should set J to the cost. % Compute the partial derivatives and set grad to the partial % derivatives of the cost w.r.t. each parameter in theta % % Hint: The computation of the cost function and gradients can be % efficiently vectorized. For example, consider the computation % % sigmoid(X * theta) % % Each row of the resulting matrix will contain the value of the % prediction for that example. You can make use of this to vectorize % the cost function and gradient computations. % % Hint: When computing the gradient of the regularized cost function, % there're many possible vectorized solutions, but one solution % looks like: % grad = (unregularized gradient for logistic regression) % temp = theta; % temp(1) = 0; % because we don't add anything for j = 0 % grad = grad + YOUR_CODE_HERE (using the temp variable) % %DIMENSIONS: % theta = (n+1) x 1 % X = m x (n+1) % y = m x 1 % grad = (n+1) x 1 % J = Scalar z = X * theta; % m x 1 h_x = sigmoid(z); % m x 1 reg_term = (lambda/(2*m)) * sum(theta(2:end).^2); J = (1/m)*sum((-y.*log(h_x))-((1-y).*log(1-h_x))) + reg_term; % scalar grad(1) = (1/m) * (X(:,1)'*(h_x-y)); % 1 x 1 grad(2:end) = (1/m) * (X(:,2:end)'*(h_x-y)) + (lambda/m)*theta(2:end); % n x 1 % ============================================================= grad = grad(:); end oneVsAll.m : function [all_theta] = oneVsAll(X, y, num_labels, lambda) %ONEVSALL trains multiple logistic regression classifiers and returns all %the classifiers in a matrix all_theta, where the i-th row of all_theta %corresponds to the classifier for label i % [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels % logistic regression classifiers and returns each of these classifiers % in a matrix all_theta, where the i-th row of all_theta corresponds % to the classifier for label i % num_labels = No. of output classifier (Here, it is 10) % Some useful variables m = size(X, 1); % No. of Training Samples == No. of Images : (Here, 5000) n = size(X, 2); % No. of features == No. of pixels in each Image : (Here, 400) % You need to return the following variables correctly all_theta = zeros(num_labels, n + 1); %DIMENSIONS: num_labels x (input_layer_size+1) == num_labels x (no_of_features+1) == 10 x 401 %DIMENSIONS: X = m x input_layer_size %Here, 1 row in X represents 1 training Image of pixel 20x20 % Add ones to the X data matrix X = [ones(m, 1) X]; %DIMENSIONS: X = m x (input_layer_size+1) = m x (no_of_features+1) % ====================== YOUR CODE HERE ====================== % Instructions: You should complete the following code to train num_labels % logistic regression classifiers with regularization % parameter lambda. % % Hint: theta(:) will return a column vector. % % Hint: You can use y == c to obtain a vector of 1's and 0's that tell you % whether the ground truth is true/false for this class. % % Note: For this assignment, we recommend using fmincg to optimize the cost % function. It is okay to use a for-loop (for c = 1:num_labels) to % loop over the different classes. % % fmincg works similarly to fminunc, but is more efficient when we % are dealing with large number of parameters. % % Example Code for fmincg: % % % Set Initial theta % initial_theta = zeros(n + 1, 1); % % % Set options for fminunc % options = optimset('GradObj', 'on', 'MaxIter', 50); % % % Run fmincg to obtain the optimal theta % % This function will return theta and the cost % [theta] = ... % fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ... % initial_theta, options); % initial_theta = zeros(n+1, 1); options = optimset('GradObj', 'on', 'MaxIter', 50); for c=1:num_labels all_theta(c,:) = ... fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ... initial_theta, options); end % ========================================================================= end predictOneVsAll.m : function p = predictOneVsAll(all_theta, X) %PREDICT Predict the label for a trained one-vs-all classifier. The labels %are in the range 1..K, where K = size(all_theta, 1). % p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions % for each example in the matrix X. Note that X contains the examples in % rows. all_theta is a matrix where the i-th row is a trained logistic % regression theta vector for the i-th class. You should set p to a vector % of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2 % for 4 examples) m = size(X, 1); % No. of Input Examples to Predict (Each row = 1 Example) num_labels = size(all_theta, 1); %No. of Ouput Classifier % You need to return the following variables correctly p = zeros(size(X, 1), 1); % No_of_Input_Examples x 1 == m x 1 % Add ones to the X data matrix X = [ones(m, 1) X]; % ====================== YOUR CODE HERE ====================== % Instructions: Complete the following code to make predictions using % your learned logistic regression parameters (one-vs-all). % You should set p to a vector of predictions (from 1 to % num_labels). % % Hint: This code can be done all vectorized using the max function. % In particular, the max function can also return the index of the % max element, for more information see 'help max'. If your examples % are in rows, then, you can use max(A, [], 2) to obtain the max % for each row. % % num_labels = No. of output classifier (Here, it is 10) % DIMENSIONS: % all_theta = 10 x 401 = num_labels x (input_layer_size+1) == num_labels x (no_of_features+1) prob_mat = X * all_theta'; % 5000 x 10 == no_of_input_image x num_labels [prob, p] = max(prob_mat,[],2); % m x 1 %returns maximum element in each row == max. probability and its index for each input image %p: predicted output (index) %prob: probability of predicted output %%%%%%%% WORKING: Computation per input image %%%%%%%%% % for i = 1:m % To iterate through each input sample % one_image = X(i,:); % 1 x 401 == 1 x no_of_features % prob_mat = one_image * all_theta'; % 1 x 10 == 1 x num_labels % [prob, out] = max(prob_mat); % %out: predicted output % %prob: probability of predicted output % p(i) = out; % end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%% WORKING %%%%%%%%% % for i = 1:m % RX = repmat(X(i,:),num_labels,1); % RX = RX .* all_theta; % SX = sum(RX,2); % [val, index] = max(SX); % p(i) = index; % end %%%%%%%%%%%%%%%%%%%%%%%%%% % ========================================================================= endpredict.m : function p = predict(Theta1, Theta2, X) %PREDICT Predict the label of an input given a trained neural network % p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the % trained weights of a neural network (Theta1, Theta2) % Useful values m = size(X, 1); num_labels = size(Theta2, 1); % You need to return the following variables correctly p = zeros(size(X, 1), 1); % m x 1 % ====================== YOUR CODE HERE ====================== % Instructions: Complete the following code to make predictions using % your learned neural network. You should set p to a % vector containing labels between 1 to num_labels. % % Hint: The max function might come in useful. In particular, the max % function can also return the index of the max element, for more % information see 'help max'. If your examples are in rows, then, you % can use max(A, [], 2) to obtain the max for each row. % %DIMENSIONS: % theta1 = 25 x 401 % theta2 = 10 x 26 % layer1 (input) = 400 nodes + 1bias % layer2 (hidden) = 25 nodes + 1bias % layer3 (output) = 10 nodes % % theta dimensions = S_(j+1) x ((S_j)+1) % theta1 = 25 x 401 % theta2 = 10 x 26 % theta1: % 1st row indicates: theta corresponding to all nodes from layer1 connecting to for 1st node of layer2 % 2nd row indicates: theta corresponding to all nodes from layer1 connecting to for 2nd node of layer2 % and % 1st Column indicates: theta corresponding to node1 from layer1 to all nodes in layer2 % 2nd Column indicates: theta corresponding to node2 from layer1 to all nodes in layer2 % % theta2: % 1st row indicates: theta corresponding to all nodes from layer2 connecting to for 1st node of layer3 % 2nd row indicates: theta corresponding to all nodes from layer2 connecting to for 2nd node of layer3 % and % 1st Column indicates: theta corresponding to node1 from layer2 to all nodes in layer3% 2nd Column indicates: theta corresponding to node2 from layer2 to all nodes in layer3 a1 = [ones(m,1) X]; % 5000 x 401 == no_of_input_images x no_of_features % Adding 1 in X %No. of rows = no. of input images %No. of Column = No. of features in each image z2 = a1 * Theta1'; % 5000 x 25 a2 = sigmoid(z2); % 5000 x 25 a2 = [ones(size(a2,1),1) a2]; % 5000 x 26 z3 = a2 * Theta2'; % 5000 x 10 a3 = sigmoid(z3); % 5000 x 10 [prob, p] = max(a3,[],2); %returns maximum element in each row == max. probability and its index for each input image %p: predicted output (index) %prob: probability of predicted output % ========================================================================= end • Neural Networks - Representation : 1. Which of the following statements are true? Check all that apply. o Any logical function over binary-valued (0 or 1) inputs x1 and x2 can be (approximately) represented using some neural network. o Suppose you have a multi-class classification problem with three classes, trained with a 3 layer network. Let be the activation of the first output unit, and similarly and . Then for any input x, it must be the case that . o A two layer (one input layer, one output layer; no hidden layer) neural network can represent the XOR function.o The activation values of the hidden units in a neural network, with the sigmoid activation function applied at every layer, are always in the range (0, 1). 2. Consider the following neural network which takes two binary-valued inputs and outputs . Which of the following logical functions does it (approximately) compute? o AND This network outputs approximately 1 only when both inputs are 1. 3. o NAND (meaning “NOT AND”) o OR o XOR (exclusive OR)2. Consider the following neural network which takes two binary-valued inputs and outputs . Which of the following logical functions does it (approximately) compute? o AND o NAND (meaning “NOT AND”) o OR This network outputs approximately 1 when atleast one input is 1. o XOR (exclusive OR) 3. Consider the neural network given below. Which of the following equations correctly computes the activation ? Note: is the sigmoid activationfunction. o Thiscorrectly uses the first row of and includes the “+1” term of . 4. o o o4. You have the following neural network: You’d like to compute the activations of the hidden layer . One way to do so is the following Octave code: You want to have a vectorized implementation of this (i.e., one that does not use for loops). Which of the following implementations correctly compute ? Check all that apply. o z = Theta1 * x; a2 = sigmoid (z); This version computes correctly in two steps , first the multiplication and then the sigmoid activation. 5. o a2 = sigmoid (x * Theta1); o a2 = sigmoid (Theta2 * x);o z = sigmoid(x); a2 = sigmoid (Theta1 * z); 5. You are using the neural network pictured below and have learned the parameters (used to compute ) and (used to compute as a function of ). Suppose you swap the parameters for the first hidden layer between its two units so and also swap the output layer so . How will this change the value of the output ? o It will stay the same. o It will increase. o It will decrease o Insufficient information to tell: it may increase or decrease.=== Week 5 === Assignments: It consist of the following files: • ex4.m - Octave/MATLAB script that steps you through the exercise • - Training set of hand-written digits • - Neural network parameters for exercise 4 • submit.m - Submission script that sends your solutions to our servers • displayData.m - Function to help visualize the dataset • fmincg.m - Function minimization routine (similar to fminunc) • sigmoid.m - Sigmoid function • computeNumericalGradient.m - Numerically compute gradients • checkNNGradients.m - Function to help check your gradients • debugInitializeWeights.m - Function for initializing weights • predict.m - Neural network prediction function • [*] sigmoidGradient.m - Compute the gradient of the sigmoid function • [*] randInitializeWeights.m - Randomly initialize weights • [*] nnCostFunction.m - Neural network cost function • Video - YouTube videos featuring Free IOT/ML tutorials * indicates files you will need to complete sigmoidGradient.m : function g = sigmoidGradient(z) %SIGMOIDGRADIENT returns the gradient of the sigmoid function %evaluated at z % g = SIGMOIDGRADIENT(z) computes the gradient of the sigmoid function % evaluated at z. This should work regardless if z is a matrix or a % vector. In particular, if z is a vector or matrix, you should return % the gradient for each element. g = zeros(size(z)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the gradient of the sigmoid function evaluated at % each value of z (z can be a matrix, vector or scalar). g = sigmoid(z).*(1-sigmoid(z)); % ============================================================= end randInitializeWeights.m : function W = randInitializeWeights(L_in, L_out) %RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in %incoming connections and L_out outgoing connections % W = RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights % of a layer with L_in incoming connections and L_out outgoing % connections. % % Note that W should be set to a matrix of size(L_out, 1 + L_in) as % the first column of W handles the "bias" terms % % You need to return the following variables correctly W = zeros(L_out, 1 + L_in); % ====================== YOUR CODE HERE ====================== % Instructions: Initialize W randomly so that we break the symmetry while % training the neural network. % % Note: The first column of W corresponds to the parameters for the bias unit % % epsilon_init = 0.12; epsilon_init = sqrt(6)/(sqrt(L_in)+sqrt(L_out)); W = - epsilon_init + rand(L_out, 1 + L_in) * 2 * epsilon_init ; % ========================================================================= endnnCostFunction.m : function [J, grad] = nnCostFunction(nn_params, ... input_layer_size, ... hidden_layer_size, ... num_labels, ... X, y, lambda) %NNCOSTFUNCTION Implements the neural network cost function for a two layer %neural network which performs classification % [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ... % X, y, lambda) computes the cost and gradient of the neural network. The % parameters for the neural network are "unrolled" into the vector % nn_params and need to be converted back into the weight matrices. % % The returned parameter grad should be a "unrolled" vector of the % partial derivatives of the neural network. % % Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices % for our 2 layer neural network % DIMENSIONS: % Theta1 = 25 x 401 % Theta2 = 10 x 26 Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ... hidden_layer_size, (input_layer_size + 1)); Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ... num_labels, (hidden_layer_size + 1)); % Setup some useful variables m = size(X, 1); % You need to return the following variables correctly J = 0; Theta1_grad = zeros(size(Theta1)); %25 x401 Theta2_grad = zeros(size(Theta2)); %10 x 26 % ====================== YOUR CODE HERE ====================== % Instructions: You should complete the code by working through the % following parts. % % Part 1: Feedforward the neural network and return the cost in the % variable J. After implementing Part 1, you can verify that your % cost function computation is correct by verifying the cost % computed in ex4.m % % Part 2: Implement the backpropagation algorithm to compute the gradients % Theta1_grad and Theta2_grad. You should return the partial derivatives of % the cost function with respect to Theta1 and Theta2 in Theta1_grad and % Theta2_grad, respectively. After implementing Part 2, you can check % that your implementation is correct by running checkNNGradients % % Note: The vector y passed into the function is a vector of labels % containing values from 1..K. You need to map this vector into a % binary vector of 1's and 0's to be used with the neural network % cost function. % % Hint: We recommend implementing backpropagation using a for-loop % over the training examples if you are implementing it for the % first time. % % Part 3: Implement regularization with the cost function and gradients. % % Hint: You can implement this around the code for % backpropagation. That is, you can compute the gradients for % the regularization separately and then add them to Theta1_grad % and Theta2_grad from Part 2. % %%%%%%%%%%% Part 1: Calculating J w/o Regularization %%%%%%%%%%%%%%% X = [ones(m,1), X]; % Adding 1 as first column in X a1 = X; % 5000 x 401 z2 = a1 * Theta1'; % m x hidden_layer_size == 5000 x 25 a2 = sigmoid(z2); % m x hidden_layer_size == 5000 x 25 a2 = [ones(size(a2,1),1), a2]; % Adding 1 as first column in z = (Adding bias unit) % m x (hidden_layer_size + 1) == 5000 x 26 z3 = a2 * Theta2'; % m x num_labels == 5000 x 10 a3 = sigmoid(z3); % m x num_labels == 5000 x 10 h_x = a3; % m x num_labels == 5000 x 10 %Converting y into vector of 0's and 1's for multi-class classification %%%%% WORKING %%%%% % y_Vec = zeros(m,num_labels); % for i = 1:m % y_Vec(i,y(i)) = 1; % end %%%%%%%%%%%%%%%%%%% y_Vec = (1:num_labels)==y; % m x num_labels == 5000 x 10 %Costfunction Without regularization J = (1/m) * sum(sum((-y_Vec.*log(h_x))-((1-y_Vec).*log(1-h_x)))); %scalar %%%%%%%%%%% Part 2: Implementing Backpropogation for Theta_gra w/o Regularization %%%%%%%%%%%%% %%%%%%% WORKING: Backpropogation using for loop %%%%%%% % for t=1:m % % Here X is including 1 column at begining % % % for layer-1 % a1 = X(t,:)'; % (n+1) x 1 == 401 x 1 % % % for layer-2 % z2 = Theta1 * a1; % hidden_layer_size x 1 == 25 x 1 % a2 = [1; sigmoid(z2)]; % (hidden_layer_size+1) x 1 == 26 x 1 % % % for layer-3 % z3 = Theta2 * a2; % num_labels x 1 == 10 x 1 % a3 = sigmoid(z3); % num_labels x 1 == 10 x 1 % % yVector = (1:num_labels)'==y(t); % num_labels x 1 == 10 x 1 % % %calculating delta values % delta3 = a3 - yVector; % num_labels x 1 == 10 x 1 % % delta2 = (Theta2' * delta3) .* [1; sigmoidGradient(z2)]; % (hidden_layer_size+1) x 1 == 26 x 1 % % delta2 = delta2(2:end); % hidden_layer_size x 1 == 25 x 1 %Removing delta2 for bias node % % % delta_1 is not calculated because we do not associate error with the input % % % CAPITAL delta update % Theta1_grad = Theta1_grad + (delta2 * a1'); % 25 x 401 % Theta2_grad = Theta2_grad + (delta3 * a2'); % 10 x 26 % % end % % Theta1_grad = (1/m) * Theta1_grad; % 25 x 401 % Theta2_grad = (1/m) * Theta2_grad; % 10 x 26 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%% WORKING: Backpropogation (Vectorized Implementation) %%%%%%% % Here X is including 1 column at begining A1 = X; % 5000 x 401 Z2 = A1 * Theta1'; % m x hidden_layer_size == 5000 x 25 A2 = sigmoid(Z2); % m x hidden_layer_size == 5000 x 25 A2 = [ones(size(A2,1),1), A2]; % Adding 1 as first column in z = (Adding bias unit) % m x (hidden_layer_size + 1) == 5000 x 26 Z3 = A2 * Theta2'; % m x num_labels == 5000 x 10 A3 = sigmoid(Z3); % m x num_labels == 5000 x 10 % h_x = a3; % m x num_labels == 5000 x 10 y_Vec = (1:num_labels)==y; % m x num_labels == 5000 x 10 DELTA3 = A3 - y_Vec; % 5000 x 10 DELTA2 = (DELTA3 * Theta2) .* [ones(size(Z2,1),1) sigmoidGradient(Z2)]; % 5000 x 26 DELTA2 = DELTA2(:,2:end); % 5000 x 25 %Removing delta2 for bias node Theta1_grad = (1/m) * (DELTA2' * A1); % 25 x 401 Theta2_grad = (1/m) * (DELTA3' * A2); % 10 x 26 %%%%%%%%%%%% WORKING: DIRECT CALCULATION OF THETA GRADIENT WITH REGULARISATION %%%%%%%%%%% % %Regularization term is later added in Part 3 % Theta1_grad = (1/m) * Theta1_grad + (lambda/m) * [zeros(size(Theta1, 1), 1) Theta1(:,2:end)]; % 25 x 401 % Theta2_grad = (1/m) * Theta2_grad + (lambda/m) * [zeros(size(Theta2, 1), 1) Theta2(:,2:end)]; % 10 x 26 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%% %%%%%%%%%%%% Part 3: Adding Regularisation term in J and Theta_grad %%%%%%%%%%%%% reg_term = (lambda/(2*m)) * (sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2))); %scalar %Costfunction With regularization J = J + reg_term; %scalar %Calculating gradients for the regularization Theta1_grad_reg_term = (lambda/m) * [zeros(size(Theta1, 1), 1) Theta1(:,2:end)]; % 25 x 401 Theta2_grad_reg_term = (lambda/m) * [zeros(size(Theta2, 1), 1) Theta2(:,2:end)]; % 10 x 26 %Adding regularization term to earlier calculated Theta_grad Theta1_grad = Theta1_grad + Theta1_grad_reg_term; Theta2_grad = Theta2_grad + Theta2_grad_reg_term; % ------------------------------------------------------------- % ========================================================================= % Unroll gradients grad = [Theta1_grad(:) ; Theta2_grad(:)]; endNeural Networks: Learning : 1. You are training a three layer neural network and would like to use backpropagation to compute the gradient of the cost function. In the backpropagation algorithm, one of the steps is to update for every i,j. Which of the following is a correct vectorization of this step? o o o o This version is correct, as it takes the “outer product” of the two vectors and which is a matrix such that the (i,j)-th entry is as desired. 2. Suppose Theta1 is a 5x3 matrix, and Theta2 is a 4x6 matrix. You set thetaVec = [Theta1(:), Theta2(:)]. Which of the following correctly recovers ? o reshape(thetaVec(16 : 39), 4, 6) This choice is correct, since Theta1 has 15 elements, so Theta2 begins at index 16 and ends at index 16 + 24 - 1 = 39. o reshape(thetaVec(15 : 38), 4, 6)o reshape(thetaVec(16 : 24), 4, 6) o reshape(thetaVec(15 : 39), 4, 6) o reshape(thetaVec(16 : 39), 6, 4) 3. Let . Let , and . Use the formula to numerically compute an approximation to the derivative at . What value do you get? (When , the true/exact derivati ve is .) o 8 o 6.0002 We compute . o 6 o 5.99984. Which of the following statements are true? Check all that apply. o For computational efficiency, after we have performed gradient checking to verify that our backpropagation code is correct, we usually disable gradient checking before using backpropagation to train the network. Checking the gradient numerically is a debugging tool: it helps ensure a correct implementation, but it is too slow to use as a method for actually computing gradients. o Computing the gradient of the cost function in a neural network has the same efficiency when we use backpropagation or when we numerically compute it using the method of gradient checking. o Using gradient checking can help verify if one’s implementation of backpropagation is bug-free. If the gradient computed by backpropagation is the same as one computed numerically with gradient checking, this is very strong evidence that you have a correct implementation of backpropagation. o Gradient checking is useful if we are using one of the advanced optimization methods (such as in fminunc) as our optimization algorithm. However, it serves little purpose if we are using gradient descent. 5. Which of the following statements are true? Check all that apply. • If we are training a neural network using gradient descent, one reasonable “debugging” step to make sure it is working is to plot J(Θ) as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration. Since gradient descent uses the gradient to take a step toward parameters with lower cost (ie, lower J(Θ)), the value of J(Θ) should be equal or less at each iteration if the gradient computation is correct and the learning rate is set properly.• Suppose you have a three layer network with parameters (controlling the function mapping from the inputs to the hidden units) and (controlling the mapping from the hidden units to the outputs). If we set all the elements of to be 0, and all the elements of to be 1, then this suffices for symmetry breaking, since the neurons are no longer all computing the same function of the input. • Suppose you are training a neural network using gradient descent. Depending on your random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations, gradient descent may converge to two different solutions). The cost function for a neural network is non-convex, so it may have multiple minima. Which minimum you find with gradient descent depends on the initialization. • If we initialize all the parameters of a neural network to ones instead of zeros, this will suffice for the purpose of “symmetry breaking” because the parameters are no longer symmetrically equal to zero.=== Week 6 === Assignments: It consist of the following files: • ex5.m - Octave/MATLAB script that steps you through the exercise • - Dataset • submit.m - Submission script that sends your solutions to our servers • featureNormalize.m - Feature normalization function • fmincg.m - Function minimization routine (similar to fminunc) • plotFit.m - Plot a polynomial fit • trainLinearReg.m - Trains linear regression using your cost function • [*] linearRegCostFunction.m - Regularized linear regression cost function • [*] learningCurve.m - Generates a learning curve • [*] polyFeatures.m - Maps data into polynomial feature space • [*] validationCurve.m - Generates a cross validation curve • Video - YouTube videos featuring Free IOT/ML tutorials * indicates files you will need to complete linearRegCostFunction.m : function [J, grad] = linearRegCostFunction(X, y, theta, lambda) %LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear %regression with multiple variables % [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the % cost of using theta as the parameter for linear regression to fit the % data points in X and y. Returns the cost in J and the gradient in grad % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; grad = zeros(size(theta)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost and gradient of regularized linear % regression for a particular choice of theta. % % You should set J to the cost and grad to the gradient. %DIMENSIONS: % X = 12x2 = m x 1 % y = 12x1 = m x 1 % theta = 2x1 = (n+1) x 1 % grad = 2x1 = (n+1) x 1 h_x = X * theta; % 12x1 J = (1/(2*m))*sum((h_x - y).^2) + (lambda/(2*m))*sum(theta(2:end).^2); % scalar % grad(1) = (1/m)*sum((h_x-y).*X(:,1)); % scalar == 1x1 grad(1) = (1/m)*(X(:,1)'*(h_x-y)); % scalar == 1x1 grad(2:end) = (1/m)*(X(:,2:end)'*(h_x-y)) + (lambda/m)*theta(2:end); % n x 1 % ========================================================================= grad = grad(:); end learningCurve.m : function [error_train, error_val] = ... learningCurve(X, y, Xval, yval, lambda) %LEARNINGCURVE Generates the train and cross validation set errors needed %to plot a learning curve % [error_train, error_val] = ... % LEARNINGCURVE(X, y, Xval, yval, lambda) returns the train

Show more Read less

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Course: Coursera: Machine Learning - All weeks solutions [Assignment + Quiz] - Andrew NG

All documents for this subject (1)

Document information

Uploaded on: June 8, 2021
Number of pages: 169
Written in: 2020/2021
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

coursera machine learning all weeks solutions assignment quiz andrew ng
machine learning stanford coursera
andrew ng ml course solutions for quiz and assignments
solutions to machine learni

Content preview

,Coursera: Machine Learning - All
Weeks solutions [Assignment +
Quiz] - Andrew NG

=== Week 1 ===
Assignments:
• No Assignment for Week 1

Introduction
1. A computer program is said to learn from experience E with respect to some task T and some
performance measure P if its performance on T, as measured by P, improves with experience E.
Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to
predict weather. What would be a reasonable choice for P?

o The probability of it correctly predicting a future date’s weather.

o The weather prediction task.

o The process of the algorithm examining a large amount of historical weather data.

o None of these.

,1. A computer program is said to learn from experience E with respect to some task T and some
performance measure P if its performance on T, as measured by P, improves with experience E.
Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to
predict weather. In this setting, what is T?

o The weather prediction task.

o None of these.

o The probability of it correctly predicting a future date’s weather.

o The process of the algorithm examining a large amount of historical weather data.

2. Suppose you are working on weather prediction, and use a learning algorithm to predict
tomorrow’s temperature (in degrees Centigrade/Fahrenheit).
Would you treat this as a classification or a regression problem?

o Regression

o Classification

2. Suppose you are working on weather prediction, and your weather station makes one of three
predictions for each day’s weather: Sunny, Cloudy or Rainy. You’d like to use a learning
algorithm to predict tomorrow’s weather.
Would you treat this as a classification or a regression problem?

o Regression

o Classification

3. Suppose you are working on stock market prediction, and you would like to predict the price of
a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for
this.
Would you treat this as a classification or a regression problem?

, o Regression

o Classification

3. Suppose you are working on stock market prediction. You would like to predict whether or not a
certain company will declare bankruptcy within the next 7 days (by training on data of similar
companies that had previously been at risk of bankruptcy).
Would you treat this as a classification or a regression problem?

o Regression

o Classification

3. Suppose you are working on stock market prediction, Typically tens of millions of shares of
Microsoft stock are traded (i.e., bought/sold) each day. You would like to predict the number of
Microsoft shares that will be traded tomorrow.
Would you treat this as a classification or a regression problem?

o Regression

o Classification

4. Some of the problems below are best addressed using a supervised learning algorithm, and the
others with an unsupervised learning algorithm. Which of the following would you apply
supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is
available for your algorithm to learn from.

o Given historical data of children’s ages and heights, predict children’s height as a
function of their age.

o Given 50 articles written by male authors, and 50 articles written by female
authors, learn to predict the gender of a new manuscript’s author (when the identity of
this author is unknown).

, o Take a collection of 1000 essays written on the US Economy, and find a way to
automatically group these essays into a small number of groups of essays that are
somehow “similar” or “related”.

o Examine a large collection of emails that are known to be spam email, to discover if
there are sub-types of spam mail.

4. Some of the problems below are best addressed using a supervised learning algorithm, and the
others with an unsupervised learning algorithm. Which of the following would you apply
supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is
available for your algorithm to learn from.

o Given data on how 1000 medical patients respond to an experimental drug (such as
effectiveness of the treatment, side effects, etc.), discover whether there are different
categories or “types” of patients in terms of how they respond to the drug, and if so
what these categories are.

o Given a large dataset of medical records from patients suffering from heart disease,
try to learn whether there might be different clusters of such patients for which we
might tailor separate treatments.

o Have a computer examine an audio clip of a piece of music, and classify whether or
not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only
musical instruments (and no vocals).

o Given genetic (DNA) data from a person, predict the odds of him/her developing
diabetes over the next 10 years.

Linear Regression with One Variable :
1. Consider the problem of predicting how well a student does in her second year of
college/university, given how well she did in her first year. Specifically, let x be equal to the
number of “A” grades (including A-. A and A+ grades) that a student receives in their first year of
college (freshmen year). We would like to predict the value of y, which we define as the number
of “A” grades they get in their second year (sophomore year).
Here each row is one training example. Recall that in linear regression, our hypothesis
is to denote the number of training examples.

, For the training set given above (note that this training set may also be referenced in other
questions in this quiz), what is the value of ? In the box below, please enter your answer
(which should be a number between 0 and 10).

4

2. Many substances that can burn (such as gasoline and alcohol) have a chemical structure based
on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand
how the number of carbon atoms in a molecule affects how much energy is released when that
molecule combusts (meaning that it is burned). The chemist obtains the dataset below. In the
column on the right, “kJ/mol” is the unit measuring the amount of energy released.

, You would like to use linear regression ( ) to estimate the amount of energy
released (y) as a function of the number of carbon atoms (x). Which of the following do you
think will be the values you obtain for and ? You should be able to select the right answer
without actually implementing linear regression.

o = −569.6, = 530.9

o = −1780.0, = −530.9

o = −569.6, = −530.9

o = −1780.0, = 530.9

, 2. For this question, assume that we are using the training set from Q1.
Recall our definition of the cost function was
What is ? In the box below,
please enter your answer (Simplify fractions to decimals when entering answer, and ‘.’ as the
decimal delimiter e.g., 1.5).

0.5

3. Suppose we set in the linear regression hypothesis from Q1. What is ?

3

3. Suppose we set = −2, = 0.5 in the linear regression hypothesis from Q1. What is ?

1

4. Let be some function so that outputs a number. For this problem, is some
arbitrary/unknown smooth function (not necessarily the cost function of linear regression,
so may have local optima).
Suppose we use gradient descent to try to minimize as a function of and .
Which of the following statements are true? (Check all that apply.)

o If and are initialized at the global minimum, then one iteration will not change
their values.

o Setting the learning rate to be very small is not harmful, and can only speed up
the convergence of gradient descent.

o No matter how and are initialized, so long as is sufficiently small, we can
safely expect gradient descent to converge to the same solution.

, o If the first few iterations of gradient descent cause to increase rather
than decrease, then the most likely cause is that we have set the learning rate to too
large a value.

4. In the given figure, the cost function has been plotted against and , as shown in
‘Plot 2’. The contour plot for the same cost function is given in ‘Plot 1’. Based on the figure,
choose the correct options (check all that apply).

o If we start from point B, gradient descent with a well-chosen learning rate will
eventually help us reach at or near point A, as the value of cost function is
maximum at point A.

o If we start from point B, gradient descent with a well-chosen learning rate will
eventually help us reach at or near point C, as the value of cost function is
minimum at point C.

o Point P (the global minimum of plot 2) corresponds to point A of Plot 1.

, o If we start from point B, gradient descent with a well-chosen learning rate will
eventually help us reach at or near point A, as the value of cost function is
minimum at A.

o Point P (The global minimum of plot 2) corresponds to point C of Plot 1.

Linear Algebra :
1. Let two matrices be

,
What is A - B ?

o

o

o

o

1. Let two matrices be

,
What is A + B ?

$2.99

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

shawonfnf

4.0

(1)

Get to know the seller

shawonfnf University Of Chittagong

View profile

Sold

Member since

5 year

Number of followers

Documents

Last sold

1 year ago

Certification Courses Answers

4.0

1 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller shawonfnf. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $2.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45158 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Coursera: Machine Learning - All weeks solutions [Assignment + Quiz] - Andrew NG

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?