Escrito por estudiantes que aprobaron Inmediatamente disponible después del pago Leer en línea o como PDF ¿Documento equivocado? Cámbialo gratis 4,6 TrustPilot
logo-home
Examen

Machine Learning Engineering Final Exams Questions And Answers Fully Solved

Puntuación
-
Vendido
-
Páginas
6
Grado
A+
Subido en
22-05-2025
Escrito en
2024/2025

Machine Learning Engineering Final Exams Questions And Answers Fully Solved When to use machine learning - answersWhen the problem is too complex for hardcoding rules. When the problem deals with an unstudied phenomenon. When the problem calls for automating some decision. The problem is changing frequently. Reasons not to use ML - answersCan't get right or enough data. Problem does not require learning from the data. Problem solved in other ways. Not ethical. How can we tell if we need more data? (method) - answers Learning curves. Show the training and validation accuracy over number of training samples. Once the curves converge we have enough data. Normalization - answers Map values onto a range such as [0, 1] Standardization - answers Rescale feature values to follow a standard normal distribution T/F Feature scaling helps with most learning algorithms - answers True, especially for linear SVM, you should rescale features. Visualizing data - answers Before getting too far into data cleansing or training a model, visualize the training data. May yield insights into features to use and types of model. Reinforcement Learning - answers Agent can interact with its environment to perform actions and get rewards Some learning algorithms, such as decision trees, k nearest neighbors, neural networks, naturally support multiclass classification. For others, such as SVM, we can transform multiclass classification into a binary classification, either one-vs-rest or one-vs-one. - answers One-vs-one. Train c(c-1)/2 binary classifiers. fig to classify class I versus class j. Predict with all and return the class with the most votes - answers What makes a nice loss function - answersContinous, differentiable, strictly convex, smooth. T/F Neural networks generally have a convex loss function - answers False Convex - answers Not convex if we can't draw a line between any two points in the shape without leaving the shape. Model parameters are determined by the training data through some optimization procedure. Hyper parameters are set by the machine learning engineer. - answers If you don't have enough data to leave some aside for validation/testing - answers Use k-fold validation Bias - answers Error due to incorrect assumptions in the model. High bias means under fitting. Variance - answers Sensitivity to small variation in the training data. High variance if highly influenced by a few data points, this is overfitting. Regularization - answers Reduces model complexity, decreases the degrees of freedom of the model. Helps with overfitting. Linear regression - answers Given the feature vector x, predict the target y as accurately as possible. =wax + b where w is the learned weights x is the input vector (feature vector) and b is the bias term Mean Squared Error (OLS) - answersPrediction - actual squared then average of the squares Mean Absolute Error - answers More robust to outliers, less mathematically convenient than OLS Polynomial regression - answers Add can add x^2, x^3 as features to capture nonlinear relationships. L1 regularization - answers Encourages sparsity in the weights. Also known as Lasso L2 regularization - answers Encourage minimization of the weights towards 0. Known as ridge regression. Logistic Regression - answers Does binary classification. Predicting the probability of the class label. Sigmoid function. Logits are the raw outputs of the function and applying sigmoid converts them to probabilities. Binary cross-entropy is the loss function of choice for logistic regression. - answers Soft ax uses predicted logit score for each class to create a probability distribution over the labels. - answers SVM - answers Represented as hyperplane wax - b = 0. The goal in SVM is to find the optimal w and b values that create a hyperplane that best separates the positive from negative examples. Hard-Margin sum. - answers Assumes data is linearly separable. Problem is maximizing the margin or space between the closest examples of each class called support vectors. Soft margin - answers Use hinge loss. Minimize margin violations. Kernel trick - answers Allows us to reflect higher dimensional space transformation only in the cost function for optimization. SVM - answers Can perform both linear and non linear classification. Can be used for regression. Well suited to small datasets. Trains slow. Decision Tree - answersNonparametric model suitable for classification or regression. Decision Tree algorithms differ in ... - answersStopping criterion, ways to find the best split, how they are regularized When to stop? - answersAll examples classified, cant find feature to split, split does not significantly improve impurity or entropy. Pruning Voting - answersPredict label with majority label, predict the mean of regressors when we have different types of models. Bagging. - answersBootstrap aggregating. Train many different models of the same type. Each model is trained on a randomly sampled subset of the data. Statistical mode as aggregation function for classification. Random Forests - answersBagging with Decision Trees. Use a random set of features to decide on best split. Bagging doesn't necessarily imply using different subsets of features. Boosting - answersTrain many weak learners, each learn correects the errors of the previous one. Stacking. - answersAlternative to voting classifiers /regressors. Train a model to do aggregation instead of averaging or majority voting (meta model). Evaluation metrics for regression. - answersMean Squared Error, Mean Absolute Error, etc. Confusion matrix - answersShows true positives vs false positives + false negatives etc. Accuracy - answersTP + TN / TP + TP + FP + FN Recall - answersTP / TP + FN How many of the actual positive samples did we recall Precision - answersTP / TP + FP How accurate were we in when we predicted the positive class. Receiver Operating Characteristics (ROC) - answersPlots true positive rate (TPR) versus false positive rate (FPR). Each point on the curve is a valid tradeoff point. AUC - answersArea under the ROC curve. AUC = 0 is worst possible classifier. AUC = .5 is average. AUC = 1.0 is percet. F1-score = 2 precision + recall / (precision + recall) - answers Well calibrated - answersA model is well-calibrated if we can interpret the score/probability as the true probability. Base rate fallacy - answers Error in reasoning. Confusing a classifier's prior probability of correct prediction and the posterior probability of a true positive. Gradient Descent - answersGeneric iterative procedure to update the parameters based on the gradient. We want to minimize the loss function. Batch Gradient Descent - answersIn each iteration, calculate the gradient of the entire dataset Stochastic Gradient Descent. - answersIn each iteration, randomly pick a single example and calculate its gradient. Mini-batch Stochastic Gradient Descent - answersRandomly shuffle the dataset and process it in mini_batches. Epoch: the number of iterations to process the entire dataset - answers Mini-batch SGD is what we typically use in practice - answers Learning rate - answersCan be constant, time based, etc. Optimizers - answersHelp change learning rate as we train. For example, Adam allows it so each parameter has its own adaptive learning rates based on past gradients. Unsupervised learning - answers Learning from unlabeled data (Must discover patterns in the datga) Problems in Unsupervised Learning - answersOutlier detection, clustering, dimensionality reduction Clustering - answersAssign data points to clusters in the optimal way. k-means, DBscan, hierarchical. We need a holdout set so the clustering doesn't overfit so we use train test split. K-means clustering - answersInitialize a random centroid for each cluster. Assign each data point to the cluster with the closest centroid, recalculate centroids from points assigned to cluster. Requires specifying cluster number k. Dimensionality Reduction Techniques - answersProjections (PCA) Manifold Learning (LLE Isomap) PCA Principle Component Analysis. - answersMethod for linear transformation onto a new system of coordinates. Transformation such that the principal components capture the greatest amount of variance given the previous components. Keep the first k principal compoenents, k is a hyperparameter. Choose k components that explain the most variance. - answers T/F Kernel trick can be used to do non-linear dimensionality reduction - answers True Manifold Learning - answersFind a mapping of the dataset X into a dataset Z such that Z preserves the local geometry of X (If x and j are close, then zx and zj are also close). Preserve neighborhood structure. Hidden layer - answersAny layer that isnt the input or output layer A single neuron - answersHas weights. We take the dot product of the weights and the input vector and then apply a nonlinear activation function to this output. Neurons also have a bias. Types of layers - answersDense (fully connected) Convolutional Recurrent Softmax activation - answersUsed in final layer for multi-class classification where each input is assigned to a class. ReLU - answersf(z) = max(0, z) Single neuron is essentially linear regression - answers Backpropagation - answersReverse pass to measure error and propagate error gradient backwards in the network. Backpropagation is how to compute gradients efficiently. Gradient descent is how to update parameters to minimize loss given the gradient - answers Given a minibatch b - answersCompute the forward pass, saving intermediate results. Compute the loss on the minibatch b. In the backwards pass, compute the per-weight gradients layer by layer. Update the weights based on the gradients (stochastic). When to use neural networks - answersComplex learning task or complex data. Provides best performance for many tasks When not to use - answersIf you can use a simpler model, dont have much data, need an interpretable model. Width is number of neurons in a layer. Depth is number of layers. - answers Softmax with cross entropy loss for multiclass classification is good. Regression with tanh is bad. Regression with linear activation is good - answers Make the network look like a funnel, large input feature vectors and small output vectors. - answers Deep neural network - answersAny neural network with 2 OR MORE HIDDEN LAYERS More hidden layers is better than wider layers - answers Choice of activation functions - answersReLU or ReLU variants in the hidden layers. Softmax for multiclass classification. Sigmoid for binary classification or multilabel. Vanishing/Exploding Gradient Problems - answersGradient vector becomes very small (vanishing gradient) or very large (exploding gradient) during backpropagation. This is makes it difficult to update weights of earlier layers and training doesnt converg. Vanishing/Exploding Gradient is an instance of unstable gradients, a more general problem . - answers Mitigations for vanishing/exploding gradient . - answersWeight initialization methods. Non-saturating activation functions. Batch normalization. Gradient clipping (for exploding) Some ways to initialize weights can make gradients unstable. Use Glorot or He with biases initialized to zero. - answers ReLU is a saturating activation function. reLU variants such as leaky relu dont allow zero values (keep some base value) or ELU allows negative. - answers Regularizing neural networks - answersEarly stopping. L1 L2 regularization. Dropout (each neuron has a probability of being dropped out or ignored for the current step or pass not whole epoch). Batch when we say it is usually referring to minibatch - answers Double descent - answersUnexpected and sudden change in test error as a function of the number of parameters. Grokking - answersNeural network learns to generalize suddenly well after the network has overfitted. Only occurs in very specific settings (small algorithmic datasets, complex neural networks, etc.) transfer learning. - answersTransfer learning is when you pick a pre-trained deep neural net in a related domain, freeze the earlier layers, and add own hidden and upper layers. Transfer learning techniques - answersLinear probing, only replace the final layer and train it. Fine-tuning: train the transfered layers more with a smaller learning rate on new data . Convolutional layers - answersEach neuron/unit is only connected to a small number of neurons/units in the previous layer. Fewer neurons/units than fully connected layers. Have a high memory usage during training. Filters - answersFilters, aka kernels, slide or (convolve) across the image or previous layer's output, producing a feature map. Applying the filter produces a single output value, and from there you can also apply an activation function. padding same - answersOutput feature map will have same spatial dimensions as the input assuming stride = 1 how to calculate output size - answersfloor(V - K + 2)) / s) + 1 Subsampling, pooling - answersAfter one or more convolutional layers, we can have subsampling (pooling) Pooling reduces dimensions of feature maps. Max pooling takes the maximum value of the sliding window, average pooling takes the average value. - answers Rules of thumb - answersAvoid larger filter sizes (2, 3) but first layer can be larger. Repeat the pattern Conv Maxpool or Conv Conv max pool. Resnet - answersResidual learning building block. Filters learn different stuff - answers Recurrent Layers - answersMade up of recurrent neurons/units which keep state. State at some time is a function of previous state and current input. Basically work the same way, have activation functions and such, except the output from the layer includes, on top of the product of the weights and input matrix, a hidden state vector that is the function of the previous state and the current input. Sequence to sequence - answersFor each input frame there is a single output frame. Exampple predicting stock prices, feed the price of a stock over the last n days and predict the price on n+1. Vector-to-sequence - answersThere is a vector of inputs (could be an image) the model produces a sequence as output. So feed the input, then the model keeps predicting next words with hidden state without feeding more input. Sequence to vector - answersInput is a sequence, the model produces a vector as output. I.e. sentiment analysis, given the text of a movie review the model outputs positive. Keep providing input, hidden states without output until the end Encoder-decoder networks. - answersSequence to vector followed by vector to sequence. I.e. language translation Encoder decoder network teacher forcing - answers Use the actual input as ground truth and not the model output of an earlier step (so for sequence translation, instead of feeding the last word) feed what the actual last word should be, Attention network or layer - answersTrained at the same time as the encoder decoder. Produces weights that sum to 1 and a weight represents the importance of that encoder output,. Transformer architecture - answersEncoder-decoder (no recurrence). Positional encoding, scaled-dot product attention with multiple heads. Cross-attention between encoder and decoder. Layer norm. Transformer architecture doesn't need recurrent networks, just attention. Transformer architecture has encoder decoder no recurrence. - answers Multihead attention multiple attention layers in parallel - answers Sequence to sequnce - answersFor each input frame there is a single output frame. An example for sequence to sequence is predicting stock prices. AutoEncoders - answersArchitectecture combining an encoder network and decoder network. gpt decoder only transformer architecture - answers Autoencoders learn efficient representations of the data, representing data points in the latent space, and can be applied to dimensionality reduction and feature learning. - answers AutoEncoders: learn to reproduce the input as output latent representation is contrained and must have lower dimensionality than input - answers Variational AutoEncoders. Encoder maps an input x to a distribution in the latent space, we can sample the latent space to get a new output, making variational autoencoders a generative model. - answers Generative Adversarial Network (GAN) - answers Generative Adversarial networks have a generator and discriminator. Generator takes random noise from some distribution and produces a data point. Discriminaator predicts real or fake for the data point. Real for data points taken from the dataset fake for taken by the generator. Discriminator helps train the generator. - answers GANs difficult to train, discriminator and generator need to learn together at roughly the same pace. - answers deepfakes are an image or video of someones likeness superimposed on an existing image or video. - answers Diffusion Process. Noising and Denoising. Add nose to inputs til pure noise, then to create samples from pure noise, we reverse the process. Diffusion models map a noise distribution into a real data distribution. - answers While GAns train a generator to fool a discriminator, diffusion models learn to reverse a noising process. While GANs have adversarial training, diffusion models add noise to data to denoise. GANs are one step (direct generation) while diffusion models are iterative with many denoising steps. - answers LIME: Given an instance x, approximate the behavior of the model in a neighborhood of x using a proxy model. We can choose a linear model, decision tree, etc. for the proxy. - answers Saliencymaps. Instead of training a proxy model directly highlight the salient features. - answers

Mostrar más Leer menos
Institución
Machine Learning Engineering
Grado
Machine Learning Engineering

Vista previa del contenido

Machine Learning Engineering Final Exams
Questions And Answers Fully Solved
When to use machine learning - answersWhen the problem is too complex for
hardcoding rules. When the problem deals with an unstudied phenomenon. When the
problem calls for automating some decision. The problem is changing frequently.
Reasons not to use ML - answersCan't get right or enough data. Problem does not
require learning from the data. Problem solved in other ways. Not ethical.
How can we tell if we need more data? (method) - answers Learning curves. Show the
training and validation accuracy over number of training samples. Once the curves
converge we have enough data.
Normalization - answers Map values onto a range such as [0, 1]
Standardization - answers Rescale feature values to follow a standard normal
distribution
T/F Feature scaling helps with most learning algorithms - answers True, especially for
linear SVM, you should rescale features.
Visualizing data - answers Before getting too far into data cleansing or training a model,
visualize the training data. May yield insights into features to use and types of model.
Reinforcement Learning - answers Agent can interact with its environment to perform
actions and get rewards
Some learning algorithms, such as decision trees, k nearest neighbors, neural
networks, naturally support multiclass classification. For others, such as SVM, we can
transform multiclass classification into a binary classification, either one-vs-rest or one-
vs-one. - answers
One-vs-one. Train c(c-1)/2 binary classifiers. fig to classify class I versus class j. Predict
with all and return the class with the most votes - answers
What makes a nice loss function - answersContinous, differentiable, strictly convex,
smooth.
T/F Neural networks generally have a convex loss function - answers False
Convex - answers Not convex if we can't draw a line between any two points in the
shape without leaving the shape.
Model parameters are determined by the training data through some optimization
procedure. Hyper parameters are set by the machine learning engineer. - answers
If you don't have enough data to leave some aside for validation/testing - answers Use
k-fold validation
Bias - answers Error due to incorrect assumptions in the model. High bias means under
fitting.
Variance - answers Sensitivity to small variation in the training data. High variance if
highly influenced by a few data points, this is overfitting.
Regularization - answers Reduces model complexity, decreases the degrees of
freedom of the model. Helps with overfitting.
Linear regression - answers Given the feature vector x, predict the target y as
accurately as possible. =wax + b where w is the learned weights x is the input vector
(feature vector) and b is the bias term

, Mean Squared Error (OLS) - answersPrediction - actual squared then average of the
squares
Mean Absolute Error - answers More robust to outliers, less mathematically convenient
than OLS
Polynomial regression - answers Add can add x^2, x^3 as features to capture nonlinear
relationships.
L1 regularization - answers Encourages sparsity in the weights. Also known as Lasso
L2 regularization - answers Encourage minimization of the weights towards 0. Known as
ridge regression.
Logistic Regression - answers Does binary classification. Predicting the probability of
the class label. Sigmoid function. Logits are the raw outputs of the function and applying
sigmoid converts them to probabilities.
Binary cross-entropy is the loss function of choice for logistic regression. - answers
Soft ax uses predicted logit score for each class to create a probability distribution over
the labels. - answers
SVM - answers Represented as hyperplane wax - b = 0. The goal in SVM is to find the
optimal w and b values that create a hyperplane that best separates the positive from
negative examples.
Hard-Margin sum. - answers Assumes data is linearly separable. Problem is maximizing
the margin or space between the closest examples of each class called support vectors.
Soft margin - answers Use hinge loss. Minimize margin violations.
Kernel trick - answers Allows us to reflect higher dimensional space transformation only
in the cost function for optimization.
SVM - answers Can perform both linear and non linear classification. Can be used for
regression. Well suited to small datasets. Trains slow.
Decision Tree - answersNonparametric model suitable for classification or regression.
Decision Tree algorithms differ in ... - answersStopping criterion, ways to find the best
split, how they are regularized
When to stop? - answersAll examples classified, cant find feature to split, split does not
significantly improve impurity or entropy. Pruning
Voting - answersPredict label with majority label, predict the mean of regressors when
we have different types of models.
Bagging. - answersBootstrap aggregating. Train many different models of the same
type. Each model is trained on a randomly sampled subset of the data. Statistical mode
as aggregation function for classification.
Random Forests - answersBagging with Decision Trees. Use a random set of features
to decide on best split. Bagging doesn't necessarily imply using different subsets of
features.
Boosting - answersTrain many weak learners, each learn correects the errors of the
previous one.
Stacking. - answersAlternative to voting classifiers /regressors. Train a model to do
aggregation instead of averaging or majority voting (meta model).
Evaluation metrics for regression. - answersMean Squared Error, Mean Absolute Error,
etc.
Confusion matrix - answersShows true positives vs false positives + false negatives etc.
Accuracy - answersTP + TN / TP + TP + FP + FN

Escuela, estudio y materia

Institución
Machine Learning Engineering
Grado
Machine Learning Engineering

Información del documento

Subido en
22 de mayo de 2025
Número de páginas
6
Escrito en
2024/2025
Tipo
Examen
Contiene
Preguntas y respuestas

Temas

$12.49
Accede al documento completo:

¿Documento equivocado? Cámbialo gratis Dentro de los 14 días posteriores a la compra y antes de descargarlo, puedes elegir otro documento. Puedes gastar el importe de nuevo.
Escrito por estudiantes que aprobaron
Inmediatamente disponible después del pago
Leer en línea o como PDF

Conoce al vendedor
Seller avatar
francisgodfrey26

Conoce al vendedor

Seller avatar
francisgodfrey26 Havard School
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
-
Miembro desde
10 meses
Número de seguidores
0
Documentos
13
Última venta
-

0.0

0 reseñas

5
0
4
0
3
0
2
0
1
0

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes