Exam (elaborations)

Data Science Comprehensive Guide: 350 Key Q&A for Exam Preparation

Rating

Sold

Pages

Grade

A+

Uploaded on

21-09-2025

Written in

2025/2026

This document offers a precise and professional collection of 350 essential data science questions and answers. It is designed to help students and professionals deepen their understanding, prepare effectively for exams, and master key concepts across data analysis, machine learning, statistics, and practical applications. A perfect study companion for anyone aiming to excel in data science.

Show more Read less

Institution

Data Science MS

Course

Data science MS

Content preview

DATA SCIENCE COMPREHENSIVE GUIDE: 350 KEY QUESTIONS AND ANSWERS FOR MASTERY
AND EXAM PREPARATION

Question 1: What is Data Science and why is it important?
Answer 1:
Data Science is an interdisciplinary field combining statistics, computer science, domain
expertise, and machine learning to extract actionable knowledge from data. It enables
organizations to make informed decisions, identify trends, and gain a competitive advantage by
analyzing and interpreting large datasets.

Question 2: What are the main stages of the data science lifecycle?
Answer 2:
The stages include: Data Collection, Data Cleaning, Data Exploration & Visualization, Feature
Engineering, Model Building, Model Evaluation, Deployment, and Monitoring. Each stage
ensures data quality and efficient knowledge extraction for predictive or descriptive insights.

Question 3: What is the difference between supervised and unsupervised learning?
Answer 3:
Supervised learning uses labeled data to predict outputs, involving tasks like classification and
regression. Unsupervised learning analyzes unlabeled data to find patterns or groupings, like
clustering and dimensionality reduction.

Question 4: Explain the concept of overfitting and how it can be prevented.
Answer 4:
Overfitting occurs when a model learns noise and details from training data too well, reducing
its generalization to new data. Prevention includes simpler models, regularization (L1/L2), cross-
validation, and increasing training data quantity.

Question 5: What is a confusion matrix and what metrics does it provide?
Answer 5:
A confusion matrix displays true vs. predicted classifications in a table form. From it, accuracy,
precision, recall, specificity, and F1 score are calculated to evaluate classification model
performance comprehensively.

,Question 6: How does Principal Component Analysis (PCA) help in data analysis?
Answer 6:
PCA reduces dimensionality by transforming correlated variables into uncorrelated principal
components that capture maximum variance. It simplifies data visualization, reduces noise, and
enhances algorithm efficiency.

Question 7: Describe the bias-variance tradeoff in machine learning.
Answer 7:
Bias is error from wrong assumptions, causing underfitting. Variance is sensitivity to training
data variations, causing overfitting. Balancing the two ensures models generalize well without
being too simple or overly complex.

Question 8: What is cross-validation and why is it used?
Answer 8:
Cross-validation partitions data into subsets, iteratively training and validating models on
different splits to assess generalization performance and prevent overfitting, aiding robust
model tuning.

Question 9: Differentiate between classification and regression problems.
Answer 9:
Classification predicts discrete categories, while regression predicts continuous numerical
values. Evaluation metrics and algorithm choices depend fundamentally on the problem type.

Question 10: What are the assumptions of linear regression?
Answer 10:
Assumptions include linearity, independence of errors, homoscedasticity (constant error
variance), normality of residuals, and no multicollinearity among predictors to produce valid
inference.

Question 11: Explain feature engineering and provide examples.
Answer 11:
Feature engineering transforms raw data into meaningful inputs to improve model quality.

,Examples: encoding categorical variables, scaling numerical features, creating interaction terms,
or deriving date parts.

Question 12: What is the role of regularization in machine learning?
Answer 12:
Regularization adds penalties on model complexity, discouraging overfitting and improving
generalization by shrinking or setting coefficients to zero (L1/L2 regularization).

Question 13: Describe batch learning and online learning.
Answer 13:
Batch learning trains on the entire dataset at once, suitable for static data. Online learning
updates models incrementally as new data arrives, enabling real-time adaptation.

Question 14: What is a decision tree and how does it make predictions?
Answer 14:
Decision trees split data on feature values forming a tree structure; predictions are made by
traversing from the root to a leaf node representing a final decision or value.

Question 15: Explain ensemble learning and list common methods.
Answer 15:
Ensemble learning combines multiple models to improve accuracy and robustness. Methods
include bagging (Random Forests), boosting (AdaBoost, Gradient Boosting), and stacking.

Question 16: Define the curse of dimensionality.
Answer 16:
The curse of dimensionality is the problem of exponential data sparsity and increased
complexity as feature space dimensionality grows, degrading model performance.

Question 17: Define precision, recall, and F1 score.
Answer 17:
Precision: True positives / predicted positives.

, Recall: True positives / actual positives.
F1 score: Harmonic mean of precision and recall, balancing false positives and false negatives.

Question 18: What is a p-value and its significance?
Answer 18:
P-value is the probability of observing results at least as extreme as those measured, assuming
the null hypothesis is true. Low p-values (< 0.05) imply statistical significance and evidence
against the null.

Question 19: How does the k-Nearest Neighbors (k-NN) algorithm work?
Answer 19:
k-NN classifies a point based on the majority class among its k closest neighbors using a distance
metric like Euclidean distance. It's simple, intuitive, and effective especially on small datasets.

Question 20: What are missing data and methods to handle them?
Answer 20:
Missing data are absent values caused by errors or non-responses. Handling techniques include
removing incomplete samples, imputing with statistical measures or predictive models, and
using algorithms that tolerate missingness.

Question 21: What distinguishes parametric from non-parametric models?
Answer 21:
Parametric models assume a fixed number of parameters and model form, enabling simplicity
but limited flexibility. Non-parametric models do not fix parameters and can adapt complexity
with data size, offering flexibility but higher computational cost.

Question 22: Describe the gradient descent algorithm.
Answer 22:
Gradient descent iteratively updates model parameters in the direction opposite the gradient of
the loss function to minimize error, widely used in optimization of machine learning models.

Question 23: How does logistic regression perform classification?

Report Copyright Violation

Written for

Institution: Data science MS
Course: Data science MS

Document information

Uploaded on: September 21, 2025
Number of pages: 53
Written in: 2025/2026
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

data science

$15.99

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

alexmurangiri

Get to know the seller

alexmurangiri Harvard University

View profile

Sold

Member since

4 months

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller alexmurangiri. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $15.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 58676 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Data Science Comprehensive Guide: 350 Key Q&A for Exam Preparation

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?