AND EXAM PREPARATION
Question 1: What is Data Science and why is it important?
Answer 1:
Data Science is an interdisciplinary field combining statistics, computer science, domain
expertise, and machine learning to extract actionable knowledge from data. It enables
organizations to make informed decisions, identify trends, and gain a competitive advantage by
analyzing and interpreting large datasets.
Question 2: What are the main stages of the data science lifecycle?
Answer 2:
The stages include: Data Collection, Data Cleaning, Data Exploration & Visualization, Feature
Engineering, Model Building, Model Evaluation, Deployment, and Monitoring. Each stage
ensures data quality and efficient knowledge extraction for predictive or descriptive insights.
Question 3: What is the difference between supervised and unsupervised learning?
Answer 3:
Supervised learning uses labeled data to predict outputs, involving tasks like classification and
regression. Unsupervised learning analyzes unlabeled data to find patterns or groupings, like
clustering and dimensionality reduction.
Question 4: Explain the concept of overfitting and how it can be prevented.
Answer 4:
Overfitting occurs when a model learns noise and details from training data too well, reducing
its generalization to new data. Prevention includes simpler models, regularization (L1/L2), cross-
validation, and increasing training data quantity.
Question 5: What is a confusion matrix and what metrics does it provide?
Answer 5:
A confusion matrix displays true vs. predicted classifications in a table form. From it, accuracy,
precision, recall, specificity, and F1 score are calculated to evaluate classification model
performance comprehensively.
,Question 6: How does Principal Component Analysis (PCA) help in data analysis?
Answer 6:
PCA reduces dimensionality by transforming correlated variables into uncorrelated principal
components that capture maximum variance. It simplifies data visualization, reduces noise, and
enhances algorithm efficiency.
Question 7: Describe the bias-variance tradeoff in machine learning.
Answer 7:
Bias is error from wrong assumptions, causing underfitting. Variance is sensitivity to training
data variations, causing overfitting. Balancing the two ensures models generalize well without
being too simple or overly complex.
Question 8: What is cross-validation and why is it used?
Answer 8:
Cross-validation partitions data into subsets, iteratively training and validating models on
different splits to assess generalization performance and prevent overfitting, aiding robust
model tuning.
Question 9: Differentiate between classification and regression problems.
Answer 9:
Classification predicts discrete categories, while regression predicts continuous numerical
values. Evaluation metrics and algorithm choices depend fundamentally on the problem type.
Question 10: What are the assumptions of linear regression?
Answer 10:
Assumptions include linearity, independence of errors, homoscedasticity (constant error
variance), normality of residuals, and no multicollinearity among predictors to produce valid
inference.
Question 11: Explain feature engineering and provide examples.
Answer 11:
Feature engineering transforms raw data into meaningful inputs to improve model quality.
,Examples: encoding categorical variables, scaling numerical features, creating interaction terms,
or deriving date parts.
Question 12: What is the role of regularization in machine learning?
Answer 12:
Regularization adds penalties on model complexity, discouraging overfitting and improving
generalization by shrinking or setting coefficients to zero (L1/L2 regularization).
Question 13: Describe batch learning and online learning.
Answer 13:
Batch learning trains on the entire dataset at once, suitable for static data. Online learning
updates models incrementally as new data arrives, enabling real-time adaptation.
Question 14: What is a decision tree and how does it make predictions?
Answer 14:
Decision trees split data on feature values forming a tree structure; predictions are made by
traversing from the root to a leaf node representing a final decision or value.
Question 15: Explain ensemble learning and list common methods.
Answer 15:
Ensemble learning combines multiple models to improve accuracy and robustness. Methods
include bagging (Random Forests), boosting (AdaBoost, Gradient Boosting), and stacking.
Question 16: Define the curse of dimensionality.
Answer 16:
The curse of dimensionality is the problem of exponential data sparsity and increased
complexity as feature space dimensionality grows, degrading model performance.
Question 17: Define precision, recall, and F1 score.
Answer 17:
Precision: True positives / predicted positives.
, Recall: True positives / actual positives.
F1 score: Harmonic mean of precision and recall, balancing false positives and false negatives.
Question 18: What is a p-value and its significance?
Answer 18:
P-value is the probability of observing results at least as extreme as those measured, assuming
the null hypothesis is true. Low p-values (< 0.05) imply statistical significance and evidence
against the null.
Question 19: How does the k-Nearest Neighbors (k-NN) algorithm work?
Answer 19:
k-NN classifies a point based on the majority class among its k closest neighbors using a distance
metric like Euclidean distance. It's simple, intuitive, and effective especially on small datasets.
Question 20: What are missing data and methods to handle them?
Answer 20:
Missing data are absent values caused by errors or non-responses. Handling techniques include
removing incomplete samples, imputing with statistical measures or predictive models, and
using algorithms that tolerate missingness.
Question 21: What distinguishes parametric from non-parametric models?
Answer 21:
Parametric models assume a fixed number of parameters and model form, enabling simplicity
but limited flexibility. Non-parametric models do not fix parameters and can adapt complexity
with data size, offering flexibility but higher computational cost.
Question 22: Describe the gradient descent algorithm.
Answer 22:
Gradient descent iteratively updates model parameters in the direction opposite the gradient of
the loss function to minimize error, widely used in optimization of machine learning models.
Question 23: How does logistic regression perform classification?