Data Science with Python 2025 — 60 Q&A Complete Study Guide
Verified
Series:
CrashCourses Professional Study Series
Author:
Dr Z. Moomba, MBChB, MRCPsych | BethelWellness Ltd
Exam Target:
Data Science Python
Year:
2025/2026
Format:
60 Questions with Verified Answers and Rationales
>
Author's Note:
This document is an original work produced for the CrashCourses Professional Study Series.
Clinical questions and professional scenarios were composed by Dr Z. Moomba based on current
exam objectives, published guidelines, and evidence-based sources (2024–2025). All patient
names, ages, and case details are fictional. Any resemblance to existing published Q&A banks is
coincidental. For personal study use only — not for reproduction or redistribution.
SECTION A — FOUNDATIONS
Question 1
A data scientist is analyzing a 1D NumPy array containing the systolic blood pressures of 10,000
patients. They need to identify all indices where the pressure exceeds 140 mmHg. Which NumPy
function is most efficient and syntactically correct for this operation without using a Python `for`
loop?
A) `np.where(bp_array > 140)`
B) `np.find(bp_array > 140)`
C) `bp_array.index(> 140)`
D) `np.extract(bp_array, > 140)`
,Answer: A
Rationale:
The `np.where()` function applies vectorised boolean logic to return the indices of elements that
satisfy the condition. The key discriminating feature is NumPy's reliance on vectorisation, which
avoids the computational overhead of iteration in standard Python. Option B fails because
`np.find()` is not a valid NumPy method (the equivalent is `np.where` or `np.nonzero`). High-yield
pearl: Vectorised operations in NumPy execute in pre-compiled C code, making them orders of
magnitude faster than native Python loops for large datasets [NumPy Official Docs 2025].
Question 2
A hospital data team has two NumPy arrays: `A` of shape (1000, 4) representing patient vitals, and
`B` of shape (4,) representing calibration weights. They multiply the arrays using `A * B`. What
foundational NumPy concept allows this operation to execute without throwing a shape mismatch
error?
A) Array concatenation
B) Memory mapping
C) Broadcasting
D) Type casting
Answer: C
Rationale:
Broadcasting is the mechanism by which NumPy implicitly expands smaller arrays to match the
shape of larger arrays during arithmetic operations. The feature that seals the answer is the shape
compatibility: the trailing dimensions match (4 and 4), so the smaller array is "stretched" across
the 1000 rows. Option A fails because concatenation joins arrays end-to-end, rather than enabling
element-wise arithmetic across differing shapes. High-yield pearl: Broadcasting strictly requires
trailing dimensions to either be equal or one of them must be 1 [Python Data Science Handbook
2024].
Question 3
You are cleaning a pandas DataFrame `df` containing electronic health records (EHR). The column
`HbA1c` has 15% missing values (NaN). You want to replace these missing values with the median
of the `HbA1c` column grouped by the `Diabetic_Status` column. Which code snippet correctly
achieves this?
, A) `df['HbA1c'].fillna(df.groupby('Diabetic_Status')['HbA1c'].transform('median'), inplace=True)`
B) `df['HbA1c'] = df.groupby('Diabetic_Status')['HbA1c'].median().fillna()`
C) `df['HbA1c'].replace(np.nan, df['HbA1c'].median())`
D) `df.fillna(df.groupby('Diabetic_Status')['HbA1c'].apply('median'))`
Answer: A
Rationale:
The `transform()` method returns an object that is the same shape as the original dataframe,
allowing direct assignment or filling of NaN values aligned to the original index. The crucial feature
is combining `transform` with `fillna`, which maps the group-specific medians back to the exact
rows missing data. Option B fails because `groupby().median()` returns an aggregated Series of a
different shape (one row per group), causing an index mismatch when assigned back to the main
DataFrame. High-yield pearl: Imputing missing clinical continuous variables with group-specific
medians is more robust to outliers than using the mean [Health Data Analytics Guidelines 2025].
Question 4
A researcher is analyzing clinical trial data and needs to reshape a pandas DataFrame from a
"wide" format (columns: `PatientID`, `Day1_Score`, `Day2_Score`) to a "long" format (columns:
`PatientID`, `Day`, `Score`). Which pandas function is explicitly designed for this transformation?
A) `pd.pivot_table()`
B) `pd.melt()`
C) `pd.concat()`
D) `pd.crosstab()`
Answer: B
Rationale:
The `pd.melt()` function unpivots a DataFrame from wide to long format, gathering columns into
rows. The explicit requirement to convert `Day1_Score` and `Day2_Score` into a single `Day`
identifier column and a single `Score` value column dictates the use of melt. Option A is incorrect
because `pivot_table()` does the exact opposite, aggregating long data into wide format. High-
yield pearl: Longitudinal medical data is typically collected in wide format but must often be
melted into long format for time-series analysis or mixed-effects modeling [Pandas
Documentation 2025].
Question 5
Verified
Series:
CrashCourses Professional Study Series
Author:
Dr Z. Moomba, MBChB, MRCPsych | BethelWellness Ltd
Exam Target:
Data Science Python
Year:
2025/2026
Format:
60 Questions with Verified Answers and Rationales
>
Author's Note:
This document is an original work produced for the CrashCourses Professional Study Series.
Clinical questions and professional scenarios were composed by Dr Z. Moomba based on current
exam objectives, published guidelines, and evidence-based sources (2024–2025). All patient
names, ages, and case details are fictional. Any resemblance to existing published Q&A banks is
coincidental. For personal study use only — not for reproduction or redistribution.
SECTION A — FOUNDATIONS
Question 1
A data scientist is analyzing a 1D NumPy array containing the systolic blood pressures of 10,000
patients. They need to identify all indices where the pressure exceeds 140 mmHg. Which NumPy
function is most efficient and syntactically correct for this operation without using a Python `for`
loop?
A) `np.where(bp_array > 140)`
B) `np.find(bp_array > 140)`
C) `bp_array.index(> 140)`
D) `np.extract(bp_array, > 140)`
,Answer: A
Rationale:
The `np.where()` function applies vectorised boolean logic to return the indices of elements that
satisfy the condition. The key discriminating feature is NumPy's reliance on vectorisation, which
avoids the computational overhead of iteration in standard Python. Option B fails because
`np.find()` is not a valid NumPy method (the equivalent is `np.where` or `np.nonzero`). High-yield
pearl: Vectorised operations in NumPy execute in pre-compiled C code, making them orders of
magnitude faster than native Python loops for large datasets [NumPy Official Docs 2025].
Question 2
A hospital data team has two NumPy arrays: `A` of shape (1000, 4) representing patient vitals, and
`B` of shape (4,) representing calibration weights. They multiply the arrays using `A * B`. What
foundational NumPy concept allows this operation to execute without throwing a shape mismatch
error?
A) Array concatenation
B) Memory mapping
C) Broadcasting
D) Type casting
Answer: C
Rationale:
Broadcasting is the mechanism by which NumPy implicitly expands smaller arrays to match the
shape of larger arrays during arithmetic operations. The feature that seals the answer is the shape
compatibility: the trailing dimensions match (4 and 4), so the smaller array is "stretched" across
the 1000 rows. Option A fails because concatenation joins arrays end-to-end, rather than enabling
element-wise arithmetic across differing shapes. High-yield pearl: Broadcasting strictly requires
trailing dimensions to either be equal or one of them must be 1 [Python Data Science Handbook
2024].
Question 3
You are cleaning a pandas DataFrame `df` containing electronic health records (EHR). The column
`HbA1c` has 15% missing values (NaN). You want to replace these missing values with the median
of the `HbA1c` column grouped by the `Diabetic_Status` column. Which code snippet correctly
achieves this?
, A) `df['HbA1c'].fillna(df.groupby('Diabetic_Status')['HbA1c'].transform('median'), inplace=True)`
B) `df['HbA1c'] = df.groupby('Diabetic_Status')['HbA1c'].median().fillna()`
C) `df['HbA1c'].replace(np.nan, df['HbA1c'].median())`
D) `df.fillna(df.groupby('Diabetic_Status')['HbA1c'].apply('median'))`
Answer: A
Rationale:
The `transform()` method returns an object that is the same shape as the original dataframe,
allowing direct assignment or filling of NaN values aligned to the original index. The crucial feature
is combining `transform` with `fillna`, which maps the group-specific medians back to the exact
rows missing data. Option B fails because `groupby().median()` returns an aggregated Series of a
different shape (one row per group), causing an index mismatch when assigned back to the main
DataFrame. High-yield pearl: Imputing missing clinical continuous variables with group-specific
medians is more robust to outliers than using the mean [Health Data Analytics Guidelines 2025].
Question 4
A researcher is analyzing clinical trial data and needs to reshape a pandas DataFrame from a
"wide" format (columns: `PatientID`, `Day1_Score`, `Day2_Score`) to a "long" format (columns:
`PatientID`, `Day`, `Score`). Which pandas function is explicitly designed for this transformation?
A) `pd.pivot_table()`
B) `pd.melt()`
C) `pd.concat()`
D) `pd.crosstab()`
Answer: B
Rationale:
The `pd.melt()` function unpivots a DataFrame from wide to long format, gathering columns into
rows. The explicit requirement to convert `Day1_Score` and `Day2_Score` into a single `Day`
identifier column and a single `Score` value column dictates the use of melt. Option A is incorrect
because `pivot_table()` does the exact opposite, aggregating long data into wide format. High-
yield pearl: Longitudinal medical data is typically collected in wide format but must often be
melted into long format for time-series analysis or mixed-effects modeling [Pandas
Documentation 2025].
Question 5