CDV Data Analysis and Fundamentals
Practice Exam questions and correct
answers– Updated 2026 (Graded A+) instant
download pdf
Subject: CDV Data Analysis Fundamentals
Subtopic: Data Validation and Quality Assessment
Question 1: During the data ingestion phase, an analyst observes inconsistent formatting in the
'Date' field where some entries are recorded as YYYY-MM-DD and others as DD-MM-YYYY.
Which data validation procedure is most appropriate to ensure downstream analytical integrity?
A) Implementing a schema-on-read transformation to force a uniform format during the query
process.
B) Applying a standardized data cleansing protocol that enforces a single date format before
loading into the primary database.
C) Utilizing statistical outlier detection to flag incorrectly formatted dates as anomalies.
D) Dropping all records with non-conforming date formats to maintain strict data quality
standards.
Correct Answer: B - Applying a standardized data cleansing protocol that enforces a single
date format before loading into the primary database.
Rationale: Pre-loading data cleansing is the industry standard for maintaining data integrity in
CDV (Data Analysis Fundamentals) pipelines. Option A, while technically possible, creates
performance overhead at the query level. Option C is ineffective because formatting issues are
structural, not statistical outliers. Option D is incorrect because deleting data causes
unnecessary information loss that violates data completeness principles.
Question 2: An analyst is performing a check on a dataset and discovers that 15% of the total
records contain null values in a mission-critical feature column. Which evaluation strategy is
most analytically sound?
A) Replace all null values with the mean of the column to ensure no loss of record count.
B) Conduct a pattern analysis to determine if the nulls are Missing Completely at Random
(MCAR) or Missing at Random (MAR) before deciding on an imputation or exclusion strategy.
,C) Immediately exclude all records containing a null value to ensure that the dataset consists
only of complete, verifiable entries.
D) Flag the null values as a new category labeled "Unknown" to preserve the total population
size for reporting.
Correct Answer: B - Conduct a pattern analysis to determine if the nulls are Missing
Completely at Random (MCAR) or Missing at Random (MAR) before deciding on an
imputation or exclusion strategy.
Rationale: The CDV fundamentals emphasize understanding the mechanism of missing data
before applying a fix. Option A (mean imputation) can severely bias results if data is not missing
at random. Option C is extreme and may introduce selection bias. Option D ignores the potential
for systemic errors in data collection. Analyzing the nature of the missingness is the professional
requirement.
Subtopic: Foundational Data Governance
Question 3: When establishing data lineage for a new report, which documentation component is
most critical for ensuring that stakeholders trust the final output?
A) A detailed inventory of every user who accessed the dataset within the last fiscal quarter.
B) The transformation logic and source systems for every field used in the report.
C) The hardware specifications of the server hosting the analytical database.
D) A list of all third-party software tools used to visualize the data.
Correct Answer: B - The transformation logic and source systems for every field used in
the report.
Rationale: Data lineage, a core concept in CDV Fundamentals, requires clear documentation of
the data journey from source to destination, including all transformations applied. Option A
focuses on security/audit logs rather than lineage. Option C and D provide environmental
context but fail to explain the data's derivation, which is necessary for verification and trust.
Question 4: An analyst needs to maintain data privacy while performing exploratory data
analysis (EDA) on customer records. Which practice aligns with data minimization principles?
A) Providing a copy of the entire database to the analytics team to ensure they have all necessary
context.
B) Using hash functions or de-identification techniques to strip personally identifiable
information (PII) from the analysis set.
, C) Restricting access to the database to only the lead developer to prevent accidental exposure.
D) Deleting all customer records that are older than six months to reduce the data footprint.
Correct Answer: B - Using hash functions or de-identification techniques to strip
personally identifiable information (PII) from the analysis set.
Rationale: Data minimization and privacy are paramount in the CDV curriculum. De-
identification allows for robust EDA while minimizing privacy risks. Option A is a violation of
data security. Option C limits productivity and doesn't solve the data privacy issue for the
analysis itself. Option D is an arbitrary data retention policy that likely destroys valuable
historical trend data.
Subtopic: Statistical Analysis for Data Fundamentals
Question 5: A dataset measuring process efficiency shows a highly right-skewed distribution.
Which measure of central tendency is most representative of the typical process time?
A) Arithmetic Mean
B) Median
C) Standard Deviation
D) Range
Correct Answer: B - Median
Rationale: In right-skewed distributions, the arithmetic mean (Option A) is disproportionately
pulled toward the tail by outliers, making it less representative. The median (Option B) is robust
against skew and provides a better measure of the central value. Option C and D are measures
of dispersion, not central tendency.
Subject: CDV Data Analysis Fundamentals
Subtopic: Advanced Data Validation and Quality Assessment
Question 31: An analyst is tasked with merging two datasets: one containing customer sales data
and another containing customer demographic information. Both sets share a 'CustomerID' field,
but the analyst identifies that the demographic dataset contains duplicate IDs. How should the
analyst proceed to ensure accurate join operations?
A) Perform a Cartesian product join to ensure all possible combinations are captured.
B) Remove the duplicate entries in the demographic dataset by selecting the most recent entry
based on a timestamp.
Practice Exam questions and correct
answers– Updated 2026 (Graded A+) instant
download pdf
Subject: CDV Data Analysis Fundamentals
Subtopic: Data Validation and Quality Assessment
Question 1: During the data ingestion phase, an analyst observes inconsistent formatting in the
'Date' field where some entries are recorded as YYYY-MM-DD and others as DD-MM-YYYY.
Which data validation procedure is most appropriate to ensure downstream analytical integrity?
A) Implementing a schema-on-read transformation to force a uniform format during the query
process.
B) Applying a standardized data cleansing protocol that enforces a single date format before
loading into the primary database.
C) Utilizing statistical outlier detection to flag incorrectly formatted dates as anomalies.
D) Dropping all records with non-conforming date formats to maintain strict data quality
standards.
Correct Answer: B - Applying a standardized data cleansing protocol that enforces a single
date format before loading into the primary database.
Rationale: Pre-loading data cleansing is the industry standard for maintaining data integrity in
CDV (Data Analysis Fundamentals) pipelines. Option A, while technically possible, creates
performance overhead at the query level. Option C is ineffective because formatting issues are
structural, not statistical outliers. Option D is incorrect because deleting data causes
unnecessary information loss that violates data completeness principles.
Question 2: An analyst is performing a check on a dataset and discovers that 15% of the total
records contain null values in a mission-critical feature column. Which evaluation strategy is
most analytically sound?
A) Replace all null values with the mean of the column to ensure no loss of record count.
B) Conduct a pattern analysis to determine if the nulls are Missing Completely at Random
(MCAR) or Missing at Random (MAR) before deciding on an imputation or exclusion strategy.
,C) Immediately exclude all records containing a null value to ensure that the dataset consists
only of complete, verifiable entries.
D) Flag the null values as a new category labeled "Unknown" to preserve the total population
size for reporting.
Correct Answer: B - Conduct a pattern analysis to determine if the nulls are Missing
Completely at Random (MCAR) or Missing at Random (MAR) before deciding on an
imputation or exclusion strategy.
Rationale: The CDV fundamentals emphasize understanding the mechanism of missing data
before applying a fix. Option A (mean imputation) can severely bias results if data is not missing
at random. Option C is extreme and may introduce selection bias. Option D ignores the potential
for systemic errors in data collection. Analyzing the nature of the missingness is the professional
requirement.
Subtopic: Foundational Data Governance
Question 3: When establishing data lineage for a new report, which documentation component is
most critical for ensuring that stakeholders trust the final output?
A) A detailed inventory of every user who accessed the dataset within the last fiscal quarter.
B) The transformation logic and source systems for every field used in the report.
C) The hardware specifications of the server hosting the analytical database.
D) A list of all third-party software tools used to visualize the data.
Correct Answer: B - The transformation logic and source systems for every field used in
the report.
Rationale: Data lineage, a core concept in CDV Fundamentals, requires clear documentation of
the data journey from source to destination, including all transformations applied. Option A
focuses on security/audit logs rather than lineage. Option C and D provide environmental
context but fail to explain the data's derivation, which is necessary for verification and trust.
Question 4: An analyst needs to maintain data privacy while performing exploratory data
analysis (EDA) on customer records. Which practice aligns with data minimization principles?
A) Providing a copy of the entire database to the analytics team to ensure they have all necessary
context.
B) Using hash functions or de-identification techniques to strip personally identifiable
information (PII) from the analysis set.
, C) Restricting access to the database to only the lead developer to prevent accidental exposure.
D) Deleting all customer records that are older than six months to reduce the data footprint.
Correct Answer: B - Using hash functions or de-identification techniques to strip
personally identifiable information (PII) from the analysis set.
Rationale: Data minimization and privacy are paramount in the CDV curriculum. De-
identification allows for robust EDA while minimizing privacy risks. Option A is a violation of
data security. Option C limits productivity and doesn't solve the data privacy issue for the
analysis itself. Option D is an arbitrary data retention policy that likely destroys valuable
historical trend data.
Subtopic: Statistical Analysis for Data Fundamentals
Question 5: A dataset measuring process efficiency shows a highly right-skewed distribution.
Which measure of central tendency is most representative of the typical process time?
A) Arithmetic Mean
B) Median
C) Standard Deviation
D) Range
Correct Answer: B - Median
Rationale: In right-skewed distributions, the arithmetic mean (Option A) is disproportionately
pulled toward the tail by outliers, making it less representative. The median (Option B) is robust
against skew and provides a better measure of the central value. Option C and D are measures
of dispersion, not central tendency.
Subject: CDV Data Analysis Fundamentals
Subtopic: Advanced Data Validation and Quality Assessment
Question 31: An analyst is tasked with merging two datasets: one containing customer sales data
and another containing customer demographic information. Both sets share a 'CustomerID' field,
but the analyst identifies that the demographic dataset contains duplicate IDs. How should the
analyst proceed to ensure accurate join operations?
A) Perform a Cartesian product join to ensure all possible combinations are captured.
B) Remove the duplicate entries in the demographic dataset by selecting the most recent entry
based on a timestamp.