Exam | Latest Verified Questions and Detailed
Answers
OVERVIEW DESCRIPTION:
The Databricks Certified Data Engineer Associate Exam focuses on practical, scenario-
driven skills for building and maintaining data pipelines on the Databricks Lakehouse
Platform. Candidates are tested heavily on hands-on Apache Spark and PySpark
transformations, Delta Lake operations (time travel, constraints, OPTIMIZE), and
incremental ingestion using Auto Loader and COPY INTO. The exam also emphasizes
production job orchestration, error handling, Unity Catalog governance (permissions,
lineage, row filters), and core platform architecture, with a particular focus on Delta Live
Tables (DLT) expectations and real-world ELT workflow development.
Data Processing & Transformations (31%)
QUESTION 1
A DataFrame sales_df has columns region, product, revenue. Which method returns a
new DataFrame with distinct rows based on all columns?
A) sales_df.dropDuplicates()
B) sales_df.distinct()
C) sales_df.unique()
D) sales_df.drop_duplicates(subset=['region','product'])
CORRECT ANSWER: B
EXPERT RATIONALE: distinct() is the PySpark method that returns a new DataFrame
with duplicate rows removed based on all columns. dropDuplicates() also works but
requires subset parameter for column-specific dedup.
,QUESTION 2
You have a PySpark DataFrame logs with a column timestamp of type StringType in
"yyyy-MM-dd HH:mm:ss". Which function converts it to TimestampType for time-based
operations?
A) to_date(col("timestamp"))
B) unix_timestamp(col("timestamp"), "yyyy-MM-dd HH:mm:ss").cast("timestamp")
C) from_unixtime(col("timestamp"))
D) to_timestamp(col("timestamp"), "yyyy-MM-dd HH:mm:ss")
CORRECT ANSWER: D
EXPERT RATIONALE: to_timestamp() directly converts a string column to
TimestampType using an optional format pattern. It is the most efficient and readable
built-in function for this purpose.
QUESTION 3
Which operation triggers immediate evaluation of a PySpark DataFrame transformation?
A) df.select("col1").alias("new")
B) df.filter(df.col2 > 10)
C) df.count()
D) df.withColumnRenamed("old", "new")
CORRECT ANSWER: C
EXPERT RATIONALE: count() is an action that triggers physical execution of the
DataFrame’s lineage. Transformations like select, filter, and withColumnRenamed are lazily
evaluated.
,QUESTION 4
Given df with a nested JSON column address as StructType containing street, city, zip.
Which expression extracts the city field?
A) df.select("address.city")
B) df.select(get_json_object(col("address"), "$.city"))
C) df.select(col("address").getItem("city"))
D) df.select("address['city']")
CORRECT ANSWER: A
EXPERT RATIONALE: When a column is already parsed as StructType, dot notation
(column.field) directly accesses nested fields. get_json_object is for string-encoded
JSON.
QUESTION 5
You register a Python UDF:
python
def square(x): return x * x
square_udf = udf(square, IntegerType())
What is a key performance downside compared to Spark built-in functions?
A) UDFs cannot be used on groupBy aggregations
B) Each row is serialized to Python, causing serialization overhead
C) UDFs only work on string columns
D) UDFs disable Catalyst optimizations and force single-core execution
CORRECT ANSWER: B
, EXPERT RATIONALE: PySpark UDFs convert each row to Python objects, incurring
serialization and deserialization overhead. Built-in functions operate on JVM data
directly without cross-process communication.
QUESTION 6
What does Delta Lake time travel enable you to do without restoring a backup?
A) Query a previous snapshot of a table using a version number or timestamp
B) Roll back schema changes automatically
C) Recover deleted files from cloud storage
D) Convert a Parquet table to Delta format
CORRECT ANSWER: A
EXPERT RATIONALE: Time travel allows querying historical data states using VERSION AS
OF or TIMESTAMP AS OF without data duplication. It relies on Delta’s transaction log.
QUESTION 7
You apply df.repartition(10) on a DataFrame with 200 GB of data. What is the primary
effect?
A) Increases parallelism by forcing 10 shuffle partitions
B) Coalesces data into 10 partitions without a full shuffle
C) Sorts data across 10 partitions
D) Persists the DataFrame to memory with 10 partitions
CORRECT ANSWER: A