Apache Spark 3.5 – Python Exam | Complete 200-
Question Practice Exam with Answers &
Explanations | PDF
Question 1
A data scientist of an e-commerce company is working with user data obtained
from its subscriber database and has stored the data in a DataFrame df_user.
Before further processing the data, the data scientist wants to create another
DataFrame df_user_non_pii and store only the non-PII columns in this
DataFrame. The PII columns in df_user are first_name, last_name,
email, and birthdate. Which code snippet can be used to meet this
requirement?
A. df_user_non_pii = df_user.drop("first_name",
"last_name", "email", "birthdate")
B. df_user_non_pii = df_user.drop("first_name",
"last_name", "email", "birthdate")
C. df_user_non_pii = df_user.dropfields("first_name",
"last_name", "email", "birthdate")
D. df_user_non_pii = df_user.dropfields("first_name,
last_name, email, birthdate")
Answer: A
Explanation:
The PySpark drop() method removes specified columns and returns a new
DataFrame. Multiple column names are passed as separate arguments.
Question 2
A data engineer is working on a Streaming DataFrame streaming_df with
unbounded streaming data.
,Which operation is supported with streaming_df?
A. streaming_df.select(countDistinct("Name"))
B. streaming_df.groupby("Id").count()
C. streaming_df.orderBy("timestamp").limit(4)
D. streaming_df.filter(col("count") < 30).show()
Answer: B
Explanation:
Structured Streaming supports aggregations over a key (groupBy). Global
operations like countDistinct, orderBy, limit, or show() are not
supported without windows or watermarks.
Question 3
An MLOps engineer is building a Pandas UDF that applies a language model
translating English strings to Spanish. The initial code loads the model on every
call to the UDF:
,def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner,
StringType())
How can the engineer reduce how many times the model is loaded?
A. Convert the Pandas UDF to a PySpark UDF
B. Convert the Pandas UDF from Series→Series to Series→Scalar UDF
C. Run the in_spanish_inner() function in a mapInPandas() call
D. Convert the Pandas UDF from Series→Series to
Iterator[Series]→Iterator[Series] UDF
Answer: D
Explanation:
Iterator-based Pandas UDFs load the model once per executor, instead of per
batch, improving performance.
Question 4
A Spark DataFrame df is cached using MEMORY_AND_DISK, but it is too large to
fit entirely in memory. What is the likely behavior?
A. Spark duplicates the DataFrame in both memory and disk. If it doesn't fit in
memory, the DataFrame is stored and retrieved from disk entirely.
B. Spark splits the DataFrame evenly between memory and disk.
C. Spark stores as much as possible in memory and spills the rest to disk when
memory is full, continuing processing with performance overhead.
D. Spark stores frequently accessed rows in memory and less frequently accessed
rows on disk.
Answer: C
Explanation:
MEMORY_AND_DISK caches as much data as possible in memory and spills the
remainder to disk to continue processing.
, Question 5
A data engineer is building a Structured Streaming pipeline and wants it to recover
from failures or intentional shutdowns by continuing where it left off. How can this
be achieved?
A. Configure checkpointLocation during readStream
B. Configure recoveryLocation during SparkSession initialization
C. Configure recoveryLocation during writeStream
D. Configure checkpointLocation during writeStream
Answer: D
Explanation:
Setting checkpointLocation in writeStream allows Spark to store
streaming progress and recover from failures.
Question 6
A Spark DataFrame df contains a column event_time of type timestamp.
You want to calculate the time difference in seconds between consecutive rows,
partitioned by user_id and ordered by event_time. Which function should
you use?
A. lag()
B. lead()
C. row_number()
D. dense_rank()
Answer: A
Explanation:
The lag() function returns the value of a column from a previous row in a
window. Combined with window partitioning and ordering, it allows you to
calculate differences between consecutive rows.