Data Processing at Scale
KNOWLEDGE ASSESSMENT
REVIEW
© ASU 2024/2025
,1. Multiple Choice: What is the primary benefit of using
MapReduce in large-scale data processing?
a) Data redundancy
b) Parallel processing
c) Data security
d) Simplified querying
Answer: b) Parallel processing
Rationale: MapReduce allows for the distribution of large data
processing tasks across multiple systems, which can work on the
tasks concurrently, significantly speeding up processing times.
2. Fill-in-the-Blank: In distributed computing, _________ refers to
the practice of dividing a large dataset into smaller chunks to be
processed in parallel.
Answer: Sharding
Rationale: Sharding is a type of database partitioning that
separates very large databases into smaller, faster, more easily
managed parts called data shards.
© ASU 2024/2025
, 3. True/False: Hadoop is an ideal solution for real-time data
processing.
Answer: False
Rationale: Hadoop is designed for high-throughput rather than
low-latency, making it better suited for batch processing rather than
real-time processing.
4. Multiple Response: Which of the following are characteristics of
a Data Lake?
a) Schema-on-read
b) Schema-on-write
c) Data in its raw form
d) Fixed configuration
Answers: a) Schema-on-read, c) Data in its raw form
Rationale: Data lakes store raw data without a predefined
schema, allowing for the schema to be defined when the data is
read, which provides flexibility in data analysis.
5. Multiple Choice: Which algorithm is commonly used for sorting
large datasets in a distributed system?
a) Quick sort
b) Bubble sort
c) Merge sort
© ASU 2024/2025