100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary (17/20) DATA ENGINEERING: SOLVED EXAM QUESTIONS

Rating
-
Sold
-
Pages
99
Uploaded on
05-09-2024
Written in
2023/2024

This document entails the solved exam questions of Data Engineering of the Master in Digital Business Engineering in an extended fashion. This document has been established based on: - Lectures - Intuition - ChatGPT 4o => Answers on the questions verified by ChatGPT 4o. Academic Year 2023/24 had less and different questions. Since this year, there's a broader range of exam questions and diverse questions. Hence why, this document was made. The owner of this document has attained a score of 17/20.

Show more Read less
Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
September 5, 2024
Number of pages
99
Written in
2023/2024
Type
Summary

Subjects

Content preview

Academic Year: 2023 – 2024

University of Antwerp




SOLVED EXAM QUESTIONS
DATA ENGINEERING
prof. L. Feremans

, THEORY

Introduction
1. What is a data pipeline? What are the different types of data processing and what is the role of the data
engineering in its development? Give an example of a data pipeline in e-commerce.

A data pipeline is a method in which raw data is extracted from various data sources (e.g., inventory
management system, salesforce system, google reviews, …) transformed into a usable format and then loaded
into a centralized structured data repository (e.g. a data warehouse or data lake). Such a pipeline can give data
scientists a foundation to turn usable data into valuable insights by doing analysis on the data and generate
value. The pipeline may contain machine learning models itself.

Data engineers must:

- Ensure that processing, and thus the pipeline, is:

 Scalable to support large amounts of data.

 Reliable and Available: with minimal downtime and operational robustness. This can be achieved
with multiple servers and an online copy to minimize downtime in case of issues.

 Maintainable: it must support continuous changes.

- Implement components to manage the data pipeline:

 ETL (Extract/Transform/Load): data is extracted from sources, transformed into a suitable format,
and then loaded into the repository (data warehouse/data lake).

 ELT (Extract/Load/Transform): data is first extracted from sources and loaded into the data
warehouse and then transformed.

- Enable data scientists to perform analyses on the data to extract insights and value .

Types of data processing:

1. Real-time processing: online processing where data is processed as soon as it arrives, suitable for
applications requiring immediate insights. Suitable for environments like financial trading platforms or
online gaming, where immediate data processing is crucial for real-time decision-making.

2. Streaming (near real-time processing): data is processed almost immediately after it is generated,
event-based, suitable for monitoring and alert systems. Ideal for environments like social media
monitoring or sensor data analysis in IoT (Internet of Things) devices, where data needs to be
processed almost instantly to trigger alerts or updates.

3. Batch processing (offline processing): data is collected over a period and processed in batches,
suitable for reporting (e.g., hourly or daily reports).

Background information:

During the transformation phase of the data pipeline, the data engineer will be concerned with:

- Aggregating the data
- Parsing the data (from one format to the other)




Example in e-commerce: Personalized Product Recommendations

,Sources of data: Include user clicks on website, user-related information on the website, buying history of
transactional databases, customer reviews.

Data will then be extracted out of these data sources and transformed intended to forge customer profiles.
This implies:

- Parsing the data
- Aggregating the data

Finally it will be pushed and loaded into a centralized data repository, i.e., a data warehouse/data lake (e.g.,
Amazon Redshift or Google BigQuery). This is an ETL pipeline, but another way of processing the data is ELT,
where the two last steps are reversed.

Empowering Data Scientists: Data scientists gain access to the central repository to analyze customer behavior
and preferences. Predictive ML models are developed to predict customer preferences and recommend
products. In this case, ML technique: collaborative filtering can be used, basing the recommendation on
similarity.

The predictive model is then integrated into the pipeline to provide recommendations. For the latter, we should
know in which nature they should occur. Should we be able to give them immediately, nearly immediately, or
can we just provide it daily/weekly, …

This is dependent on the type of data processing. Type is based on the speed of the data processing, i.e., “to
which extent is the data processed as soon as the data becomes available (e.g., the customer buys product)”

A. Immediately -> Real-time processing (immediate recommendation)
B. Nearly immediately -> Streaming (recommendation after a minute or so)
C. In regular timestamps (daily/weekly) -> Batches (recommendation just occurs at the end
of the week, or at each Thursday you’ll receive an e-mail in your inbox).

2. What is the three-tier architecture? Describe the function and common technologies used in each layer.
Give an example of a three-tier architecture pipeline in e-commerce.

A three-tier architecture is a system architecture that divides a system or application into three logical and
physical layers. Each layer has its own specific roles and responsibilities. In system design, it adheres to the
separation of concerns principle that implies that one task should contain one change driver. That means
that implementations in terms of the presentation (UI) / application logic or data storage/retrieval can and
thereby should happen independently. This architecture does mitigate ripple effects of changes in
implementations in any of the layers. Moreover, each tier can be designed simultaneously by a separate
development team. For instance, front-end developers for the presentation tier; back-end developers for
the logic tier; database engineers for the data tier.
(Example: e-commerce/online web shop)

 Presentation tier: This is the top-level or User Interface. It is responsible for translating requests/tasks
by the client and results to something that the client understands. In order to fulfill the needs of the
client, this UI sends the request to the Logic tier to handle it and to receive the result back to it so it
can display it. A webpage of the online web shop where the user can pursue actions, such as viewing a
product, buying it, registering themselves, adding products to the basket, paying, etc.

 Business Logic tier: This is the second layer and is responsible for coordinating the application and
handling the requests it receives from the UI layer. It makes logical evaluations based on these tasks
and executes operations by retrieving the data of the Data tier, by sending a request to this layer. It
moves and processes data between the UI and Data tier. This layer would handle the actions and
requests of the user of the webshop by validating user information, calculating order totals, checking
the availability of the product, and managing the order status.

,  Data tier: This is the third layer where data is stored and provided by a database or file system. After
receiving the request of the Logic tier, it sends back the necessary data so the logic tier can process the
data and do the necessary operations to produce the results necessary and propagate it back to the
Presentation tier that makes sense of the result by presenting it properly to the client. The data that
this layer stores with respect to the webshop example would be the data of the transactions made by
users of the webshop, customer data, product data, etc.




Summary:

PT

Function: Interaction points with the end-user. Receiving requests, sending requests to LT and properly
presenting the results to the client

Applied to Example: Webpage or website through which the user can pursue certain actions/transactions (buy
a product, view a product, …). Facilitated by a web server and tools such as HTML, JavaScript, PHP, CSS, …

LT

Function: Receive requests of PT, execute operations and logical evaluations based on requests, Demand
necessary data objects to the DT by sending a request to it, Send back the results to the PT after necessary
calculations, operations and evaluations are finished.

Applied to Example: Handling actions by the users of the e-commerce website, such as providing product
availability, fill-in form to request a quote, calculating a total or sub-total, adding a set of products to the basket,
... Facilitated by an application server and tools such as Python / Java.

DT

Function: Data persistency. It is responsible for storing the data and providing the data upon request of the LT.

Applied to Example: user data, product data, sales data, user clicks, ... Facilitated by an relational database
server/cloud storage and tools such as DBeaver / SQL.

3. Give three reasons why an organization would collect large datasets. Briefly discuss the strengths
(personalization, optimization of the supply chain, data-driven decision-making) and challenges (big
data, latency) of data-intensive applications. Give an example in e-commerce.

There are several reasons why enterprises collect large datasets:

1. Enhance Customer Engagement: Organizations can use large datasets to analyze customer behavior and
preferences online in real-time, enabling them to offer personalized recommendations and
experiences. This is done by an algorithm that analyzes the data in milliseconds.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
giorgibala Universiteit Antwerpen
Follow You need to be logged in order to follow users or courses
Sold
13
Member since
1 year
Number of followers
0
Documents
3
Last sold
19 hours ago

5.0

2 reviews

5
2
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions