100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Exam question Data Engineering

Rating
-
Sold
3
Pages
48
Uploaded on
03-06-2021
Written in
2020/2021

The document consists of answers to more than 100 possible exam questions for the course Data Engineering

Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
June 3, 2021
Number of pages
48
Written in
2020/2021
Type
Summary

Subjects

Content preview

Exam Questions Data Engineering

Week 1: Introduction, file formats, python for data engineering
● What is a data pipeline? When is a data pipeline expected to finish?
Which are other, technical requirements, that are ensured by a data
engineer?

○ what is data pipeline:




■ A data pipeline is a series of data processing steps. It consists
of three key elements:
● Data source(s).
● Processing step(s).
● Destination: data warehouse.
■ different data sources within organization
■ extract data and put in central repository in structured format
via ETL (extract/transform/load)
■ data pipeline can contain machine learning models
■ data processing is either
● real-time: online/streaming
● once per day: offline/batch
○ When is a data pipeline expected to finish? (the answer on this
question are our own thoughts because we think that he didn’t say
anything about this during classes)
■ A data pipeline needs to be updated constantly and must be
available at all times to support the business processes of the
organization. Therefore, a data pipeline is only expected to
finish if a better data pipeline is implemented or if the
business processes (which this data pipeline supports) cease
to exist.
■ Real-time := online/streaming processing (link week 8)
● Eg. User goes to Dreamland: the products they get on
the page is real-time, there’s some query that goes to
database and they get result immediately
■ Once per hour/day := offline/batch processing (link week 8)
○ Data engineer:
■ data engineer is responsible for implementing necessary
components for managing the data flow to enable data
scientists to do analysis and gain necessary insights
1

, ■ data engineer ensures processing is:
● scalable: support huge amount of users (link with
distributed processing)
● reliable/available: min downtime and operational robust
(back-ups and online appli’s available 24/7)
● maintainable: support continuous change (software
and hardware updates)

● We saw three different data models for representing data? Name and
provide a short summary of each data model.
○ The relational model:
■ Consists of tables and rows (or tuples /records)
■ Each column contains primitive value such as string, integer,
float or date
■ Two types of tables:
● Entities, i.e. Persons, groups, objects
● Relations between entities: i.e. part-of, has-a, has-many,
linked-to
■ Each table can be saved as Comma-Seperated-Values (or CSV)
file
Strengths Weaknesses

structured static and less flexible schema

schema checking joins = necessary evil (they are
complex)

natural model for batch
processing

flexible queries

○ The document-oriented model:
■ Consists of keys and documents, that is, each key is associated
with one document
■ Document is a tree containing:
● Primitive values
● Nested entities
● On-to-many relations
■ Each document can be stored (and transferred) in JSON or XML
Strengths Weaknesses

structured no static schema checking

flexible: dynamic scheme less flexible queries
checking

natural model for tree many intra document relations

2

, structured data

performance

○ The graph-oriented model:
■ Consists of nodes and edges
■ A node is an instance of an entity and has a unique ID
■ An edge is a relation between two nodes and has a unique ID
■ A node and edge have named properties with a primitive value
Strengths Weaknesses

structured no static schema checking

flexible: schema can be easily used less in industry
changed (academic model)

natural model for when used in domains where
everything is connected with everything is connected
each other f.ex. social through everything (not really a
networks weakness said Len)

variable number of joins

● What are the strengths and weaknesses of the relation model versus the
document-oriented model? Which model would you prefer?


Relational model Document-oriented model

Strengths Weaknesses Strengths Weaknesses

structured static and less structured no static
flexible schema schema
checking

schema joins = flexible less flexible
checking necessary evil queries

natural model natural model many Intra
for batch (when data is document
processing tree-structured relations
with few intra
document (or
many-to-many)
relations)

flexible queries performance

○ Which model would you prefer?
3

, ■ Each model is widely used for different purposes, there is no
one-size-fits-all solution !!!
■ Decision depends on domain, that is, the structure of the data and
type of application
■ Mixed systems are available, for instance, JSON columns are
supported in most Relational databases these days.

● Which file formats are used for storing and communication data?
Provide two short examples in JSON and XML for storing student
grades.
○ CSV = Comma-Seperated-Values:
■ A plain text format
■ Represents single table in relational data model
■ values can be surrounded by “ “ marks.
■ Used very commonly for batch processing, export/input
larger amounts of data
■ Easy to partition, (i.e. 2020-10-01_sales.csv,
2020-10-02_sales.csv (= sales data for each month))
■ Can be easily compressed using zip

⇒ CSV is niet echt gebruikt voor communicating data dus denk bij deze vraag
da ge alleen JSON en XML moet geven

○ JSON = JavaScript Object Notation:
■ A plain text format
■ Same syntax as data in Python and Javascript
■ Represents single tree of data in document-oriented model
■ makes use of arrays and dictionaries
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ For configuration of applications / services
■ Typically single JSON documents is small, but NoSQL
databases such as MongoDB store millions of documents
with a unique ID for each document

○ XML = eXtensible Markup Language:
■ Represents single tree of data in document-oriented model
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ instead of arrays and dicts, it uses TAGS (<>) with
attributes
■ For communication and configuration of applications /
services
■ XHTML, for formatting web-pages, is a type of XML
■ (As the name suggests XML is not really a format, but a
4
R150,04
Get access to the full document:

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Get to know the seller
Seller avatar
arnoverlinden2014

Get to know the seller

Seller avatar
arnoverlinden2014 Universiteit Antwerpen
Follow You need to be logged in order to follow users or courses
Sold
7
Member since
4 year
Number of followers
7
Documents
2
Last sold
1 year ago

0,0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions