100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Data Engineering

Rating
4.3
(3)
Sold
24
Pages
190
Uploaded on
21-05-2020
Written in
2019/2020

This summary Data Engineering contains the course material with extra notes in grey and is made in the year including my answers for the example exam and example questions during the course. Also contains questions of exam itself. This document is very handy to learn in a structured way (highly structured document!). Check also the "quick" review of course 1-10 in the back! The notes on the GDelt Project & screenshots of every step are added in the back starting from page 106 till the end (not entirely in English, let me know if you need this and then I will make an update of this part). These sessions are only as support for your group assignment and not the exam, so I wouldn't even print out this part ;)

Show more Read less
Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
May 21, 2020
File latest updated on
June 15, 2020
Number of pages
190
Written in
2019/2020
Type
Summary

Subjects

Content preview

Data Engineering 2019-2020
Content table – Data Engineering 2019-2020

Course 1 ......................................................................................................................................................... 4
1.1 Intro ............................................................................................................................................................... 4
1.1.A defining data engineering....................................................................................................................... 4
1.1.B Course topics .......................................................................................................................................... 5
1.1.C Class format, lab sessions, exam and project ......................................................................................... 6
1.2 Basic computer architecture and operating systems .................................................................................... 7
1.2.A Basic Computer Architecture ................................................................................................................. 7
1.2.B Operating System (OS) level ................................................................................................................. 10
1.3 File formats.................................................................................................................................................. 14
1.3.A human readable file formats ................................................................................................................ 14
1.3.A.1 CSV..................................................................................................................................................... 14
1.3.A.2 XML.................................................................................................................................................... 15
1.3.A.3 JSON .................................................................................................................................................. 16
1.3.B Not human readable and compressed file formats .............................................................................. 19
1.4 Python concepts .......................................................................................................................................... 21

Course 2 ....................................................................................................................................................... 25
2.1 basic computer architecture and Operating systems (os) ........................................................................... 25
2.2 intro to computer networks......................................................................................................................... 25
2.2.A Important network applications: Web – HTTP ..................................................................................... 27
2.2.B Important network applications: DNS .................................................................................................. 30
2.2.C lab sessions ........................................................................................................................................... 30
2.3 Regular expressions (regex)......................................................................................................................... 31
2.3.A DeFInition and general application ...................................................................................................... 31
2.3.B Regular expressions in Python .............................................................................................................. 32
2.3.C Gone wrong .......................................................................................................................................... 34
2.3.D Concluding remarks .............................................................................................................................. 34
Summary ........................................................................................................................................................... 34

Course 3 ....................................................................................................................................................... 35
3.1 Basic Linux ................................................................................................................................................... 35
3.1.A linux ...................................................................................................................................................... 36
3.1.B Linux command line instructions (FIle manipulation) .......................................................................... 38
3.1.C JQ .......................................................................................................................................................... 39
3.2 Cloud Services .............................................................................................................................................. 40
3.2.A DEFIning cloud services ........................................................................................................................ 40
3.2.B Core AWS services ................................................................................................................................ 41
3.2.C Storage infrastructure .......................................................................................................................... 44
3.2.D Database services ................................................................................................................................. 44
3.2.E Cloud architecture example.................................................................................................................. 45
Summary ........................................................................................................................................................... 45




1

,Course 4 ....................................................................................................................................................... 46
4.1 algorithms and complexity .......................................................................................................................... 46
4.1.A Storting ................................................................................................................................................. 49
4.2 basic datastructures .................................................................................................................................... 53
4.2.A collections or container ........................................................................................................................ 54
A.1 List ........................................................................................................................................................... 54
A.2 set ............................................................................................................................................................ 55
A.3 map.......................................................................................................................................................... 55
4.2.B trees ...................................................................................................................................................... 55
4.2.C Hash Tables ........................................................................................................................................... 57
Summary ........................................................................................................................................................... 58

Course 5 ....................................................................................................................................................... 59
Databases.......................................................................................................................................................... 59
5.1 Data, data, data ....................................................................................................................................... 59
5.2 evolution of databases ............................................................................................................................ 59
5.3 relational databases................................................................................................................................. 60
5.4 types of databases ................................................................................................................................... 63
5.4.A type 1: production database ................................................................................................................ 63
5.4.B type 2: analytical database ................................................................................................................... 63
5.5 NoSQL Data Stores ................................................................................................................................... 64
5.6 Big Data.................................................................................................................................................... 64

Course 6&7 .................................................................................................................................................. 65
6. Parallel and distributed computing ............................................................................................................... 65
6.1 Parallel computing ................................................................................................................................... 65
6.1.A communication patterns ...................................................................................................................... 66
6.1.B Examples ............................................................................................................................................... 68
6.1.C Analysis of speedup .............................................................................................................................. 70
6.1.D Dependencies ....................................................................................................................................... 70
6.2 Distributed computing ............................................................................................................................. 71
6.3 Use cases ................................................................................................................................................. 73
7. Map reduce ................................................................................................................................................... 74
7.1 map reduce .............................................................................................................................................. 75
7.2 Map-Reduce example .............................................................................................................................. 76
7.3 SQL operations......................................................................................................................................... 77
7.4 Hadoop .................................................................................................................................................... 78
7.5 Shuffling ................................................................................................................................................... 79
7.6 matrix operations .................................................................................................................................... 79
7.7 summary .................................................................................................................................................. 80
7.8 Spark ........................................................................................................................................................ 81
7.9 the debit example on spark ..................................................................................................................... 82
7.10 indexing web pages using spark ............................................................................................................ 83
7.11 Spark functions ...................................................................................................................................... 83
7.11 use cases ................................................................................................................................................ 85

Course 8 & 9: Gdelt project .......................................................................................................................... 85




2

,Course 10 ..................................................................................................................................................... 86
10. Web api’s ..................................................................................................................................................... 86
10.1 Rest api .................................................................................................................................................. 87
10.2 Designing a REST API.............................................................................................................................. 88
10.3 demo ...................................................................................................................................................... 89
10.4 api access ............................................................................................................................................... 90
10.5 Microservices ......................................................................................................................................... 91
10.6 summary ................................................................................................................................................ 92

Course 11: closing remarks ........................................................................................................................... 93
11.1 Choose your technology stack ................................................................................................................... 93
11.2 Streaming .................................................................................................................................................. 94
11.3 Sampling .................................................................................................................................................... 94
11.4 filtering ...................................................................................................................................................... 95
11.5 Streaming technology ............................................................................................................................... 95
11.6 data warehouses ....................................................................................................................................... 96
11.7 Unstructured data ..................................................................................................................................... 98
11.8 Web API’s .................................................................................................................................................. 98

Example Exam .............................................................................................................................................. 99

Quick review of course 1-10 ....................................................................................................................... 109

Gdelt project .............................................................................................................................................. 138




3

, COURSE 1

1.1 INTRO

1.1.A DEFINING DATA ENGINEERING
Defining a data engineer by differentiating it from a data scientist
A data scientist’s principal role is to find value or discover new
opportunities in the company’s data or fulfill business needs using
that data. The data scientist/analyst uses the company’s tools and
infrastructure together with his/her knowledge of basic
mathematics, machine learning and statistics

The role of the data engineer is to provide the data scientist with
the software infrastructure for fetching and processing the data so
that the data scientist can easily explore and gain insight in the
data. He/she is responsible deploying new models and applications
typically making use of a workflow management platform

Extract/Transform/Load (ETL)
Besides supporting data science, the data engineer is more
generally responsible for the processing of data

The data engineer is responsible for
Extract/Transform/Load (ETL)implementing the interfaces that are
The data engineer is responsible for implementing the interfaces that are
necessary for managing the data flow and Data
necessary for managing the data flow and keeping the data available for source
keeping the data available for analysis
analysis
extract
The data architect is usually the person load
The data architect is usually the person responsible for the design of the
responsible for the design of the whole Data
whole system Data
transform
system source
warehouse
Typically there are many different data sources within the company. To
Typically there are many different data
enable data scientists to gain insight in that data and generate value, all
sources within the company. Toenable data
that data should be accessible in a central repository in some uniform Data
scientists to gain insight in that data and source
format
generate value, all that data should be
accessible in a central repository in some
uniform format
The data pipeline
The set of processes to automatically extract data from different sources, transform it into some uniform format and store
it in a central place defines the data pipeline

The data pipeline can also contain production models made by data scientists. Depending on the requirements these
models have to run in real-time, once per hour/day...
Data engineers need to maintain this data flow and ensure its availability and quality:
● make changes if data is added/removed
● solve bottlenecks in the pipeline
● monitor, log and solve errors
● handle duplicate, incorrect or corrupted data
● scale
● test
Workflow Management Platform
● ...

Workflow Management Platform
Image shows how we manage
this data.
We split up the data in parts,
and each split is a step, but you
don’t do every step yourself
(don’t have to reinvent the
wheel every time)




4
DAG configuration and monitoring @PrediCube
$11.71
Get access to the full document:
Purchased by 24 students

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Reviews from verified buyers

Showing all 3 reviews
1 year ago

10 months ago

5 year ago

4.3

3 reviews

5
2
4
0
3
1
2
0
1
0
Trustworthy reviews on Stuvia

All reviews are made by real Stuvia users after verified purchases.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
julievantroyen Universiteit Antwerpen
Follow You need to be logged in order to follow users or courses
Sold
607
Member since
6 year
Number of followers
255
Documents
3
Last sold
5 months ago
FBE / TEW / Handelsingenieur samenvattingen

Ik ben een studente van de faculteit bedrijfswetenschappen en economie. Ik verkoop mijn notities/samenvattingen voor een tal van vakken, voornamelijk uit de richting handelsingenieur (in de beleidsinformatica).

4.8

167 reviews

5
143
4
17
3
4
2
0
1
3

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions