1. Introduction
Data Engineer
Develops the architecture that helps analyse and process data in the way the organization
needs it.
Data Science Lifecycle
Big Data
Term for a collection of datasets so large and complex that it becomes difficult to process using
traditional data processing applications.
Structured Data Semi-Structured Data Unstructured Data
RDMS XML, RDF, JSON, etc. Video, Images, Text, etc.
V3 Model
● Volume: enterprises are always growing in terms of data
● Velocity: sometimes 2 minutes is too late, data must be used as streams in order to
maximize its value
● Variety: structured and unstructured data, new insights are found when analyzing these
data types together
Data Pipeline
Aggregates, organizes and moves data for storage, insights and analysis.
1
, ML Ops
● Software engineering approach: enables team to efficiently produce high quality software
● Cross-functional team: experts with different skill sets and workflows (DE, DS, ML, Dev,
Ops)
● Producing software based on code, data and models: all artifacts of ML software
production process require different tools and workflows → must be versioned and
managed accordingly
● Small and safe increments: the release of software artifacts is divided into small
increments → allows visibility and control around the levels of variance of its outcomes
● Reproducible and reliable software release: model outputs are non-deterministic →
process of releasing ML software is reliable and reproducible → leverage automation as
much as possible
● Software release at any time: ML software needs to be delivered into production at any
time → when to release is a business decision rather than a technical decision
● Short adaptation cycles: short cycles mean development cycles are in the order of days
or hours
2. Cloud Computing & Virtualization
Cloud
A type of distributed system consisting of interconnected and virtualized computers dynamically
provisioned and presented as one (or more) unified computing resource(s) based on
service-level agreements established through negotiation between service providers and
consumers.
● Cloud contains “your” data
● Cloud computes “your” data
● Cloud hosts “your” data intensive applications
▶ Central ideas:
● Utility computing over data
● SOA (Service Oriented Architectures)
● SLA (Service Level Agreements)
2
Data Engineer
Develops the architecture that helps analyse and process data in the way the organization
needs it.
Data Science Lifecycle
Big Data
Term for a collection of datasets so large and complex that it becomes difficult to process using
traditional data processing applications.
Structured Data Semi-Structured Data Unstructured Data
RDMS XML, RDF, JSON, etc. Video, Images, Text, etc.
V3 Model
● Volume: enterprises are always growing in terms of data
● Velocity: sometimes 2 minutes is too late, data must be used as streams in order to
maximize its value
● Variety: structured and unstructured data, new insights are found when analyzing these
data types together
Data Pipeline
Aggregates, organizes and moves data for storage, insights and analysis.
1
, ML Ops
● Software engineering approach: enables team to efficiently produce high quality software
● Cross-functional team: experts with different skill sets and workflows (DE, DS, ML, Dev,
Ops)
● Producing software based on code, data and models: all artifacts of ML software
production process require different tools and workflows → must be versioned and
managed accordingly
● Small and safe increments: the release of software artifacts is divided into small
increments → allows visibility and control around the levels of variance of its outcomes
● Reproducible and reliable software release: model outputs are non-deterministic →
process of releasing ML software is reliable and reproducible → leverage automation as
much as possible
● Software release at any time: ML software needs to be delivered into production at any
time → when to release is a business decision rather than a technical decision
● Short adaptation cycles: short cycles mean development cycles are in the order of days
or hours
2. Cloud Computing & Virtualization
Cloud
A type of distributed system consisting of interconnected and virtualized computers dynamically
provisioned and presented as one (or more) unified computing resource(s) based on
service-level agreements established through negotiation between service providers and
consumers.
● Cloud contains “your” data
● Cloud computes “your” data
● Cloud hosts “your” data intensive applications
▶ Central ideas:
● Utility computing over data
● SOA (Service Oriented Architectures)
● SLA (Service Level Agreements)
2