SUMMARY DATA
PREPARATION & WORKFLOW
MANAGEMENT
Demi van de Pol || Master Marketing Analytics || Tilburg University || 2022
1
, Demi van de Pol | Summary | Data Preparation & Workflow Management | TISEM | Tilburg University | Spring-2022
CONTENT
This summary is written for the course “Data Preparation & Workflow Management” during the
semester Spring-2022 and is part of the master Marketing Analytics. The input for this summary
consists of lectures, articles and tutorials.
Disclaimer: The course “Data Preparation & Workflow Management” is mainly focused on the practical part of this subject
(i.e. working with data). This summary is by no means a substitute for the lectures and tutorials provided by the lecturer.
This summary merely provides support on the theoretical part of the course.
…
WEEK 1
READING: Professionalize you Team Work Using Scrum
The entire article can be found via this link: https://tilburgsciencehub.com/tutorials/scale-up/scrum-for-
researchers/use-scrum-in-your-team/
● Scrum is a simple framework for effective team collaboration that provides structure which leads
to commitment and motivation.
● Scrum defines three main roles for members of the team: the product owner, the Scrum master
and development team members.
● The product owner is accountable for maximizing the value of the product and for defining a clear
“task list” (called product backlog).
● The Scrum master is accountable for the team’s effectiveness by coaching and helping the team
members to focus, removing obstacles for the team and ensuring that tasks are completed in a
positive, productive and timely manner.
● The development team members are responsible for completing the tasks in the Sprint (period).
● Scrum can be seen as a structured way of working with meetings that are shorter and more
productive, and cooperating in a flexible way in-between meetings.
2
, Demi van de Pol | Summary | Data Preparation & Workflow Management | TISEM | Tilburg University | Spring-2022
WEEK 2: Project Management &
Version Control
READING: Principles of Project Setup and Workflow Management
The entire article can be found via this link: https://tilburgsciencehub.com/tutorials/reproducible-research-
and-automation/principles-of-project-setup-and-workflow-management/project-setup-overview/
PROJECT SETUP
Two major issues in managing data-intensive projects are:
● Losing sights of the project (= directory and file chaos)
● Difficult to (re)execute the project (= lack of automation)
The primary mission of managing data- and computation-intensive projects is to build a transparent
project infrastructure, that allows for easily (re)executing your code potentially many times.
PIPELINES AND PROJECT COMPONENTS
It is useful to break down a project into its most basic parts:
● A pipeline refers to the steps that are necessary to build a project (e.g., prepare dataset, run
model, produce tables and figures).
● Components refer to a project’s most nuclear building blocks (e.g., data, source code, and
generated temporary and/or output files).
The power of setting up the project in this way lies in:
● Full portability
● Reproducibility and transparency
PIPELINES
Benefits of conceiving your project like a pipeline:
● Write clearer source code: Separate the different steps in your project in smaller steps of separate
source code files.
● Obtain results faster: Because your project is separated into different pipeline stages and each of
these stages is self-contained, you can easily run “later” stages of your project (called
“downstream”), based on different input files defined earlier in your project (called “upstream”).
● Increase transparency and foster collaboration: With more transparent source code, you allow
others to more easily understand the code you use(d).
● Use multiple software packages: Due to the smaller steps you can easily use for instance R to
prepare your dataset and Python to build an algorithm based on the cleaned data.
3