Study Program: Pre-Master Data Science and Society
Academic Year 2021/2022, Semester 1 (August to December 2021)
Course: Methodology for Premasters DSS, 800884-B-6
Lecturers: B. Nicenboim and G. Saygili
,2. The data science process
Data Science Introduction
• Data Science is the art of turning data into actions
• the terminology exists since 1960s
• the term data science as the current understanding was introduced in 1990s in
statistics and data mining communities
• It was first named as an independent discipline in 2001
• data consists of
o 1) structured data → traditional relational database management system
o 2) unstructured data
o 3) semi structured data
Data Science in Organizations
• organizations use data to maintain competitiveness
o increase of data usability by 10 % leads to
→ 17 - 49 % increase in their productivity
→ 11 - 42 % return on assets
→ 5 - 6 % performance improvement via data driven decision making
• Big Data Opportunities
o Top 3: Increasing Operational efficiency (51%), Informing Strategic Direction,
Better Customer Service (27%)
• Four key activities of data science in organizations
o acquire: obtaining needed data
o prepare: preprocessing operations
on the data
o analyze: analyzing and interpret results
o act: taking actions based on results
Data Mining Process
• Data Mining is the process of extracting previously unknown and potentially useful
information from the data using mathematical, statistical and machine learning
methods
• CRISP-DM: Cross-Industry Standard Process for Data Mining (late 1990s)
o guideline for a structured approach to execute a data mining process
o six phases (the whole process can restart several times)
, o updated versions of CRISP-DM based on new demands
▪ IBM: ASUM-DM (Analytics
Solutions Unified Method
for Data Mining)
▪ SAS: SEMMA (Sampling,
Exploring, Modifying,
Modeling, Assessing)
▪ Microsoft: TDSP (Team
Data Science Process)
• Phases of CRISP-DM
o 1) Business Understanding
▪ Understanding the
business goal
▪ Situation assessment
▪ Translating the business goal to a data mining objective
▪ Development of a project plan
o 2) Data Understanding
▪ Considering data requirements
▪ Initial data collection, exploration, and quality assessment
o 3) Data Preparation
▪ Selection of required data
close
▪ Data cleaning
dependency
▪ Data transformation and enrichment
o 4) Modeling
▪ Selection of the appropriate modeling technique
▪ Training and test set creation for evaluation
▪ Development and examination of alternative modeling algorithms
▪ Fine tuning the model parameters
o 5) Model Evaluation
▪ Evaluation of the model in the context of the business success criteria
▪ Model approval
o 6) Deployment
▪ Reporting of the findings
▪ Planning and development of deployment procedure
▪ Deployment of the model
▪ Development of a maintenance or update plan
▪ Review of the project and planning the next steps
,• Team Data Science Process (TDSP)
o TDSP is Microsoft’s new version of CRISP-DM
o Key components:
▪ A data science lifecycle definition
▪ A standardized project structure
▪ Infrastructure and resources recommended for data science projects
▪ Tools and utilities recommended for project execution.
o TDSP Infrastructure and Resources for Data Science Projects: TDSP provides
recommendations for managing shared analytics and storage infrastructure
such as:
▪ Cloud file systems for storing datasets
▪ Databases
▪ Big Data (SQL or Spark) clusters
▪ Machine learning service
o TDSP provides recommendations for R and Python
• Data Science Trajectory / Process
o Raw Data → Process Data → Clean Data (suppress noise, add missing data)
o from exploratory data analysis you can go back to get more data or proceed
o data product can be used in real world
, • A Data Scientist’s Role in This Process
o Initial step: What data is needed?
o Data can come from different fields
o raw data can be from different type of data
o process and clean data to suppress noise and discard outliers
o data scientists formulate a hypothesis and research question which should be
answered within a study → we need to know where we are going to
Big Data
• big data combines different sources of data (e.g., social media, transactions,
enterprise data…)
• Big data can be defined as a collection of diverse and large amounts of data that is
hard to process with conventional data processing platforms.
→ Doug Laney’s explanation: big data = 3 V’s (volume, velocity, variety) plus value
→ big data = high volume, high velocity, and high variety of information
o “Data are becoming the new raw material of business: Economic input is
almost equivalent to capital and labor” (Economist, 2010)
o “Information will be the “21th Century Oil”” (Gartner company, 2019)