Summary

Summary Introduction to Data Science Entire Course

Rating

Sold

Pages

Uploaded on

21-03-2022

Written in

2021/2022

Everything you need to know for the IDS exam!

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Vrije Universiteit Amsterdam (VU)
Study: Business Analytics
Course: Introduction To Data Science

All documents for this subject (4)

Document information

Uploaded on: March 21, 2022
Number of pages: 18
Written in: 2021/2022
Type: Summary

Subjects

introduction
data
science
minor
business
analytics
summary

Content preview

INTRODUCTION TO DATA SCIENCE

Paper 2: Data Life Cycle (CRISP-DM)
CRISP-DM stands for cross-industry process for data mining. It provides a structured
approach to planning a data mining project. The model is a sequence of events, it has 6
phases:
1. Business understanding:
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment

STAGE 1: BUSINESS UNDERSTANDING
understand what you want to accomplish from a business perspective
 What are the desired outputs of the project?
 Assess the current situation:
o List all resources like personnel, data, computing resources, software.
o List all requirements, assumptions, and constraints.
o List all risks or events that might delay the project or cause it to fail
o Compile a glossary of terminology relevant to the project.
o Construct a cost-benefit analysis for the project which compares the costs of
the project with the potential benefits to the business if it is successful.
 Determine data mining goals:
o Business success criteria: describe the intended outputs of the project that
enable the achievement of the business objectives.
o Data mining success criteria: define the criteria for a successful outcome to the
project in technical terms
 Produce project plan:
o Project plan: list the stages to be executed in the project, together with their
duration, resources required, inputs, outputs, and dependencies.
o Initial assessment of tools and techniques

STAGE 2: DATA UNDERSTANDING
Acquire the data listed in the project resources
 Collect the data: sources and methods
 Describe the data: including its format, quantity, identities of the fields
 Explore the data: visualize the data by looking at relationships between attributes,
distribution of attributes, and simple statistical analyses.
 Verify data quality: is it complete and correct, are there errors or missing values?

,STAGE 3: DATA PREPARATION
 Select your data: decide on the data that you are going to use for analysis.
 Clean your data: raise the data quality to the level required by the analysis
techniques that you have selected, by for example selecting clean subsets of data /
handling missing data.
 Construct required data: derive new attributes / generate records (completely new)
 Integrate data: merge and aggregate data

STAGE 4: MODELLING
 Select modeling technique: together with any modelling assumptions.
 Set up test and training sets
 Build the model: list the parameter settings, the models produced and the model
descriptions.
 Assess the model: discuss results with experts (considering project goal) and revise
parameter settings: tune them for the next modelling run.
Iterate model building and assessment until you strongly believe that you have found the
best model.

STAGE 5: EVALUATION
 Evaluate your results: judge quality of model by taking business criteria into account
and approve the proper models
 Review process: check if approved model fulfils and satisfies tasks & requirements
 Determine next steps: list the possible actions and decide what to do.

Paper 3: Principles of Data Wrangling
Structure of a dataset refers to the format and encoding of its records and fields.
 You want a rectangular dataset: table with a fixed number of rows and columns.
 If the record fields in a dataset are not consistent (some records have additional
fields, others are missing fields), then you have a jagged table.
 The encoding of the dataset specifies how the record fields are stored and presented
to the user, like what time zones are used for times.
 In many cases, it is advisable to encode a dataset in plain text, such that it is human-
readable. Drawback: takes up a lot of space.
 More efficient is to use binary encodings of numerical values.
 Finding out the structure is mostly about counting the number of records and fields
in the dataset and determining the dataset’s encoding.
 A few extra questions to ask yourself when assessing the structure of a dataset:
o Do all records in the dataset contain the same fields?
o How are the records delimited in the dataset?
o What are the relationship types between records and the record fields?

Granularity of a dataset refers to the kinds of entities that each data record represents or
contains information about.
 In their most common form, records in a dataset will contain information about
many instances of the same kind of entity (like a costumer ID).

,  We look at granularity in terms of coarseness and fineness: the level of depth or the
number of distinct entities represented by a single record of your dataset.
 Fine: single record represents a single entity (single transaction at store)
 Coarse: single record represents multiple entities (sales per week per region)
 A few questions to ask yourself when assessing the data granularity:
o What kind of things do the records represent? (Person, object, event, etc.)
o What alternative interpretations of the records are there?
 If the records are customers, could they actually be all known
contacts (only some of which are customers)?
o Example: one dataset has as location the country, the other dataset has
coordinates

Accuracy of a dataset refers to its quality: the values populating record fields in the dataset
should be consistent and accurate.
 Common inaccuracies are misspellings of categorical variables, lack of appropriate
categories, underflow and overflow of numerical values, missing field components
 A few questions to ask yourself when assessing the data accuracy:
o Check if the date times are specific, are the address components consistent,
and correct, are numeric items like phone numbers complete?
o Is data entered by people? Because that increases the chance of misspellings.
o Does the distribution of inaccuracies affect many records?

Temporality deals with how accurate and consistent the data is over time.
 Even when time is not explicitly represented in a dataset, it is still important to
understand how time may have impacted the records in a dataset.
 Therefore, it is important to know when the dataset was generated.
 A few questions to ask yourself when assessing the data temporality:
o Were all the records and record fields collected at the same time?
o Have some records or record fields been modified after the time of creation?
o In what ways can you determine if the data is stale?
o Can you forecast when the values in the dataset might get stale?

Scope of a dataset has 2 dimensions:
 1) The number of distinct attributes represented in a dataset.
 2) The attribute-by-attribute population coverage: are all the attributes for each field
represented in the dataset, or have some been randomly, intentionally, or
systematically been excluded.
 The larger the scope, the larger the number of fields.
 As with granularity, you want to include only as much detail as you might use.
 A few questions to ask yourself when assessing the scope of your data:
o Given the granularity, what characteristics of the things represented by the
records are captured by the record fields? And what characteristics are not?
o Are the record fields consistent? Like does the customer’s age field make
sense relative to the date-of-birth field?
o Are the same record fields available for all records?
o Are there multiple records for the same thing? If so, does this change
granularity?

$7.78

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

femkestokkink

4.0

(3)

Get to know the seller

femkestokkink Vrije Universiteit Amsterdam

View profile

Sold

Member since

4 year

Number of followers

Documents

Last sold

1 year ago

4.0

3 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller femkestokkink. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.78. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 48341 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Summary Introduction to Data Science Entire Course

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?