100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Data Mining for Business and Governance

Rating
-
Sold
6
Pages
66
Uploaded on
01-02-2023
Written in
2022/2023

English summary of the Data Mining fro Business and Governance course from Master in Data Science and Society. Summary of lecture materials, readings, and notes.

Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
February 1, 2023
Number of pages
66
Written in
2022/2023
Type
Summary

Subjects

Content preview

Data mining for business and governance
Week 1

What is data mining: the study of collecting, cleaning, processing, analyzing and gaining
useful insight of information.
- It is an umbrella term and the methods used relates to different disciplines:
o Knowledge discovery in databases
o Statistics
o Artificial intelligence
o Machine learning

Key aspects:
- Computation vs large data sets: trade-off between processing time and memory.
- Computation enables analysis of the large data sets: computers as a tool and with
growing data.
- Data mining often implies knowledge discovery from databases: from unstructured
data to structured knowledge.

What are large amounts or big data:
- Volume
o Too big for manual analysis
o Too big to fit in RAM
o Too big to store on disk
- Variety
o Range of values: variance
o Outliers, confounders and noise
o Different data types
- Velocity
o Data changes quickly: require results before data changes
o Streaming data (no storage)
It is not only about volume but also about complexity (variety) and for example the speed of
the database.

Applications of data mining:
- Companies: business intelligence
- Science: knowledge discovery

,Datapoints could be observations that you have in a dataset. They could be related
(dependent) or independent.
Dependency oriented data types: explicit, like in social networks, relationships can be
captured by edges in graph representations, or implicit like temperature readings of a
sensor, similar values, this relationship is not market explicitly:
- Spatial data, spatio temporal data
- Network data
- Time series data
- String data (text)
- Discrete sequences (event logs)
The dependency properties relate for example to the assumptions of the methods that can
be applied.

In short: Implicit data is information that is not provided intentionally but gathered from
available data streams, either directly or through analysis of explicit data. Explicit data is
information that is provided intentionally, for example through surveys and membership
registration forms.

Non dependency oriented: no specified dependency records:
- Multidimensional data, can be quantitative, categorical, binary (can be seen as
quantitative or categorical)
- Text data with a representation ignoring the relationship, for example looking into
the frequency of words.
For machine learning models, observations are assumed to be independent of each other.

,The general pipeline:




What makes prediction possible?
- Fitting data is easy, but predictions are hard




- Associations between features/target
o Numerical: correlation coefficient
o Categorical: mutual information value of x1 contains information about value
of x2

Statistical descriptions of data:
Measures of central tendency:
- Mean: average
- Median: the middle vale in a set of ordered data values
- Mode: value that occurs most frequently in the set




Statistical descriptions are relevant for summarizing the data, for having an overview when
we are exploring the data.

, Measuring the spread of data, five number summary:
- Range: (max()) – (min())
- Quantiles: points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets. The second quantile is the median, the 4
quantiles are quartiles (3 data points Q1, Q2, Q3), and 100 quartiles are percentiles.
- Interquartile range: IQR= Q3 – Q1

Basic plots: Box plot: includes Q1, median, Q3, min and max values as well as outliers, points
are at least 1,5 IQR further away from Q1 and Q3. It shows you information about how the
data is spread.




When there are outliers, instead of min max, points that are at 1,5IQR distance from the
median are used instead of the min max values for the whisks (horizontal lines outside the
box). Box plots allow us to compare distributions of several features.

Measuring the dispersion of data:
- Variance standard deviation



Basic plot: scatter plot: when you have two features. It gives you an overview about how the
data is distributed over the x, y plane.

Correlation: measure of how two features are moving together. There is also a coefficient
related to that that you can calculate.
- Pearson’s r measures the strength of linear relationship (dependency)

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
liekebuuron Avans Hogeschool
Follow You need to be logged in order to follow users or courses
Sold
170
Member since
5 year
Number of followers
103
Documents
15
Last sold
3 days ago

3.2

11 reviews

5
4
4
2
3
1
2
0
1
4

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions