100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4,6 TrustPilot
logo-home
Samenvatting

Summary Data Mining for Business and Governance

Beoordeling
-
Verkocht
6
Pagina's
66
Geüpload op
01-02-2023
Geschreven in
2022/2023

English summary of the Data Mining fro Business and Governance course from Master in Data Science and Society. Summary of lecture materials, readings, and notes.

Instelling
Vak











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
1 februari 2023
Aantal pagina's
66
Geschreven in
2022/2023
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Data mining for business and governance
Week 1

What is data mining: the study of collecting, cleaning, processing, analyzing and gaining
useful insight of information.
- It is an umbrella term and the methods used relates to different disciplines:
o Knowledge discovery in databases
o Statistics
o Artificial intelligence
o Machine learning

Key aspects:
- Computation vs large data sets: trade-off between processing time and memory.
- Computation enables analysis of the large data sets: computers as a tool and with
growing data.
- Data mining often implies knowledge discovery from databases: from unstructured
data to structured knowledge.

What are large amounts or big data:
- Volume
o Too big for manual analysis
o Too big to fit in RAM
o Too big to store on disk
- Variety
o Range of values: variance
o Outliers, confounders and noise
o Different data types
- Velocity
o Data changes quickly: require results before data changes
o Streaming data (no storage)
It is not only about volume but also about complexity (variety) and for example the speed of
the database.

Applications of data mining:
- Companies: business intelligence
- Science: knowledge discovery

,Datapoints could be observations that you have in a dataset. They could be related
(dependent) or independent.
Dependency oriented data types: explicit, like in social networks, relationships can be
captured by edges in graph representations, or implicit like temperature readings of a
sensor, similar values, this relationship is not market explicitly:
- Spatial data, spatio temporal data
- Network data
- Time series data
- String data (text)
- Discrete sequences (event logs)
The dependency properties relate for example to the assumptions of the methods that can
be applied.

In short: Implicit data is information that is not provided intentionally but gathered from
available data streams, either directly or through analysis of explicit data. Explicit data is
information that is provided intentionally, for example through surveys and membership
registration forms.

Non dependency oriented: no specified dependency records:
- Multidimensional data, can be quantitative, categorical, binary (can be seen as
quantitative or categorical)
- Text data with a representation ignoring the relationship, for example looking into
the frequency of words.
For machine learning models, observations are assumed to be independent of each other.

,The general pipeline:




What makes prediction possible?
- Fitting data is easy, but predictions are hard




- Associations between features/target
o Numerical: correlation coefficient
o Categorical: mutual information value of x1 contains information about value
of x2

Statistical descriptions of data:
Measures of central tendency:
- Mean: average
- Median: the middle vale in a set of ordered data values
- Mode: value that occurs most frequently in the set




Statistical descriptions are relevant for summarizing the data, for having an overview when
we are exploring the data.

, Measuring the spread of data, five number summary:
- Range: (max()) – (min())
- Quantiles: points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets. The second quantile is the median, the 4
quantiles are quartiles (3 data points Q1, Q2, Q3), and 100 quartiles are percentiles.
- Interquartile range: IQR= Q3 – Q1

Basic plots: Box plot: includes Q1, median, Q3, min and max values as well as outliers, points
are at least 1,5 IQR further away from Q1 and Q3. It shows you information about how the
data is spread.




When there are outliers, instead of min max, points that are at 1,5IQR distance from the
median are used instead of the min max values for the whisks (horizontal lines outside the
box). Box plots allow us to compare distributions of several features.

Measuring the dispersion of data:
- Variance standard deviation



Basic plot: scatter plot: when you have two features. It gives you an overview about how the
data is distributed over the x, y plane.

Correlation: measure of how two features are moving together. There is also a coefficient
related to that that you can calculate.
- Pearson’s r measures the strength of linear relationship (dependency)

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
liekebuuron Avans Hogeschool
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
170
Lid sinds
5 jaar
Aantal volgers
103
Documenten
15
Laatst verkocht
1 maand geleden

3.3

12 beoordelingen

5
5
4
2
3
1
2
0
1
4

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen