100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Summary/Lecture notes Data mining for Business & Governance

Beoordeling
-
Verkocht
10
Pagina's
38
Geüpload op
25-03-2023
Geschreven in
2022/2023

Summary/Lecture notes for the course Data Mining for Business and Governance. Includes all lectures.

Instelling
Vak











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
25 maart 2023
Aantal pagina's
38
Geschreven in
2022/2023
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Lectures data mining

Lecture 1
Pattern classification
- In this problem, we have 3 numerical variables (features) to be
used to predict the outcome (decision class).
- It’s multi-class since we have 3 possible outcomes
The goal in pattern classification is to build a model able to generalize
well beyond the historical training data.

In this lecture we cover 3 main things:
1. How to deal with missing values
2. How to compute the correlation/association between two
features
3. Methods to encode categorical features and handle class imbalance

Missing values
Missing values might result from fields that are not always applicable, incomplete
measurements, lost values.
Imputation strategies for missing values:
1. Simplest strategy → remove the feature containing missing values.
➢ Recommended when the majority of the instances (observations) have missing
values for that feature.
➢ However, there are situations in which we have a few features or the feature we
want to remove is deemed relevant.
2. If we have scattered missing values and few features, we might want to remove the
instances having missing values.
3. Most popular → replacing the missing values for a given feature with a
representative value such as the mean, the median or the mode of that feature.
➢ However, we need to be aware that we are introducing noise.
4. Fancier strategies include estimating the missing values with a machine learning
model trained on the non-missing information.
5. Autoencoders are deep neural networks
that involve two neural blocks named
encoder and decoder. The encoder reduces
the problem dimensionality while the
decoder completes the pattern.
➢ They use unsupervised learning to
adjust the weights that connect the
neurons.

,Feature scaling
1. Normalization
➢ Different features might encode different measurements
and scales (the age and height of a person)
➢ Normalization allows encoding all numeric features in the
[0,1] scale
➢ We subtract the minimum from the value to be
transformed and divide the result by the feature range.
2. Standardization
➢ This transformation method is similar to the
normalization, but the transformed values might not be in
the [0,1] interval
➢ We subtract the mean form the value to be transformed
and divide the result by the standard deviation.
➢ Normalization and standardization might lead to different
scaling results.

Normalization vs. standardization




- These feature scaling approaches might be affected by extreme values.

Feature interaction
1. Correlation between two numerical variables → Sometimes, we need to measure the
correlation between numerical features describing a certain problem domain.
➢ For example, what is the correlation between gender and income in Sweden?




2. Pearson’s correlation → it is used when we want to determine the correlation
between two numerical variables given k observations.
➢ It is intended for numerical variables only and its value lies in [-1, 1]
➢ The order of variables does not matter since the coefficient is symmetric.

Example: correlation between age and glucose levels

,The terminology can be different. We use correlation when we are working with numerical
data, and we use association when we are working with categorical data.

3. Association between two categorical variables → sometimes, we need to measure
the association degree between two categorical (ordinal or nominal) variables.
➢ For example, what is the association between gender and eye color?
4. The X2 association measure → it is used when we want to measure the association
between two categorical variables given k observations.
➢ We should compare the frequencies of values appearing together with their
individual frequencies
➢ The first step in that regard would be to create a contingency table.
➢ Let us assume that a categorical variable X involves m possible categories while Y
involves n categories.
➢ The observed value gives how many times each combination was found.
➢ The expected value is the multiplication of the individual frequencies divided by
the number of observations.

Association between gender and eye color




Such an example is very likely for in the exam.

Encoding strategies
Encoding categorical features → some machine learning, data mining algorithms or
platforms cannot operate with categorical features. Therefore, we need to encode these
features as numerical quantities.
1. Label encoding → consists of assigning integer numbers to each category. It only
makes sense if there is an ordinal relationship among the categories.
➢ E.g., weekdays, months, star-based hotel ratings, income
categories.
2. One-hot encoding → is used to encode nominal features that
lack an ordinal relationship. Each category of the categorical
feature is transformed into a binary feature such that one
marks the category.
➢ This strategy often increases the problem dimensionality
notably since each feature is encoded as a binary vector.

, Class imbalance
Sometimes we have problems with much more instances belonging to a decision class than
the other classes.
- In this example, we have more instances labelled with the
negative decision class than the positive one.
Classifiers are tempted to recognize the majority decision class only.

Simple strategies:
1. Under sampling
2. Oversampling
One strategy is to select some instances from the majority decision class,
provided we retain enough instances.
Another method consists of creating new instances belonging to the
minority class (creating random copies)
These strategies are applied to the data when building the model.


SMOTE → synthetic minority oversampling technique. It is a popular
strategy to deal with class imbalance.
- Creates synthetic instances in the neighborhoods of instances
belonging to the minority class.
- Caution is advised since the classifier is forced to learn from
artificial instances, which might induce noise.


Lecture 2
Classification problem
In this problem, we have four categorical (ordinal and nominal) features to be
used to predict the outcome.

We have only two possible outcomes or decision classes (binary problem).
The goal in pattern classification is to build a model to generalize well beyond
the historical training data.

Rule-based learning: in this approach, the classification problem is modelled as
a set of rules involving features and their values in the antecedent of such rules
and decision classes in the consequent.
- Algorithm → decision trees are perhaps the most popular algorithm of this
paradigm.

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
sophiedekkers54 Tilburg University
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
45
Lid sinds
7 jaar
Aantal volgers
24
Documenten
3
Laatst verkocht
1 maand geleden

3.0

2 beoordelingen

5
0
4
0
3
2
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen