Samenvatting

Summary/Lecture notes Data mining for Business & Governance

Beoordeling

Verkocht

Pagina's

Geüpload op

25-03-2023

Geschreven in

2022/2023

Summary/Lecture notes for the course Data Mining for Business and Governance. Includes all lectures.

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Tilburg University (UVT)
Studie: Data Science & Society
Vak: Data mining (880662M6)

Alle documenten voor dit vak (2)

Documentinformatie

Geüpload op: 25 maart 2023
Aantal pagina's: 38
Geschreven in: 2022/2023
Type: Samenvatting

Onderwerpen

data mining
data science
tilburg university

Voorbeeld van de inhoud

Lectures data mining

Lecture 1
Pattern classification
- In this problem, we have 3 numerical variables (features) to be
used to predict the outcome (decision class).
- It’s multi-class since we have 3 possible outcomes
The goal in pattern classification is to build a model able to generalize
well beyond the historical training data.

In this lecture we cover 3 main things:
1. How to deal with missing values
2. How to compute the correlation/association between two
features
3. Methods to encode categorical features and handle class imbalance

Missing values
Missing values might result from fields that are not always applicable, incomplete
measurements, lost values.
Imputation strategies for missing values:
1. Simplest strategy → remove the feature containing missing values.
➢ Recommended when the majority of the instances (observations) have missing
values for that feature.
➢ However, there are situations in which we have a few features or the feature we
want to remove is deemed relevant.
2. If we have scattered missing values and few features, we might want to remove the
instances having missing values.
3. Most popular → replacing the missing values for a given feature with a
representative value such as the mean, the median or the mode of that feature.
➢ However, we need to be aware that we are introducing noise.
4. Fancier strategies include estimating the missing values with a machine learning
model trained on the non-missing information.
5. Autoencoders are deep neural networks
that involve two neural blocks named
encoder and decoder. The encoder reduces
the problem dimensionality while the
decoder completes the pattern.
➢ They use unsupervised learning to
adjust the weights that connect the
neurons.

,Feature scaling
1. Normalization
➢ Different features might encode different measurements
and scales (the age and height of a person)
➢ Normalization allows encoding all numeric features in the
[0,1] scale
➢ We subtract the minimum from the value to be
transformed and divide the result by the feature range.
2. Standardization
➢ This transformation method is similar to the
normalization, but the transformed values might not be in
the [0,1] interval
➢ We subtract the mean form the value to be transformed
and divide the result by the standard deviation.
➢ Normalization and standardization might lead to different
scaling results.

Normalization vs. standardization

- These feature scaling approaches might be affected by extreme values.

Feature interaction
1. Correlation between two numerical variables → Sometimes, we need to measure the
correlation between numerical features describing a certain problem domain.
➢ For example, what is the correlation between gender and income in Sweden?

2. Pearson’s correlation → it is used when we want to determine the correlation
between two numerical variables given k observations.
➢ It is intended for numerical variables only and its value lies in [-1, 1]
➢ The order of variables does not matter since the coefficient is symmetric.

Example: correlation between age and glucose levels

,The terminology can be different. We use correlation when we are working with numerical
data, and we use association when we are working with categorical data.

3. Association between two categorical variables → sometimes, we need to measure
the association degree between two categorical (ordinal or nominal) variables.
➢ For example, what is the association between gender and eye color?
4. The X2 association measure → it is used when we want to measure the association
between two categorical variables given k observations.
➢ We should compare the frequencies of values appearing together with their
individual frequencies
➢ The first step in that regard would be to create a contingency table.
➢ Let us assume that a categorical variable X involves m possible categories while Y
involves n categories.
➢ The observed value gives how many times each combination was found.
➢ The expected value is the multiplication of the individual frequencies divided by
the number of observations.

Association between gender and eye color

Such an example is very likely for in the exam.

Encoding strategies
Encoding categorical features → some machine learning, data mining algorithms or
platforms cannot operate with categorical features. Therefore, we need to encode these
features as numerical quantities.
1. Label encoding → consists of assigning integer numbers to each category. It only
makes sense if there is an ordinal relationship among the categories.
➢ E.g., weekdays, months, star-based hotel ratings, income
categories.
2. One-hot encoding → is used to encode nominal features that
lack an ordinal relationship. Each category of the categorical
feature is transformed into a binary feature such that one
marks the category.
➢ This strategy often increases the problem dimensionality
notably since each feature is encoded as a binary vector.

, Class imbalance
Sometimes we have problems with much more instances belonging to a decision class than
the other classes.
- In this example, we have more instances labelled with the
negative decision class than the positive one.
Classifiers are tempted to recognize the majority decision class only.

Simple strategies:
1. Under sampling
2. Oversampling
One strategy is to select some instances from the majority decision class,
provided we retain enough instances.
Another method consists of creating new instances belonging to the
minority class (creating random copies)
These strategies are applied to the data when building the model.

SMOTE → synthetic minority oversampling technique. It is a popular
strategy to deal with class imbalance.
- Creates synthetic instances in the neighborhoods of instances
belonging to the minority class.
- Caution is advised since the classifier is forced to learn from
artificial instances, which might induce noise.

Lecture 2
Classification problem
In this problem, we have four categorical (ordinal and nominal) features to be
used to predict the outcome.

We have only two possible outcomes or decision classes (binary problem).
The goal in pattern classification is to build a model to generalize well beyond
the historical training data.

Rule-based learning: in this approach, the classification problem is modelled as
a set of rules involving features and their values in the antecedent of such rules
and decision classes in the consequent.
- Algorithm → decision trees are perhaps the most popular algorithm of this
paradigm.

$6.82

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

sophiedekkers54

3.0

(2)

Maak kennis met de verkoper

sophiedekkers54 Tilburg University

Bekijk profiel

Volgen

Verkocht

Lid sinds

7 jaar

Aantal volgers

Documenten

Laatst verkocht

1 maand geleden

3.0

2 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper sophiedekkers54. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $6.82. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 48341 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen