Samenvatting

Summary of Introduction to Data Science course

Beoordeling

Verkocht

Pagina's

Geüpload op

20-08-2019

Geschreven in

2018/2019

This document presents an in depth summary of the Introduction to Data Science course within the Cognitive Science and Artificial Intelligence bachelor program at Tilburg University. This summary covers the basic topics of Machine Learning and Data Science, such as Supervised and Supervised learning at an intermediate level. It also provides clear explanations of the most used algorithms in ML, both for Classification and Regression analysis in Supervised Learning and clustering and dimensionality reduction techniques in Unsupervised Learning. This summary can be adopted and used to study these topics for a variety of different courses besides the afore-mentioned one.

Meer zien Lees minder

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Gekoppeld boek

Andreas C. Müller, Sarah Guido Introduction to Machine Learning with Python

Uitgave:Onbekend
ISBN:9781449369897
Druk:1

Geschreven voor

Instelling: Tilburg University (UVT)
Studie: Cognitive Science and Artificial Intelligence
Vak: Introduction to Data Science

Alle documenten voor dit vak (1)

Documentinformatie

Heel boek samengevat?: Nee
Wat is er van het boek samengevat?: Most of it, but not all.
Geüpload op: 20 augustus 2019
Aantal pagina's: 22
Geschreven in: 2018/2019
Type: Samenvatting

Onderwerpen

Voorbeeld van de inhoud

INTRODUCTION TO DATA SCIENCE SUMMARY

● Supervised Learning: when you have input variables (x) and output variable (y) and
teach an algorithm to map a function from inputs to outputs. Aim is for the algo to be able
to predict output variable (y) from new input data (x). Supervised cause the process of
an algo learning from the training dataset can be thought of as a teacher supervising the
learning process. We know the correct answers (classes,labels), the algorithm iteratively
makes predictions on the training data and is corrected by the teacher. Learning stops
when the algorithm achieves an acceptable level of performance.
○ Classification: task where the output variable is a category, such as color
(red,blue,green) or diagnosis (ill, not ill). The model trained from the data defines
a decision boundary that separates the data
■ Logistic Regression, Neural Networks (Multi-Layer Perceptron), Naive
Bayes, KNN, Decision Trees, Linear SVMs, Kernel SVMs, Ensemble
Learning (e.g. Random Forests, Gradient Boosting)
■ Types of classifiers:
● Instance based classifiers: Use observations directly without
models, e.g. K nearest neighbors
● Generative: p(x|y), build a generative statistical model, rely on all
points to learn the generative model, e.g. Bayes classifiers
● Discriminative: p(y|x) , directly estimate a decision rule/boundary,
mainly care about the boundary, e.g. Decision trees
○ Regression: task where the output variable is a real value, such as “dollars” or
“weight”. The model fits the data to describe the relation between 2 features or
between a feature (e.g., height) and the label (e.g., yes/no)
■ Linear, Polynomial Regression, NN (MLP) Regression, Bayesian Ridge
Regression, KNN Regression, Decision Trees Regression, Linear SVM
Regression, Kernel SVM Regression, Ensemble Learning (e.g. Random
Forests Regression, Gradient Boosting Regression)
● Unsupervised Learning: when only have input data (x) and no corresponding output
variables. Aim here is to model the underlying structure or distribution in the data in order
to learn more about the data. Unsupervised cause there is no correct answers and there

, is no teacher. Algorithms are left to their own to discover and present the interesting
structure in the data.
○ Clustering: where want to discover the inherent groupings in the data, such as
grouping listeners by music genre preferences.
○ Association: where you to discover rules that describe large portions of data,
such as people that listen (x) also tend to listen (y).
● Supervised: all data is labeled and the algorithms learn to predict the output from the
input data.
● Unsupervised: all data is unlabeled and the algorithms learn to inherent structure from
the input data.
● Semi-supervised: some data is labeled but most of it is unlabeled and a mixture of
supervised and unsupervised techniques can be used.

Data preparation
● Scaling: method used to normalize the range of independent variables or features of
data. Methods:
○ Min-max → the simplest method and consists in rescaling the range of features
to scale the range in [0, 1] or [−1, 1]
○ Mean normalization
○ Standardization → makes the values of each feature in the data have zero-mean
and unit-variance; determine the distribution mean and standard deviation for
each feature, then subtract the mean from each feature, then divide the values
(mean is already subtracted) of each feature by its standard deviation
○ Scale to unit length → scale the components of a feature vector such that the
complete vector has length one; means dividing each component by the
Euclidean length of the vector
● Missing values: take them off because missing data can (1) introduce a substantial
amount of bias, (2) make the handling and analysis of the data more arduous, and (3)
create reductions in efficiency
● Data balance: class imbalance, when each of your classes have a different number of
examples; only do it if really care about the class in minorance; lead to hard-to-interpret
accuracy

, ○ Undersampling and oversampling
● Binning: make the model more robust and prevent overfitting, however, it has a cost to
the performance; everytime you bin something, you sacrifice information and make your
data more regularized; trade-off between performance and overfitting is the key point of
the binning process
● Log Transform: helps to handle skewed data and after transformation, the distribution
becomes more approximate to normal; in most of the cases the magnitude order of the
data changes within the range of the data; it also decreases the effect of the outliers,
due to the normalization of magnitude differences and the model become more robust;
data must have only positive values
● Unsupervised Feature Reduction:
○ Variance-based → variance or few unique values
○ Covariance-based → remove correlated features
○ PCA → remove linear subspaces
● Model: an equation that links the values of some features to the predicted value of the
target variable; finding the equation (and coefficients in it) is called ‘building a model’
● Feature selection vs. extraction: feature selection reduces the number of features by
selecting the important ones; feature extraction reduces the number of features by
means of a mathematical operation
● Evaluation metrics:
○ Accuracy → is the ratio of number of correct predictions to the total number of
input samples
○ Logarithmic Loss → works by penalising the false classifications; works well for
multi-class classification; here the classifier must assign probability to each class
for all the samples; nearer to 0 = higher accuracy, away from 0 = lower accuracy
○ Confusion matrix → a table showing correct predictions (the diagonal) and the
types of incorrect predictions made (what classes incorrect predictions were
assigned)
○ Precision → measure of a classifier’s exactness; is the number of positive
predictions divided by the total number of positive class values predicted; low
precision = large FP

€8,49

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

massimilianogarzoni

2,7

(3)

Maak kennis met de verkoper

massimilianogarzoni Universiteit Utrecht

Bekijk profiel

Volgen

Verkocht

Lid sinds

8 jaar

Aantal volgers

Documenten

Laatst verkocht

5 maanden geleden

2,7

3 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper massimilianogarzoni. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €8,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 50201 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Summary of Introduction to Data Science course

Gekoppeld boek

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Tilburg University (UVT) > Cognitive Science and Artificial Intelligence

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?