Resumen

Midterm Summary Data Mining for Business and Governance (880022-M-6)

Name: Midterm Summary Data Mining for Business and Governance (880022-M-6)
SKU: doc_1541308
Rating: 2.00 (2 reviews)
Author: Socnerd

Puntuación

2.0

(2)

Vendido

Páginas

Subido en

04-02-2022

Escrito en

2021/2022

This documents contains a summary of the first three modules/weeks for the course Data Mining for Business and Governance. The following topics are included in this summary: ⋅ What is data mining? ⋅ What are the related disciplines? ⋅ What are the applications? ⋅ What is big data? ⋅ Supervised and unsupervised learning ⋅ Examples of supervised and unsupervised learning ⋅ Workflow of supervised learning ⋅ Descriptive analysis: data visualization, exploring data distribution, detecting outliers, testing hypotheses ⋅ Representation of data ⋅ Learning and tuning - training set - validation set - test set ⋅ Parameter or model tuning ⋅ Evaluation - generalisation - overfitting, underfitting ⋅ Correlation coefficient ⋅ Covariance ⋅ Correlation versus causation ⋅ Caveats of correlation coefficient ⋅ Anscombe’s quartet ⋅ Regression - linear regression ⋅ Dependent / independent variables ⋅ Classification ⋅ Classification examples/applications ⋅ Decision trees ⋅ Multi-class classification ⋅ Decision boundaries ⋅ Dimensionality reduction ⋅ Clustering ⋅ What makes prediction possible? ⋅ Logistic regression ⋅ Evaluation metrics ⋅ R square (R2), root mean square error (RMSE), mean absolute error (MAE) ⋅ Distance metrics (Manhattan, Euclidian, Minkowski, Hamming, Chebyshev, Cosine) ⋅ k - Nearest Neighbours (k-NN) ⋅ Variance - bias ⋅ Hyperparameters /parameters ⋅ Confusion table ⋅ Accuracy, precision, recall, F1-score ⋅ (k-fold) cross validation, leave one out method, hold out method ⋅ ROC curve

Mostrar más Leer menos

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Tilburg University (UVT)
Estudio: Data Science & Society
Grado: Data Mining For Business And Governance (880022M6)

Todos documentos para esta materia (8)

Información del documento

Subido en: 4 de febrero de 2022
Número de páginas: 14
Escrito en: 2021/2022
Tipo: Resumen

Temas

880022 m 6
data mining for business and governance

Vista previa del contenido

What is data mining?
Data mining is the computational process of discovering patterns in large datasets.

What are the related disciplines?
Artificial intelligence, machine learning and statistics.

What are the applications?
The actual extraction of knowledge from data using models. Examples are found in science
and business.

What is big data?
Big data is measured in volume, variety and velocity.

Volume: Variety:
 Too big for manual analysis  Big range of values
 Too big to fit on RAM  Outliers, confounders and noise
 Too big to store on disk  Different data types

Velocity:
 Data changes quickly
 Streaming, or online, data

Supervised and unsupervised learning
Supervised learning uses labeled data containing examples and the desired target variable.
Unsupervised learning uses unlabeled data with no target variable.

Examples of supervised and unsupervised learning
Supervised: linear regression. Describes the relationship between two variables and predicts
the value of one continuous variable based on another variable.
Supervised: classification. Classifies features into certain, known, groups based on their
characteristics.
Unsupervised: dimensionality reduction. This is the process of reducing the number of
features into a set of principal, important features for analysis. This can be done through
feature selection or feature extraction.
Unsupervised: clustering. The grouping of similar datapoints that have no labels.
Unsupervised: association. Used to discover the co-occurrence of items in a database.

Workflow of supervised learning
1. Collect data
2. Label examples
3. Choose feature representation

,4. Train model
5. Evaluate model

Descriptive analysis: data visualization, exploring data distribution, detecting outliers,
testing hypotheses
A visualization of the data can give you an idea on how the data is distributed. This is usually
done with graphs. These visualizations make it possible to detect outliers in the data. The
testing of hypothesis can be done using statistical tests.

Representation of data
Data are represented by features. These can be numerical or categorical. It is possible to
convert features into a vector: a fixed-size list of values. Some algorithms require features
represented as vectors.

Learning and tuning - training set - validation set - test set
A model is said to learn if its performance in Tasks as measured by Performance improves
with Experience.

We sample, or split, our data into a training, validation and test set. We use stratification to
ensure all sets are structurally the same. We use a certain algorithm to build a model. We
train this model on the training set. We use the validation set to determine how well our
current parameter configuration performs and to tune the algorithm to see which
configuration performs best. We evaluate this ‘best’ model on our test set. Our test set thus
remains unseen until the very end.

Collection of classified examples
Training examples Test set
Training set Validation set

Train Tune, evaluate Evaluate

Model Optimized
model

In general, we want to either 1) outperform state-of-the-art models doing the same task
(otherwise there is no need for us to train our own) or, if there is no such model 2) beat
some simple model. The latter is known as the baseline. For a linear regression, we can
check if the mean target value of the test set correspond with the mean target value of the
training examples. This baseline performs well if the target value is normally distributed. For
classification we can use the majority baseline; we check if the most frequent label in the

, test set correspond with the most frequent label in the training examples. This baseline
performs well if there is one common, dominant class.

Parameter or model tuning
Tuning can be informally defined as the process of selecting the hyperparameter value
reporting the highest performance value when evaluating the corresponding model on our
validation set. We can use this hyperparameter value for our test set.

Evaluation - generalisation - overfitting, underfitting
We want to evaluate models to see if our model correctly predicts our target. Data mining
experiments try to evaluate models on noisy sources to test if an observed pattern cannot
be subscribed to generalization errors. Any machine learning task can be formally evaluated
by comparing the true values of the target with the predicted values of the target.

Generalization can be defined as the ability of a model to correctly predict completely new
instances that are most dissimilar to the instances that we have seen. If we test on instances
that are similar, we would not get a good indication of generalization.

A model overfits when it’s capturing all the variance in the training examples. It won’t know
how to fit the new test data: the model is too complex.
A model underfits when it’s not predicting enough variance in the test set: the model is too
simple.

Correlation coefficient
Measures the strength of a linear relationship between two variables. An example is
Pearson’s r, calculated as follows:

r=
∑ (x−x )( y− y )
√∑ ( x−x)2 ∑ ( y − y)
2

Covariance
Is the measure of joint variability between two variables; to what extent do the variables
change together. It’s calculated in the numerator of the correlation coefficient:
n

∑ (X i− X)(Y i −Y )
i=1
cov ( X ,Y )=
n−1

Correlation versus causation
If two variables are correlated, it does not imply that one causes the other to happen. In
correlation, it is never sure what relationship two variables have. Correlation does not imply
causation. In causation, there is actual evidence that one variable causes effect in another.

$5.48

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

Socnerd

3.4

(23)

Documento también disponible en un lote

Reseñas de compradores verificados

Se muestran los 2 comentarios

mdegal Data Science & Society · 15 reseñas

3 año hace

jadikar Accountancy And Financial Management · 4 reseñas

3 año hace

2.0

2 reseñas

Reseñas confiables sobre Stuvia

Todas las reseñas las realizan usuarios reales de Stuvia después de compras verificadas.

Conoce al vendedor

Socnerd Universiteit Utrecht

Ver perfil

Seguir

Vendido

Miembro desde

8 año

Número de seguidores

Documentos

Última venta

3 año hace

3.4

23 reseñas

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Socnerd. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $5.48. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now