Resume

Summary Seminar Data Analytics (grade: 8,7)

Name: Summary Seminar Data Analytics (grade: 8,7)
SKU: doc_1217714
Rating: 2.00 (1 reviews)
Author: sanneerasmus

Note

2,0

(1)

Vendu

Pages

Publié le

15-07-2021

Écrit en

2020/2021

This is an English summary of the (Erasmus) Master course "Seminar Data Analytics" (seminar sessions). For this course, I got a 8,7.

Établissement

Cours

Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

Signaler une violation de copyright

École, étude et sujet

Établissement: Erasmus Universiteit Rotterdam (EUR)
Cours: Accountancy, Auditing & Control
Cours: Seminar Data Analytics (FEM71004)

Tous les documents sur ce sujet (1)

Infos sur le Document

Publié le: 15 juillet 2021
Nombre de pages: 16
Écrit en: 2020/2021
Type: Resume

Sujets

seminar data analytics
fem71004
summary

Aperçu du contenu

Seminar Data Analytics
Topic 1 – Unsupervised learning
Data analytics
Data analytics refer to the process of transforming and analysing large datasets to produce
information that can be effectively used. Data analytics are vital to make sense of and use
the information to improve decision-making and business outcomes.

Unsupervised learning (used without having a specific question → e.g. “Do our clients form
natural groups based on similar attributes?”):
- Cluster analysis for identifying groups of observations with a similar profile according to a
specific criteria (big dataset → can groups be created?)
- Principal component analysis (PCA) for summarizing the most important information
contained in a multivariate dataset

Cluster analysis
Clustering is used to identify groups in a big dataset, where the observations of each group
are very similar to each other. Clustering algorithms calculate the minimum distance of all
observations and group those observations.

Research question = “What are the factors that affect the rejection of a loan?”.

Euclidean distance
The classification of observations into groups requires the computation of the distance/
similarity between each pair of observations → measured by the Euclidean distance:
deuc(x,y) = √ Σ ni=1(x i− y i)2. A smaller Euclidean distance → greater similarity between each
pair of observations.

Cluster methods
Two main categories of clustering:
1. Partitioning methods → observations are assigned to one of K clusters.
K-means clustering is often used in practice because of its simplicity and speed:
- It is considered to be appropriate for large datasets.
- It is less sensitive to outliers.
The goal of clustering is finding groups such that within-group variation is small (i.e.
minimize) (small → good quality of clustering) and the between-group variation is
large (i.e. maximize).
Disadvantage = you have to decide on the amount of clusters yourself.
K-means algorithm:
1. Choose the number of clusters, K.
2. Generate K random points as cluster centre/centroids.
* The algorithm starts by randomly selecting K observations to serve as the
initial centres for the clusters (= cluster means/centroids/centres).
3. Assign each point to the closest cluster centroid (use Euclidean distance)
4. Recompute the new cluster centroid.
* After the assignment step, the algorithm computes the new mean value of
each cluster.
1

, 5. Repeat steps 3 and 4 until some convergence criteria is met.
* Usually the convergence criteria is that the assignment of observations to
clusters has not changed over multiple iterations.
2. Hierarchical methods → seeks to build a hierarchy of clusters without having a fixed
number of clusters. Two types of clustering:
1) Agglomerative clustering: good for identifying small clusters
- Each observation is assigned to a cluster of their own
- Then two nearest clusters are merged into the same cluster until there is only a
single cluster left
2) Divisive clustering: good for identifying large clusters
- The opposite of agglomerative clustering → begins with one cluster
- Then the most heterogeneous (i.e. dissimilar) clusters are successively divided until
all observations have a cluster of their own
The result of hierarchical clustering is a tree-based representation of the clusters,
which is known as dendrogram. The height of the fusion, provided on the vertical
axis, indicates the (dis)similarity between two clusters. The higher the height of the
fusion, the less similar the clusters.
Drawbacks = isn’t appropriate for large datasets, doesn’t work if there are missing
values and doesn’t tell you where to cut the dendrogram.

Basic steps involved in cluster analysis:
1. Formulating the problem → select the variables used as the basis for clustering
2. Deciding on the number of clusters
3. Computing the clusters
4. Plotting the clusters
5. Cluster validation
6. Interpreting clusters, drawing conclusions and use illustrative techniques

R and clustering
Deciding on the number of clusters
Steps:
1. Assess the necessity of forming clusters with the Hopkins statistic H. If H < 0.5 →
clustering is necessary/preferable.
2. Use the elbow method or silhouette method for identifying the number of clusters.
→ Elbow method
Choose a number of clusters so that adding another cluster does not add sufficient
information. This can be determined by plotting the within-group variation against
the number of clusters. At some point, the marginal gain from adding an additional
cluster will drop (the elbow).
→ Silhouette method (measures the quality of clustering)
The silhouette analysis takes both the variation within the clusters (ai) as the
variation between the clusters (bi) into account → Si = (bi – ai)/max(ai, bi):
* Si is close to 1 → good clusters
* Si is close to 0 → bad clusters

2

, Cluster validation
Two commonly used measures for cluster validation (i.e. evaluate the goodness of the
clustering structure):
- Silhouette coefficient (values close to 1 are desirable)
- Dunn index (larger values are desirable)
D = min.separation/max.diameter
* min.separation → minimum inter-cluster distance (should be large)
* max.diameter → maximum intra-cluster distance (should be small)

Principal component analysis (PCA)
The goal of PCA is to reduce a large dataset with many variables to a dataset with fewer
variables (i.e. reducing the dimensionality of the data), without losing important
information. Given a set of variables, find a small number of (latent → not directly
observed) variables that describe the original set as good as possible. Moreover, PCA is
useful when independent variables are correlated with each other.

PCA tries to describe the variables in X as good as possible by a few new variables that are a
linear combination of X:
- The new variables are called principal components/dimensions
- The principal components are uncorrelated
- Each principal component/dimension Zs can be expressed as:
Zs = u1X1 + u2X2 + … + upXp
* For the addition to make sense, the variables in X are often standardized (each
variable is equally weighted/important)
* The trick of PCA is to choose Zs such that as much as possible variance of the
variables in X is explained. The amount of variance retained by each principal
component is measured by the eigenvalue → a large eigenvalue (>1) means that the
principal component explains a large amount of the variance.

The PCA method is useful when the variables within the dataset are highly correlated (=
indicates redundancy). Due to this redundancy, PCA can be used to reduce the original
variables into a smaller number of new variables explaining most of the variance in the
original variables.

Research question = “What are the main determinants of the financial performance and
health of Scandinavian companies?”.

Taken together, the main purpose of PCA is to:
- Identify hidden patterns in a dataset
- Reduce the dimensionality of the data by removing noise and redundancy in the data
- Identify correlated variables

Basic steps involved in PCA:
1. Formulate the problem
2. PCA on the dataset
3. Retain principal components
4. Retain variables

3

3,49 €

Accéder à l'intégralité du document:

Garantie de satisfaction à 100%

Disponible immédiatement après paiement

En ligne et en PDF

Tu n'es attaché à rien

Faites connaissance avec le vendeur

sanneerasmus

3,5

(22)

Avis des acheteurs vérifiés

Affichage de tous les avis

kellyjangbahadoorsing Economie En Bedrijfseconomie · 8 revues

6 mois de cela

2,0

1 revues

Avis fiables sur Stuvia

Tous les avis sont réalisés par de vrais utilisateurs de Stuvia après des achats vérifiés.

Faites connaissance avec le vendeur

sanneerasmus Erasmus Universiteit Rotterdam

Voir profil

Vendu

201

Membre depuis

5 année

Nombre de followers

Documents

Dernière vente

1 jours de cela

Economie en Bedrijfseconomie & Accounting, Auditing and Control samenvattingen en uitwerkingen.

Hi! Ik ben Sanne en heb de Bachelor Economie en Bedrijfseconomie gehaald met een 8,1 gemiddeld. Daarnaast ben ik Cum Laude geslaagd voor de Master Accounting, Auditing and Control (8,6 gemiddeld). Hierbij de samenvattingen waar ik de afgelopen jaren veel aan gehad heb, succes!

3,5

22 revues

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur sanneerasmus. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour 3,49 €. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis) 45736 résumés ont été vendus ces 30 derniers jours Fondée en 2010, la référence pour acheter des résumés depuis déjà 15 ans

Summary Seminar Data Analytics (grade: 8,7)

École, étude et sujet

Infos sur le Document

Sujets

Aperçu du contenu

Plus de cours sur Erasmus Universiteit Rotterdam (EUR) > Accountancy, Auditing & Control

Avis des acheteurs vérifiés

Faites connaissance avec le vendeur

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Le document ne convient pas ? Choisis un autre document

Paye comme tu veux, apprends aussitôt

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Garantie de remboursement : comment ça marche ?

Auprès de qui est-ce que j'achète ce résumé ?

Est-ce que j'aurai un abonnement?

Peut-on faire confiance à Stuvia ?