100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
College aantekeningen

Cheat sheet for during exam

Beoordeling
-
Verkocht
-
Pagina's
2
Geüpload op
17-04-2022
Geschreven in
2021/2022

Sheet with notes for during the exam, it is allowed to use it during the exam as long as it is printed.









Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
17 april 2022
Aantal pagina's
2
Geschreven in
2021/2022
Type
College aantekeningen
Docent(en)
X
Bevat
Alle colleges

Onderwerpen

Voorbeeld van de inhoud

Categorical data (no intrinsic value) When not visualizing: individual precise ->strive for low within-cluster-distance
Nominal: outcomes that have no values matter, summary and detail values, (W(C)): sum of distances between centroids
..natural order (hair colour: blond,..) scale is broken, decision needed in min time and each observation
Dichotomous: nominal with 2 outcomes W(C) decreases when k increases
Ordinal: outcomes that have natural Effectiveness: how well visualization helps When is clustering informative:
..order (ratings: bad, good,.) person with their tasks K = 1: one cluster -> no
K = n: equal clusters and obs -> no
Numerical data (intrinsic value) Location summary statistics: level 1 < K < n -> yes
Continuous: any value on scale n ! no rule for k, but not minimizing W(C)
1
Interval: equal intervals > equal
…..differences, no fixed 0-point …..
Mean: ∑x
n i=1 i
Equal treatment of att is important:
Use same units for similar attributes
(temperature C, IQ, time) Median: odd obs: middle value when Ensure units used lead to relevant
Ratio: differences and ratio make ordered, even obs: avg two middle values ….distance for problem
…...sense, fixed 0-point (budget, …… Mode: most frequently occurring value Standardize units for dissimilar att
temperature K, distance) Distance linear regression:
Discrete: only certain values (number of..) Scale statistics: spread Residuals/deviations
Range: max – min Determines SSD, and so optimal
Location percentile P =1+(P/100)*(n-1) IQR: 3rd quartile – 1st quartile ….model
Pth perctl val = l+(LP-LPround,down)(h-l) n Basis of quality measure R
1
Lookup: know what and where
Sample var: ∑ (x −x )2
n−1 i=1 i
Distance clustering:
Distances determine clusters
Browse: don’t know what, know where Different attribute scales can be



n
Locate: know what, not where 1 ….chosen, influencing distance
Explore: don’t know what nor where Sample sd: ∑ (x −x)2
n−1 i=1 i
Basis of quality measure W(C)

Key attribute = independent attribute MAD: median of absolute deviation from Distance: measure for how close things
Value attribute = dependent attribute median are, how related things are, distances can
be easily compared, no single appropriate
Scatterplot: 2 quant. att., no keys only Sample covariance and correlation: relation distance
values, points, horiz, + vert. position, find 1 Euclidean distance: as the crow flies
trends, outliers, distribution, correlation, s xy = ( x −x )( y i− y ) and Network distance: know network of
clusters n−1 i ..possible movements, network is sparse ..
Bar chart: 1 cat. att. (key) + 1 quant. att. s xy (not too many possible roads)
(value), lines, length to express quant. r xy = Manhattan distance: movement is
value, spatial regions: one per mark, sx s y ..restricted to fixed grid
compare+look up values
Stacked bar chart: 2 cat. att., 1 quant. att, Categories data mining: Decision tree:
vertical stack of line marks, glyph: Predefined target? TP+TN
composite object, internal structure from Yes->supervised method Accuracy =
multiple marks, length and color hue, No->unsupervised method
TP+ TN+ FP+ FN
spatial regions: one per glyph, Info applicable to all of some data? Where to split: all-yes or all-no most
compare+look up values, part-to-whole All->global method informative, equal yes/no least informative;
relationship Some->local method lowest avg entropy:
Normalized stack. barchart: same as H ( p )=− p log 2 ( p )−(1− p) log 2 (1− p)
stacked bar chart, reduces comparability for DM methods:
all cat. except lowest and highest Lin regression: supervised, global
Line chart: 2 quant. att., 1 key, 1 value, Association rule learning:
Clustering: unsupervised, global
points, aligned lengths to express qual. val., Decision tree: supervised, global | X|
separated+ordered by key att into Support of itemset X: supp ( X )=
horizontal regions, find trend, connecting
Association rule learning: unsupervised, n
local
line emphasizes ordering of items along key Support of itemset X ∩ Y:
axis by showing relationship between to Linear regression: supp ( X ∩Y )=¿ X ∩Y ∨ ¿ ¿
items Consider residual y - ŷ betw. real value y n
Heatmap: 2 cat att, 1 quant att, area, and predicted value ŷ = b0 + b1x. Confidence of rule X => Y:
separate+align in 2D matrix, indexed by 2
cat values, color by quant att, find SSD = conf ( X =¿ Y )=¿ X ∩Y ∨ ¿ ¿
clusters+outliers Lower SSD -> better model ¿ X∨¿ ¿
Histogram: table, find distribution(shape),
new table: keys are bins, values are counts, Object system: ‘real’ world of a company,
bin size crucial, related to kernel density organization…
estimate and rug plot Information system: representation of real
Boxplot: table, find distribution(group Best values are world in a computer system using data to
comparison), 5 quant att, median: central represent objects
line, lower+upper quartiles: boxes,
lower+upper fences: whiskers, first quartile Not storing all data in one table: duplication
-1.5IQR, third quartile +1.5IQR, outliers of information, difficulty keeping information
beyond fence shown Higher R2 -> better model consistent, difficulty accessing+sharing
Violin plot: same as boxplot, outliers are data, hard to keep data safe/secure, hard to
represented in density plot Clustering: express interesting analytics
Bar vs line chart: depends on type key att: Centroids represent clusters.
bar if key=cat(nominal), line if key=ordered, K-means clustering algorithm: Database management systems
never line for categorical key: violates Pick k points as centroids (DBMS’s) provide solutions: data
expressiveness principle +trend so strong it Assign points to nearest centroid redundancy+inconsistency, data security,
overrides semantics Recompute centroids: mean of points in efficient data analytics
Box vs violin plot: boxplots hide essential ….cluster
aspects of dataset, violin plots better for Repeat steps 2 and 3 Primary key = unique identifier
representing differences in distribution of How well does centroid represent cluster:
data small distance->good, large distance->bad Logical schema (data model) – logical
structure of database:
€6,49
Krijg toegang tot het volledige document:

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Maak kennis met de verkoper
Seller avatar
jbtue

Maak kennis met de verkoper

Seller avatar
jbtue Technische Universiteit Eindhoven
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
7
Lid sinds
6 jaar
Aantal volgers
7
Documenten
11
Laatst verkocht
1 jaar geleden

0,0

0 beoordelingen

5
0
4
0
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen