Dis X
:
exp .prob
# ·
element , observation
.
Rows :
I # value the random -
5
Data Probability dist .
var may assume -
(numericorend
- .
categorical >
-
uniform prob dist -
I
-
, .
columns features
·
-
:
f(x) :
prob. of each random var
.
paia = b
names/labels
&
x =
nominal f(x)
-
:
=
Scale of measurement
a
ordinal : can be ordered
· discrete uniform :
f(x) =
* weight O ; otherwise
-
f(x))
-
has property interval diff is imp & expressed in fixed E(x) M Ex E(x) (a + b) /2
= =
: =
.
-
.
of the rest Unit
ratio ratio is meaningful, + Of Var(x) (b -a) 2/12
Varcx =3M
:
=
absolute power of e
: measured in NATURAL UNIT -
O Normal prob dist
O
interval instead
.
.
Co has use
specific meaning z(jy)
-
of Specific point e
-
Quantitative f(x)
~
a
=
e
defined & chande
E
can't be or
Within X 42 AUcoff(x b + wX, +2 2π
prob dist Prob
= -
· cont . .
.
,, e = 2 . 71828
.
# 3 74159
F
=
defined by M , o
E I
.
-
"Sin
0
oE(p) 15
=
sampling dist of p
=
x
Stat
Descriptive ·
-
Class
.
total AUC =
1 0 25 ↑ -
~
. - =
· E(X) =
M (if E(X) =
prob param
. >
-
Point est . is UNBIASED) A standard determine
Yategoricalnon-overlappingcategoryeast-min
.
pas
normal
In -
width
frequency Ot treat as
[ finite
dist SD
If A
· · : =
. >
- 0 05 + 5: 1 ,
INFINITE
.
# class infinite & zere St normal mean t
d =
z ↳ finite
pop correction factor
: .
red
median-
freq
= ,
rela X M
-
=
standard of meand z
-
o
M
=
err mode
.
.
·
.
O
↳ convert to
~
E
Size (for both T , 2 test)
· pCt freg =
rela freg x 100 Sampling
·
St normal
1
.
normal ad
.
assumed
.
can de
.
① most if n > 30 -
cases , pointest
outlier > 50 point est
highly skewed use
if pop have n Var
, is pop .
,
.
X
Quantitative M
categorical fixeth ②
If pop is not normal but roughly symmetric n =
15 suffice
· S
can
,
histogram used E & proportion
T
TT
no natural ⑤ is believed approx be
separation if pop
-
bar chart : NORMAL n X15 &
E
- : , .
# merd e
axis :
-category
# Ar central
>
-
sample
limit theorem
w/ size n can
: use if pop
be approx
,
,
is normally dist.
not
normal If n becomes large
o + al
+
&
i 5
frea /pct
·
freg /rela skewness AUC btw E(X)
prob of sample within A from mean use A E(E) + A
E
· +
-
. .
.
,
frea M14
↳ confidence confidenceofainterval
-
-
pie Chart :
display rela. ↓ skewed Symmetric R-skewed
~
higher deg of confidence + higher MoE
+
:j horizontalivarofintea
.
-
axis
freg / pCt freg
·
[pointest error]
·
.
· Interval Est I margin of -
d)
. .
5
& t-test (uses to est
.
unknown : .
+
·
of circle + 3500 =
100% % Frea
5 known : 2-test
pie is selected as
3
*
- in-2)[(ti) + id f
* if pos of skewness =
= 0 .
=
n -
1
. .
show 9th
.
in
est Can be obtained based M
wisely , it can for good ·
↑ d 0 f. + ↑ dispersion
②F T
moderately L-skewed data info t-dist depend d f -
>
. .
-
Co :
on hist. - other ) on 0
meaningful - d closer
.
- ↑ .f to
.
0 +
↳d
.
:
o symmetric L
. .
:
-
ex 95% C e a 0 05 normal
moderatelyR-skewed
. =
2012
.
-
20 [ : * = >
-
+a +0 .
0 . f ? 100 .
=
,
/2 825
.
OK to assume normal
(1-tailed)
Numerical Measures
e affected by
outliers
diff btw . two means (X-12) interval est
test stat ,
for hyp test M . -M2 ; known . I
quartile
.
of location o mean F 3x; opercentile & (1 x2) Do
=
: =
measure
GF
- -
2 = *2 2x2(d x) z =
n
(0)(n + 1) =
-
(p
-
=
=
-
&
, 4
M Exi > Location of
-
-
E * just replace
pth percentile
=
"Tailed lower-Ho : Me-Mz Do Ha : Mi-M2 < D
D
* w/x xz
,
hyp test
-
:
-
z . -
-
upper-Ho : M1 M2 Do , Ha My-M2) Do
-
:
Ewixi
-
o weighted mean * GW ,
0x
=
ECX-E2)
-
:
identify
=
Mc-M2 2-tailed -
Mo :
M M2 -
=
Do Ha :
Mc-M2 #Do
-
x2
Growth Factor EWi
-
,
note : rate
I
mean
= Return
+ 1 · geometric mean :
Fg = ~
X , xz xn ...
over
of change
period # FORM lower-Ho :
M > Mo , Ha : M < Mo
100 sensitive to outlier
one-tailed
·
e
standard deviation upper - Ho M < Mo , Ha :
M) Mo
variability
:
measure of o Range = max-min o I
↓ not sensitive to outlier
S 52 deasier Hypothesis Testing two-tailed
·
· IQR =
&3- &I
=
to interpret
-
Ho :
M Mo =
, Ha :
M + Mo
6 =
52 than variance Error : Type I : REJECT Ho when it's TRUE - easier
variance
'do
o
same unit as data)
useful for
Ho When it's FALSE avoid by using not rejectHo' instead
nocoef
comparing variability 452 3[xi x of variation Type I : ACCEPT - -
·
-
=
can be controlled at &
compareentity
of 2 + data most 1 .
er r at a time evidence
not enough 7
( +100) %
↑ val approach Reject Ho if P-val - &
to reject
in434
G2 3(Xi
:
-
1
=
Of
-
(100)
was
evenfotain a
y data
M
-
&
1
WI diff
M
= &
O & Y lower-tailed two-tailed :
:
min
measure of dist Shape o outliers :
111
.
s 2 Score
Zi
>S
Xi
or
-
2- score <- 3
↓
d
p-val AUGowert =
AUC
:
reject Ho
=
I smaller
·
: area
upper
cases : incorrectly recorded
=
Reject Ho Reject Ho if p-ralX(
M- 32 M-20 M
critical value approach
-2 M M + 2 M + 20 M + 3
-
correctly recorded
111 58 26 %
.
* some casesCan't be removed
Rejection Rule :
just at least 1 side is
Ho (but both is better
enough for
95 44 %
fraud detections rejecting
~
lex
.
.
99 72 %
upper-tailed 2 Za two-tailed : 2- 2212
Lower limit Q1 -1 5/1QR) lower-tailed 2-22 T
.
: .
: :
·1824242
measure of assoc btw 2 wars. Boxplot Upper limit :
Q3 + 1 5 (1QR)
1
I 1
.
. .
-
Reject
↑
Ho Reject Ho &
·
u
do not reject do not
- - - - - - - - -
Ho
covariance (measure of linear assoc )
min max &
Q2 &3
E
.
-
reject Ho
n
"II >
In
*
&
F
Sxy =
E(xi -
)(yi -
y) Oppos
t rela
outliers
-
5
(lower upper IIII
.
E
.
=
n
O + neg rela
-
1 ~
Oxy
sh
. .
3(Xi Mx)(Yi -
My)
22 2212
=
strong-linear rela
2012
-
-
· near-1 : .
3
N
· E
linear rela * but
-
correlation coef ( person corr near 1 :
strong + .
A swapping
de careful !! *
.
·
the closer to 0
bir upper a lower is .
possible if it's still ans. the questions
,
the weaker rela
.
7
r ·
IV & V Hyp testD Ho :
B1 0 Ha : Be = 0 ⑳ -
E
Testing
=
Linear Regression for Significant
.
:
x = independent var
, y :
dependent var ,
↓ E
F-test -
F
regressor T-test ,
simple linear reg
. multiple linear reg.
2 + ind
-
use est of 82 (var (3)
& 1 y only vars
.
1x . .
*
normally adding more var =
better est. * For Simple LR : T-test & F-test are the same !
(err deSSEd) .
fortoomanadver
but simple reg is not designed
multiple LR
/F-test
D for overall significant
.
:
Straight line
graph hyperplane result significant (Hest/var!
:
may give diff .
T-test D for individual
model :
y =
Bo B1 + + E y =
Bo +
B x1 B2xz
, + +... +
BpXp + 3
Bo Mean Square corresponding d f
+ Bij + in var
o
Be0
. .
Ms
mean sq Sum of squares
I
Bo
=
. or
y SST n 1
.
= -
⑨
*
⑧
E(y) Bo corresponding d o f
E
=
+ . .
SSR
⑨
P
E(y) =
Bo + BeX 1 + B2xc +... + BpXp
y bo byx Bi pop param * for simple ; D 1 SSE n p 1
&
- -
+
,
= = =
1
.
bo + byx+ + 02x2 +... + Dpxp
⑧
MSE = 4
y
·
=
Of Bi N -
bi = est
-
.
E(3) 0, Var (5) 82
#
=
: O
P =H ind Var n = #ODs
· E
= -Bi
↓
Least .,
.
314 ; -ji)
me
Square min S (Bo , Br , B2 1 .... Bpl 2
method : ↓ be minimized we respect
i=
T- test Ho : Bi = 0 -
to the coef.
5)
+
Bit &
---
minimize 3(xi -
1) /Di -
Ldiff = o Ha :
n p 1
y 5 =5
= -
aims tofind
-
x
eco i I
diff btw
Top Y ; and
1
.
.
3(X : -
1/2 least S zi ·
est .
~
Y, bo =
y -
b, 3 =
(x'x) Xy
+
t = 2 ; Sbi =
S
E
- 5
reg . model j =
X'B =
Bo + B I
Sbi Ja(xi -
=)
I F
E
*, y = mean of X , y -
n
x (1 , x1 , /2 Xk]
+
y at the ithobs Reject Ho !
=
, ...,
Xi , Yi = val of X, .
- xB x(x(x) x y
if p-value & + -tap or
- = =
Hy ,
-
H at matrix ↓
H = x(X(x)
-
X based on t-dist W/d 0 f = n -p -
1
#
. .
.
coef Of SSR SSE R-Syntax : LM assumption Chec k
SST +
; SS Sum of squares
-
. = :
determination (r))
↑ same for both
31Y :
total
-[12 =
reg
3) Y:
.
-
y,
-
+ 3(Yi
err .
- Yi) Adjusted MUL coef .
of det -
= adding var -derr
. .
+ ISSE + SSR =
SST-SSESSRd]
-
↓R2 SSRA
-
Normality
plot (Im-name which 2)
-
,
Q
= ...... .......
X
& aims to Compensate #added ind
=
simple multiple adjusted R2 Var
.
Heterosedstcityplotpredyagainstthe standardizeda
.
higher , the better
.
>
- the
linear
R4(m)
reg
.
ANOVA result
r2 =
SSR
know much reg help .
defining Ra = 1 -
-
avar if d depends on i + hetero
Sst the data any trends (ex low i low var
source
PF
.
plot (Im-name, which
explainedSample
3)
corr Coef (Exy) * Simple reg * violation
=
=
r % of var can be
· . .
= + 100
residual err. vstd 61)
·
Snow P (which Resid val us fitted , 2 QQ , 3 fitted residuals , (max
V 1 =
linear assoc btw
= =
of D ) > X X : =
by the model rxy (Sign r2 -
. ...
,
total =
,
. .