What is Machine Learning? - correct answer-"Field of study that gives
computers the ability to learn without being explicitly programmed"
- Samuel, 1959
"Learning is changing behavior in a way that makes performance better in the
future"
- Witten & Frank 1999
"Improvement with experience at some task" and "A well-defined ML problem:
- improve over task T
- w/ regards to performance measure p
- based on experience E"
...Mitchell, 1997
Data Mining - - correct answer-Extract interesting knowledge from large
unstructured data-sets
* non-obvious, comprehensible, meaningful, useful
3 V's - correct answer-1.) Volume: terabytes and up.
2.) Velocity: from streaming data
3.) Variety: numeric, video, sensor, unstructured text...
Curse of Dimensionality - correct answer-The curse of dimensionality refers to
how certain learning algorithms may perform poorly in high-dimensional data.
First, it's very easy to overfit the the training data, since we can have a lot of
assumptions that describe the target label (in case of supervised learning). In
other words we can easily express the target using the dimensions that we
have.
Second,we may need to increase the number of training data exponentially, to
overcome the curse of dimensionality and that may not be feasible.
Third, in ML learning algorithms that depends on the distance, like k-means
for clustering or k nearest neighbors, everything can become far from each
others and it's difficult to interpret the distance between the data points.
, Entropy - correct answer-measure of uncertainty of a random variable
(acquisition of information corresponds to a reduction of entropy)
Information Gain - correct answer-of an attribute in Entropy from partitioning
the data according to that attribute
Noise - correct answer-Imprecise or incorrect attribute values or labels
- Can't always quantify it, but should know from situation if it is present
- E.g. labels may require subjective judgement or values may come from
imprecise measurements
Main symptom of over fitting - correct answer-Much better performance on the
training data than on independent test data
Key insights to kNN - correct answer-- Each sample can be considered to be a
point in sample space
- if two samples are close to each other in space, they should be close to each
other in their target values
lazy learning - correct answer-
Eager learning - correct answer-When given training data, construct model for
future use in prediction that summarises the data
- Analogy: compilation in programming language
- Slow in model construction, quicker in subsequent use
- Model itself may be useful/informative
Lazy Learning - correct answer-No explicit model constructed
- Calculations deferred until new case to be classified
Training Set Quality - MNAR - correct answer-When the missing values are
neither MCAR nor MAR. People w/ depression not reporting it.
Training Set Quality - MAR - correct answer-When missing data is not random
but can be totally related to a variable where there is complete information
Example - Men not reporting depression
Training Set Quality - MCAR - correct answer-The presence/absence of data
is completely independent of observable variables