The primary goals of this course is - 1) to help you view business problems from a data
perspective
2) understand principles of extracting useful knowledge
The difference between data mining and data science is two-fold - 1a) data science is a
set of fundamental principals that guide the extraction of knowledge from data
1b) data mining is the extraction of knowledge from data, via technologies that
incorporate these principals
2) in this course we will use them as synonyms
Major Takeaway - Data science involved principals, processes, and techniques for
understand phenomena via analysis of data. The ultimate goal of data science is to
improve decisions making. We can tie this thought to what we learned about the
"quantity of info." One more consideration is that Data driven decision-making refers to
the practice of basing decisions on the analysis of data, rather than intuition
Two types of decisions: Discovery and Repeatable - Technology is an effective way to
automate repeatable decisions. The data science framework is where the real
innovation happens when we go into discovery mode.
Vortex Model and the three perspectives - 1)Birds Eye View: Executives. will have a
better view of what tree level and ground level are doing. Give them feedback to make
right decisions
2) Tree Level View: Managers
3) Ground Level View
Exploiting data, new, and existing, can be a competitive advantage.
(slides 44-48) - Companies that are interested in creating a competitive advantage
through data science will need to hire people (data scientists) with varying data science
backgrounds. (e.g. Chief Data Officer. data analysts, programmers, etc.) invest in
technology to mine BIG DATA, and create processes focused on delivering data mining
tasks
A critical skil in data science is the ability to decompose a data-analytics problem into
pieces such that each piece matches a known task for which tools are available. Tying
this to discovery and repeatable decisions.
(slides 34, 51-56) - Type of question will be "Why does a data scientist decompose a
problem into pieces?"
, Ans: so that the decomposed pieces can be matched up to repeatable decisions freeing
up human intellect to focus on discovery tasks.
6 processes: Business Understanding, Data Understanding, Data preparation,
Modeling, Evaluation, and Deployment) - The Data Mining Process is iterative and not
linear. based on the cross industry standard process for data mining (CRISP-DM). 2
questions on this. Watch videos on data mining process
CRISP-DM: what does it stand for? what is it? - The Cross Industry Standard Process
for Data Mining (CRISP-DM) is based around exploration; it iterates on approaches and
strategy rather than on software designs.
One of the most important fundamental principals of data science - The thoughtfulness
of the design team to think carefully about the problem to be solved. One of the points
stressed in the business world is the importance of correctly identifying and confirming
the problem statement in both individual and team problem solving models.
Definition of a target attribute in relation to a supervised data mining technique
(slide 88) - Supervised methods (classification, regression, causal modeling): such
techniques have a specific purpose for the grouping - predicting the target. The target is
will a customer leave when her contract expires? technically, another condition must be
met for supervised data mining. there MUST be data on the target
The modeling stage - the primary place where data techniques are applied to the data
Theme - Its tempting and a mistake to view the data mining process as a software
development cycle. The CRISP cycle is based around exploration and iterate on
approaches and strategy rather than on software designs
Theme: Regression Analysis is a major concept in MODELING - Regression analysis is
useful in Data Science for testing patterns on New data to evaluate their generality.
Secondly, regression analysis supports techniques for reducing the tendency to find
patterns specific to a particular set of data and introduces different views through
dimensions and positioning the models to predict data with a focus on reducing
uncertainty.
Definitely a question on info
(slide 77) - info is a quantity that reduces uncertainty about something.
This is a takeaway from models, induction, prediction
9 common data mining tasks and their groupings - -supervised: classification,
regression, causal modeling
-unsupervised: clustering, profiling, co-occurence grouping
-either: link prediction, similarity matching, data reduction