Discovering knowledge in data, H: 1, 2, 3, 4, 5, 6, 7, 8
Uva, Minor datascience
Chapter 1, An introduction to Data Mining
1.1 What is data mining?
Data mining is the process of discovering useful patterns and trends in large datasets.
For example:
- Supermarkets: each cash-register product scan collected helps to build a profile about
the shopping habits of your family and the other families who are checking out.
- Banks: customer service have acces to individual customer profiles, so that the
customer can be informed of new products or services that may be of greatest interest
to him or her. => This may help to identify the type of marketing approach for a
particular customer, based on customer’s individual profile.
1.2/1.3 Wanted Data Miners, The need for human direction of data mining
Some early data mining definitions described the process as ‘automatic’ => this has
misled many people into believing data mining is product that can be bought rather than
a discipline that must be learned.
The problem today is not that there is not enough data and information streaming in, but
the lack of skilled people that kan translate these data into knowledge.
Datamining is easy to do badly, understanding of mathematical model structures of
underlying software is required.
Humans need to be actively involved in every phase of the data mining proces.
Task of data mining should be integrated into the human process of problem solving.
1.4 The cross-industry standard practice for data mining (CRISP-DM)
Data mining model: CRISP-DM fits data mining into the general problem-solving strategy
of business/research unit.
The model consists of 6 phases that are adaptive, which means that for example we are
in the modeling phase. Depending on the behavior and characteristics of the model, we
1
,may have to return to the data preparation phase for further refinement before moving
forward to the model evaluation phase.
The iterative nature of CRISP is symbolized by the outer circle in the model. Often the
solution to a particular business or research problem leads to further questions of interest,
with may than be attacked using the same general process.
The six phases:
1. Business/Research Understanding Phase:
- Define project requirements and objectives.
- Translate objectives into a data mining problem definition.
- Prepare a preliminary strategy to meet objectives.
2. Data Understanding Phase:
- Collect data.
- Perform exploratory data analysis (EDA).
- Evaluate the quality of the data.
- Optionally, select interesting subsets
3. Data Preparation Phase (belangrijkste fase):
- Prepares for modeling in subsequent phases.
- Select cases and variables you want to analyze, and that are appropriate for your
analysis.
2
,- Clean and prepare data so it is ready for modeling tools.
- Perform transformation on certain variables if needed.
4. Modeling Phase:
- Select and apply one or more modeling techniques.
- Calibrate model setting to optimize results.
- If necessary, additional data preparation may be required for supporting a particular
technique.
5. Evaluation Phase:
- Evaluate one or more models for effectiveness.
- Determine whether defined objectives are achieved.
- Establish whether some important facet of the problem has not been sufficiently
accounted for.
- Make decision regarding data mining results before deploying to the field.
6. Deployment Phase (implementing (inzet)):
- Make use of the models created.
- Example of simple deployment: generate a report.
- Example of complex deployment: implement a parallel data mining process in another
department.
- In businesses the customer often carries out the deployment based on your model.
Zie pwp ch.1 voor enkele voorbeelden met de fases.
1.5 Fallacies of Data Mining
Fallacies Reality
There are data mining tools that we can turn loose - No automatic data mining tools solve problems.
on our data repositories, and find answers to our - Rather data mining is a process (CRISP-DM).
problems. - Data mining integrates in overall business
objectives.
The data mining process is autonomous, requiring - Requires significant intervention during every
little or no human oversight. phase.
- After model deployment, new models require
updates.
- Continuous evaluative measures monitored by
analysts.
Data mining pays for itself quite quickly. - The returns rates vary
- Depending on the start-up costs, personnel etc.
Data mining software packages are intuitive and - The ease of use varies between projects
easy to use. - You can’t just purchase, install and sit back.
- Data analysts must combine subject matter
knowledge with an analytical mind and familiarity
with the overall business.
3
, Fallacies Reality
Data mining will identify the causes of business - The knowledge discovery process will help
problems. uncover patterns of behavior
- It’s up to humans to identify the causes.
Data mining will automatically clean up our messy - Data is possibly not examined voor years.
database. - Organizations who start a new data mining
operations will be often confronted with huge
data preprocessing tasks.
Data mining always provides positive results. - There is no guarantee of positive results.
- But when used properly, data mining can provide
highly profitable results.
1.6 What tasks can data mining accomplish ?
6 common data mining tasks
- Description
- Estimation
- Prediction
- Classification
- Clustering
- Association
Description
- Describes patterns and trends lying within the data.
Descriptions of patterns and trends often suggest possible explanations for such
patterns and trends.
- Data mining models should be transparent.
The results of the data mining model should describe clear patterns that can be
interpreted by humans.
- High-quality description accomplished using Exploratory Data Analysis (EDA)
Graphical method of exploring patterns and trends in data.
Estimation
- In estimation bepalen we the value of a numeric target by using a set of numeric and/or
categorical predictor variables.
- Models are build using complete records, which provide the value of the target variable,
as well as the predictors.
- Then, for new observations, estimates of the value of the target variable are made,
based on the values of the predictors.
- Estimation is similar to classification task, except target variable is numeric.
For example: Estimating the amount of money a randomly chosen family of four will spend
for back-to-school shopping this fall.
Bekijk figuur 1.2 op blz. 9.
4