100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4.2 TrustPilot
logo-home
Resumen

Summary Data Mining | Midterm week 1-3

Puntuación
5.0
(1)
Vendido
5
Páginas
30
Subido en
25-02-2020
Escrito en
2019/2020

This summary includes all material of week 1-3. It serves for the first midterm of this course. * Lectures notes: 1-3

Institución
Grado










Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
25 de febrero de 2020
Número de páginas
30
Escrito en
2019/2020
Tipo
Resumen

Temas

Vista previa del contenido

Week 1
Slides
Data Mining for Business & Governance

Lecture 1: What is Data Mining?
Data mining is the computational process of discovering patterns in large data
sets involving methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems

It is about extracting novel, interesting and potentially useful knowledge.

(main) relations to:
• Knowledge discovery in databases
• Machine learning → branch of computer science studying learning from data
• Statistics → branch of mathematics focused on data
• Artificial intelligence → interdisciplinary field aiming to develop intelligent
machines

Key aspects
• Computation vs large data sets: there is a trade-off to be made between
processing time and memory
• Computation enables analysis of large data sets: computers as a tool and with growing data → design
efficient computation methods to work on data to extract and give meaning to knowledge.
• Data mining often implies knowledge discovery from data bases: from unstructured data to
structured knowledge.
- Unstructured data: text
- Semi structured data: html page due to the tags which give us some more information
- Structured data: tables

What are large amounts or big data? (definition is always changing)
→ Current opinion: we should have smaller datasets, so we can enrich them, give them a higher quality
Volume Variety Velocity
• Too big for manual • Range of values: variance • Data changes quickly:
analysis • Outliers, confounders and require results before data
• Too big to fit in RAM noise changes
• Too big to store on disk • Different data types • Streaming data (no
storage)

Application of data mining
Companies: business intelligence → market Science: knowledge discovery → scientific
analysis and management discovery in large data
• Target marketing, CRM • DNA: sequence data
• Risk analysis and management • SETI program, time series
• Forecasting, customer retention, quality • Electronic Health Records
control, competitive analysis • Social Network Analysis
• Fraud detection and management • Text Mining (natural language
• AH bonus card, Amazon, Mastercard, processing): going from unstructured
Booking.com text → structured knowledge


What makes prediction possible?
Make sure of some structure in the data!
• Associations between features/target
• Association features in numerical variables: correlation coefficient
• Categorical: mutual information value of X1, contains information about value of X2


Different types of learning
? A program is said to learn from experience (E) on task (T) and a performance measure (P), if its performance
at tasks in T as measured by P improves with E.
• Supervised learning – label
= You train the machine in using data which is well ‘labeled’ --> so you are
mapping from the input to the essential output

- Classification: because we have a label, we could try to get a model
to classify different classes of diseases.
- Regression: when we have numerical data, e.g. specifying the risk
of getting a disease




1

,• Unsupervised learning – no labels
= We don’t know anything about the data; you are not aiming to produce output in the response of the input.
Instead, you want to discover patterns in the data.

- Dimensionality reduction: large number of attributes, we could try to reduce to the most
relevant/interesting ones.
- Clustering: you will investigate similar groups of patients

Inductive learning for algorithms: learns from samples/ training data / trial and error

Supervised learning workflow for algorithms




1. Collect data
• How do you select your sample?
4. Train model(s)
• Reliability of measurement
• Keep some examples for final evaluation: test
• Privacy and other regulations
set
• Use the rest for:
2. Label examples
- Learning: training set
• Annotation guidelines
- Tuning: validation set
• Measure inter-annotator agreement
• Crowdsourcing
Parameter or model tuning
• Learning algorithms typically have setting (aka
3. Choose representation
hyperparameters)
• Features: attributes describing examples
• For each value of hyperparameters:
- Numerical or categorical (binary)
- Apply algorithm to training set to learn
• Possibly convert to feature vector
- Check performance on validation set
- A vector is a fixed-size list of numbers
- Find/choose best-performing setting
- Feature vector: describes the object that
you want to use.
5. Evaluate
- Some learning algorithms require
• Check performance of tuned model on test set
examples represented as vectors →
• Goal: estimate how well your model will do in
spectra representation
the real world
• Keep evaluation realistic
• Decision tree models, neural networks etc.
• You want to have your data balanced, it’s bad
if one group is overrepresented or
underrepresented → learn to create a
representative sample, e.g. down sample data




2

, Correlation Coefficient
Pearson’s r measures the strength of a linear relationship (dependency)




Pearson’s correlation coefficient
• Numerator: covariance → to what extent do the features change together?
• Denominator: product of standard deviations → makes correlations independent of unit




Covariance and correlation
Covariance = indicates the relationship of two
variables whenever one variable changes. If an
increase in one variable results in an increase in
the other variable, both variables are said to have
a positive covariance

→ corresponds to the strength of the linear
relationship.



Magnitude (direction) of the covariance is not
easy to interpret


Correlation coefficient is normalized and
corresponds to strength of the linear relation

Divide variance by the product of the variable’s
standard deviations




3
$5.98
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Reseñas de compradores verificados

Se muestran los comentarios
5 año hace

5.0

1 reseñas

5
1
4
0
3
0
2
0
1
0
Reseñas confiables sobre Stuvia

Todas las reseñas las realizan usuarios reales de Stuvia después de compras verificadas.

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
ioumi Tilburg University
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
11
Miembro desde
12 año
Número de seguidores
9
Documentos
1
Última venta
4 año hace
BSc Political Science: International Relations / MSc Data Science & Society

Hi there! I studied the BSc Political Science with a specialization in IR at the Vrije Universiteit Amsterdam. Currently I'm studying Data Science & Society at Tilburg University. Writing summaries has always be my way of learning: hopefully my documents will make the exam period easier for you!

5.0

1 reseñas

5
1
4
0
3
0
2
0
1
0

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes