Resumen

Summary Data Mining for Business And Governance

Puntuación

Vendido

Páginas

Subido en

09-02-2021

Escrito en

2020/2021

Summary of full course

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Tilburg University (UVT)
Estudio: Data Science & Society
Grado: Data Mining For Business & Governance

Todos documentos para esta materia (7)

Información del documento

Subido en: 9 de febrero de 2021
Número de páginas: 47
Escrito en: 2020/2021
Tipo: Resumen

Temas

clustering
data mining
supervised learning
unsupervised learning

Vista previa del contenido

Data Mining for Business and Governance

Introduction to Data Science (Week 1, Video 1)
What is Data Science?
= Is a “concept to unify statistics, data analysis and their related methods” in order to “understand and
analyze actual phenomena” with data.

What makes a Data Scientist?
= Data scientist use their data and analytical ability to find and interpret rich data sources; manage large
amounts of data (…); create visualizations to aid in understanding data; build mathematical models
using the data; and present and communicate the data insights/findings.

A lot of Related Fields
Artificial Intelligence; focuses on intelligent behaviors (human copy; mimicking). More interested in
doing better than humans
Machine Learning; focuses on certain learning objectives wanting to achieve by programming different
functions and algorithms
Data mining; VR/Sensory, Medical as input
Information Retrieval; what you do when you type in something in Google (users are interested in it,
doing a search online – Siri fe.)
Natural Language Processing; deals with interpretational language, clever interpretation. Some of the
systems capture better information than we as humans do.
Computer Vision; focuses on the vision system we as humans have, processing images, getting
information from them, object classification within images.
Audio Signal Processing; deals with audio, speech, music.
Cognitive Sciences; deals with the brain specifically, processes of the brain (too broad)
Intelligent Games; where the agents in a game behave intelligently (with machine learning fe)
Agents (Biology); simulate certain agents against games, entities, animals can we make sense of their
behavior etc?
→ Know the minor differences!

ONE COMMONALITY: DATA-DRIVEN SCIENCE (ML/DM)

What is Data?
Example clouds: what does the weather have that is data? (Table with certain attributes)
What you can see is the outlook. What is the temperature, wind, can I play outside? YES/NO
• We want to use this data to predict something, make a classification
• Sunny or not, this kid doesn’t play outside when it’s sunny.
• Windy condition, not really a good point for play.
• There is not enough data to make a good interpretation
Convert the data into features
• Convert outlook into numbers.
o 1 = Sunny
o 0 = Cloudy
o 2 = Rainy
• Do the same for wind and temperature, where there is a mapping for values.
o 1 = yes
o 0 = no
o Binary representation, called attributes
• Features are the same as attributes!

Another measurements could be the degrees, temperature feelings, amount of rainfall and probability of
thunder.
• Scale is unknown cause it is degrees, percentage, km/h etc (units are different)
Another measurement could be with image data / combination with other data sources (photo of
clouds over map)
• Might combine it with other information such as ticket sales for a theme park.

This is INTERPRETING DATA!

Back to our data (of playing outside)
Can you come up with some rules for playing outside?
Conditions (Rules for prediction):
• If it’s sunny & hot → kid does not want to play outside.
• If it’s windy → kid does not want to play outside.

1

,• Then 1 rule left: if it’s not windy and not hot → kid wants to play.
We want to predict our target P L A Y given the features we have available.

FORMALLY!
• We have our data: X (with features: outlook, temp,
windy)
o Features can be continuous and discrete
o Continuous features: are real valued and can
be within some range.
o Discrete features: finite values and usually
associated with some label of category.
• Our data exists of smaller instances, ‘some instance’ is
written as: x.
• If we want to specifically point at a particular instance
(say our first row), we write: x1. We can see our model
as a function f, that when given any instance x, gives us
a prediction ŷ.
• The application of the model to some instance in our
data can be written as f(x).
• Our hope is that ŷ is the same as our target: y.

Quick Recap of Example
• Features: X (outlook, temp, windy)
• Targets: Y (play)
• Some instance: x
• Some target: y
• First column: x1 (sunny, hot, no)
• First target: y1 (no)
• Model: if it’s not windy and not hot → play (f)
• Predictions by f(x): ŷ
• Prediction for f(x1): ŷ1 (no)

Predictive Model (OR ALGORITHM)
def play predictor(data):
if data[“windy”] == ‘no’ and data[“temp”] != ‘hot’:
return ‘play’
else:
return ‘no play’

It’s sunny, mild, and windy…should I play? Realistic?
It will return ‘no play’, because the algorithm says it is NOT windy and HOT weather (not mild).

How do we know if our model performs well?
• Correct evaluation is incredibly important in Data Mining.
• We came up with some rules, be how do w know they generalize; if the rules we learned apply with
the same success rate to data where we don’t know what the target is.

Results of our model
• 5/6 correct, so our model has 83.3% accuracy.
• Did we cover all conditions?
• What if we are presented with new conditions?
• Rules are probably too strict.
• Other than the training data we determined our rules by, we also need test data; unseen by us, to
evaluate.

Explanation of unseen data: REALISTIC USE CASE
PREDICTING HOUSING PRICES (great example of data mining)
• Would you be able to determine the price of a house? → You need expert knowledge.
• Many observations required to gain experience (mental representation to know higher house price
fe)
• Features to predict the price of a house?
o Amount of bedrooms
o Big garden
o Good neighborhood
HOW TO EVALUATE?

2

,• Previously we had a clear binary (yes/no) prediction.
• Say we had more classes, we would still be predicting a nominal target (different from a numeric
target, where you really want to predict a price range. Order does not matter).
• We can’t say: we got … out of … correct, and therefore use accuracy.
• We are more likely interested in how far our prediction was off from the actual value: this is error.

TYPES OF PREDICTION
• Classes → classification (binary fe.)
• Values → regression

Complex information
• How would location affect price?
• How would pollution affect price?
• How about the good location but high pollution?
• Do you know how much of either would affect the price?
• Would one be able to easily craft a successful ruleset?

LEARNING TO PREDICT (some problems are very hard to solve for humans)
• Hand-made rules are not flexible
• Given more instances/observations, rules will become more complex, thus requiring better (more
complex) rules.
• Too much data becomes impossible to manually analyze.
• If done automatically, little expert knowledge is required; mostly data.
• Models can give information regarding underlying patterns and feature importances.
o If many rules mention location as a first condition to look at, that must be an important
feature.

You need good intuitions, domain expertise and get to know your data well (not just a bunch of
algorithms and you are “done”).

EXTRA MATERIAL
Quick discussion of:
• PC hardware and relation to data and algorithms
• Programming languages and their relation to above
This is not computer science, why do I need to know this>
• Algorithm choices often depend on hardware limitations.
• Some model families specifically deal with shortage of computation power.
• Different data types often relate to storage and processing.
• Certain terms are widespread throughout this course.

PC HARDWARE
Power supply (left corner)
CPU (processor)
HDD (storage, disks)
RAM (memory)
Motherboard (connects all the components)

HARD DRIVE (HDD/SSD)
Place where all your stuff is stored.
• Stores your files.
• HDD are larger (store more data, 1-5T) but slower (in reading/writing) and are fragile.
• SSDs are smaller (up to 1T), faster, more robust, but expensive.
• Most modern laptops come with an SSD
• For computations, algorithms/models read a particular set of data from you disks into memory.

MEMORY CHIP (RAM)
• Very fast reading/writing, but even more limited in space (8-16G up to 256G), very expensive.
• Algorithms can quickly access and manipulate data that is in memory.
• If memory limit is exceeded, computers usually freeze/processes slow down.
• Computations done on date in memory are commonly handled by the CPU.

PROCESSOR (CPU)
• Does computation part of a computer (berekeningen).

3

, • Can have multiple computation cores (duo core, quad core) to run operations in parallel (i.e.
simultaneously) which speeds up processes.
• The more expensive the CPU, the faster it does similar computations. The more cores, the faster it
runs parallel computations.

GRAPHICS CARD (GPU, special CPU)
• Some computations an be done on a GPU rather than the CPU.
• Commonly used for processing images or other visual content. Popular for video games.
• For ordinary systems, GPU is usually embedded in the CPU.
• GPU’s are very fast at ‘matrix operations’ and have therefore been popularized for Deep Learning
research (explained in future lectures).
• Has its own RAM (and therefore limitations).

Programming Languages
Python (this course): almost reads like English (high-level). C++ or Prolog are low-level (hard to read).

Representing Data (Week 1, Video 2)
Practical Lecture

LAST WEEK
• Data
• Features
• Algorithms

THIS WEEK
• Data

How do we get Data?
• Pre-mades: data sources that already have been compiled by people, clear prediction task, ideal to
work with as starting data scients
o Kaggle, UCI, Snap
• Dumps: big dumps of data
o IMDB, Reddit, MovieLens
• Scientific repositories: are always attached to some paper/research
o Dataverse
• (Web) API’s: common interfaces
o Twitter, Reddit
• Web scraping: where you do it yourself, select the fields you need. Becomes messy if you have to
use different websites cause every website has it’s own structure.
• At industry-level: databases (their own).

File Formats (3 main formats)
• CSV, comma separated values → flat data structure (table format), named columns and
rows. Separators and quotes.
• JSON → hierarchical (nested), lists and key/values. Widely used for API’s and document-
based databases (NOT FLAT)
• XML <data>, <movie id> → hierarchical, tags to name items. Very common standard; easy to
evaluate if according to some predefined structure.
• See Examples from Lecture Videos

What are Databases?
= Collections of hard disks. Internet connected machines that host a bunch of magnetic disks, that store
a lot of information. Because they are internet connected you can access them remotely. You would
connect to them via an IP address or URL.
https://mydatabase.com and will ask for a user + password. You will then be able to do queries to get
data from database to you. DBMS: Database Management System is a software between the PC and
the Database.

Databases are typically split in 2 types of different types of handling and structuring these queries:
Relational databases – Structured Query Language (SQL) databases
• Pre-defined, structured tables, relational → not easy to scale horizontally, but robust and well-
supported
• Ex.: MySQL, PostgreSQL, SQLite, MariaDB
Non-relational databases – NoSQL

4

$5.97

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

sabrinadegraaf

4.0

(3)

Conoce al vendedor

sabrinadegraaf Tilburg University

Ver perfil

Seguir

Vendido

Miembro desde

4 año

Número de seguidores

Documentos

Última venta

1 año hace

4.0

3 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller sabrinadegraaf. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $5.97. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now