100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4.2 TrustPilot
logo-home
Resumen

Summary Data Mining for Business And Governance

Puntuación
-
Vendido
5
Páginas
47
Subido en
09-02-2021
Escrito en
2020/2021

Summary of full course

Institución
Grado











Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
9 de febrero de 2021
Número de páginas
47
Escrito en
2020/2021
Tipo
Resumen

Temas

Vista previa del contenido

Data Mining for Business and Governance

Introduction to Data Science (Week 1, Video 1)
What is Data Science?
= Is a “concept to unify statistics, data analysis and their related methods” in order to “understand and
analyze actual phenomena” with data.

What makes a Data Scientist?
= Data scientist use their data and analytical ability to find and interpret rich data sources; manage large
amounts of data (…); create visualizations to aid in understanding data; build mathematical models
using the data; and present and communicate the data insights/findings.

A lot of Related Fields
Artificial Intelligence; focuses on intelligent behaviors (human copy; mimicking). More interested in
doing better than humans
Machine Learning; focuses on certain learning objectives wanting to achieve by programming different
functions and algorithms
Data mining; VR/Sensory, Medical as input
Information Retrieval; what you do when you type in something in Google (users are interested in it,
doing a search online – Siri fe.)
Natural Language Processing; deals with interpretational language, clever interpretation. Some of the
systems capture better information than we as humans do.
Computer Vision; focuses on the vision system we as humans have, processing images, getting
information from them, object classification within images.
Audio Signal Processing; deals with audio, speech, music.
Cognitive Sciences; deals with the brain specifically, processes of the brain (too broad)
Intelligent Games; where the agents in a game behave intelligently (with machine learning fe)
Agents (Biology); simulate certain agents against games, entities, animals can we make sense of their
behavior etc?
→ Know the minor differences!

ONE COMMONALITY: DATA-DRIVEN SCIENCE (ML/DM)

What is Data?
Example clouds: what does the weather have that is data? (Table with certain attributes)
What you can see is the outlook. What is the temperature, wind, can I play outside? YES/NO
• We want to use this data to predict something, make a classification
• Sunny or not, this kid doesn’t play outside when it’s sunny.
• Windy condition, not really a good point for play.
• There is not enough data to make a good interpretation
Convert the data into features
• Convert outlook into numbers.
o 1 = Sunny
o 0 = Cloudy
o 2 = Rainy
• Do the same for wind and temperature, where there is a mapping for values.
o 1 = yes
o 0 = no
o Binary representation, called attributes
• Features are the same as attributes!

Another measurements could be the degrees, temperature feelings, amount of rainfall and probability of
thunder.
• Scale is unknown cause it is degrees, percentage, km/h etc (units are different)
Another measurement could be with image data / combination with other data sources (photo of
clouds over map)
• Might combine it with other information such as ticket sales for a theme park.

This is INTERPRETING DATA!

Back to our data (of playing outside)
Can you come up with some rules for playing outside?
Conditions (Rules for prediction):
• If it’s sunny & hot → kid does not want to play outside.
• If it’s windy → kid does not want to play outside.



1

,• Then 1 rule left: if it’s not windy and not hot → kid wants to play.
We want to predict our target P L A Y given the features we have available.

FORMALLY!
• We have our data: X (with features: outlook, temp,
windy)
o Features can be continuous and discrete
o Continuous features: are real valued and can
be within some range.
o Discrete features: finite values and usually
associated with some label of category.
• Our data exists of smaller instances, ‘some instance’ is
written as: x.
• If we want to specifically point at a particular instance
(say our first row), we write: x1. We can see our model
as a function f, that when given any instance x, gives us
a prediction ŷ.
• The application of the model to some instance in our
data can be written as f(x).
• Our hope is that ŷ is the same as our target: y.

Quick Recap of Example
• Features: X (outlook, temp, windy)
• Targets: Y (play)
• Some instance: x
• Some target: y
• First column: x1 (sunny, hot, no)
• First target: y1 (no)
• Model: if it’s not windy and not hot → play (f)
• Predictions by f(x): ŷ
• Prediction for f(x1): ŷ1 (no)

Predictive Model (OR ALGORITHM)
def play predictor(data):
if data[“windy”] == ‘no’ and data[“temp”] != ‘hot’:
return ‘play’
else:
return ‘no play’

It’s sunny, mild, and windy…should I play? Realistic?
It will return ‘no play’, because the algorithm says it is NOT windy and HOT weather (not mild).

How do we know if our model performs well?
• Correct evaluation is incredibly important in Data Mining.
• We came up with some rules, be how do w know they generalize; if the rules we learned apply with
the same success rate to data where we don’t know what the target is.

Results of our model
• 5/6 correct, so our model has 83.3% accuracy.
• Did we cover all conditions?
• What if we are presented with new conditions?
• Rules are probably too strict.
• Other than the training data we determined our rules by, we also need test data; unseen by us, to
evaluate.

Explanation of unseen data: REALISTIC USE CASE
PREDICTING HOUSING PRICES (great example of data mining)
• Would you be able to determine the price of a house? → You need expert knowledge.
• Many observations required to gain experience (mental representation to know higher house price
fe)
• Features to predict the price of a house?
o Amount of bedrooms
o Big garden
o Good neighborhood
HOW TO EVALUATE?



2

,• Previously we had a clear binary (yes/no) prediction.
• Say we had more classes, we would still be predicting a nominal target (different from a numeric
target, where you really want to predict a price range. Order does not matter).
• We can’t say: we got … out of … correct, and therefore use accuracy.
• We are more likely interested in how far our prediction was off from the actual value: this is error.

TYPES OF PREDICTION
• Classes → classification (binary fe.)
• Values → regression

Complex information
• How would location affect price?
• How would pollution affect price?
• How about the good location but high pollution?
• Do you know how much of either would affect the price?
• Would one be able to easily craft a successful ruleset?

LEARNING TO PREDICT (some problems are very hard to solve for humans)
• Hand-made rules are not flexible
• Given more instances/observations, rules will become more complex, thus requiring better (more
complex) rules.
• Too much data becomes impossible to manually analyze.
• If done automatically, little expert knowledge is required; mostly data.
• Models can give information regarding underlying patterns and feature importances.
o If many rules mention location as a first condition to look at, that must be an important
feature.

You need good intuitions, domain expertise and get to know your data well (not just a bunch of
algorithms and you are “done”).


EXTRA MATERIAL
Quick discussion of:
• PC hardware and relation to data and algorithms
• Programming languages and their relation to above
This is not computer science, why do I need to know this>
• Algorithm choices often depend on hardware limitations.
• Some model families specifically deal with shortage of computation power.
• Different data types often relate to storage and processing.
• Certain terms are widespread throughout this course.

PC HARDWARE
Power supply (left corner)
CPU (processor)
HDD (storage, disks)
RAM (memory)
Motherboard (connects all the components)

HARD DRIVE (HDD/SSD)
Place where all your stuff is stored.
• Stores your files.
• HDD are larger (store more data, 1-5T) but slower (in reading/writing) and are fragile.
• SSDs are smaller (up to 1T), faster, more robust, but expensive.
• Most modern laptops come with an SSD
• For computations, algorithms/models read a particular set of data from you disks into memory.

MEMORY CHIP (RAM)
• Very fast reading/writing, but even more limited in space (8-16G up to 256G), very expensive.
• Algorithms can quickly access and manipulate data that is in memory.
• If memory limit is exceeded, computers usually freeze/processes slow down.
• Computations done on date in memory are commonly handled by the CPU.

PROCESSOR (CPU)
• Does computation part of a computer (berekeningen).




3

, • Can have multiple computation cores (duo core, quad core) to run operations in parallel (i.e.
simultaneously) which speeds up processes.
• The more expensive the CPU, the faster it does similar computations. The more cores, the faster it
runs parallel computations.

GRAPHICS CARD (GPU, special CPU)
• Some computations an be done on a GPU rather than the CPU.
• Commonly used for processing images or other visual content. Popular for video games.
• For ordinary systems, GPU is usually embedded in the CPU.
• GPU’s are very fast at ‘matrix operations’ and have therefore been popularized for Deep Learning
research (explained in future lectures).
• Has its own RAM (and therefore limitations).

Programming Languages
Python (this course): almost reads like English (high-level). C++ or Prolog are low-level (hard to read).

Representing Data (Week 1, Video 2)
Practical Lecture

LAST WEEK
• Data
• Features
• Algorithms

THIS WEEK
• Data

How do we get Data?
• Pre-mades: data sources that already have been compiled by people, clear prediction task, ideal to
work with as starting data scients
o Kaggle, UCI, Snap
• Dumps: big dumps of data
o IMDB, Reddit, MovieLens
• Scientific repositories: are always attached to some paper/research
o Dataverse
• (Web) API’s: common interfaces
o Twitter, Reddit
• Web scraping: where you do it yourself, select the fields you need. Becomes messy if you have to
use different websites cause every website has it’s own structure.
• At industry-level: databases (their own).

File Formats (3 main formats)
• CSV, comma separated values → flat data structure (table format), named columns and
rows. Separators and quotes.
• JSON → hierarchical (nested), lists and key/values. Widely used for API’s and document-
based databases (NOT FLAT)
• XML <data>, <movie id> → hierarchical, tags to name items. Very common standard; easy to
evaluate if according to some predefined structure.
• See Examples from Lecture Videos

What are Databases?
= Collections of hard disks. Internet connected machines that host a bunch of magnetic disks, that store
a lot of information. Because they are internet connected you can access them remotely. You would
connect to them via an IP address or URL.
https://mydatabase.com and will ask for a user + password. You will then be able to do queries to get
data from database to you. DBMS: Database Management System is a software between the PC and
the Database.

Databases are typically split in 2 types of different types of handling and structuring these queries:
Relational databases – Structured Query Language (SQL) databases
• Pre-defined, structured tables, relational → not easy to scale horizontally, but robust and well-
supported
• Ex.: MySQL, PostgreSQL, SQLite, MariaDB
Non-relational databases – NoSQL




4
$5.97
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
sabrinadegraaf Tilburg University
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
60
Miembro desde
4 año
Número de seguidores
47
Documentos
12
Última venta
1 año hace

4.0

3 reseñas

5
1
4
1
3
1
2
0
1
0

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes