College aantekeningen

Lecture slides + notes Data Science

Beoordeling

Verkocht

Pagina's

Geüpload op

12-09-2025

Geschreven in

2024/2025

A summary of the lecture slides and notes from the Data Science course. This section covers the Data Mining (DM) and Data Exploration and Preparation (DEP) components of the Data Science course, which is offered in the Master's program in Health Sciences. Samenvatting van de college slides en met aantekeningen van het vak Data Science. Het betreft de onderdelen Data Mining (DM) en Data Exploration and Preparation (DEP) van het vak Data Science die wordt gegeven in de master Health Sciences.

Meer zien Lees minder

Instelling

Vak

Voorbeeld van de inhoud

Data Science
Introduction (03/09)
Revolution of Scientific Method
Paradigms of scientific method:
1. Empiricism  knowledge gained through observation and experimentation
2. Mathematical modelling  uses mathematical equations and abstractions to represent real-
world systems; analyzing and predicting behavior through theoretical frameworks
3. Simulation  creating computer-based models to imitate real-world processes; running
experiments in a virtual environment
A new paradigm: Data-intensive Scientific Discovery
4. Combining and analyzing data in novel ways is capable of tackling research questions that
could not be answered before

Big Data  large complex data sets
The 4 V’s of big data, not always all 4, usually a combination of V’s
1. Volume: vast amount of data being generated
2. Velocity: the speed at which data is created and processed
3. Variety: different types and formats of data sources
4. Veracity: the quality (reliability and accuracy) of the data

CRISP cycle = a framework for data analysis projects consisting of six phases. The steps are iterative,
allowing revisits to previous stages for model improvement.
Goal: to derive valuable insights that align with business objectives through data analysis

Data Exploration & Preparation
 Just by “looking at” data we can’t see anything
 Explore: what is there, what does it mean, what is its quality
 Transform (in R)
 Stored in DBMS
o DBMS = Database Management System, software that manages and organizes data in
databases (e.g., PostgreSQL)
 Access
 Use: Analytics Modeling

1

,From data to insights:
1. Formulate “Questions to data”
2. Imagine visualizations/reports
3. Design star schema(s) for (1) cube(s) by analyzing and for (2) fact(s) and dimensions
4. Create (empty) database with schema
5. Fill database by transforming sources
6. Use: Analytics (e.g., visualization) or (Predictive) modeling by connecting to the database

Data Mining
 Techniques to automatically extract knowledge from data (by hand is simply not feasible anymore)
 Supervised techniques = learn a target function by examples
o For decision tree mining, model = decision tree
o For deep learning, model = neural network with weights on connections
o For regression, model = (linear) function
 Unsupervised techniques = find “obvious” patterns

2

,Topic DM: Data Mining (06/09)
Basics of Data Mining (DM)
What: discovering patterns, correlations, anomalies, insights, trends from (large) datasets
Purpose: to get insights of the data for decision-making, prediction and knowledge discovery
Related to:
 Machine learning: developing algorithms that enable computers to learn from data and make
predictions or decisions
 Statistical learning: providing a framework for understanding and analyzing data by modeling
relationships and making predictions based on statistical principles and techniques
 Artificial Intelligence: creating intelligent systems that can perform tasks autonomously

Given lots of data
Discover patterns and models that are:
 Valid: hold on new data with some certainty
 Useful: should be possible to act on the item
 Unexpected: non-obvious to the system
 Understandable: humans should be able to interpret the pattern

Supervised learning = involves training a model for predicting or estimating (an output based on one
or more inputs)
 Training data includes desired outputs / labels
Unsupervised learning = learn about relationships and structure of the data
 Training data does not include desired outputs / unlabeled

Supervised Learning
Regression problem: output is continuous
Classification problem: output is a binary or categorical value (based on a probability)
 Binary classification: two classes
 Multi-class classification

Examples supervised:
 prediction of credit card fraud (classification)
 filtering out spam (classification)
 convert hand-writing images into text (classification)
 predicting house/property, stock market prices (regression)

Examples unsupervised
 identify groups of customers with a certain purchasing behavior (clustering)
 identify patterns like: if a customer buys X then there is a tendency to buy Y also (association)

Applications in the Medical Domain (supervised)
 Automatically composed advice for patients based on questionnaires, diagnostic information
 Automatic detection of atrial fibrillation
 Scheduling of OR: prediction of surgery duration
 Prediction of the time to fracture after the visit to osteoporosis poli
 Prediction of occurrence of a post-operative infectious complication
 Prediction of the length of stay after complex surgery
=C-C-R-R-C-R

Classification or Regression problems?

3

,  Predicting the gender of a person by his/her handwriting style
 Predicting house price based on area
 Predicting the nationality of a person
 Predicting the number of copies a music album will be sold next month
 Predicting whether the stock price of a company will increase tomorrow
 Predicting the probability of surviving a after hip fracture surgery
=C–R–C–R–C-C

Terminology
 Input: feature, attribute, variable, covariate
 Output: dependent variable, response variable, label
 Feature selection: variable selection
 Feature engineering: variable transform, dummy coding
 Method: algorithm, approach or technique used to train a model on data (the estimator)
 Model: the trained outcome from applying a method to a dataset (the estimate)
 Training: process of teaching a model to make predictions or decisions by feeding it data
 Learning: the outcome of the training process

Training of a Model
Complex models aren’t always better: their effectiveness depends on performance.
Assessing how well a model works  Validation of your model with unseen test data

Most simple method = Linear Regression
2 parameters 3 parameters

Complexity
The number of parameters in a model reflects its complexity and flexibility. More parameters allow
the model to capture finer details and nuances in the data.
 Non linear terms (e.g., higher order polynomials (= x3, x4 etc.)
 More layers in your network
The more features in your model the better?
 No, beware of overfitting

Overfitting = a too complex model (large number of parameters) to capture random fluctuations in
the training data  poor performance on unseen data

4

Meld schending auteursrecht

Geschreven voor

Instelling: Universiteit Twente (UT)
Studie: Health Sciences
Vak: Data Science (20242023002001A)

Alle documenten voor dit vak (2)

Documentinformatie

Geüpload op: 12 september 2025
Aantal pagina's: 32
Geschreven in: 2024/2025
Type: College aantekeningen
Docent(en): Karin oudshoorn
Bevat: Alle colleges

Onderwerpen

data mining
data exploration
machine learning
model
linear regression
overfitting
underfitting
k fold cross validation
traintest split
naïve bayes
decision tree
dep
fact table
dimensions

€5,49

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

L273

2,9

(9)

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

L273 Universiteit Twente

Bekijk profiel

Volgen

Verkocht

Lid sinds

3 jaar

Aantal volgers

Documenten

Laatst verkocht

5 dagen geleden

2,9

9 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper L273. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 58915 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Lecture slides + notes Data Science

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Meer vakken binnen Universiteit Twente (UT) > Health Sciences

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?