Lecture 1 Introduction to data analysis and modelling
Why are biological systems complicated?? Biological processes occur simultaneously
at different scales (at different scales processes are happening) -> a lot of complexity!
- Human -> bigger level (social networks)
Large number of components -> large number of processes/interactions (in each cell
you have complex behavior)
Processes/interactions of biological systems are often non-linear!
Linear process = more you invest, more refund (this is never the case in biological
systems)
Non-linear process = more you put on, not the more comes out
➔ Growth is for example a non-linear process!
We cannot apply the principle of superposition to non-linear processes!
- In a linear system, we can analyze the system piece by piece and then add the
pieces together!
Principle of Superposition: For a linear system, the response caused by two or
more independent inputs is the sum of the individual responses caused by each
input acting alone.
- In non-linear systems, we cannot investigate the separate components as we
might destroy the properties emergent from the interactions (“emergent
behavior” – see later!)
How to approach this complexity?
1. Data-driven: no assumptions, purely based on data (try to find the best relation
between input and output) = black box (better quality, but no understanding)
2. Mechanistic modelling: trying to describe underlying biological mechanisms
based on assumptions. (trying to understand the underlying phenomenon) =
white box (we want to understand the underlying process)
Machine learning = definition Arthur Samuel = field of study that gives computers the
ability to learn without being explicitly programmed.
, ➔ Learns by itself given data (not programming, not: when this ...., do this)
Tom Mitchell’s definition = machine learning = well posed learning problem: a computer
program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with
experience E.
➔ Self-driving car: stop at red light (T), did it stop (P), driving around (E)
➔ Netflix algorithm: predicting what to watch (T), how often do you click what it
recommends to you (P), watch list(E)
➔ Trading stock (T), how much profit do I make (P), all trades(E)
Difference Machine learning and programming!!!!
➔ In traditional programming, a programmer writes explicit rules for how a system
should behave in different scenarios. The system follows these predefined rules
to produce outputs. In contrast, machine learning allows a system to learn
patterns from data without being explicitly programmed for every possible
situation.
Classes of machine learning problems (see examples mentioned during lecture)
- Classifications
- Regression
- Recommender systems
Classification = methods for the categorization of objects of situations into distinct
classes.
Regression = methods for the prediction of continuous variables
Recommender systems = systems for the recommendation of objects out of the set of
all available objects, which the user is most likely to be interested in. (not used in
Regenerative Medicine!)
➔ Item based (which item is similar to this item)
➔ User based
Categories of Machine learning
- Supervised
- Unsupervised
- Reinforcement learning
Supervised : receive input data and corresponding output data. Goal is to learn a
mapping from input to output (how to get from data to output). Output can be class
labels (classification) or real numbers (regression)
, Unsupervised: no output data present. Goal of the learning algorithm is to find structure
in the input data. Example: analysis in social media.
Reinforcement learning: system interacts with a dynamic environment, to reach a goal.
System receives feedback (rewards/punishment) for acts in environment, possible with
a delay.
Tools:
- Statistics
- Linear algebra
- Optimization theory
Data
- Small
- Big data:
• Volume – size of data
• Variety – different file types, sources and formats
• Velocity – speed, with which new data arrives
• Veracity – varying trustworthiness and quality of data/data sources
➔ All 4 needed to have big data
➔ Problems in big data thus often data preparation and afterwards machine
learning on small(ish) data.
RMT -> mostly small data: relatively small, one data source, complete data sets, high
quality data
What’s the data like?
- One data point Sample
- Described by attributes – value pairs, called Features (qualitative and
quantitative features)
- All samples in data set should have the same features.
Qualitative features (nominal/categorical)
- Observed value belongs to one of several classes
- No ordering of these classes
- Examples : occupation
- Can check for equivalence
- <,> not possible
- Should not be presented as numerical values!
We can also have qualitative features on ordinal scale
➔ Representation with arbitrary number