Introduction
Why care ?
expected from society
o gen z
huge potential risks
o humans: physical (delf driving cars), mental, privacy discrimination
o businesses: reputational, financial
potential benefits
o improve model & be a marketing instrument (part of brand)
Remove bias in data : improve accuracy & fairness
Explain predictions : improve trust
proper data gathering : improve data quality
Future : increased digitalization, automation and use of AI, Scifi > Sci
note : Data scientists and business students are not inherently unethical, but at
the same time not trained to think this through neither.
Data Science (ethics) = about what is right and wrong when conducting data
science.
Responsible AI: the development + application of AI that is aligned with moral
values in society
Theories:
Utilitariasm = consequentialism, Action is moral if the consequence is moral, it is
just a means to an end ; this justifies immoral things
Deontology = not doing immoral things even if it leads to moral action
examples:
Trolley Problem ( variants with pregnant women, old people )
include these preferences into self driving cars by MIT moral machine
(regional differences)
Aristole’s Nicomachean Ethic = Moral behavior can be found at the mean
between 2 extremes:
excess : using all available data w/out concern for ethical aspect
deficiency : not using any data at all
Golden mean of data science ethics
Data Science Equilibrum : A state of data science practices determined by the
ethical concerns and utility of data science.
churn
CV sorting
,Data: facts or information ( esp when used to find out things or to make
decisions)
Algorithm: a set of rules that must be followed when solving a particular problem
Prediction or AI Model: the decision-making formula, which has been learnt from
data by a prediction/AI algorithm
Data mining : actual extraction of knowledge from data
AI : methods for improving the knowledge/performance of an intelligent agent
over time
Types of data
Personal data: any information relating to an identifiable natural person (‘data
subject’);
an identifiable natural person is one who can be identified, directly or indirectly,
in particular by reference to an identifier such as a name, an identification
number, location data, an online identifier or to one or more factors specific to
the physical, physiological, genetic, mental, economic, cultural or social identity
of that natural person
Behavioral data: data providing evidence of actions taken by people, such as
location data, Facebook likes, online browsing data, payment data.
Sensitive data: personal data revealing racial or ethnic origin, political opinions,
religious or philosophical beliefs, and the processing of genetic data, biometric
data for the purpose of uniquely identifying a natural person, data concerning
health or data concerning a natural person's sex life or sexual orientation shall be
prohibited
FAT
3 dimensions of FAT framework
1. Stage in the data science process GPMED
a. Data Gathering
o GDPR : hashing & encryption (gov backdoors)
o Privacy : A/B Testing (OK Cupid)
b. Data Preparation
o input selection, defininig target variable: k anonymity, l diversity, t
closeness
o proxies, re-identification (netflix re id, red lining)
c. Modeling
o measuring fairness : diff priv, zero knowledge proof
o removing bias : explainable ai (self driving cars)
d. Evaluation
o defining KPIs
o p hacking , detect bias (predicting recidivism, apple card)
e. Deployment
o access : overruling, deep fake (target)
o honesty & oversight
2. Evaluation criterion
,Fair: Treating people equally without favouritism or discrimination.
1. fair to data subject's privacy rights : a state in which one is not observed
or disturbed by other people (human right)
2. not discriminating against sensitive groups : often race, gender, sexual
preference.
Accountable: Required or expected to justify actions or decisions; responsible
obligation to
1. implement effective measures to ensure that principles are complied with
2. demonstrate compliance upon request
3. recognize potential neg consequences
Transparant: Easy to perceive or detect
1. Process : making the process understandable, but not necessarily exposing
every secret and detail
2. explainable AI : explain why a model made certain predictions/decisions
3. Role of the human
Data subject : whose (personal) data is being used.
how are you using my data in your models?
Data Scientist : who is performing the data science
where is the model making mistakes?
Data Manager : who manages and signs off on a data science project
how does the model generally work?
Model Subject : who the model is being applied to
why am I being denied credit?
Data Gathering
2 extremes : continuum
Privacy is a human right
You have zero privacy anyway get over it
o security cameras, internet as a whole, ..
o who has nothing to hide has nothing to fear
GDPR : general data protection regulation
= privacy and data protection of EU citizens, world's most robust data protection
rules
key concepts:
Personal Data : it protects personal data (ip adres, home adres, name, nr..)
Anonymisation : Processing data so it can not be traced back to an
individual (not re-identifiable)
Pseudonymisation : (simply removing name) Processing data in such a
manner that the personal data can no longer be attributed to a specific
data subject without the use of additional information, provided that such
additional information is kept separately
! privacy of data subject & model applicant
, Cambridge analytica : gathered fb profiles data without consent to use for
political targeting
When does GDPR allow processing of personal data? CCLLPI
1. unambiguous consent of data subject
! how deep do you have to go in detail in order to be sure people are informed
about what happens with their data
2. to fulfill a contract (bank)
3. legal obligation (law states to maintain record)
4. legitimate interest = allows businesses to process personal data without
requiring explicit consent, as long as it is necessary for legitimate business
purposes
e.g. coupons ok, targeted ads need consent, selling data is not ok
Balancing between what a reasonable person would find acceptable and what
the potential impact is: continuum! coupon: small impact >< health premium
can be expensive
5. interest of data subject
6. performance of a task carried out in the public interest (vaccination
records)
GDPR art 5: 6 principles relating to processing of personal data
Q exam: what are the principles of GDPR related to processing the data Q exam:
for a given case, how do they relate to the case
1. Data gathering : lawful, transparent, fair la liga
2. Purpose Limitation only use for inteded
purpose
3. Storage Limitation store it safely
4. Data Minimisation only use as much is needed
5. Accuracy hungarian bank fine
6. Integrity and Confidentiality accessible to only
authorized ppl
CIA
Confidentiality: data only available to authorized entities
Integrity: maintaining and assuring the accuracy of the data
Availability: data available when needed
Encryption = encode info in such a way that only authorized persons can
access it.
Historically
Caeser shift cipher : 3-right (move 3 letters up)
Scytale : rod of given diameter
Shave head
Why care ?
expected from society
o gen z
huge potential risks
o humans: physical (delf driving cars), mental, privacy discrimination
o businesses: reputational, financial
potential benefits
o improve model & be a marketing instrument (part of brand)
Remove bias in data : improve accuracy & fairness
Explain predictions : improve trust
proper data gathering : improve data quality
Future : increased digitalization, automation and use of AI, Scifi > Sci
note : Data scientists and business students are not inherently unethical, but at
the same time not trained to think this through neither.
Data Science (ethics) = about what is right and wrong when conducting data
science.
Responsible AI: the development + application of AI that is aligned with moral
values in society
Theories:
Utilitariasm = consequentialism, Action is moral if the consequence is moral, it is
just a means to an end ; this justifies immoral things
Deontology = not doing immoral things even if it leads to moral action
examples:
Trolley Problem ( variants with pregnant women, old people )
include these preferences into self driving cars by MIT moral machine
(regional differences)
Aristole’s Nicomachean Ethic = Moral behavior can be found at the mean
between 2 extremes:
excess : using all available data w/out concern for ethical aspect
deficiency : not using any data at all
Golden mean of data science ethics
Data Science Equilibrum : A state of data science practices determined by the
ethical concerns and utility of data science.
churn
CV sorting
,Data: facts or information ( esp when used to find out things or to make
decisions)
Algorithm: a set of rules that must be followed when solving a particular problem
Prediction or AI Model: the decision-making formula, which has been learnt from
data by a prediction/AI algorithm
Data mining : actual extraction of knowledge from data
AI : methods for improving the knowledge/performance of an intelligent agent
over time
Types of data
Personal data: any information relating to an identifiable natural person (‘data
subject’);
an identifiable natural person is one who can be identified, directly or indirectly,
in particular by reference to an identifier such as a name, an identification
number, location data, an online identifier or to one or more factors specific to
the physical, physiological, genetic, mental, economic, cultural or social identity
of that natural person
Behavioral data: data providing evidence of actions taken by people, such as
location data, Facebook likes, online browsing data, payment data.
Sensitive data: personal data revealing racial or ethnic origin, political opinions,
religious or philosophical beliefs, and the processing of genetic data, biometric
data for the purpose of uniquely identifying a natural person, data concerning
health or data concerning a natural person's sex life or sexual orientation shall be
prohibited
FAT
3 dimensions of FAT framework
1. Stage in the data science process GPMED
a. Data Gathering
o GDPR : hashing & encryption (gov backdoors)
o Privacy : A/B Testing (OK Cupid)
b. Data Preparation
o input selection, defininig target variable: k anonymity, l diversity, t
closeness
o proxies, re-identification (netflix re id, red lining)
c. Modeling
o measuring fairness : diff priv, zero knowledge proof
o removing bias : explainable ai (self driving cars)
d. Evaluation
o defining KPIs
o p hacking , detect bias (predicting recidivism, apple card)
e. Deployment
o access : overruling, deep fake (target)
o honesty & oversight
2. Evaluation criterion
,Fair: Treating people equally without favouritism or discrimination.
1. fair to data subject's privacy rights : a state in which one is not observed
or disturbed by other people (human right)
2. not discriminating against sensitive groups : often race, gender, sexual
preference.
Accountable: Required or expected to justify actions or decisions; responsible
obligation to
1. implement effective measures to ensure that principles are complied with
2. demonstrate compliance upon request
3. recognize potential neg consequences
Transparant: Easy to perceive or detect
1. Process : making the process understandable, but not necessarily exposing
every secret and detail
2. explainable AI : explain why a model made certain predictions/decisions
3. Role of the human
Data subject : whose (personal) data is being used.
how are you using my data in your models?
Data Scientist : who is performing the data science
where is the model making mistakes?
Data Manager : who manages and signs off on a data science project
how does the model generally work?
Model Subject : who the model is being applied to
why am I being denied credit?
Data Gathering
2 extremes : continuum
Privacy is a human right
You have zero privacy anyway get over it
o security cameras, internet as a whole, ..
o who has nothing to hide has nothing to fear
GDPR : general data protection regulation
= privacy and data protection of EU citizens, world's most robust data protection
rules
key concepts:
Personal Data : it protects personal data (ip adres, home adres, name, nr..)
Anonymisation : Processing data so it can not be traced back to an
individual (not re-identifiable)
Pseudonymisation : (simply removing name) Processing data in such a
manner that the personal data can no longer be attributed to a specific
data subject without the use of additional information, provided that such
additional information is kept separately
! privacy of data subject & model applicant
, Cambridge analytica : gathered fb profiles data without consent to use for
political targeting
When does GDPR allow processing of personal data? CCLLPI
1. unambiguous consent of data subject
! how deep do you have to go in detail in order to be sure people are informed
about what happens with their data
2. to fulfill a contract (bank)
3. legal obligation (law states to maintain record)
4. legitimate interest = allows businesses to process personal data without
requiring explicit consent, as long as it is necessary for legitimate business
purposes
e.g. coupons ok, targeted ads need consent, selling data is not ok
Balancing between what a reasonable person would find acceptable and what
the potential impact is: continuum! coupon: small impact >< health premium
can be expensive
5. interest of data subject
6. performance of a task carried out in the public interest (vaccination
records)
GDPR art 5: 6 principles relating to processing of personal data
Q exam: what are the principles of GDPR related to processing the data Q exam:
for a given case, how do they relate to the case
1. Data gathering : lawful, transparent, fair la liga
2. Purpose Limitation only use for inteded
purpose
3. Storage Limitation store it safely
4. Data Minimisation only use as much is needed
5. Accuracy hungarian bank fine
6. Integrity and Confidentiality accessible to only
authorized ppl
CIA
Confidentiality: data only available to authorized entities
Integrity: maintaining and assuring the accuracy of the data
Availability: data available when needed
Encryption = encode info in such a way that only authorized persons can
access it.
Historically
Caeser shift cipher : 3-right (move 3 letters up)
Scytale : rod of given diameter
Shave head