Data Science & Society
Notes of the Lectures, Assignments, and Literature
Not representative of all the matter
I left out material I considered obviou s knowledge
Some parts are copied from other sum maries
Skipped some lecture material since it ’s already covered in Literature
Content
Week 1 – Stair Reynolds – IS ......................................................................................................... 2
Lecture 1 – Catching up with SQL .................................................................................................. 2
Assignment 1: Bash Fundamentals ............................................................................................... 3
Lecture 2 – Applied Data Science for Student Empowerment ..................................................... 4
Lecture 3 – The Knowledge Discovery Process for Societal Impact ........................................... 5
Week 2 – Chapman – CRISP-DM 1.0 ............................................................................................. 6
Week 2 – Davenport – Data Scientist: Sexiest Job of 21st Century .............................................. 9
Week 2 – Chang Grade – NIST Big Data Interoperability Framework ........................................ 10
Week 2 – Spruit, Lytras – Applied Data Science.......................................................................... 14
Week 2 – Braschler – ADS............................................................................................................ 16
Assignment 2 – Methods & Statistics in R .................................................................................. 19
Lecture 4 – Hadoop & MapReduce .............................................................................................. 20
Lecture 5 – Methodology, Statistics and Pitfalls ......................................................................... 23
Week 3 – Lazer – Google Flu: Big Data Traps ............................................................................. 25
Week 3 – Broniatowski, Lazer – Twitter: Big Data Opportunities ............................................... 27
Week 4 – Dean, Ghemawat – MapReduce .................................................................................. 28
Week 4 – Chambers, Zaharia – Spark Guide [Chapters 1-3] ...................................................... 31
Week 4 – Ambrose – Big Data in historical perspective ............................................................. 35
Assignment 3 – MapReduce in Hadoop & Spark ........................................................................ 42
Lecture 6 - NoSQL, Spark & Big Data ............................................................................................ 43
Lecture 7 – Statistics 2 ................................................................................................................. 48
1
,Week 1 – Stair Reynolds – IS
Quite basic concepts of Information Systems. Recommend skimming through the pages and
read the bold definitions and meaning.
Lecture 1 – Catching up with SQL
• Some definitions
o Create: Creation of database objects
o Alter: Modify the structure and/or the characteristics of database objects
o Drop: Deletion of database objects
o Truncate: Deletion of data in tables without altering the structure
• The core parts of SQL:
o Data definition language (DDL): Used to define database structures
▪ CREATE TABLE, DROP TABLE
o Data manipulation language (DML): define, update and request data (queries)
▪ INSERT (add row), UPDATE (modify values in existing row/collection of
rows), DELETE (delete a row/collection of rows), SELECT (select rows),
DISTINCT (addition to select, to prevent duplicate rows are shown),
WHERE (addition for criteria records should meet), AND/OR/NOT
(addition for multiple matching criteria), BETWEEN (self-explanatory),
ORDER BY (to sort results), GROUP BY (for subtotals), HAVING (to limit
how much data is shown)
▪ Built-in SQL functions
• COUNT: the number of rows that match the criteria
• MIN: minimal value in certain column
• MAX: maximum value in certain column
• SUM: of values in certain column
• AVG: average of values in certain column
▪ A query retrieves data from one or more tables and creates a new
(temporary) table
▪ Subqueries are queries that are used as input for another query
2
,Assignment 1: Bash Fundamentals
Command Action
cat Displays output of a file
cd Change directory
cd .. Go up one directory
chmod Change permission to read, write and execute (000 = none, 777 = all)
cp Copy file to given directory
echo Returns given value (functionality of the echo commands can vary) (‘-e’ is used
if you use escapes in your string, such as ‘\n’)
grep Filters a given input
mkdir Create a directory
mv Move file to another directory
ls Returns names of files and directories in your current directory (use ‘-l’ for
additional information)
paste Merge two output streams by column (use ‘>’ to save it in a new file)
pwd Return full path of current directory
rm Remove a file (use ‘rm -r’ for an entire directory including all content)
sort Sort an input, ‘-r’ to reverse and ‘-n’ for numeric
| Use output of command before ‘|’ as input for the command after ‘|’
> Get the output of the command and write it to a file
>> Get the output of a command and add it to a file
* Placeholder for every character or sequence
~ Home directory
# Indicated comment (everything on the same line after ‘#’ is ignored)
\n \t Character for new line and a tab
Bash scripts
You can create a bash-script to bundle a number of commands. This is just a text-file with a
.sh extension. You can execute the script by typing the path to the script in the command line.
3
, Lecture 2 – Applied Data Science for Student Empowerment
Applied Data Science is where Analytical Applications are combined with Data Science
Data Science:
- Theoretical
- Algorithms
Applied Data Science:
- Solution-oriented
- Meta-Algorithmic Models
Citizen Data Science:
- Applied
- Automated Software Tools
Self-Service Capability: “To empower non-
data scientist with automated software
tools and meta-algorithmic models to self-
service their own data analyses on their
own data sources in a reliable, usable, and
transparent manner.
4
Notes of the Lectures, Assignments, and Literature
Not representative of all the matter
I left out material I considered obviou s knowledge
Some parts are copied from other sum maries
Skipped some lecture material since it ’s already covered in Literature
Content
Week 1 – Stair Reynolds – IS ......................................................................................................... 2
Lecture 1 – Catching up with SQL .................................................................................................. 2
Assignment 1: Bash Fundamentals ............................................................................................... 3
Lecture 2 – Applied Data Science for Student Empowerment ..................................................... 4
Lecture 3 – The Knowledge Discovery Process for Societal Impact ........................................... 5
Week 2 – Chapman – CRISP-DM 1.0 ............................................................................................. 6
Week 2 – Davenport – Data Scientist: Sexiest Job of 21st Century .............................................. 9
Week 2 – Chang Grade – NIST Big Data Interoperability Framework ........................................ 10
Week 2 – Spruit, Lytras – Applied Data Science.......................................................................... 14
Week 2 – Braschler – ADS............................................................................................................ 16
Assignment 2 – Methods & Statistics in R .................................................................................. 19
Lecture 4 – Hadoop & MapReduce .............................................................................................. 20
Lecture 5 – Methodology, Statistics and Pitfalls ......................................................................... 23
Week 3 – Lazer – Google Flu: Big Data Traps ............................................................................. 25
Week 3 – Broniatowski, Lazer – Twitter: Big Data Opportunities ............................................... 27
Week 4 – Dean, Ghemawat – MapReduce .................................................................................. 28
Week 4 – Chambers, Zaharia – Spark Guide [Chapters 1-3] ...................................................... 31
Week 4 – Ambrose – Big Data in historical perspective ............................................................. 35
Assignment 3 – MapReduce in Hadoop & Spark ........................................................................ 42
Lecture 6 - NoSQL, Spark & Big Data ............................................................................................ 43
Lecture 7 – Statistics 2 ................................................................................................................. 48
1
,Week 1 – Stair Reynolds – IS
Quite basic concepts of Information Systems. Recommend skimming through the pages and
read the bold definitions and meaning.
Lecture 1 – Catching up with SQL
• Some definitions
o Create: Creation of database objects
o Alter: Modify the structure and/or the characteristics of database objects
o Drop: Deletion of database objects
o Truncate: Deletion of data in tables without altering the structure
• The core parts of SQL:
o Data definition language (DDL): Used to define database structures
▪ CREATE TABLE, DROP TABLE
o Data manipulation language (DML): define, update and request data (queries)
▪ INSERT (add row), UPDATE (modify values in existing row/collection of
rows), DELETE (delete a row/collection of rows), SELECT (select rows),
DISTINCT (addition to select, to prevent duplicate rows are shown),
WHERE (addition for criteria records should meet), AND/OR/NOT
(addition for multiple matching criteria), BETWEEN (self-explanatory),
ORDER BY (to sort results), GROUP BY (for subtotals), HAVING (to limit
how much data is shown)
▪ Built-in SQL functions
• COUNT: the number of rows that match the criteria
• MIN: minimal value in certain column
• MAX: maximum value in certain column
• SUM: of values in certain column
• AVG: average of values in certain column
▪ A query retrieves data from one or more tables and creates a new
(temporary) table
▪ Subqueries are queries that are used as input for another query
2
,Assignment 1: Bash Fundamentals
Command Action
cat Displays output of a file
cd Change directory
cd .. Go up one directory
chmod Change permission to read, write and execute (000 = none, 777 = all)
cp Copy file to given directory
echo Returns given value (functionality of the echo commands can vary) (‘-e’ is used
if you use escapes in your string, such as ‘\n’)
grep Filters a given input
mkdir Create a directory
mv Move file to another directory
ls Returns names of files and directories in your current directory (use ‘-l’ for
additional information)
paste Merge two output streams by column (use ‘>’ to save it in a new file)
pwd Return full path of current directory
rm Remove a file (use ‘rm -r’ for an entire directory including all content)
sort Sort an input, ‘-r’ to reverse and ‘-n’ for numeric
| Use output of command before ‘|’ as input for the command after ‘|’
> Get the output of the command and write it to a file
>> Get the output of a command and add it to a file
* Placeholder for every character or sequence
~ Home directory
# Indicated comment (everything on the same line after ‘#’ is ignored)
\n \t Character for new line and a tab
Bash scripts
You can create a bash-script to bundle a number of commands. This is just a text-file with a
.sh extension. You can execute the script by typing the path to the script in the command line.
3
, Lecture 2 – Applied Data Science for Student Empowerment
Applied Data Science is where Analytical Applications are combined with Data Science
Data Science:
- Theoretical
- Algorithms
Applied Data Science:
- Solution-oriented
- Meta-Algorithmic Models
Citizen Data Science:
- Applied
- Automated Software Tools
Self-Service Capability: “To empower non-
data scientist with automated software
tools and meta-algorithmic models to self-
service their own data analyses on their
own data sources in a reliable, usable, and
transparent manner.
4