Computational Analysis of Digital Communication
Lecture 1: Computational analysis of digital communication
Week 1: introduction to computational methods in communication science
Increasing amount of data available online
Much of what we know about human behavior
..is gebaseerd op wat mensen ons vertellen:
• in self-report measures in surveys
• in responses in experimental research
• in qualitative interviews
But a lot of (mass) communication looks like this.. …or is based on user-generated content
How can we analyze large amount of texts?
This is what we will discuss in this course!
Objectives and learning goals
After completion of the course, you will…
1. be able to identify data analytic problems, analyze
them critically, and find appropriate solutions
2. have a good understanding of the general text
classification pipeline
3. have practical knowledge about different approaches of
text classification (incl. dictionary approaches, machine
learning, large language models…)
,Skills and methods
With regard to the specific methods being taught in R, you will be able
to…
• gather, scrape, and import data from different file types, APIs, and
websites
• link data from different sources to create new insights
• clean and transform messy data into a tidy data format ready for
text classification and analysis
• use different approaches (e.g., dictionary, classic machine learning,
transformer, LLMs) to extract information from textual data
• perform statistical analyses on the substantive data
1.1 What is Computational Social Science? …and why should we care?
Example: Surprising Sources of Information
• In 2009 wilden onderzoekers rijkdom en armoede in Rwanda bestuderen.
• Ze voerden een enquête uit met een willekeurige steekproef van 1.000 klanten van
de grootste mobiele telefoonprovider.
• Ze verzamelden demografische, social en economische kenmerken (inc. rijkdom).
• Tot dusver traditionele sociale wetenschap, toch?
• De auteurs hadden ook toegang tot volledige belgegevens van 1,5 miljoen mensen.
• Door beide databronnen te combineren, gebruikten ze de enquêtegegevens om een
machine learning-model te "trainen" dat de rijkdom van een persoon voorspelt op
basis van zijn belgegevens.
• Ze schatten ook de woonplaatsen in op basis van de geografische informatie die in de
belgegevens is opgenomen.
Computational Social Science= Field of social science that uses algorithmic tools and
large/unstructured data to understand human and social behavior
Vult traditionele methodologieën aan in plaats van deze te vervangen: methoden zijn niet
het doel, maar dragen bij aan datageneratie
Omvat methodes zoals:
- Data mining (bijv. het scrapen en verzamelen van grote datasets)
- Software development for social science experiments
- Automated text analysis (bijv. Sentiment analysis, keyword extraction, dictionary
approaches)
- Image classification (bijv. face recognition, visual topic modeling)
- Machine learning approaches (bijv. classification, prediction, topic modeling)
- Actor-based modeling (bijv. simulatie van sociaal gedrag, verspreiding van informatie)
,Why is this important now?
• Vast amounts of digitally available data, ranging from social media messages and
other digital traces to web archives and newly digitized newspaper and other
historical archives
• Large-scale records (big data) of persons or businesses are created constantly
• Powerful and comparatively cheap processing power, and easy to use computing
infrastructure for processing these data
• Improved tools to analyze this data, including network analysis methods and
automatic text analysis methods such as supervised text classification, topic
modeling, word embeddings, as well as large language models
10 Characteristics of Big Data
# Characteristics Description
1 Big The scale or volume of some current data sets is often impressive. However, big
data sets are not an end in themselves, but they can enable certain kinds of
research including the study of rare events, the estimation of heterogeneity, and
the detection of small differences
2 Always-on Many big data systems are constantly collecting data and thus enable to study
unexpected events and allow for real-time measurement
3 Nonreactive Participants are generally not aware that their data are being captured or they
have become so accustomed to this data collection that it no longer changes
their behavior.
4 Incomplete Most big data sources are incomplete, in the sense that they don’t have the
information that you will want for your research. This is a common feature of
data that were created for purposes other than research.
5 Inaccessible Data held by companies and governments are difficult for researchers to access.
6 Nonrepresantative Most big datasets are nonetheless not representative of certain populations.
Out-of-sample generalizations are hence difficult or impossible.
7 Drifting Many big data systems are changing constantly, thus making it difficult to study
long-term trends
8 Algorithmically Behavior in big data systems is not natural; it is driven by the engineering goals
confounded of the systems
9 Dirty Big data often includes a lot of noise (e.g., junk, spam, spurious data points…)
10 Sensitive Some of the information that companies and governments have is sensitive.
Pro’s and Con’s of computational methods
, 1.2 Computational Communication Science
Why computational methods are important for communication research…
Computational Communication Science (CCS)= is the label applied to the emerging subfield
that investigates the use of computational algorithms to gather and analyze big and often
semi- or unstructured data sets to develop and test communication science theories.
(de term die wordt gebruikt voor het opkomende vakgebied dat het gebruik van computationele
algoritmen onderzoekt om grote en vaak semi- of ongestructureerde datasets te verzamelen en te
analyseren, met als doel communicatiewetenschappelijke theorieën te ontwikkelen en te testen)
Typical research areas
Computational communication science studies thus usually
involve:
1. Large and complex data set
2. Consisting of digital traces and other “naturally
occurring” data
3. Requiring algorithmic solutions to analyze (e.g.,
machine learning, LLMs)
4. Allowing the study of human communication by
applying and testing communication theory
Example 1: Analyzing News Coverage
Jacobi en collega's (2016) analyseerden de berichtgeving
over nucleaire technologie van 1945 tot 2014 in de New
York Times.
• Analyse van 51.528 nieuwsartikelen (kop en
inleiding): Veel te veel voor handmatige codering!
• Gebruikten “LDA-topicmodellering” om latente
onderwerpen te extraheren en analyseerden hun
voorkomen in de loop van de tijd
Example 2: Facebook Data to Predict Personality
Kosinski en collega's (2013) gebruikten een dataset van meer dan 58.000 vrijwilligers die hun
Facebook-likes, gedetailleerde demografische profielen en de resultaten van verschillende
psychometrische tests hebben verstrekt.
• Ze konden aantonen dat het mogelijk is om een verscheidenheid aan persoonlijke
kenmerken en persoonlijkheidstrekken te voorspellen op basis van eenvoudige
Facebook-likes.
Lecture 1: Computational analysis of digital communication
Week 1: introduction to computational methods in communication science
Increasing amount of data available online
Much of what we know about human behavior
..is gebaseerd op wat mensen ons vertellen:
• in self-report measures in surveys
• in responses in experimental research
• in qualitative interviews
But a lot of (mass) communication looks like this.. …or is based on user-generated content
How can we analyze large amount of texts?
This is what we will discuss in this course!
Objectives and learning goals
After completion of the course, you will…
1. be able to identify data analytic problems, analyze
them critically, and find appropriate solutions
2. have a good understanding of the general text
classification pipeline
3. have practical knowledge about different approaches of
text classification (incl. dictionary approaches, machine
learning, large language models…)
,Skills and methods
With regard to the specific methods being taught in R, you will be able
to…
• gather, scrape, and import data from different file types, APIs, and
websites
• link data from different sources to create new insights
• clean and transform messy data into a tidy data format ready for
text classification and analysis
• use different approaches (e.g., dictionary, classic machine learning,
transformer, LLMs) to extract information from textual data
• perform statistical analyses on the substantive data
1.1 What is Computational Social Science? …and why should we care?
Example: Surprising Sources of Information
• In 2009 wilden onderzoekers rijkdom en armoede in Rwanda bestuderen.
• Ze voerden een enquête uit met een willekeurige steekproef van 1.000 klanten van
de grootste mobiele telefoonprovider.
• Ze verzamelden demografische, social en economische kenmerken (inc. rijkdom).
• Tot dusver traditionele sociale wetenschap, toch?
• De auteurs hadden ook toegang tot volledige belgegevens van 1,5 miljoen mensen.
• Door beide databronnen te combineren, gebruikten ze de enquêtegegevens om een
machine learning-model te "trainen" dat de rijkdom van een persoon voorspelt op
basis van zijn belgegevens.
• Ze schatten ook de woonplaatsen in op basis van de geografische informatie die in de
belgegevens is opgenomen.
Computational Social Science= Field of social science that uses algorithmic tools and
large/unstructured data to understand human and social behavior
Vult traditionele methodologieën aan in plaats van deze te vervangen: methoden zijn niet
het doel, maar dragen bij aan datageneratie
Omvat methodes zoals:
- Data mining (bijv. het scrapen en verzamelen van grote datasets)
- Software development for social science experiments
- Automated text analysis (bijv. Sentiment analysis, keyword extraction, dictionary
approaches)
- Image classification (bijv. face recognition, visual topic modeling)
- Machine learning approaches (bijv. classification, prediction, topic modeling)
- Actor-based modeling (bijv. simulatie van sociaal gedrag, verspreiding van informatie)
,Why is this important now?
• Vast amounts of digitally available data, ranging from social media messages and
other digital traces to web archives and newly digitized newspaper and other
historical archives
• Large-scale records (big data) of persons or businesses are created constantly
• Powerful and comparatively cheap processing power, and easy to use computing
infrastructure for processing these data
• Improved tools to analyze this data, including network analysis methods and
automatic text analysis methods such as supervised text classification, topic
modeling, word embeddings, as well as large language models
10 Characteristics of Big Data
# Characteristics Description
1 Big The scale or volume of some current data sets is often impressive. However, big
data sets are not an end in themselves, but they can enable certain kinds of
research including the study of rare events, the estimation of heterogeneity, and
the detection of small differences
2 Always-on Many big data systems are constantly collecting data and thus enable to study
unexpected events and allow for real-time measurement
3 Nonreactive Participants are generally not aware that their data are being captured or they
have become so accustomed to this data collection that it no longer changes
their behavior.
4 Incomplete Most big data sources are incomplete, in the sense that they don’t have the
information that you will want for your research. This is a common feature of
data that were created for purposes other than research.
5 Inaccessible Data held by companies and governments are difficult for researchers to access.
6 Nonrepresantative Most big datasets are nonetheless not representative of certain populations.
Out-of-sample generalizations are hence difficult or impossible.
7 Drifting Many big data systems are changing constantly, thus making it difficult to study
long-term trends
8 Algorithmically Behavior in big data systems is not natural; it is driven by the engineering goals
confounded of the systems
9 Dirty Big data often includes a lot of noise (e.g., junk, spam, spurious data points…)
10 Sensitive Some of the information that companies and governments have is sensitive.
Pro’s and Con’s of computational methods
, 1.2 Computational Communication Science
Why computational methods are important for communication research…
Computational Communication Science (CCS)= is the label applied to the emerging subfield
that investigates the use of computational algorithms to gather and analyze big and often
semi- or unstructured data sets to develop and test communication science theories.
(de term die wordt gebruikt voor het opkomende vakgebied dat het gebruik van computationele
algoritmen onderzoekt om grote en vaak semi- of ongestructureerde datasets te verzamelen en te
analyseren, met als doel communicatiewetenschappelijke theorieën te ontwikkelen en te testen)
Typical research areas
Computational communication science studies thus usually
involve:
1. Large and complex data set
2. Consisting of digital traces and other “naturally
occurring” data
3. Requiring algorithmic solutions to analyze (e.g.,
machine learning, LLMs)
4. Allowing the study of human communication by
applying and testing communication theory
Example 1: Analyzing News Coverage
Jacobi en collega's (2016) analyseerden de berichtgeving
over nucleaire technologie van 1945 tot 2014 in de New
York Times.
• Analyse van 51.528 nieuwsartikelen (kop en
inleiding): Veel te veel voor handmatige codering!
• Gebruikten “LDA-topicmodellering” om latente
onderwerpen te extraheren en analyseerden hun
voorkomen in de loop van de tijd
Example 2: Facebook Data to Predict Personality
Kosinski en collega's (2013) gebruikten een dataset van meer dan 58.000 vrijwilligers die hun
Facebook-likes, gedetailleerde demografische profielen en de resultaten van verschillende
psychometrische tests hebben verstrekt.
• Ze konden aantonen dat het mogelijk is om een verscheidenheid aan persoonlijke
kenmerken en persoonlijkheidstrekken te voorspellen op basis van eenvoudige
Facebook-likes.