Computational Analysis of Digital Communication
Article
Lecture 1:
- ARTIKEL 1: Van Attenveldt Peng (2018) When Communication Meets Computation
Opportunities Challenges and Pitfalls in Computational Communication Science
The role of computational methods in communication science
De recente versnelling in de belofte en het gebruik van computationele methoden voor
communicatiewetenschap wordt voornamelijk aangedreven door de samenkomst van verschillende
ontwikkelingen:
1. A deluge (overflow) of digitally available data, ranging from social media messages and
other digital traces to web archives and newly digitized newspaper and other historical
archives.
2. Improved tools to analyze this data, including network analysis methods and automatic text
analysis methods such as supervised text classification, topic modelling, and syntactic
methods.
3. The emergence of powerful and cheap processing power, and easy to use computing
infrastructure for processing these data, including scientific and commercial cloud
computing, sharing platforms such as Github and Dataverse, and crowd coding platforms
such as Amazon MTurk and Crowdflower.
Many of these new data sets contain communication artifacts such as tweets, posts,
emails, and reviews. These new methods are aimed at analyzing the structure and
dynamics of human communication.
These three developments have the potential to give an unprecedented boost to
progress in communication science, provided we can overcome the technical, social,
and ethical challenges presented by these developments.
Big data can be defined by:
Large and complex data sets
Consisting of digital traces and other naturally occurring data
Requiring algorithmic solutions to analyze
Allowing the study of human communication by applying and testing communication theory
Computational methods do not replace the existing methodological approaches, but
rather complement it. Computational methods are an expansion and enhancement to
the existing methodological toolbox, while traditional methods can also contribute to
the development, calibration, and validation of computational methods.
Oppertunities offered by computational methods
Computational methods allow us to analyze social behavior and communication in ways
that were not possible before and have the potential to radically change our discipline
at least in 4 ways:
From self-report to real behavior: Digital traces of online social behavior can function as a
new behavioral lab available for communication researchers. These data allow us to measure
actual behavior in an unobtrusive way rather than self- reported attitudes or intentions. This
can help overcome social desirability problems, and it does not reply on people’s imperfect
estimate of their own desires and intentions. It is methodologically viable to unravel the
dynamics underlying human communication and disentangle the interdependent
, relationships between multiple communication processes. It is now possible to trace news
consumption in real-time and combine it with survey data to get a more sophisticated
measurement of news consumption and effects.
From lab experiments to studies of the actual social environment: We can observe the
reaction of persons to stimuli in their actual environment rather than in an artificial lab
setting. In their daily lives, people are exposed to a multitude of stimuli simultaneously, and
their relations are also conditioned by how a stimulus fits into the overall perception and
daily routine of people. Researchers are mostly interested in social behavior, and how people
act strongly depends on their actions and attitudes in their social network. The emergence of
social media facilitates the design and implementation of experiment research.
Crowdsourcing platforms on social media lowers the obstacles in research subject
recruitment. However, the implementation of experimental design on social media is not an
easy task. Social media companies will be very selective on their collaborators and on
research topics. The fear of them is to lose reputation and it could also be extremely time-
consuming.
From small-N to large-N: Increasing the scale of measurement can enable the researchers to
study more subtle relations or effects in smaller subpopulations than possible with the
sample sizes normally available in communication research. In order to leverage the more
complex models afforded by larger data sets we need to change the way we build and test
our models. It is useful to consider techniques developed in machine learning research for
model selection and model shrinkage (penalized regression and cross-validation) which are
aimed at out-of-sample prediction rather than within-sample explanation. These techniques
estimate more parsimonious models and hence alleviate the problems of overfitting that can
occur with large data sets.
From solitary to collaborative research: Digital data and computational tools make it easier
to share and reuse the resources. An increased focus on sharing data and tools will also force
us to be more rigorous in defining operationalizations and documenting the data and analysis
process. By fostering the interdisciplinary collaboration needed to deal with larger data sets
and more complex computational techniques can change the way we do research. By
offering a change to zoom in from the macro level down to the individual data points, digital
methods can also bring quantitative and qualitative research closer together, allowing
qualitative research to improve our understanding of data and build theory, while keeping
the link to large- scale quantitative research to test the resulting hypotheses.
Challenges and pitfalls in computational methods
As said before, computational methods offer a wide range of possibilities for
communication researchers to explore new research questions and re-examine classical
theories from new perspectives. By observing actual behavior in the social environment,
and if possible of a whole network of connected people, we get a better measurement
of how people actually react, rather than of how they react in the artificial isolation of
the lab setting. Large-scale exploratory research can help formulate theories and
identify interesting cases or subsets for further study, while at the same time smaller
and qualitative studies can help make sense of the results of big data research. Big data
research can help test whether causal relations found in experimental studies actually
hold in the wild on large populations and in real social settings.
Using these new methods and data sets also creates a new set of challenges and pitfalls:
How do we keep research datasets accessible?
Although the volume, variety, velocity, and veracity of big data has been repeatedly
bragged in both news reports and scholarly writings, it is a hard truth that many of the
big data sets are proprietary ones which are highly demanding to access for most
communication researchers. Researchers connected to these actors are generally based
,only on a single platform, which makes it challenging to develop a panoramic
understanding of user’s behavior on social media as a holistic ecosystem and increases
generalizability problems.
Such privileged access to big data will thwart the reproducibility of computational
research which serves as the minimum standard by which scientific claims are judged.
Samples of big data on social media are made accessible to the public either in its
original form or in aggregate format. External parties also create accessible archives of
web data. However, the sampling, aggregation, and other transformation imposed on
the released data is a black box, which poses great challenges for communication
researchers to evaluate the quality and representativeness of the data and then assess
the external validity of their findings derived from such data.
It is important to make sure that the data is open and transparent and to make sure
that research is not reserved to the privileged few who have the network or resources
to acquire data sets. It is vital that we stimulate sharing and publishing data sets.
Where possible these should be fully open and published on platforms such as
dataverse, where needed for privacy or copyright reasons the data should be securely
stored but accessible under clear conditions. A corpus management tool can help
alleviate copyright restrictions by allowing data to be queried and analyzed even if the
full text of the data set cannot be published. When working with funding agencies and
data providers such as newspaper publishers and social media platforms, you can make
standardized data sets available for all researchers.
Is big data always good data?
Big data is found while survey data is made. Most of the big data are secondary are
intended for other primary uses most of which have little relevance to academic
research. On the other side, most of the survey data are made by researchers who
design and implement their studies and questionnaires with specific research purposes
in mind. The big data is found and then tailored or curated by researchers to address
their own theoretical or practical concerns. The gap between the primary purpose
intended for big data and the secondary purpose found for big data will pose threat to
the validity of design, measurement, and analysis in computational communication
research.
That data is ‘big’ does not mean that it is representative for a certain population. Based
on representative survey data, people do not randomly select into social media
platforms, and very limited information is available for communication researchers to
assess the representativeness of big data retrieved from social media. Specialized actors
on social media (issue experts, professionals, institutional users) are over-represented
while the ordinary publics are under-represented in computational research, which
leads to a sampling bias to be carefully handled. This means that p-values are less
meaningful as a measure of validity. For very large data sets, there representativeness,
selection and measurement biases are a much greater threat to validity than small
sample sizes, p-values are not a very meaningful indicator of effect.
Size of data is neither a sign of validity nor of invalidity of the conclusions. For big data
studies you should focus more on substantive effect size and validity than mere
statistical significance by showing confidence intervals and using simulations or
bootstrapping to show the estimated real effects of the found relations.
Are computational measurement methods valid and reliable?
, The unobtrusiveness of social media data makes them less vulnerable to traditional
measurement bias, such as instrument bias, interviewer bias, and social desirability
bias. However, this does not imply that they are free of measurement errors.
Measurement errors can be introduced when text mining techniques are employed to
identify semantic features in user-generated content, whether using dictionaries,
machine learning, or unsupervised techniques and when social and communication
networks are constructed from user-initiated behavior.
Researchers found that different sentiment dictionaries capture different
underlying phenomena and highlight the importance of tailoring lexicons to
domains to improve construct validity.
Researchers also observe the lack of correlation between sentiment dictionaries, and
similarly argue for the need for domain adaptation of dictionaries. Similar to techniques
like factor analysis, unsupervised methods such as topic modelling require the
researcher to interpret and validate the resulting topics, and although quantitative
measures of topic coherence exist these do not always correlate with human judgments
of topic quality.
It should be noted that classical methods of manual content analysis are also no
guarantee of valid or reliable data. Researchers show that using trained manual coders
to extract subjective features such as moral claims can lead to overestimation of
reliability and argue that untrained (crowd) coders can actually be better at capturing
intuitive judgements.
The errors can introduce systematic biases in subsequent multivariate analysis and
threaten the validity of statistical inference. This means that we need to emphasize the
validity of measurements of social media and other digital data.
What is responsible and ethical conduct in computational communication research?
The scientific community and the general public have expressed growing concern on
ethical conduct in computational social science. Such concerns can exist in different
steps of computational communication research. F.e. in field experiments on social
media, how can researchers get informed consent from the subjects? When users of a
social media platform accept the terms of service of the platform, can researchers
assume that the users have given an explicit or implicit consent to participate in any
types of experiments conducted on the platform? There is no unambiguous answer to
these questions but it is also not possible to ignore these problems and losing the trust
of the general public. This calls for a collective effort from the whole community to set
up a responsible conduct of research in computational communication research.
How do we get the needed skills and infrastructure?
Reaping (oogsten) the benefits of computational methods require that as a scientific
community we need to invest in skills, infrastructure, and institutions. It is important
that as practitioners we are skilled at dealing with data and computational tools. Many
digital traces and other big data are textual rather than the numerical data most
scholars are trained for and used to, and will require us to hone skills in natural
language processing.
Collaboration with other researchers is important but, collaboration requires research
that is innovative and challenging to both sides, and in many cases what we need is a
good programmer to help us gather, clean, analyze, and visualize data rather than a
scientist to invent a new algorithm. Not all researchers can afford to fire such
programmers. Thus, researchers expect that doing research in communication science
will increasingly demand at least some level of computational literacy. It is vital that we
make methods more prominent in our teaching to make sure the new generation of
Article
Lecture 1:
- ARTIKEL 1: Van Attenveldt Peng (2018) When Communication Meets Computation
Opportunities Challenges and Pitfalls in Computational Communication Science
The role of computational methods in communication science
De recente versnelling in de belofte en het gebruik van computationele methoden voor
communicatiewetenschap wordt voornamelijk aangedreven door de samenkomst van verschillende
ontwikkelingen:
1. A deluge (overflow) of digitally available data, ranging from social media messages and
other digital traces to web archives and newly digitized newspaper and other historical
archives.
2. Improved tools to analyze this data, including network analysis methods and automatic text
analysis methods such as supervised text classification, topic modelling, and syntactic
methods.
3. The emergence of powerful and cheap processing power, and easy to use computing
infrastructure for processing these data, including scientific and commercial cloud
computing, sharing platforms such as Github and Dataverse, and crowd coding platforms
such as Amazon MTurk and Crowdflower.
Many of these new data sets contain communication artifacts such as tweets, posts,
emails, and reviews. These new methods are aimed at analyzing the structure and
dynamics of human communication.
These three developments have the potential to give an unprecedented boost to
progress in communication science, provided we can overcome the technical, social,
and ethical challenges presented by these developments.
Big data can be defined by:
Large and complex data sets
Consisting of digital traces and other naturally occurring data
Requiring algorithmic solutions to analyze
Allowing the study of human communication by applying and testing communication theory
Computational methods do not replace the existing methodological approaches, but
rather complement it. Computational methods are an expansion and enhancement to
the existing methodological toolbox, while traditional methods can also contribute to
the development, calibration, and validation of computational methods.
Oppertunities offered by computational methods
Computational methods allow us to analyze social behavior and communication in ways
that were not possible before and have the potential to radically change our discipline
at least in 4 ways:
From self-report to real behavior: Digital traces of online social behavior can function as a
new behavioral lab available for communication researchers. These data allow us to measure
actual behavior in an unobtrusive way rather than self- reported attitudes or intentions. This
can help overcome social desirability problems, and it does not reply on people’s imperfect
estimate of their own desires and intentions. It is methodologically viable to unravel the
dynamics underlying human communication and disentangle the interdependent
, relationships between multiple communication processes. It is now possible to trace news
consumption in real-time and combine it with survey data to get a more sophisticated
measurement of news consumption and effects.
From lab experiments to studies of the actual social environment: We can observe the
reaction of persons to stimuli in their actual environment rather than in an artificial lab
setting. In their daily lives, people are exposed to a multitude of stimuli simultaneously, and
their relations are also conditioned by how a stimulus fits into the overall perception and
daily routine of people. Researchers are mostly interested in social behavior, and how people
act strongly depends on their actions and attitudes in their social network. The emergence of
social media facilitates the design and implementation of experiment research.
Crowdsourcing platforms on social media lowers the obstacles in research subject
recruitment. However, the implementation of experimental design on social media is not an
easy task. Social media companies will be very selective on their collaborators and on
research topics. The fear of them is to lose reputation and it could also be extremely time-
consuming.
From small-N to large-N: Increasing the scale of measurement can enable the researchers to
study more subtle relations or effects in smaller subpopulations than possible with the
sample sizes normally available in communication research. In order to leverage the more
complex models afforded by larger data sets we need to change the way we build and test
our models. It is useful to consider techniques developed in machine learning research for
model selection and model shrinkage (penalized regression and cross-validation) which are
aimed at out-of-sample prediction rather than within-sample explanation. These techniques
estimate more parsimonious models and hence alleviate the problems of overfitting that can
occur with large data sets.
From solitary to collaborative research: Digital data and computational tools make it easier
to share and reuse the resources. An increased focus on sharing data and tools will also force
us to be more rigorous in defining operationalizations and documenting the data and analysis
process. By fostering the interdisciplinary collaboration needed to deal with larger data sets
and more complex computational techniques can change the way we do research. By
offering a change to zoom in from the macro level down to the individual data points, digital
methods can also bring quantitative and qualitative research closer together, allowing
qualitative research to improve our understanding of data and build theory, while keeping
the link to large- scale quantitative research to test the resulting hypotheses.
Challenges and pitfalls in computational methods
As said before, computational methods offer a wide range of possibilities for
communication researchers to explore new research questions and re-examine classical
theories from new perspectives. By observing actual behavior in the social environment,
and if possible of a whole network of connected people, we get a better measurement
of how people actually react, rather than of how they react in the artificial isolation of
the lab setting. Large-scale exploratory research can help formulate theories and
identify interesting cases or subsets for further study, while at the same time smaller
and qualitative studies can help make sense of the results of big data research. Big data
research can help test whether causal relations found in experimental studies actually
hold in the wild on large populations and in real social settings.
Using these new methods and data sets also creates a new set of challenges and pitfalls:
How do we keep research datasets accessible?
Although the volume, variety, velocity, and veracity of big data has been repeatedly
bragged in both news reports and scholarly writings, it is a hard truth that many of the
big data sets are proprietary ones which are highly demanding to access for most
communication researchers. Researchers connected to these actors are generally based
,only on a single platform, which makes it challenging to develop a panoramic
understanding of user’s behavior on social media as a holistic ecosystem and increases
generalizability problems.
Such privileged access to big data will thwart the reproducibility of computational
research which serves as the minimum standard by which scientific claims are judged.
Samples of big data on social media are made accessible to the public either in its
original form or in aggregate format. External parties also create accessible archives of
web data. However, the sampling, aggregation, and other transformation imposed on
the released data is a black box, which poses great challenges for communication
researchers to evaluate the quality and representativeness of the data and then assess
the external validity of their findings derived from such data.
It is important to make sure that the data is open and transparent and to make sure
that research is not reserved to the privileged few who have the network or resources
to acquire data sets. It is vital that we stimulate sharing and publishing data sets.
Where possible these should be fully open and published on platforms such as
dataverse, where needed for privacy or copyright reasons the data should be securely
stored but accessible under clear conditions. A corpus management tool can help
alleviate copyright restrictions by allowing data to be queried and analyzed even if the
full text of the data set cannot be published. When working with funding agencies and
data providers such as newspaper publishers and social media platforms, you can make
standardized data sets available for all researchers.
Is big data always good data?
Big data is found while survey data is made. Most of the big data are secondary are
intended for other primary uses most of which have little relevance to academic
research. On the other side, most of the survey data are made by researchers who
design and implement their studies and questionnaires with specific research purposes
in mind. The big data is found and then tailored or curated by researchers to address
their own theoretical or practical concerns. The gap between the primary purpose
intended for big data and the secondary purpose found for big data will pose threat to
the validity of design, measurement, and analysis in computational communication
research.
That data is ‘big’ does not mean that it is representative for a certain population. Based
on representative survey data, people do not randomly select into social media
platforms, and very limited information is available for communication researchers to
assess the representativeness of big data retrieved from social media. Specialized actors
on social media (issue experts, professionals, institutional users) are over-represented
while the ordinary publics are under-represented in computational research, which
leads to a sampling bias to be carefully handled. This means that p-values are less
meaningful as a measure of validity. For very large data sets, there representativeness,
selection and measurement biases are a much greater threat to validity than small
sample sizes, p-values are not a very meaningful indicator of effect.
Size of data is neither a sign of validity nor of invalidity of the conclusions. For big data
studies you should focus more on substantive effect size and validity than mere
statistical significance by showing confidence intervals and using simulations or
bootstrapping to show the estimated real effects of the found relations.
Are computational measurement methods valid and reliable?
, The unobtrusiveness of social media data makes them less vulnerable to traditional
measurement bias, such as instrument bias, interviewer bias, and social desirability
bias. However, this does not imply that they are free of measurement errors.
Measurement errors can be introduced when text mining techniques are employed to
identify semantic features in user-generated content, whether using dictionaries,
machine learning, or unsupervised techniques and when social and communication
networks are constructed from user-initiated behavior.
Researchers found that different sentiment dictionaries capture different
underlying phenomena and highlight the importance of tailoring lexicons to
domains to improve construct validity.
Researchers also observe the lack of correlation between sentiment dictionaries, and
similarly argue for the need for domain adaptation of dictionaries. Similar to techniques
like factor analysis, unsupervised methods such as topic modelling require the
researcher to interpret and validate the resulting topics, and although quantitative
measures of topic coherence exist these do not always correlate with human judgments
of topic quality.
It should be noted that classical methods of manual content analysis are also no
guarantee of valid or reliable data. Researchers show that using trained manual coders
to extract subjective features such as moral claims can lead to overestimation of
reliability and argue that untrained (crowd) coders can actually be better at capturing
intuitive judgements.
The errors can introduce systematic biases in subsequent multivariate analysis and
threaten the validity of statistical inference. This means that we need to emphasize the
validity of measurements of social media and other digital data.
What is responsible and ethical conduct in computational communication research?
The scientific community and the general public have expressed growing concern on
ethical conduct in computational social science. Such concerns can exist in different
steps of computational communication research. F.e. in field experiments on social
media, how can researchers get informed consent from the subjects? When users of a
social media platform accept the terms of service of the platform, can researchers
assume that the users have given an explicit or implicit consent to participate in any
types of experiments conducted on the platform? There is no unambiguous answer to
these questions but it is also not possible to ignore these problems and losing the trust
of the general public. This calls for a collective effort from the whole community to set
up a responsible conduct of research in computational communication research.
How do we get the needed skills and infrastructure?
Reaping (oogsten) the benefits of computational methods require that as a scientific
community we need to invest in skills, infrastructure, and institutions. It is important
that as practitioners we are skilled at dealing with data and computational tools. Many
digital traces and other big data are textual rather than the numerical data most
scholars are trained for and used to, and will require us to hone skills in natural
language processing.
Collaboration with other researchers is important but, collaboration requires research
that is innovative and challenging to both sides, and in many cases what we need is a
good programmer to help us gather, clean, analyze, and visualize data rather than a
scientist to invent a new algorithm. Not all researchers can afford to fire such
programmers. Thus, researchers expect that doing research in communication science
will increasingly demand at least some level of computational literacy. It is vital that we
make methods more prominent in our teaching to make sure the new generation of