Introduction
Research design
- = the overall strategy or plan utilized to answer research questions or hypotheses
- encompasses various elements such as the selection of participants, data
collection methods, data analysis techniques, and the overall framework within
which the study will be conducted
- A well-designed research study is essential for producing credible and meaningful
findings that contribute to the advancement of knowledge in a particular field
- Links questions and answers
Close reading → quali
- a deep and attentive approach to analyze textual sources
- goes beyond simply understanding the main points and involves actively engaging with
the text to uncover its nuances, underlying meanings, and potential biases.
- sources like historical documents, speeches, or surveys to understand the
perspectives and motivations of individuals or groups
- eg; Deconstructing media portrayals of social issues to understand the underlying
messages and potential ideological influences
- eg; Social media certainly make it much easier to peek into people’s lives, but it is also quite easy to misinterpret
online traces. Example: Gang isignia on MySpace profile –> indication of Gang involvement?
- => Focus on Details:
o Close reading emphasizes meticulous attention to details like word choice, sentence structure,
punctuation, and even the physical presentation of the text. This analysis helps uncover deeper meanings
and potential authorial intentions that might be missed in a cursory reading
- => Multiple Readings:
o Unlike skimming, close reading involves repeated engagement with the text. This allows for a gradual
understanding of the complex layers of meaning present within the source
- => Active Analysis:
o Close reading is not a passive activity. It requires active questioning, critical thinking, and making
connections between different parts of the text, as well as with other relevant information and historical
context.
Distant reading → quanti
- analyzes large amounts of text data with the help of computers
- Large scale analysis:
o It deals with massive datasets, often from digital libraries, that can include hundreds or even thousands
of works.
- Computational methods:
o It relies on computer programs to analyze the data. This could involve things like tracking word usage,
sentence structure, or thematic patterns.
- Focus on trends:
o The goal is to identify broader trends and patterns across a body of literature. This can help us understand
how literature reflects or shapes culture, history, or social movements.
- can also be applied to individual texts
- example
o Investigating health and wealth in Rwanda
o Traditional social science survey
o Call records (of approx. 1,5 million people)
,Readymade vs custom made data
- Readymade
o Pre-collected
▪ This data already exists and is readily available for
purchase from a variety of sources like data providers,
research institutions, or government agencies.
o Generic
▪ It's designed for a broad audience and caters to general needs. Think of it as a pre-built dataset
that covers a specific topic or industry.
o Faster access
▪ Since it's already collected and organized, you can access it quickly, saving time and resources.
o Lower cost
▪ Generally less expensive than custom data as the collection and organization costs are spread
across multiple buyers.
o Limited customization
▪ You can't modify the data itself, and it might not perfectly match your specific needs
- Custommade
o Tailor-made:
▪ This data is collected specifically for your project or research question. It caters to your unique
requirements and targets a specific audience.
o Highly relevant:
▪ Since it's designed for your needs, it's more likely to be directly applicable to your analysis.
o More control:
▪ full control over the collection process, ensuring the data quality and relevance to your needs.
o Time-consuming:
▪ Collecting and organizing custom data takes time and resources, which can be a drawback.
o Higher cost:
▪ cost reflects the time & effort involved in collecting and organizing the data specifically for you
10 characteristics of big data sources
Helpful: big, always-on, nonreactive
Problematic: incomplete, inaccessible, nonrepresentative, drifting, algorithmically
confounded, dirty, sensitive
1/ big
- Repurposing (hergebruiken): ‘found’ versus ‘designed’ data
o The terms "found" and "designed" data refer to the origin and
purpose behind the data collection.
o found data allows for exploration and discovery
o designed data is better suited for testing specific hypotheses.
o The choice between the two depends on your research goals and the type of information you need
o Found data
▪ Unstructured: This data exists naturally and wasn't collected with a specific research question in
mind. It can come from a variety of sources like social media posts, web browsing history, or
sensor readings.
▪ Focus on discovery: The goal is to discover patterns or insights within the existing data. It's like
finding hidden gems in a vast digital landscape.
▪ Requires analysis: Found data is often messy and needs cleaning and processing before it can be
used for analysis.
▪ Examples: Clickstream data, social media feeds, customer reviews
, o Designed data
▪ Structured: This data is collected with a specific purpose in mind. Researchers design surveys,
experiments, or questionnaires to gather data that directly addresses their research questions
▪ Focus on testing: The goal is to test a hypothesis or answer a specific question. It's like
conducting a controlled experiment to gather precise information
▪ Ready to analyze: usually well-organized and requires less processing before analysis
▪ Examples: Survey responses, clinical trial data, A/B test results
- What should the ideal data set look like?
- ‘Twitter’ versus ‘social survey’ data
- Big datasets are never an end in themselves, but do allow for the study of rare cases, detection small differences,
and estimation of heterogeneity
- Example
o Analysis of the 2016 US Presidential Campaign on Twitter (Kollanyi, Howard, Woolley, 2016)
o Total of 18 910 250 tweets were analyzed
▪ 39.1% debat tweets pro Trump hashtag (e.g., #MAGA)
▪ 13.6% debat tweets pro Clinton (e.g., #ImWithHer)
o However…
▪ 32.7% pro tweets Trump originated from bots
▪ 22.3% pro tweets Clinton originated from bots
2/ always-on
- Unexpected events
- Real-time measurements
- Example: Sandy-related Twitter and Foursquare data
- Continuous Data Collection
o Big data systems are designed to collect information constantly. This means data is being generated and
streamed in from various sources 24/7, without any breaks or pre-determined schedules. Sensors in
machines, social media feeds, and online transactions are all examples of sources that continuously
generate data
- Real-Time Processing
o Unlike traditional data analysis where information is collected and then analyzed later, big data systems
often process information as it arrives. This allows for near real-time insights and quicker decision
making. For instance, fraud detection systems in banks or traffic monitoring systems rely on continuous
data analysis to identify issues and take immediate action
o Real-time processing allows for analysis of up-to-date information, enabling businesses to react to
trends and changes much faster.
3/ nonreactive
- "nonreactive" refers to the idea that the data is collected without influencing the behavior it's trying to measure.
This is a key advantage of big data for certain types of research.
- Unlike traditional research methods like surveys or interviews, big data doesn't directly interact with the subjects.
This avoids the possibility of the research itself influencing the results.
- Passive collection:
o Big data is often gathered from existing sources like social media posts, purchase records, or sensor
readings. People aren't aware their data is being collected, so they don't alter their behavior because of it
- example: Imagine you want to study consumer preferences for a new product.
o Traditional method: You might conduct a survey, where people report their preferences. However, some
people might be hesitant to admit they wouldn't buy the product, leading to biased data.
o Big data approach: You could analyze purchase history data. This wouldn't directly ask people about the
new product, but by observing their actual buying habits, you could gain valuable insights into their
preferences
, - it's important to note that "nonreactive" doesn't necessarily mean "perfect." Big data can still have limitations:
o Social desirability bias: Even though people aren't directly questioned, they might still leave a curated
online presence that doesn't reflect their true behavior
o Algorithmic bias: The way data is collected and analyzed by algorithms can introduce bias, impacting the
results.
4/ incomplete
- Usually the following information is missing/ incomplete
o Demographic information about participants
o Behavior on other platforms
o Data to operationalize theoretical constructs
- Construct validity = whether your test truly measures what it claims to measure
o For example: spending more time on the phone with your colleague does not imply that they are more
important that spouses.
- Measuring social capital
o Articulated networks → contacts
▪ Self-reported connections:
• People report their own social ties through surveys or
interviews. They might be asked who they consider
friends, family, or colleagues they can rely on
▪ Focus on structure:
• This approach emphasizes the structure of the network – who people say they are
connected to
▪ Limited insight into interaction:
• Articulated networks don't necessarily tell you how often or in what way people interact
with their connections. The strength or quality of the relationships isn't directly captured.
▪ Examples: Surveys asking people to list their close friends or colleagues they trust.
o Behavioral networks → communication
▪ Observed interactions:
• These networks are based on actual observed interactions between people, not self-
reported information. Data can come from phone calls, email exchanges, co-authorship
of papers, or even face-to-face interactions captured by sensors
▪ Focus on frequency and nature of interactions:
• This approach emphasizes how often people interact and the nature of their
interactions. It provides a more nuanced picture of social capital
▪ Challenges in data collection:
• Obtaining data on real-world interactions can be more complex and require access to
communication records or setting up specific tracking mechanisms. Privacy concerns
are also a consideration
▪ Examples: Analyzing phone call logs to see how often people from different social groups
connect, tracking co-authorship of research papers to identify collaboration networks.
➔ Choosing the right approach:
o The best method for measuring social capital depends on the research question:
▪ Structure matters: If you're interested in the overall structure of social networks and
how it relates to access to resources, articulated networks might be sufficient.
▪ Strength of ties matters: If understanding the frequency and nature of interactions are
crucial, then behavioral networks offer a richer picture.
Research design
- = the overall strategy or plan utilized to answer research questions or hypotheses
- encompasses various elements such as the selection of participants, data
collection methods, data analysis techniques, and the overall framework within
which the study will be conducted
- A well-designed research study is essential for producing credible and meaningful
findings that contribute to the advancement of knowledge in a particular field
- Links questions and answers
Close reading → quali
- a deep and attentive approach to analyze textual sources
- goes beyond simply understanding the main points and involves actively engaging with
the text to uncover its nuances, underlying meanings, and potential biases.
- sources like historical documents, speeches, or surveys to understand the
perspectives and motivations of individuals or groups
- eg; Deconstructing media portrayals of social issues to understand the underlying
messages and potential ideological influences
- eg; Social media certainly make it much easier to peek into people’s lives, but it is also quite easy to misinterpret
online traces. Example: Gang isignia on MySpace profile –> indication of Gang involvement?
- => Focus on Details:
o Close reading emphasizes meticulous attention to details like word choice, sentence structure,
punctuation, and even the physical presentation of the text. This analysis helps uncover deeper meanings
and potential authorial intentions that might be missed in a cursory reading
- => Multiple Readings:
o Unlike skimming, close reading involves repeated engagement with the text. This allows for a gradual
understanding of the complex layers of meaning present within the source
- => Active Analysis:
o Close reading is not a passive activity. It requires active questioning, critical thinking, and making
connections between different parts of the text, as well as with other relevant information and historical
context.
Distant reading → quanti
- analyzes large amounts of text data with the help of computers
- Large scale analysis:
o It deals with massive datasets, often from digital libraries, that can include hundreds or even thousands
of works.
- Computational methods:
o It relies on computer programs to analyze the data. This could involve things like tracking word usage,
sentence structure, or thematic patterns.
- Focus on trends:
o The goal is to identify broader trends and patterns across a body of literature. This can help us understand
how literature reflects or shapes culture, history, or social movements.
- can also be applied to individual texts
- example
o Investigating health and wealth in Rwanda
o Traditional social science survey
o Call records (of approx. 1,5 million people)
,Readymade vs custom made data
- Readymade
o Pre-collected
▪ This data already exists and is readily available for
purchase from a variety of sources like data providers,
research institutions, or government agencies.
o Generic
▪ It's designed for a broad audience and caters to general needs. Think of it as a pre-built dataset
that covers a specific topic or industry.
o Faster access
▪ Since it's already collected and organized, you can access it quickly, saving time and resources.
o Lower cost
▪ Generally less expensive than custom data as the collection and organization costs are spread
across multiple buyers.
o Limited customization
▪ You can't modify the data itself, and it might not perfectly match your specific needs
- Custommade
o Tailor-made:
▪ This data is collected specifically for your project or research question. It caters to your unique
requirements and targets a specific audience.
o Highly relevant:
▪ Since it's designed for your needs, it's more likely to be directly applicable to your analysis.
o More control:
▪ full control over the collection process, ensuring the data quality and relevance to your needs.
o Time-consuming:
▪ Collecting and organizing custom data takes time and resources, which can be a drawback.
o Higher cost:
▪ cost reflects the time & effort involved in collecting and organizing the data specifically for you
10 characteristics of big data sources
Helpful: big, always-on, nonreactive
Problematic: incomplete, inaccessible, nonrepresentative, drifting, algorithmically
confounded, dirty, sensitive
1/ big
- Repurposing (hergebruiken): ‘found’ versus ‘designed’ data
o The terms "found" and "designed" data refer to the origin and
purpose behind the data collection.
o found data allows for exploration and discovery
o designed data is better suited for testing specific hypotheses.
o The choice between the two depends on your research goals and the type of information you need
o Found data
▪ Unstructured: This data exists naturally and wasn't collected with a specific research question in
mind. It can come from a variety of sources like social media posts, web browsing history, or
sensor readings.
▪ Focus on discovery: The goal is to discover patterns or insights within the existing data. It's like
finding hidden gems in a vast digital landscape.
▪ Requires analysis: Found data is often messy and needs cleaning and processing before it can be
used for analysis.
▪ Examples: Clickstream data, social media feeds, customer reviews
, o Designed data
▪ Structured: This data is collected with a specific purpose in mind. Researchers design surveys,
experiments, or questionnaires to gather data that directly addresses their research questions
▪ Focus on testing: The goal is to test a hypothesis or answer a specific question. It's like
conducting a controlled experiment to gather precise information
▪ Ready to analyze: usually well-organized and requires less processing before analysis
▪ Examples: Survey responses, clinical trial data, A/B test results
- What should the ideal data set look like?
- ‘Twitter’ versus ‘social survey’ data
- Big datasets are never an end in themselves, but do allow for the study of rare cases, detection small differences,
and estimation of heterogeneity
- Example
o Analysis of the 2016 US Presidential Campaign on Twitter (Kollanyi, Howard, Woolley, 2016)
o Total of 18 910 250 tweets were analyzed
▪ 39.1% debat tweets pro Trump hashtag (e.g., #MAGA)
▪ 13.6% debat tweets pro Clinton (e.g., #ImWithHer)
o However…
▪ 32.7% pro tweets Trump originated from bots
▪ 22.3% pro tweets Clinton originated from bots
2/ always-on
- Unexpected events
- Real-time measurements
- Example: Sandy-related Twitter and Foursquare data
- Continuous Data Collection
o Big data systems are designed to collect information constantly. This means data is being generated and
streamed in from various sources 24/7, without any breaks or pre-determined schedules. Sensors in
machines, social media feeds, and online transactions are all examples of sources that continuously
generate data
- Real-Time Processing
o Unlike traditional data analysis where information is collected and then analyzed later, big data systems
often process information as it arrives. This allows for near real-time insights and quicker decision
making. For instance, fraud detection systems in banks or traffic monitoring systems rely on continuous
data analysis to identify issues and take immediate action
o Real-time processing allows for analysis of up-to-date information, enabling businesses to react to
trends and changes much faster.
3/ nonreactive
- "nonreactive" refers to the idea that the data is collected without influencing the behavior it's trying to measure.
This is a key advantage of big data for certain types of research.
- Unlike traditional research methods like surveys or interviews, big data doesn't directly interact with the subjects.
This avoids the possibility of the research itself influencing the results.
- Passive collection:
o Big data is often gathered from existing sources like social media posts, purchase records, or sensor
readings. People aren't aware their data is being collected, so they don't alter their behavior because of it
- example: Imagine you want to study consumer preferences for a new product.
o Traditional method: You might conduct a survey, where people report their preferences. However, some
people might be hesitant to admit they wouldn't buy the product, leading to biased data.
o Big data approach: You could analyze purchase history data. This wouldn't directly ask people about the
new product, but by observing their actual buying habits, you could gain valuable insights into their
preferences
, - it's important to note that "nonreactive" doesn't necessarily mean "perfect." Big data can still have limitations:
o Social desirability bias: Even though people aren't directly questioned, they might still leave a curated
online presence that doesn't reflect their true behavior
o Algorithmic bias: The way data is collected and analyzed by algorithms can introduce bias, impacting the
results.
4/ incomplete
- Usually the following information is missing/ incomplete
o Demographic information about participants
o Behavior on other platforms
o Data to operationalize theoretical constructs
- Construct validity = whether your test truly measures what it claims to measure
o For example: spending more time on the phone with your colleague does not imply that they are more
important that spouses.
- Measuring social capital
o Articulated networks → contacts
▪ Self-reported connections:
• People report their own social ties through surveys or
interviews. They might be asked who they consider
friends, family, or colleagues they can rely on
▪ Focus on structure:
• This approach emphasizes the structure of the network – who people say they are
connected to
▪ Limited insight into interaction:
• Articulated networks don't necessarily tell you how often or in what way people interact
with their connections. The strength or quality of the relationships isn't directly captured.
▪ Examples: Surveys asking people to list their close friends or colleagues they trust.
o Behavioral networks → communication
▪ Observed interactions:
• These networks are based on actual observed interactions between people, not self-
reported information. Data can come from phone calls, email exchanges, co-authorship
of papers, or even face-to-face interactions captured by sensors
▪ Focus on frequency and nature of interactions:
• This approach emphasizes how often people interact and the nature of their
interactions. It provides a more nuanced picture of social capital
▪ Challenges in data collection:
• Obtaining data on real-world interactions can be more complex and require access to
communication records or setting up specific tracking mechanisms. Privacy concerns
are also a consideration
▪ Examples: Analyzing phone call logs to see how often people from different social groups
connect, tracking co-authorship of research papers to identify collaboration networks.
➔ Choosing the right approach:
o The best method for measuring social capital depends on the research question:
▪ Structure matters: If you're interested in the overall structure of social networks and
how it relates to access to resources, articulated networks might be sufficient.
▪ Strength of ties matters: If understanding the frequency and nature of interactions are
crucial, then behavioral networks offer a richer picture.