- Big data are made up of digital traces (also called exhaust), which is the records of the activities
that we do online. Since most of the activities happen online, we leave behind a lot of traces and
that information gets recorded and stored. For instance, clicks on a website, calls/location on the
phone, what you buy with a credit care, what you like/share on social media, what you watch.
- There has been an explosion in the amount of information that has been recorded and parallel
to that there has also been an increase in computational power in order to analyse all of that
information. The trend suggests that both values will get larger and larger.
- We can think of data in two different notions. Datasets are sort of a matrix.
In this matrix we have variables that are the columns of this matrix, whereas the observations
represent the rows of the matrix.
- The total number of variables is p
- The total number of observations is n
- An example of a matrix could be about visits on a particular site and our variables could be: visit
numbers, date of visit, start-end time of the visit and on the rows we have observations.
- Data can be big in more than one way:
1. Many observations, few variables n>>p 🡪 Tall data
2. Few observations, many variables n<<p 🡪Wide data
- These notions of big tend to go together: if you have a lot of variables that you need to worry
about, then you need a lot of observations to do so and vice versa: if you don’t have many
variables to worry about it makes no sense to collect so many observations. However, it is still
useful to check if the dataset is tall or wide.
- The definition of Big Data evolved over time. The early definition of Big Data is that it was too
large to be loaded into one machine and had to be distributed among different computers &
each calculation would be made in every machine and then aggregated.
- The definition of big data right now doesn’t come so much from the size of the data itself, but is
more related to the tools of artificial intelligence (like machine learning) needed to extract
meaning from that data.
- In this course, we don’t focus on the size of big data, we focus on the models and tools of
analysis that we need for large datasets.
- Data can be characterized as primary and secondary data:
1. Primary data (custom-made): data collected to answer a specific research question
, 2. Secondary data (ready-made): data collected for non-research purpose (ex. Generating
profits, administering laws). Big data is a form of secondary data according to the book Bit by
Bit. However the professor disagrees, specifying that most big data applications are
secondary, but not all of them.
- Business data:
1. The first question to ask is whether it is easily organized for analysis. In this regard, data can
be structured or unstructured.
Structured data is in units or scale that we can analyse with a pre-determined plan. For
example, with survey ratings (how satisfied are you with… on a scale 1-5), I know that the
higher the number the higher the satisfaction. This data is on a scale that I can easily
analyse. Structured data can be human-generated and machine-generated. Examples include
survey ratings, aptitude testing, web metrics, product purchase from sales records, process
control measures.
Unstructured data is harder to analyse, because there is no built-in scale that helps me
analyse it. For example a text review of an item, which doesn’t immediately gives us
information on whether the reviewer liked the item or not. Other examples include audio
transcripts, customer comments, voicemails, pictures, reviews.
This type of data is not immediately classifiable on its own.
Unstructured data represents around 80% of all data available to businesses and thanks to
recent developments in big data we can use it more.
2. The second question is where the data is generated. In this regard, data can be external or
internal.
Internal is data that is created within the firm that analyses this data.
External data is data taken from other sources like YouTube, social media, GPS, online forum
comments.
Web clip 1.2: uses of big data
- One wide use of big data is for personalization. For example, Netflix uses big data type
techniques to personalize recommender systems. To design this recommender system they use
the big data of what people have watched, what people that also watched those things watched
next and so on.
- Another way of using big data is for boosting engagement. ‘At any given time there is 10000
versions of Facebook running’. The idea is that when you are on a site they are constantly
running A/B testing because they want to test what works best, whether different versions of
something work better. For example, Facebook tested whether people respond better to
‘amount of people liked this post’ or ‘Person X that you know liked this post’. This experiment is
an example of big data but also of primary data.
- Big data is also used for new product development. This is because big data is always running.
The company ‘Tastewise’ monitors social chatter, online recipes and the country’s most
influential restaurants and menus to understand how food is prepared, loved and shared &
create new product ideas.
- Big data is also used for reducing churns. A customer churn is when a customer quits some
service. We can use past data to estimate a model that predicts churns (length of time being a
, customer, number of other services subscribed to, demographics) so we can predict the
probability of churns on current customers and intervene on those most likely to churn.
- Big data is used for public policy and the economy. The mobile phone has become a primary
source of public data intelligence. For instance, now that the measures against corona have been
relaxed, are people going back to work? Are they going to restaurants?
You can get this data from surveys, but they take time and sometimes we want to know this
information in real time. And for that we can for example use google mobility data (people on
google maps have location turned on).
Web clip 1.3: the 10 characteristics of big data
- Big data is:
(+) big
(+) always on
(+) nonreactive
(-) incomplete
(-) inaccessible
(-) non-representative
(-) drifting
(-) algorithmically confounded
(-) dirty
(-) sensitive
- The characteristics with a plus are considered as an advantage of big data for research. Whereas,
those with a minus are a disadvantage
- Big data is big and this is an advantage when the event is rare or small, if there is heterogeneity
or if the relationship is complex.
Example for rare event: in marketing we are interested in predicting if an ad will be successful or
not. If we have good predictions we can use those predictions to better plan our advertisement
(show X type of ad to Y type of people on Z type of site). The problem with the data is that clicks
are a very rare event: the average click through rate on banner ad is 0.35% so for every 10.000
observations you have 35 clicks. Therefore, if we have more information we can build a better
model.
Example for small event: we are running an online experiment with an ad A and an ad B and we
want to see which one. Then we estimate that the average CTR (click through rate) for A is 0.35%
and for B 0.40%. B is 0.5% better, but how sure are we of this number? If we don’t have enough
data, the confidence interval might be too large and a lot of data might let us narrow the
confidence interval and make us more confident about our estimates.
Example for heterogeneity: heterogeneity means that customers respond differently to the same
thing. Instead of saying that banner ad B has the same effect for everyone (0.05%) it could be
that it has an increase in CTR of 0.1% for half and 0% for the other half of the customers.
Example for complicated relationship: building a model to predict churns. Without big data we
might think that above a certain line churns are more likely to happen, however with big data we
might be able to see more complex relationship (really, like maps, areas in which customers are
more likely to churn.
, - Big data is also always on. Collecting data in real-time is important when we need to know and
respond to answers directly (unlike surveys, which take a long time). They are useful to monitor
economic activity, public health, competition (prices, campaigns of competitors), trends,
marketing response, digital marketing dashboard (how often your brand is mentioned on social
media)
- Big data is also nonreactive. People usually change their behavior when they know they are
being observed, however with big data they don’t have the chance to, because they are unaware
they are being recorded. For example, an experiment aimed to see if more/less light increased
the workers’ productivity, but the findings weren’t making sense (productivity increased in any
case). This is because workers knew they were being observed and they worked harder because
of that. For big data people don’t change their behaviour, either because they don’t know they
are being watched or because, even if they know, they don’t care.
- Big data, however, is also incomplete. It records what happened but not why. For example, big
data allows us to predict churns, but not why they chose to do so. For example we know that
using the product less might lead to churns, but we don’t know why customers are using the
product less. Maybe satisfaction explains both why customers do not use service much and are
more likely to quit.
- Big data might also be inaccessible. From outside the organization there might be legal, business
or ethical barriers preventing researchers from accessing the data. From inside, it could be
because databases are not integrated, there are variables to match or there are different coding
schemes.
For example google mobility might give you the final aggregated data, but not the individual
observations.
- Big data is nonrepresentative and this is a big threat to reliability. If the sample is representative,
you can make inferences about the population based on your sample. For example we can use
the average income as a measure for the entire population. However, if we use Facebook to
learn something that is not related to Facebook, you need to take into account that not everyone
is on Facebook so it might not be generalizable to the entire population.
Another example related to marketing concerns is reviews. Consumers tend to trust reviews a
Alle Vorteile der Zusammenfassungen von Stuvia auf einen Blick:
Garantiert gute Qualität durch Reviews
Stuvia Verkäufer haben mehr als 700.000 Zusammenfassungen beurteilt. Deshalb weißt du dass du das beste Dokument kaufst.
Schnell und einfach kaufen
Man bezahlt schnell und einfach mit iDeal, Kreditkarte oder Stuvia-Kredit für die Zusammenfassungen. Man braucht keine Mitgliedschaft.
Konzentration auf den Kern der Sache
Deine Mitstudenten schreiben die Zusammenfassungen. Deshalb enthalten die Zusammenfassungen immer aktuelle, zuverlässige und up-to-date Informationen. Damit kommst du schnell zum Kern der Sache.
Häufig gestellte Fragen
Was bekomme ich, wenn ich dieses Dokument kaufe?
Du erhältst eine PDF-Datei, die sofort nach dem Kauf verfügbar ist. Das gekaufte Dokument ist jederzeit, überall und unbegrenzt über dein Profil zugänglich.
Zufriedenheitsgarantie: Wie funktioniert das?
Unsere Zufriedenheitsgarantie sorgt dafür, dass du immer eine Lernunterlage findest, die zu dir passt. Du füllst ein Formular aus und unser Kundendienstteam kümmert sich um den Rest.
Wem kaufe ich diese Zusammenfassung ab?
Stuvia ist ein Marktplatz, du kaufst dieses Dokument also nicht von uns, sondern vom Verkäufer federicavulcano. Stuvia erleichtert die Zahlung an den Verkäufer.
Werde ich an ein Abonnement gebunden sein?
Nein, du kaufst diese Zusammenfassung nur für 6,99 €. Du bist nach deinem Kauf an nichts gebunden.