UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing –
Basic Statistical descriptions of Data
Big Data:
Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as for example,
the RDBMS.
I. Data Science:
• Data science involves using methods to analyze massive amounts of data and extract
the knowledge it contains.
• The characteristics of big data are often referred to as the three Vs:
o Volume—How much data is there?
o Variety—How diverse are different types of data?
o Velocity—At what speed is new data generated?
• Fourth V:
• Veracity: How accurate is the data?
• Data science is an evolutionary extension of statistics capable of dealing with the
massive amounts of data produced today.
• Data scientist apart from a statistician are the ability to work with big data and
experience in machine learning, computing, and algorithm building. Tools Hadoop,
Pig, Spark, R, Python, and Java, among others.
II. Benefits and uses of data science and big data
• Data science and big data are used almost everywhere in both commercial and non-
commercial settings.
• Commercial companies in almost every industry use data science and big data to
gain insights into their customers, processes, staff, completion, and products.
• Many companies use data science to offer customers a better user experience.
o Eg: Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet
o MaxPoint - example of real-time personalized advertising.
• Human resource professionals:
o people analytics and text mining to screen candidates,
o monitor the mood of employees, and
o study informal networks among coworkers
• Financial institutions use data science:
o to predict stock markets, determine the risk of lending money, and
o learn how to attract new clients for their services
• Governmental organizations:
o internal data scientists to discover valuable information,
o share their data with the public
, o Eg: Data.gov is but one example; it’s the home of the US Government’s open
data.
o organizations collected 5 billion data records from widespread applications
such as Google Maps, Angry Birds, email, and text messages, among many
other data sources.
• Nongovernmental organizations:
o World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Eg: DataKind is one such data scientist group that devotes its time to the
benefit of mankind.
• Universities:
o Use data science in their research but also to enhance the study experience of
their students.
o massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional
classes.
o Eg: Coursera, Udacity, and edX
III. Facets of data:
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured data:
• Structured data is data that depends on a data model and resides in a fixed field
• within a record.
• Easy to store structured data in tables within databases or Excel files or Structured
Query Language.
Unstructured data:
• Unstructured data is data that isn’t easy to fit into a data model
• The content is context-specific or varying.
• Eg: E-mail
• Email contains structured elements such as the sender, title, and body text
, • Eg: It’s a challenge to find the number of people who have written an email
complaint about a specific employee because so many ways exist to refer to a
person.
• The thousands of different languages and dialects.
Natural language:
• A human-written email is also a perfect example of natural language data.
• Natural language is a special type of unstructured data;
• It’s challenging to process because it requires knowledge of specific data science
techniques and linguistics.
• Topics in NLP: entity recognition, topic recognition, summarization, text
completion, and sentiment analysis.
• Human language is ambiguous in nature.
Machine-generated data:
• Machine-generated data is information that’s automatically created by a computer,
process, application, or other machines without human intervention.
• Machine-generated data is becoming a major data resource.
• Eg: Wikibon has forecast that the market value of the industrial Internet will be
approximately $540 billion in 2020.
• International Data Corporation has estimated there will be 26 times more
connected things than people in 2020.
• This network is commonly referred to as the internet of things.
• Examples of machine data are web server logs, call detail records, network event
logs, and telemetry.
, Graph-based or network data:
• “Graph” in this case points to mathematical graph theory. In graph theory, a graph
is a
• mathematical structure to model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store
graphical
• data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate the shortest path between two people.
• Graph-based data can be found on many social media websites.
• Eg: LinkedIn, Twitter, movie interests on Netflix
• Graph databases are used to store graph-based data and are queried with
specialized
• query languages such as SPARQL.
Audio, image, and video:
• Audio, image, and video are data types that pose specific challenges to a data
scientist.
• Recognizing objects in pictures, turn out to be challenging for computers.
• Major League Baseball Advanced Media - video capture to approximately 7 TB per
• game for the purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time.
• DeepMind succeeded at creating an algorithm that’s capable of learning how to
play video games.
• This algorithm takes the video screen as input and learns to interpret everything
via a complex process of deep learning.
• Google – Artificial Intelligence Development plans
Streaming data:
• The data flows into the system when an event happens instead of being loaded into
a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events, and
• the stock market.
The data science process:
• The data science process typically consists of six steps:
o Setting the research goal
o Retrieving data
o Data preparation