Summary

Summary Data science

Rating

Sold

Pages

105

Uploaded on

01-06-2025

Written in

2024/2025

The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative to represent real project. To address the distinct requirements for performing analysis on Big Data, step – by – step methodology is needed to organize the activities and tasks involved with acquiring, processing, analyzing, and repurposing data.

Show more Read less

Institution

Course

Content preview

UNIT I : INTRODUCTION TO BIG DATA, FRAMEWORKS AND VISUALIZATION 10
1. Big Data and Data Science
2. Big Data Analytics, Business intelligence vs Big data,
3. Big data frameworks -Hadoop, Hive, MapR, Sharding
4. MapReduce
5. NoSQL Databases ,S3
6. Hadoop Distributed file systems,
7. Current landscape of analytics,
8. Data visualization techniques, visualization software.

Big Data and Data Science
What is big data?
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10 15 byte size is called
Big Data. It is stated that almost 90% of today's data has been generated in the past 3 years.

Sources of Big Data
These data come from many sources like
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data
on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which
users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.
3V's of Big Data
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables are
structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.

,Types of big data
1. Structured data. Any data set that adheres to a specific structure can be called structured data.
These structured data sets can be processed relatively easily compared to other data types as users
can exactly identify the structure of the data. A good example for structured data will be a
distributed RDBMS which contains data in organized table structures.
2. Semi-structured data. This type of data does not adhere to a specific structure yet retains some
kind of observable structure such as a grouping or an organized hierarchy. Some examples of semi-
structured data will be markup languages (XML), web pages, emails, etc.
3. Unstructured data. This type of data consists of data that does not adhere to a schema or a preset
structure. It is the most common type of data when dealing with big data—things like text, pictures,
video, and audio all come up under this type.

What is data science?
o Data science is the art and science of acquiring knowledge through data.
o Data science is a multidisciplinary approach that extracts information from data by
combining:
 Scientific methods
 Maths and statistics
 Programming
 Advanced analytics
 ML and AI

,Data Analytics Lifecycle:
The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is
iterative to represent real project. To address the distinct requirements for performing analysis on Big
Data, step – by – step methodology is needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.

Phase 1: Discovery:
 The data science team learns and investigates the problem.
 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates initial hypothesis that can be later tested with data.

Phase 2: Data Preparation
 Steps to explore, preprocess, and condition data prior to modeling and analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and transform, to get data into
the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.

Phase 3: Model Planning
 Team explores data to learn about relationships between variables and subsequently, selects key
variables and the most suitable models.
 In this phase, data science team develop data sets for training, testing, and production purposes.
 Team builds and executes models based on the work done in the model planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.

Phase 4: Model Building
 Team develops datasets for testing, training, and production purposes.
 Team also considers whether its existing tools will suffice for running the models or if they need more
robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.

Phase 5: Communication Results
 After executing model team need to compare outcomes of modeling to criteria established for success
and failure.

,  Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop narrative to summarize and
convey findings to stakeholders.

Phase 6: Operationalize
 The team communicates benefits of project more broadly and sets up pilot project to deploy work in
controlled way before broadening the work to full enterprise of users.
 This approach enables team to learn about performance and related constraints of the model in
production environment on small scale, and make adjustments before full deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.

Report Copyright Violation

Written for

Institution: Anna University Chennai
Course: 191 (191CSC701T)

All documents for this subject (1)

Document information

Uploaded on: June 1, 2025
Number of pages: 105
Written in: 2024/2025
Type: SUMMARY

Subjects

hdfs yarn mapreduce hadoop common

$43.99

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

monishdp

Get to know the seller

monishdp

View profile

Sold

Member since

9 months

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller monishdp. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $43.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 49067 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Summary Data science

Content preview

Written for

Document information

Subjects

Get to know the seller

Trending documents

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?