Zusammenfassung

Summary Data science

Bewertung

Verkauft

seiten

105

Hochgeladen auf

01-06-2025

geschrieben in

2024/2025

The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative to represent real project. To address the distinct requirements for performing analysis on Big Data, step – by – step methodology is needed to organize the activities and tasks involved with acquiring, processing, analyzing, and repurposing data.

Mehr anzeigen Weniger lesen

Hochschule

Kurs

Inhaltsvorschau

UNIT I : INTRODUCTION TO BIG DATA, FRAMEWORKS AND VISUALIZATION 10
1. Big Data and Data Science
2. Big Data Analytics, Business intelligence vs Big data,
3. Big data frameworks -Hadoop, Hive, MapR, Sharding
4. MapReduce
5. NoSQL Databases ,S3
6. Hadoop Distributed file systems,
7. Current landscape of analytics,
8. Data visualization techniques, visualization software.

Big Data and Data Science
What is big data?
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10 15 byte size is called
Big Data. It is stated that almost 90% of today's data has been generated in the past 3 years.

Sources of Big Data
These data come from many sources like
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data
on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which
users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.
3V's of Big Data
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables are
structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.

,Types of big data
1. Structured data. Any data set that adheres to a specific structure can be called structured data.
These structured data sets can be processed relatively easily compared to other data types as users
can exactly identify the structure of the data. A good example for structured data will be a
distributed RDBMS which contains data in organized table structures.
2. Semi-structured data. This type of data does not adhere to a specific structure yet retains some
kind of observable structure such as a grouping or an organized hierarchy. Some examples of semi-
structured data will be markup languages (XML), web pages, emails, etc.
3. Unstructured data. This type of data consists of data that does not adhere to a schema or a preset
structure. It is the most common type of data when dealing with big data—things like text, pictures,
video, and audio all come up under this type.

What is data science?
o Data science is the art and science of acquiring knowledge through data.
o Data science is a multidisciplinary approach that extracts information from data by
combining:
 Scientific methods
 Maths and statistics
 Programming
 Advanced analytics
 ML and AI

,Data Analytics Lifecycle:
The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is
iterative to represent real project. To address the distinct requirements for performing analysis on Big
Data, step – by – step methodology is needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.

Phase 1: Discovery:
 The data science team learns and investigates the problem.
 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates initial hypothesis that can be later tested with data.

Phase 2: Data Preparation
 Steps to explore, preprocess, and condition data prior to modeling and analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and transform, to get data into
the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.

Phase 3: Model Planning
 Team explores data to learn about relationships between variables and subsequently, selects key
variables and the most suitable models.
 In this phase, data science team develop data sets for training, testing, and production purposes.
 Team builds and executes models based on the work done in the model planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.

Phase 4: Model Building
 Team develops datasets for testing, training, and production purposes.
 Team also considers whether its existing tools will suffice for running the models or if they need more
robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.

Phase 5: Communication Results
 After executing model team need to compare outcomes of modeling to criteria established for success
and failure.

,  Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop narrative to summarize and
convey findings to stakeholders.

Phase 6: Operationalize
 The team communicates benefits of project more broadly and sets up pilot project to deploy work in
controlled way before broadening the work to full enterprise of users.
 This approach enables team to learn about performance and related constraints of the model in
production environment on small scale, and make adjustments before full deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.

Urheberrechtsverletzung melden

Schule, Studium & Fach

Hochschule: Anna University Chennai
Kurs: 191 (191CSC701T)

Alle Dokumente für dieses Fach (1)

Dokument Information

Hochgeladen auf: 1. juni 2025
Anzahl der Seiten: 105
geschrieben in: 2024/2025
Typ: ZUSAMMENFASSUNG

Themen

hdfs yarn mapreduce hadoop common

39,51 €

Vollständigen Zugriff auf das Dokument erhalten:

Geschrieben von Student*innen, die bestanden haben

Sofort verfügbar nach Zahlung

Online lesen oder als PDF

Lerne den Verkäufer kennen

monishdp

Lerne den Verkäufer kennen

monishdp

Profil betrachten

Folgen

Verkauft

Mitglied seit

9 Jahren

Anzahl der Follower

Dokumente

Zuletzt verkauft

0,0

0 rezensionen

Kürzlich von dir angesehen.

Warum sich Studierende für Stuvia entscheiden

on Mitstudent*innen erstellt, durch Bewertungen verifiziert

Geschrieben von Student*innen, die bestanden haben und bewertet von anderen, die diese Studiendokumente verwendet haben.

Nicht zufrieden? Wähle ein anderes Dokument

Kein Problem! Du kannst direkt ein anderes Dokument wählen, das besser zu dem passt, was du suchst.

Bezahle wie du möchtest, fange sofort an zu lernen

Kein Abonnement, keine Verpflichtungen. Bezahle wie gewohnt per Kreditkarte oder Sofort und lade dein PDF-Dokument sofort herunter.

“Gekauft, heruntergeladen und bestanden. So einfach kann es sein.”

Alisha Student

Häufig gestellte Fragen

Was bekomme ich, wenn ich dieses Dokument kaufe?

Du erhältst eine PDF-Datei, die sofort nach dem Kauf verfügbar ist. Das gekaufte Dokument ist jederzeit, überall und unbegrenzt über dein Profil zugänglich.

Zufriedenheitsgarantie: Wie funktioniert das?

Unsere Zufriedenheitsgarantie sorgt dafür, dass du immer eine Lernunterlage findest, die zu dir passt. Du füllst ein Formular aus und unser Kundendienstteam kümmert sich um den Rest.

Wem kaufe ich diese Zusammenfassung ab?

Stuvia ist ein Marktplatz, du kaufst dieses Dokument also nicht von uns, sondern vom Verkäufer monishdp. Stuvia erleichtert die Zahlung an den Verkäufer.

Werde ich an ein Abonnement gebunden sein?

Nein, du kaufst diese Zusammenfassung nur für 39,51 €. Du bist nach deinem Kauf an nichts gebunden.

Kann man Stuvia trauen?

4.6 Sterne auf Google & Trustpilot (+1000 reviews) 45.681 Zusammenfassungen wurden in den letzten 30 Tagen verkauft Gegründet 2010, seit 16 Jahren die erste Adresse für Zusammenfassungen

Summary Data science

Inhaltsvorschau

Schule, Studium & Fach

Dokument Information

Themen

Mehr Fächer für Anna University Chennai >

Lerne den Verkäufer kennen

Kürzlich von dir angesehen.

Warum sich Studierende für Stuvia entscheiden

on Mitstudent*innen erstellt, durch Bewertungen verifiziert

Nicht zufrieden? Wähle ein anderes Dokument

Bezahle wie du möchtest, fange sofort an zu lernen

Häufig gestellte Fragen

Was bekomme ich, wenn ich dieses Dokument kaufe?

Zufriedenheitsgarantie: Wie funktioniert das?

Wem kaufe ich diese Zusammenfassung ab?

Werde ich an ein Abonnement gebunden sein?

Kann man Stuvia trauen?