Hadoop Certification New Exam With Complete
Solutions 100% Accurate
Hortonworks Data Flow HDF
To data in motion. Powered by Apache NiFi. 1) real-time-add, trace, adjust; 2)
integrated-common input, output, transformation; 3) secure-security rules, encryption,
traceability; 4) adaptive-adapts data flow, scalable; if connection poor skinnies down
data
Data discovery
A user-directed process of searching for patterns or specific items in a data set. Data
discovery applications use visual tools such as geographical maps, pivot-tables and
heat-maps to make the process of finding patterns or specific items rapid and intuitive.
Data discovery may leverage statistical and data mining. Ex. Web log analysis, online ad
placement, claims notes mining
ETL onboard
Ex. sensor data ingest
Active archive
Ex. individual driver histories
Data in motion
Perishable insights
Data at rest
,Historical insights
Actionable intelligence
Supports data discovery, single view, predictive analytics
Single view
A Single View application aggregates data from multiple sources into a central
repository to create a single view of anything — of customers, inventory, systems
Splunk
Leading platform for Operational Intelligence. Empowers the curious to look closely at
what others ignore—machine data—and find what others never see: insights that can
make your company more productive, more profitable, more competitive and more
secure
Apache Splunk
An open source big data processing framework built around speed, ease of use, and
sophisticated analytics. Originally developed in 2009 in UC Berkeley's AMPLab, and
open sourced in 2010 as an Apache project
Apache Storm
Real-time event processing for sensor and business activity monitoring. Storm is a free
and open source distributed realtime computation system. Storm makes it easy to
reliably process unbounded streams of data, doing for real-time processing what
Hadoop did for batch processing. Storm is simple, can be used with any programming
language. Ingests millions of events per second. Manage with Ambari. Horizontally
scalable. Fixed, low latency and continuous processing for very high frequency
streaming data.
YARN
Data operating system. Cluster resource management. 2013 - includes batch,
, interactive and realtime. At core of Hortonworks Data Platform - HDP for data at rest.
Centralized platform for: 1) operations - cluster management, one data lake or clusters;
2) governance - data lifecycle mgt, modeling with metadata, lineage capability 3)
security - roles or data tags, encryption at rest and in motion, authentication. Includes
data functions for: batch, machine learning, search, interactive, streaming
Hive on YARN
SQL:2011 for analytics
Hortonworks Data Platforms (HDP)
Data at rest. Powered by Open Enterprise Hadoop. 1) Open - open source; 2) Central -
Yarn at core; 3) Interoperable - existing technology, skills; 4) Ready - enterprise-ready
re operations, governance, security; dev efforts include: 1) data management; 2) data
access; 3) governance and integration; 4) operations; 5) security
Apache Spark at Scale
Open source cluster computing framework originally developed in the AMPLab at
University of California, Berkeley but was later donated to the Apache Software
Foundation where it remains today. Integrated component of HDP. Agile analytics using
data science notebooks, includes geospatial, entity resolution; wide array of data
sources; RDD sharing, HDFS memory tier. Newer approach than SQL handled by Hive.
Data access engine for fast, large scale data processing. Designed for iterative,
in-memory computations and interactive data mining. APIs for Scala, Java, Python.
Spark SQL, Spark Streaming, MLlib, GraphX - can run as a YARN workload - can run on
a single data set in Hadoop.
Resilient Distributed Dataset (RDD)
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an
immutable, partitioned collection of elements that can be operated on in parallel.
Hadoop Distributed File System (HDFS)
HDFS is a distributed, scalable, and portable file-system in Java for the Hadoop
framework. A Hadoop cluster has nominally a single namenode plus a cluster of
Solutions 100% Accurate
Hortonworks Data Flow HDF
To data in motion. Powered by Apache NiFi. 1) real-time-add, trace, adjust; 2)
integrated-common input, output, transformation; 3) secure-security rules, encryption,
traceability; 4) adaptive-adapts data flow, scalable; if connection poor skinnies down
data
Data discovery
A user-directed process of searching for patterns or specific items in a data set. Data
discovery applications use visual tools such as geographical maps, pivot-tables and
heat-maps to make the process of finding patterns or specific items rapid and intuitive.
Data discovery may leverage statistical and data mining. Ex. Web log analysis, online ad
placement, claims notes mining
ETL onboard
Ex. sensor data ingest
Active archive
Ex. individual driver histories
Data in motion
Perishable insights
Data at rest
,Historical insights
Actionable intelligence
Supports data discovery, single view, predictive analytics
Single view
A Single View application aggregates data from multiple sources into a central
repository to create a single view of anything — of customers, inventory, systems
Splunk
Leading platform for Operational Intelligence. Empowers the curious to look closely at
what others ignore—machine data—and find what others never see: insights that can
make your company more productive, more profitable, more competitive and more
secure
Apache Splunk
An open source big data processing framework built around speed, ease of use, and
sophisticated analytics. Originally developed in 2009 in UC Berkeley's AMPLab, and
open sourced in 2010 as an Apache project
Apache Storm
Real-time event processing for sensor and business activity monitoring. Storm is a free
and open source distributed realtime computation system. Storm makes it easy to
reliably process unbounded streams of data, doing for real-time processing what
Hadoop did for batch processing. Storm is simple, can be used with any programming
language. Ingests millions of events per second. Manage with Ambari. Horizontally
scalable. Fixed, low latency and continuous processing for very high frequency
streaming data.
YARN
Data operating system. Cluster resource management. 2013 - includes batch,
, interactive and realtime. At core of Hortonworks Data Platform - HDP for data at rest.
Centralized platform for: 1) operations - cluster management, one data lake or clusters;
2) governance - data lifecycle mgt, modeling with metadata, lineage capability 3)
security - roles or data tags, encryption at rest and in motion, authentication. Includes
data functions for: batch, machine learning, search, interactive, streaming
Hive on YARN
SQL:2011 for analytics
Hortonworks Data Platforms (HDP)
Data at rest. Powered by Open Enterprise Hadoop. 1) Open - open source; 2) Central -
Yarn at core; 3) Interoperable - existing technology, skills; 4) Ready - enterprise-ready
re operations, governance, security; dev efforts include: 1) data management; 2) data
access; 3) governance and integration; 4) operations; 5) security
Apache Spark at Scale
Open source cluster computing framework originally developed in the AMPLab at
University of California, Berkeley but was later donated to the Apache Software
Foundation where it remains today. Integrated component of HDP. Agile analytics using
data science notebooks, includes geospatial, entity resolution; wide array of data
sources; RDD sharing, HDFS memory tier. Newer approach than SQL handled by Hive.
Data access engine for fast, large scale data processing. Designed for iterative,
in-memory computations and interactive data mining. APIs for Scala, Java, Python.
Spark SQL, Spark Streaming, MLlib, GraphX - can run as a YARN workload - can run on
a single data set in Hadoop.
Resilient Distributed Dataset (RDD)
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an
immutable, partitioned collection of elements that can be operated on in parallel.
Hadoop Distributed File System (HDFS)
HDFS is a distributed, scalable, and portable file-system in Java for the Hadoop
framework. A Hadoop cluster has nominally a single namenode plus a cluster of