COMP 1702 BIGDATA
Sinchana Mallesh
001427852
MSc In Data Science
,Table of Contents
TASK A: HADOOP ANALYSIS ...........................................................................................................1
TASK B: DATA WAREHOUSE DESIGN: .............................................................................................3
TASK C: MAPREDUCE PROGRAMMING ........................................................................................28
TASK D: BIG DATA PROJECT ANALYSIS ..........................................................................................32
D1: Data Lake vs. Data Warehouse ..............................................................................................32
D2: Real-Time Analytics Framework ............................................................................................33
D3: Cloud Hosting Strategy .........................................................................................................34
REFERENCES ...............................................................................................................................36
i
,TABLE OF FIGURES:
Figure 1: Hive Table Creation for 'Customers' Table ...................................................................... 4
Figure 2:Hive Table Creation for 'Sales' Table ............................................................................... 4
Figure 3:Hive Table Creation for 'Products' Table .......................................................................... 5
Figure 4:Data Insertion into 'Customers' Table in Hive .................................................................. 5
Figure 5: Data Insertion into 'Sales' Table in Hive .......................................................................... 5
Figure 6: Data Insertion into 'Products' Table in Hive .................................................................... 6
Figure 7:Query to Find Total Sales Amount by Each Customer ..................................................... 6
Figure 8:Hive Query Execution of Total Sales Amount by Each Customer .................................... 7
Figure 9: Output Showing Total Sales Amount by Each Customer (Part 1) .................................... 7
Figure 10:Output Showing Total Sales Amount by Each Customer (Part 2) ................................... 8
Figure 11: Output Showing Total Sales Amount by Each Customer (Part 3) .................................. 8
Figure 12: Final Result Set of Total Sales Amount by Each Customer ........................................... 9
Figure 13: Query Code for Listing the Top 5 Highest-Priced Products Sold ................................... 9
Figure 14: Hive Query Execution for Listing the Top 5 Highest-Priced Products Sold ................. 10
Figure 15: Output Showing Top 5 Highest-Priced Products Sold ................................................. 10
Figure 16: Final Result Set of Top 5 Highest-Priced Products Sold .............................................. 11
Figure 17: Query Code for Calculating the Average Price of Products by Category ..................... 11
Figure 18: Hive Query Execution for Average Price of Products by Category .............................. 12
Figure 19: Output Showing Average Price of Products by Category (Part 1) ................................ 12
Figure 20: Final Result Set of Average Price of Products by Category ......................................... 13
Figure 21: Query Code for Customers Purchasing More Than One Product in a Transaction ....... 13
Figure 22: Hive Query Execution for Customers Purchasing More Than One Product ................. 14
Figure 23: Final Result Set of Customers with Multiple Product Transactions ............................. 14
Figure 24: Query Code for Products That Have Never Been Sold (Zero Sales) ............................ 15
Figure 25: Hive Query Execution for Products with Zero Sales ................................................... 15
Figure 26: Final Result Set of Unsold Products............................................................................ 15
Figure 27: Query Code for Total Sales Amount and Quantity Sold per Category.......................... 16
Figure 28: Hive Query Execution for Sales Amount and Quantity per Category .......................... 16
Figure 29: Output Showing Sales Amount and Quantity per Category (Part 1) ............................ 16
ii
, Figure 30: Final Result Set of Sales Amount and Quantity per Category ...................................... 17
Figure 31: Query Code for Top 3 Cities with Highest Total Sales ................................................ 17
Figure 32: Hive Query Execution for Top 3 Cities with Highest Sales ......................................... 18
Figure 33: Output Showing Top 3 Cities by Sales (Part 1) ........................................................... 18
Figure 34: Final Result Set of Top 3 Cities with Highest Sales..................................................... 19
Figure 35: Query Code for Total Products Sold by Each Customer ............................................. 19
Figure 36: Hive Query Execution for Total Products Sold by Each Customer .............................. 20
Figure 37: Output Showing Products Sold by Each Customer (Part 1) ......................................... 20
Figure 38: Output Showing Products Sold by Each Customer (Part 2) ......................................... 21
Figure 39: Output Showing Products Sold by Each Customer (Part 3) ......................................... 21
Figure 40: Final Result Set of Products Sold by Each Customer .................................................. 22
Figure 41: Query Code for Total Sales for Each Product .............................................................. 22
Figure 42: Hive Query Execution for Total Sales per Product ...................................................... 23
Figure 43: Output Showing Sales for Each Product (Part 1) ......................................................... 23
Figure 44: Output Showing Sales for Each Product (Part 2) ......................................................... 24
Figure 45: Output Showing Sales for Each Product (Part 3) ......................................................... 24
Figure 46: Final Result Set of Sales for Each Product .................................................................. 25
Figure 47: Query Code for Number of Customers per City .......................................................... 25
Figure 48: Hive Query Execution for Number of Customers per City .......................................... 26
Figure 49: Output Showing Customers per City (Part 1) .............................................................. 26
Figure 50: Final Result Set of Customers per City ....................................................................... 27
iii
Sinchana Mallesh
001427852
MSc In Data Science
,Table of Contents
TASK A: HADOOP ANALYSIS ...........................................................................................................1
TASK B: DATA WAREHOUSE DESIGN: .............................................................................................3
TASK C: MAPREDUCE PROGRAMMING ........................................................................................28
TASK D: BIG DATA PROJECT ANALYSIS ..........................................................................................32
D1: Data Lake vs. Data Warehouse ..............................................................................................32
D2: Real-Time Analytics Framework ............................................................................................33
D3: Cloud Hosting Strategy .........................................................................................................34
REFERENCES ...............................................................................................................................36
i
,TABLE OF FIGURES:
Figure 1: Hive Table Creation for 'Customers' Table ...................................................................... 4
Figure 2:Hive Table Creation for 'Sales' Table ............................................................................... 4
Figure 3:Hive Table Creation for 'Products' Table .......................................................................... 5
Figure 4:Data Insertion into 'Customers' Table in Hive .................................................................. 5
Figure 5: Data Insertion into 'Sales' Table in Hive .......................................................................... 5
Figure 6: Data Insertion into 'Products' Table in Hive .................................................................... 6
Figure 7:Query to Find Total Sales Amount by Each Customer ..................................................... 6
Figure 8:Hive Query Execution of Total Sales Amount by Each Customer .................................... 7
Figure 9: Output Showing Total Sales Amount by Each Customer (Part 1) .................................... 7
Figure 10:Output Showing Total Sales Amount by Each Customer (Part 2) ................................... 8
Figure 11: Output Showing Total Sales Amount by Each Customer (Part 3) .................................. 8
Figure 12: Final Result Set of Total Sales Amount by Each Customer ........................................... 9
Figure 13: Query Code for Listing the Top 5 Highest-Priced Products Sold ................................... 9
Figure 14: Hive Query Execution for Listing the Top 5 Highest-Priced Products Sold ................. 10
Figure 15: Output Showing Top 5 Highest-Priced Products Sold ................................................. 10
Figure 16: Final Result Set of Top 5 Highest-Priced Products Sold .............................................. 11
Figure 17: Query Code for Calculating the Average Price of Products by Category ..................... 11
Figure 18: Hive Query Execution for Average Price of Products by Category .............................. 12
Figure 19: Output Showing Average Price of Products by Category (Part 1) ................................ 12
Figure 20: Final Result Set of Average Price of Products by Category ......................................... 13
Figure 21: Query Code for Customers Purchasing More Than One Product in a Transaction ....... 13
Figure 22: Hive Query Execution for Customers Purchasing More Than One Product ................. 14
Figure 23: Final Result Set of Customers with Multiple Product Transactions ............................. 14
Figure 24: Query Code for Products That Have Never Been Sold (Zero Sales) ............................ 15
Figure 25: Hive Query Execution for Products with Zero Sales ................................................... 15
Figure 26: Final Result Set of Unsold Products............................................................................ 15
Figure 27: Query Code for Total Sales Amount and Quantity Sold per Category.......................... 16
Figure 28: Hive Query Execution for Sales Amount and Quantity per Category .......................... 16
Figure 29: Output Showing Sales Amount and Quantity per Category (Part 1) ............................ 16
ii
, Figure 30: Final Result Set of Sales Amount and Quantity per Category ...................................... 17
Figure 31: Query Code for Top 3 Cities with Highest Total Sales ................................................ 17
Figure 32: Hive Query Execution for Top 3 Cities with Highest Sales ......................................... 18
Figure 33: Output Showing Top 3 Cities by Sales (Part 1) ........................................................... 18
Figure 34: Final Result Set of Top 3 Cities with Highest Sales..................................................... 19
Figure 35: Query Code for Total Products Sold by Each Customer ............................................. 19
Figure 36: Hive Query Execution for Total Products Sold by Each Customer .............................. 20
Figure 37: Output Showing Products Sold by Each Customer (Part 1) ......................................... 20
Figure 38: Output Showing Products Sold by Each Customer (Part 2) ......................................... 21
Figure 39: Output Showing Products Sold by Each Customer (Part 3) ......................................... 21
Figure 40: Final Result Set of Products Sold by Each Customer .................................................. 22
Figure 41: Query Code for Total Sales for Each Product .............................................................. 22
Figure 42: Hive Query Execution for Total Sales per Product ...................................................... 23
Figure 43: Output Showing Sales for Each Product (Part 1) ......................................................... 23
Figure 44: Output Showing Sales for Each Product (Part 2) ......................................................... 24
Figure 45: Output Showing Sales for Each Product (Part 3) ......................................................... 24
Figure 46: Final Result Set of Sales for Each Product .................................................................. 25
Figure 47: Query Code for Number of Customers per City .......................................................... 25
Figure 48: Hive Query Execution for Number of Customers per City .......................................... 26
Figure 49: Output Showing Customers per City (Part 1) .............................................................. 26
Figure 50: Final Result Set of Customers per City ....................................................................... 27
iii