BUSINESS INTELLIGENCE & DATA MANAGEMENT
Dr. Emiel Caron & Dr. Ekaterini Ioannou & Dr. Poonacha Medappa
TABLE OF CONTENTS
LECTURE 1: INTRODUCTION TO BI AND DATABASE SYSTEMS ........................................................................ 4
1. INTRODUCTION TO BUSINESS INTELLIGENCE ........................................................................................................... 4
1.1. Business Intelligence (BI) vs. Business Analytics (BA). ........................................................................ 4
1.2. Definition ............................................................................................................................................ 4
1.3. Business Intelligence architecture ...................................................................................................... 5
2. INTRODUCTION TO DATABASES ............................................................................................................................ 7
Text 1. Database systems: design, implementation, and management – Carlos Coronel, Steven Morris &
Peter Rob.......................................................................................................................................................... 7
2.1. Database systems ............................................................................................................................. 32
2.2. Relational databases ........................................................................................................................ 32
2.3. Trends in the database world ........................................................................................................... 33
3. READING: DATA WAREHOUSE DESIGN- MODERN PRINCIPLES AND METHODOLOGIES ................................................. 34
LECTURE 2: SQL & DATA WAREHOUSING ................................................................................................... 45
1. INTRODUCTION STRUCTURED QUERY LANGUAGE (SQL) ........................................................................................ 45
1.1. Data types ......................................................................................................................................... 45
1.2. Join types .......................................................................................................................................... 45
2. INTRODUCTION TO DATA WAREHOUSING............................................................................................................. 46
2.1. Why do we need a separate data warehouse? ................................................................................ 47
2.2. DW framework: components............................................................................................................ 47
2.3. DW framework: Architecture............................................................................................................ 50
2.4. Data warehouse architecture variants ............................................................................................. 51
LECTURE 3: OLAP BUSINESS DATABASES & BUSINESS DASHBOARDS ........................................................... 57
1. ETL (EXTRACTION, TRANSFORMATION, & LOAD) ................................................................................................. 57
1.1. Process steps ..................................................................................................................................... 57
1.2. Transformation ................................................................................................................................. 57
1.3. ETL tools ............................................................................................................................................ 59
2. OLAP BUSINESS DATABASES ............................................................................................................................. 60
2.1. Why OLAP? ....................................................................................................................................... 61
2.2. OLAP operators ................................................................................................................................. 62
2.3. Multi-dimentional modelling ............................................................................................................ 64
2.4. Central fact table .............................................................................................................................. 67
2.5. Dimension table ................................................................................................................................ 68
2.6. OLAP software vendors..................................................................................................................... 69
3. TECHNICAL OLAP ISSUES ................................................................................................................................. 70
3.1. Sparse fact table ............................................................................................................................... 70
3.2. History in the stars ............................................................................................................................ 71
1
, 4. BUSINESS DASHBOARDS ................................................................................................................................... 72
4.1. Two theoretical perspectives: ........................................................................................................... 72
LECTURE 4: DATA MINING INTRODUCTION ................................................................................................ 74
1. PYTHON REFRESHER ........................................................................................................................................ 74
2. DECISION MAKING WITH BIG DATA ..................................................................................................................... 75
3. DATA MINING METHODS ................................................................................................................................. 75
4. DATA ........................................................................................................................................................... 75
4.1. Data and types of variables .............................................................................................................. 76
4.2. Sources of data ................................................................................................................................. 77
5. DATA MINING PROCESS(ES)—OVERVIEW OF THE STEPS INVOLVED IN DATA MINING .................................................... 77
Step 1: Develop an understanding of the purpose of the data mining project ............................................. 77
Step 2: Obtain the dataset to be used in the analysis ................................................................................... 77
Step 3: Explore, clean, and preprocess the data ............................................................................................ 78
Step 4: Reduce the data dimension, if necessary........................................................................................... 78
Step 5: Determine the data mining task ........................................................................................................ 78
Step 6: Partition the data (for supervised tasks) ........................................................................................... 78
Step 7: Choose the data mining technique(s) ................................................................................................ 78
Step 8: Use algorithms to perform the task ................................................................................................... 78
Step 9: Interpret the results of the algorithms .............................................................................................. 78
Step 10: Deploy the model ............................................................................................................................. 79
5.1. SEMMA methodology ....................................................................................................................... 79
5.2. CRISP-DM .......................................................................................................................................... 79
LECTURE 5: REGRESSION MODELS ............................................................................................................. 81
1. DEFINITION: REGRESSION ANALYSIS .................................................................................................................... 82
2. LINEAR REGRESSION MODEL ............................................................................................................................ 82
3. VISUALIZATION............................................................................................................................................... 83
4. ORDINARY LEAST SQUARES (OLS) ..................................................................................................................... 83
5. MODEL USAGE AND POSSIBLE ISSUES .................................................................................................................. 84
5.1. Objectives for single/multiple regression ......................................................................................... 84
5.2. Issues................................................................................................................................................. 84
LECTURE 6: CLASSIFICATION...................................................................................................................... 85
1. WHAT IS CLASSIFICATION? ............................................................................................................................... 85
1.1. Classification vs. Clustering .............................................................................................................. 85
1.2. Classification process ........................................................................................................................ 85
1.3. Need for classification ...................................................................................................................... 86
1.4. Model induction and application ...................................................................................................... 86
1.5. Classification techniques .................................................................................................................. 87
2. NAÏVE BAYES FOR CLASSIFICATION ..................................................................................................................... 88
2.1. Naïve Bayes classifier........................................................................................................................ 88
2.2. Some concerns .................................................................................................................................. 90
2.3. Pros and cons .................................................................................................................................... 90
LECTURE 7: K NEAREST NEIGHBORS FOR CLASSIFICATION .......................................................................... 91
1. DETERMINING RECORD’S NEIGHBORS ................................................................................................................. 91
1.1. Euclidean Distance ............................................................................................................................ 91
2
, 1.2. Manhattan Distance ......................................................................................................................... 92
2. CHOOSING THE NUMBER OF NEIGHBORS, I.E., VALUE K .......................................................................................... 92
3. COMPUTING PREDICTION (FOR A NUMERICAL OUTCOME) ....................................................................................... 93
LECTURE 8: PERFORMANCE MEASURES ..................................................................................................... 95
1. EVALUATING PREDICTIVE PERFORMANCE (OF NUMERIC/CONTINUOUS PREDICTIONS) ................................................... 96
1.1. Prediction Accuracy measures .......................................................................................................... 96
1.2. Lift chart............................................................................................................................................ 97
2. JUDGING CLASSIFIER PERFORMANCE (CATEGORICAL VARIABLES EX. BIRDS) ................................................................ 98
2.1. Confusion matrix............................................................................................................................... 98
2.2. Accuracy (overall success rate) ......................................................................................................... 99
2.3. Receiver operating characteristic (ROC) ........................................................................................... 99
2.4. Cost Matrix (as response to the limitation of Accuracy) ................................................................ 100
2.5. Kappa statistic for multiclass prediction......................................................................................... 101
2.6. Precision and Recall ........................................................................................................................ 102
2.7. 𝑭𝟏-measure .................................................................................................................................... 103
LECTURE 9: DECISION TREES.................................................................................................................... 104
1. MAIN PROCESSING........................................................................................................................................ 104
1.1. Induction (with a Greedy Strategy)................................................................................................. 105
2. PROS AND CONS OF DECISION TREES ................................................................................................................. 109
LECTURE 10: ASSOCIATION RULES ........................................................................................................... 110
1. RULES ........................................................................................................................................................ 110
2. TWO-STAGE PROCESS.................................................................................................................................... 111
2.1. Generation of frequent itemsets → Apriory algorithm .................................................................. 111
2.2. Selecting the strong rules i.e., criteria for judging the strength of the rules.................................. 112
3. ALTERNATIVE DATA REPRESENTATION (TO SPEED UP EXECUTION) ........................................................................... 113
LECTURE 11: CLUSTERING ....................................................................................................................... 114
1. CLUSTER ANALYSIS ........................................................................................................................................ 114
1.1. Issues for clustering ........................................................................................................................ 114
2. REPRESENTATION & DISTANCE........................................................................................................................ 115
2.1. Distance .......................................................................................................................................... 115
3. TWO TYPES OF CLUSTERING ............................................................................................................................ 117
3.1. Hierarchical clustering .................................................................................................................... 117
3.2. Partitional Algorithms: k means ..................................................................................................... 120
3
, LECTURE 1: INTRODUCTION TO BI AND DATABASE SYSTEMS
1. INTRODUCTION TO BUSINESS INTELLIGENCE
Data Information Knowledge
Methods of BI:
1. Descriptive analytics: use data to understand past and present.
Retrospective
2. Diagnostic analytics: explain why something happened.
3. Predictive analytics: predict future behavior based on past
performance.
Prospective
4. Prescriptive analytics: make decisions or recommendations to
achieve the best performance.
Functions of BI: Marketing analytics, Sales analytics, HR analytics, Financial analytics, Supply chain
analytics, Accounting analytics ….
1.1. BUSINESS INTELLIGENCE (BI) VS. BUSINESS ANALYTICS (BA).
These terms are often fighting for dominance, distinguished by the following view:
- BI = data warehousing + descriptive analytics
- BA = predictive + prescriptive analytics
However, the prof thinks they are too similar to really be separated, as both are examples of a Decision
Support System (DSS).
1.2. DEFINITION
= Transforming data into meaningful information/knowledge to support business decision-
making. (general)
= BI is an umbrella term that combines the processes, technologies, and tools needed to
transform data into information, information into knowledge, and knowledge into plans that
drive profitable business action. (process view)
= BI is information and knowledge that enables business decision-making. (output view)
4
Dr. Emiel Caron & Dr. Ekaterini Ioannou & Dr. Poonacha Medappa
TABLE OF CONTENTS
LECTURE 1: INTRODUCTION TO BI AND DATABASE SYSTEMS ........................................................................ 4
1. INTRODUCTION TO BUSINESS INTELLIGENCE ........................................................................................................... 4
1.1. Business Intelligence (BI) vs. Business Analytics (BA). ........................................................................ 4
1.2. Definition ............................................................................................................................................ 4
1.3. Business Intelligence architecture ...................................................................................................... 5
2. INTRODUCTION TO DATABASES ............................................................................................................................ 7
Text 1. Database systems: design, implementation, and management – Carlos Coronel, Steven Morris &
Peter Rob.......................................................................................................................................................... 7
2.1. Database systems ............................................................................................................................. 32
2.2. Relational databases ........................................................................................................................ 32
2.3. Trends in the database world ........................................................................................................... 33
3. READING: DATA WAREHOUSE DESIGN- MODERN PRINCIPLES AND METHODOLOGIES ................................................. 34
LECTURE 2: SQL & DATA WAREHOUSING ................................................................................................... 45
1. INTRODUCTION STRUCTURED QUERY LANGUAGE (SQL) ........................................................................................ 45
1.1. Data types ......................................................................................................................................... 45
1.2. Join types .......................................................................................................................................... 45
2. INTRODUCTION TO DATA WAREHOUSING............................................................................................................. 46
2.1. Why do we need a separate data warehouse? ................................................................................ 47
2.2. DW framework: components............................................................................................................ 47
2.3. DW framework: Architecture............................................................................................................ 50
2.4. Data warehouse architecture variants ............................................................................................. 51
LECTURE 3: OLAP BUSINESS DATABASES & BUSINESS DASHBOARDS ........................................................... 57
1. ETL (EXTRACTION, TRANSFORMATION, & LOAD) ................................................................................................. 57
1.1. Process steps ..................................................................................................................................... 57
1.2. Transformation ................................................................................................................................. 57
1.3. ETL tools ............................................................................................................................................ 59
2. OLAP BUSINESS DATABASES ............................................................................................................................. 60
2.1. Why OLAP? ....................................................................................................................................... 61
2.2. OLAP operators ................................................................................................................................. 62
2.3. Multi-dimentional modelling ............................................................................................................ 64
2.4. Central fact table .............................................................................................................................. 67
2.5. Dimension table ................................................................................................................................ 68
2.6. OLAP software vendors..................................................................................................................... 69
3. TECHNICAL OLAP ISSUES ................................................................................................................................. 70
3.1. Sparse fact table ............................................................................................................................... 70
3.2. History in the stars ............................................................................................................................ 71
1
, 4. BUSINESS DASHBOARDS ................................................................................................................................... 72
4.1. Two theoretical perspectives: ........................................................................................................... 72
LECTURE 4: DATA MINING INTRODUCTION ................................................................................................ 74
1. PYTHON REFRESHER ........................................................................................................................................ 74
2. DECISION MAKING WITH BIG DATA ..................................................................................................................... 75
3. DATA MINING METHODS ................................................................................................................................. 75
4. DATA ........................................................................................................................................................... 75
4.1. Data and types of variables .............................................................................................................. 76
4.2. Sources of data ................................................................................................................................. 77
5. DATA MINING PROCESS(ES)—OVERVIEW OF THE STEPS INVOLVED IN DATA MINING .................................................... 77
Step 1: Develop an understanding of the purpose of the data mining project ............................................. 77
Step 2: Obtain the dataset to be used in the analysis ................................................................................... 77
Step 3: Explore, clean, and preprocess the data ............................................................................................ 78
Step 4: Reduce the data dimension, if necessary........................................................................................... 78
Step 5: Determine the data mining task ........................................................................................................ 78
Step 6: Partition the data (for supervised tasks) ........................................................................................... 78
Step 7: Choose the data mining technique(s) ................................................................................................ 78
Step 8: Use algorithms to perform the task ................................................................................................... 78
Step 9: Interpret the results of the algorithms .............................................................................................. 78
Step 10: Deploy the model ............................................................................................................................. 79
5.1. SEMMA methodology ....................................................................................................................... 79
5.2. CRISP-DM .......................................................................................................................................... 79
LECTURE 5: REGRESSION MODELS ............................................................................................................. 81
1. DEFINITION: REGRESSION ANALYSIS .................................................................................................................... 82
2. LINEAR REGRESSION MODEL ............................................................................................................................ 82
3. VISUALIZATION............................................................................................................................................... 83
4. ORDINARY LEAST SQUARES (OLS) ..................................................................................................................... 83
5. MODEL USAGE AND POSSIBLE ISSUES .................................................................................................................. 84
5.1. Objectives for single/multiple regression ......................................................................................... 84
5.2. Issues................................................................................................................................................. 84
LECTURE 6: CLASSIFICATION...................................................................................................................... 85
1. WHAT IS CLASSIFICATION? ............................................................................................................................... 85
1.1. Classification vs. Clustering .............................................................................................................. 85
1.2. Classification process ........................................................................................................................ 85
1.3. Need for classification ...................................................................................................................... 86
1.4. Model induction and application ...................................................................................................... 86
1.5. Classification techniques .................................................................................................................. 87
2. NAÏVE BAYES FOR CLASSIFICATION ..................................................................................................................... 88
2.1. Naïve Bayes classifier........................................................................................................................ 88
2.2. Some concerns .................................................................................................................................. 90
2.3. Pros and cons .................................................................................................................................... 90
LECTURE 7: K NEAREST NEIGHBORS FOR CLASSIFICATION .......................................................................... 91
1. DETERMINING RECORD’S NEIGHBORS ................................................................................................................. 91
1.1. Euclidean Distance ............................................................................................................................ 91
2
, 1.2. Manhattan Distance ......................................................................................................................... 92
2. CHOOSING THE NUMBER OF NEIGHBORS, I.E., VALUE K .......................................................................................... 92
3. COMPUTING PREDICTION (FOR A NUMERICAL OUTCOME) ....................................................................................... 93
LECTURE 8: PERFORMANCE MEASURES ..................................................................................................... 95
1. EVALUATING PREDICTIVE PERFORMANCE (OF NUMERIC/CONTINUOUS PREDICTIONS) ................................................... 96
1.1. Prediction Accuracy measures .......................................................................................................... 96
1.2. Lift chart............................................................................................................................................ 97
2. JUDGING CLASSIFIER PERFORMANCE (CATEGORICAL VARIABLES EX. BIRDS) ................................................................ 98
2.1. Confusion matrix............................................................................................................................... 98
2.2. Accuracy (overall success rate) ......................................................................................................... 99
2.3. Receiver operating characteristic (ROC) ........................................................................................... 99
2.4. Cost Matrix (as response to the limitation of Accuracy) ................................................................ 100
2.5. Kappa statistic for multiclass prediction......................................................................................... 101
2.6. Precision and Recall ........................................................................................................................ 102
2.7. 𝑭𝟏-measure .................................................................................................................................... 103
LECTURE 9: DECISION TREES.................................................................................................................... 104
1. MAIN PROCESSING........................................................................................................................................ 104
1.1. Induction (with a Greedy Strategy)................................................................................................. 105
2. PROS AND CONS OF DECISION TREES ................................................................................................................. 109
LECTURE 10: ASSOCIATION RULES ........................................................................................................... 110
1. RULES ........................................................................................................................................................ 110
2. TWO-STAGE PROCESS.................................................................................................................................... 111
2.1. Generation of frequent itemsets → Apriory algorithm .................................................................. 111
2.2. Selecting the strong rules i.e., criteria for judging the strength of the rules.................................. 112
3. ALTERNATIVE DATA REPRESENTATION (TO SPEED UP EXECUTION) ........................................................................... 113
LECTURE 11: CLUSTERING ....................................................................................................................... 114
1. CLUSTER ANALYSIS ........................................................................................................................................ 114
1.1. Issues for clustering ........................................................................................................................ 114
2. REPRESENTATION & DISTANCE........................................................................................................................ 115
2.1. Distance .......................................................................................................................................... 115
3. TWO TYPES OF CLUSTERING ............................................................................................................................ 117
3.1. Hierarchical clustering .................................................................................................................... 117
3.2. Partitional Algorithms: k means ..................................................................................................... 120
3
, LECTURE 1: INTRODUCTION TO BI AND DATABASE SYSTEMS
1. INTRODUCTION TO BUSINESS INTELLIGENCE
Data Information Knowledge
Methods of BI:
1. Descriptive analytics: use data to understand past and present.
Retrospective
2. Diagnostic analytics: explain why something happened.
3. Predictive analytics: predict future behavior based on past
performance.
Prospective
4. Prescriptive analytics: make decisions or recommendations to
achieve the best performance.
Functions of BI: Marketing analytics, Sales analytics, HR analytics, Financial analytics, Supply chain
analytics, Accounting analytics ….
1.1. BUSINESS INTELLIGENCE (BI) VS. BUSINESS ANALYTICS (BA).
These terms are often fighting for dominance, distinguished by the following view:
- BI = data warehousing + descriptive analytics
- BA = predictive + prescriptive analytics
However, the prof thinks they are too similar to really be separated, as both are examples of a Decision
Support System (DSS).
1.2. DEFINITION
= Transforming data into meaningful information/knowledge to support business decision-
making. (general)
= BI is an umbrella term that combines the processes, technologies, and tools needed to
transform data into information, information into knowledge, and knowledge into plans that
drive profitable business action. (process view)
= BI is information and knowledge that enables business decision-making. (output view)
4