Statistics & data science 188
Jess Rolfe
SU 2024 - module 188
What is statistics?
Statistics
• Is the collection of methods that allow one to work with data effectively
Statistics is a TOOL
• To obtain information from data
It provides us with
- a formal basis to summarise and visualise data
- reach conclusions about the data
- make reliable predictions about business activities
- improve the businesss process
Statistics must be applied correctly
→ many professionals make errors by misusing statistical methods or mistaking statistics as
a substitution for , and not an enhancement of, a decision-making process
DCOVA framework
DCOVA
• A framework used in statistics to minimise errors
• It organises a set of tasks to apply statistics correctly
Define
- the data that you want to study to meet an objective
Collect
- the data from appropriate sources
Organise
- the data collected by developing tables
Visualise
- the data by developing charts
Analyse
- the data collected , reach conclusions and present the results
*note that the Define and Collect steps must be done before the others. The remaining
three are done in varying orders
Business analytics
Business analytics
• combine statistical methods
• with management science and information systems
• to form an interdisciplinary tool
• that support fact-based decision making
1
,This includes
- statistical methods to analyse and explore data that can uncover previously unknown
or unforeseen relationships
- information systems methods to collect and process datasets of all sizes, including very
large datasets that would otherwise be hard to use effectively
- management science methods to develop optimisation models that support all levels of
management, from strategic planning to daily operations
Data science
• is the field of study
• that combines domain expertise, programming skills and knowledge of mathematics &
statistics
• to extract meaningful insights from data
• Data science practitioners use their methods to :
➜ use a wide range of tools + techniques for evaluating and preparing data
➜ extract insights from data using predictive analytics and artificial intellgenence (Al) ,
including machine learning and deep learning models
➜ write applications that automate data processing and calculations
➜ tell and illustrate stories that clearly convey the meaning of results to decision-makers
and stakeholders at every level of technical knowledge and understanding
➜ explain how these results can be used to solve business problems
Big data
Big data
• is a collection of data that cannot be easily browsed or analysed using traditional
methods
Big data is collected in
• massive volumes
• at very fast rates (real time)
• and in variety of forms
It refers to
- large datasets of structured data
- stored in files or worksheets
big data may be unstructured
- such that data have an irregular pattern and contain values that are not
comprehensive without further interpretation
unstructured data
• could be text , pictures , videos or audio
Definitions and terminology
• A variable = defines a characteristic or property of an item that can vary among the
occurrences of those items
• Data = a set of values associated with one or more variables
• *note that each value for a variable is a single fact - not a list of facts
• Statistics = defined as the methods that analyse the data of the variables of interest
• Descriptive statistics = methods of organising , summarising and presenting data in an
informative and convenient way
2
,• Inferential statistics = the methods used to make a conclusion about a characteristic of
a population , based on a smaller sample of the population
Classifying variables by type
1. Categorial (qualitative) variables take categories as their values such as “yes” , “no” ,
or “blue” , “brown” , “green”
2. Numerical (quantitative) variables have values that represent a counted or measured
quantity
• Discrete variables arise from a counting process. Values are countable over a finite
range
• Continuous variables arise from a measuring process. Values are uncountable over a
finite range
examples
measurement scales
Nominal scale
• Classifies categorical data into distinct categories in which no ranking is implied
Ordinal scale
• Classifies categorial data into distinct categories in which ranking is implied
3
, Numerical variables use an interval scale or ratio scale
• An internal scale is an ordered scale in which the difference between measurements is
a meaningful quantity but the measurements is a meaningful quantity but the
measurements do not have a true zero point
e.g. temperature (in degrees) OR standardised exam score
• A ratio scale is an ordered scale in which the difference between the measurements is
a meaningful quantity and the measurements have a true zero point
e.g. height , weight , age , salary
population and sample
➜ data is collected from either a population or sample
population
• A population contains all the items or individuals of interest that you seek to study
sample
• A sample contains only a portion of a population of interest
4
Jess Rolfe
SU 2024 - module 188
What is statistics?
Statistics
• Is the collection of methods that allow one to work with data effectively
Statistics is a TOOL
• To obtain information from data
It provides us with
- a formal basis to summarise and visualise data
- reach conclusions about the data
- make reliable predictions about business activities
- improve the businesss process
Statistics must be applied correctly
→ many professionals make errors by misusing statistical methods or mistaking statistics as
a substitution for , and not an enhancement of, a decision-making process
DCOVA framework
DCOVA
• A framework used in statistics to minimise errors
• It organises a set of tasks to apply statistics correctly
Define
- the data that you want to study to meet an objective
Collect
- the data from appropriate sources
Organise
- the data collected by developing tables
Visualise
- the data by developing charts
Analyse
- the data collected , reach conclusions and present the results
*note that the Define and Collect steps must be done before the others. The remaining
three are done in varying orders
Business analytics
Business analytics
• combine statistical methods
• with management science and information systems
• to form an interdisciplinary tool
• that support fact-based decision making
1
,This includes
- statistical methods to analyse and explore data that can uncover previously unknown
or unforeseen relationships
- information systems methods to collect and process datasets of all sizes, including very
large datasets that would otherwise be hard to use effectively
- management science methods to develop optimisation models that support all levels of
management, from strategic planning to daily operations
Data science
• is the field of study
• that combines domain expertise, programming skills and knowledge of mathematics &
statistics
• to extract meaningful insights from data
• Data science practitioners use their methods to :
➜ use a wide range of tools + techniques for evaluating and preparing data
➜ extract insights from data using predictive analytics and artificial intellgenence (Al) ,
including machine learning and deep learning models
➜ write applications that automate data processing and calculations
➜ tell and illustrate stories that clearly convey the meaning of results to decision-makers
and stakeholders at every level of technical knowledge and understanding
➜ explain how these results can be used to solve business problems
Big data
Big data
• is a collection of data that cannot be easily browsed or analysed using traditional
methods
Big data is collected in
• massive volumes
• at very fast rates (real time)
• and in variety of forms
It refers to
- large datasets of structured data
- stored in files or worksheets
big data may be unstructured
- such that data have an irregular pattern and contain values that are not
comprehensive without further interpretation
unstructured data
• could be text , pictures , videos or audio
Definitions and terminology
• A variable = defines a characteristic or property of an item that can vary among the
occurrences of those items
• Data = a set of values associated with one or more variables
• *note that each value for a variable is a single fact - not a list of facts
• Statistics = defined as the methods that analyse the data of the variables of interest
• Descriptive statistics = methods of organising , summarising and presenting data in an
informative and convenient way
2
,• Inferential statistics = the methods used to make a conclusion about a characteristic of
a population , based on a smaller sample of the population
Classifying variables by type
1. Categorial (qualitative) variables take categories as their values such as “yes” , “no” ,
or “blue” , “brown” , “green”
2. Numerical (quantitative) variables have values that represent a counted or measured
quantity
• Discrete variables arise from a counting process. Values are countable over a finite
range
• Continuous variables arise from a measuring process. Values are uncountable over a
finite range
examples
measurement scales
Nominal scale
• Classifies categorical data into distinct categories in which no ranking is implied
Ordinal scale
• Classifies categorial data into distinct categories in which ranking is implied
3
, Numerical variables use an interval scale or ratio scale
• An internal scale is an ordered scale in which the difference between measurements is
a meaningful quantity but the measurements is a meaningful quantity but the
measurements do not have a true zero point
e.g. temperature (in degrees) OR standardised exam score
• A ratio scale is an ordered scale in which the difference between the measurements is
a meaningful quantity and the measurements have a true zero point
e.g. height , weight , age , salary
population and sample
➜ data is collected from either a population or sample
population
• A population contains all the items or individuals of interest that you seek to study
sample
• A sample contains only a portion of a population of interest
4