QM1 Statistic terms
Chapter 1: data and decisions
Big data: the collection and analysis of data sets so large and complex that traditional
methods typically brought to bear on the problem would be overwhelmed.
Business analytics: the process of using statistical analysis and modeling to drive
business decisions.
Case (record/row): a case is an individual about whom or which we have data.
Categorical (or qualitative) variable: a variable that names categories (whether with
words or numerals).
Context: the context ideally tells who was measured, what was measured, how the data
was collected, where the data were collected, and when and why the study was
performed.
Cross-sectional data: data taken from situations that vary over time but measured at a
single time instant are said to be a cross-section of the time series.
Data: recorded values, whether numbers or labels, together with their context.
Data mining (or predictive analytics): the process of using a variety of statistical tools to
analyze large databases or data warehouses.
Data table: an arrangement of data in which each row represents a case, and each
column represents a variable.
Data warehouse: a large database of information collected by a company or other
organization usually to record transactions that the organization makes, but also used
for analysis via data mining.
Experimental unit: an individual in a study for which or for whom data values are
recorded. Human experimental units are usually called subjects or participants.
Identifier variable: a categorical variable that records a unique value for each case, used
to name or identify it.
Metadata: auxiliary information about variables in a database, typically including how,
when, and where (and possible why) the data were collected; who each case represents;
and the definitions of all the variables.
Nominal variable: the term “nominal” can be applied to a variable whose values are used
only to name categories.
Ordinal variable: the term “ordinal” can be applied to a variable whose categorical values
possess some kind of order.
Participant (subject): a human experimental unit
Quantitative variable: a variable in which the numbers are values of measured quantities
with units.
Record: information about an individual in a database.
Relational database: a relational database stores and retrieves information. Within the
database, information is kept in data tables that can be “related” to each other.
Respondent: someone who answers, or responds to, a survey.
Spreadsheet: a spreadsheet is a layout designed for accounting that is often used to
store and manage data tables.
Subject (participant): a human experimental unit.
Times series: data measured over time. Usually, the time intervals are equally spaced or
regularly spaced.
Units: a quantity or amount adopted as a standard of measurements.
Variable: a variable holds information about the same characteristic for many cases.
, Chapter 2: visualizing and describing categorical data
Area principle: in a statistical display, each data value is represented by the same
amount of area.
Bar chart (relative bar chart): a chart that represents the count (or percentage) of each
category in a categorical variable as a bar, allowing easy visual comparisons across
categories.
Cell: each location in a contingency table, representing the values of two categorical
variables, is called a cell.
Column percent: the proportion of each column contained in the cell of a frequency
table.
Conditional distribution: the distribution of a variable restricting the who to consider only
a smaller group of individuals.
Contingency table: a table displaying the frequencies (sometimes percentages) for each
combination of two or more variables.
Distribution: the distribution of a variable is a list of:
All the possible values of the variable
The relative frequency of each value
Frequency table (relative frequency table): a table that lists the categories in a
categorical variable and gives the number (the percentage) of observations for each
category.
Independent variables: variables for which the conditional distribution of one variable is
the same for each category of the other.
Marginal distribution: in a contingency table, the distribution of either variable alone. The
counts or percentages are the totals found in the margins (usually the right-most column
or bottom row) of the table.
Mosaic plot: a mosaic plot is a graphical representation of a (usually two-way)
contingency table. The plot is divided into rectangles so that the area of each rectangle
is proportional to the number of cases in the corresponding cell.
Pie chart: pie charts show how a “whole” divides into categories by showing a wedge of a
circle whose area corresponds to the proportion in each category.
Row percent: the proportion of each row contained in the cell of a frequency table.
Segmented (or stacked) bar chart: a segmented bar chart displays the conditional
distribution of a categorical variable within each category of another variable.
Simpson’s paradox: a phenomenon that arises when averages appear to contradict the
overall averages.
Total percent: the proportion of the total contained in the cell of a frequency table.
Chapter 1: data and decisions
Big data: the collection and analysis of data sets so large and complex that traditional
methods typically brought to bear on the problem would be overwhelmed.
Business analytics: the process of using statistical analysis and modeling to drive
business decisions.
Case (record/row): a case is an individual about whom or which we have data.
Categorical (or qualitative) variable: a variable that names categories (whether with
words or numerals).
Context: the context ideally tells who was measured, what was measured, how the data
was collected, where the data were collected, and when and why the study was
performed.
Cross-sectional data: data taken from situations that vary over time but measured at a
single time instant are said to be a cross-section of the time series.
Data: recorded values, whether numbers or labels, together with their context.
Data mining (or predictive analytics): the process of using a variety of statistical tools to
analyze large databases or data warehouses.
Data table: an arrangement of data in which each row represents a case, and each
column represents a variable.
Data warehouse: a large database of information collected by a company or other
organization usually to record transactions that the organization makes, but also used
for analysis via data mining.
Experimental unit: an individual in a study for which or for whom data values are
recorded. Human experimental units are usually called subjects or participants.
Identifier variable: a categorical variable that records a unique value for each case, used
to name or identify it.
Metadata: auxiliary information about variables in a database, typically including how,
when, and where (and possible why) the data were collected; who each case represents;
and the definitions of all the variables.
Nominal variable: the term “nominal” can be applied to a variable whose values are used
only to name categories.
Ordinal variable: the term “ordinal” can be applied to a variable whose categorical values
possess some kind of order.
Participant (subject): a human experimental unit
Quantitative variable: a variable in which the numbers are values of measured quantities
with units.
Record: information about an individual in a database.
Relational database: a relational database stores and retrieves information. Within the
database, information is kept in data tables that can be “related” to each other.
Respondent: someone who answers, or responds to, a survey.
Spreadsheet: a spreadsheet is a layout designed for accounting that is often used to
store and manage data tables.
Subject (participant): a human experimental unit.
Times series: data measured over time. Usually, the time intervals are equally spaced or
regularly spaced.
Units: a quantity or amount adopted as a standard of measurements.
Variable: a variable holds information about the same characteristic for many cases.
, Chapter 2: visualizing and describing categorical data
Area principle: in a statistical display, each data value is represented by the same
amount of area.
Bar chart (relative bar chart): a chart that represents the count (or percentage) of each
category in a categorical variable as a bar, allowing easy visual comparisons across
categories.
Cell: each location in a contingency table, representing the values of two categorical
variables, is called a cell.
Column percent: the proportion of each column contained in the cell of a frequency
table.
Conditional distribution: the distribution of a variable restricting the who to consider only
a smaller group of individuals.
Contingency table: a table displaying the frequencies (sometimes percentages) for each
combination of two or more variables.
Distribution: the distribution of a variable is a list of:
All the possible values of the variable
The relative frequency of each value
Frequency table (relative frequency table): a table that lists the categories in a
categorical variable and gives the number (the percentage) of observations for each
category.
Independent variables: variables for which the conditional distribution of one variable is
the same for each category of the other.
Marginal distribution: in a contingency table, the distribution of either variable alone. The
counts or percentages are the totals found in the margins (usually the right-most column
or bottom row) of the table.
Mosaic plot: a mosaic plot is a graphical representation of a (usually two-way)
contingency table. The plot is divided into rectangles so that the area of each rectangle
is proportional to the number of cases in the corresponding cell.
Pie chart: pie charts show how a “whole” divides into categories by showing a wedge of a
circle whose area corresponds to the proportion in each category.
Row percent: the proportion of each row contained in the cell of a frequency table.
Segmented (or stacked) bar chart: a segmented bar chart displays the conditional
distribution of a categorical variable within each category of another variable.
Simpson’s paradox: a phenomenon that arises when averages appear to contradict the
overall averages.
Total percent: the proportion of the total contained in the cell of a frequency table.