SCSB1231 DATA AND INFORMATION SCIENCE
UNIT 3 DATA QUALITY AND TRANSFORMATION
Data Imputation – Data Transformation (minmax, log transform, z-score transform etc.,). –
Binning, Classing and Standardization. – Outlier/Noise & Anomalies.
,Data Imputation:
Data imputation is the process of replacing missing or incomplete data points in a
dataset with estimated or substituted values. These estimated values are typically derived
from the available data, statistical methods, or machine learning algorithms.
Data imputation fills missing values in datasets, preserving data completeness and quality. It
ensures practical analysis, model performance, and visualizations by preventing data loss and
maintaining sample size. Imputation reduces bias, maintains data relationships, and
facilitates various statistical techniques, enabling better decision-making and insights from
incomplete data.
Importance of Data Imputation in Analysis
Data imputation is crucial in data analysis as it addresses missing or incomplete data,
ensuring the integrity of analyses. Imputed data enables the use of various statistical methods
and machine learning algorithms, improving model accuracy and predictive power.
Without imputation, valuable information may be lost, leading to biased or less reliable
results. It helps maintain sample size, reduces bias, and enhances the overall quality and
reliability of data-driven insights.
Types of Missing Data
Below are the different types as follows:
1. Missing Completely at Random (MCAR)
In this type, the probability of data being missing is unrelated to both observed and
unobserved data. In other words, missing is purely random and occurs by chance. MCAR
implies that the missing data is not systematically related to any variables in the dataset. For
example, a sensor failure that results in sporadic missing temperature readings can be
considered MCAR.
2. Missing at Random (MAR)
,Missing data is considered MAR when the probability of data being missing is related to
observed data but not directly to unobserved data. In other words, missingness is dependent
on some observed variables. For instance, in a medical study, men might be less likely to
report certain health conditions than women, creating missing data related to the gender
variable. MAR is a more general and common type of missing data than MCAR.
3. Missing Not at Random (MNAR)
MNAR occurs when the probability of data being missing is related to unobserved data or the
missing values themselves. This type of missing data can introduce bias into analyses because
the missingness is related to the missing values. An example of MNAR could be patients with
severe symptoms avoiding follow-up appointments, resulting in missing data related to the
severity of their condition.
Data Imputation Techniques:
There are several methods and techniques for data imputation, each with its strengths and
suitability depending on the nature of the data and the analysis goals. Let’s discuss some
commonly used data imputation techniques:
1. Mean/Median/Mode Imputation
• Mean Imputation: Replace missing values in numerical variables with the average of
the observed values for that variable.
• Median Imputation: Replace missing values in numerical variables with the middle
value of the observed values for that variable.
• Mode Imputation: Replace missing values in categorical variables with the most
frequent category among the observed values for that variable.
Advantages
, • Simplicity
• Preserves Data Structure
• Applicability
Disadvantages and Considerations
• Ignores Data Relationships
• May Distort Data
• Inappropriate for Missing Data Patterns
When to Use:
• Use mean imputation for numerical variables when missing data is missing
completely at random (MCAR) and the variable has a relatively normal distribution.
• Use median imputation when the data is skewed or contains outliers, as it is less
sensitive to extreme values.
• Use mode imputation for categorical variables when you have missing values that can
be reasonably replaced with the most frequent category.
2. Forward Fill and Backward Fill
• Forward Fill: In forward fill imputation, missing values are replaced with the most
recent observed value in the sequence. It propagates the last known value forward
until a new observation is encountered.
• Backward Fill: In backward fill imputation, missing values are replaced with the
next observed value in the sequence. It propagates the next known value backward
until a new observation is encountered.
For forward fill, replace each missing value with the most recent observed value that
precedes it in time. For backward fill, replace each missing value with the next
observed value that follows it in time.
UNIT 3 DATA QUALITY AND TRANSFORMATION
Data Imputation – Data Transformation (minmax, log transform, z-score transform etc.,). –
Binning, Classing and Standardization. – Outlier/Noise & Anomalies.
,Data Imputation:
Data imputation is the process of replacing missing or incomplete data points in a
dataset with estimated or substituted values. These estimated values are typically derived
from the available data, statistical methods, or machine learning algorithms.
Data imputation fills missing values in datasets, preserving data completeness and quality. It
ensures practical analysis, model performance, and visualizations by preventing data loss and
maintaining sample size. Imputation reduces bias, maintains data relationships, and
facilitates various statistical techniques, enabling better decision-making and insights from
incomplete data.
Importance of Data Imputation in Analysis
Data imputation is crucial in data analysis as it addresses missing or incomplete data,
ensuring the integrity of analyses. Imputed data enables the use of various statistical methods
and machine learning algorithms, improving model accuracy and predictive power.
Without imputation, valuable information may be lost, leading to biased or less reliable
results. It helps maintain sample size, reduces bias, and enhances the overall quality and
reliability of data-driven insights.
Types of Missing Data
Below are the different types as follows:
1. Missing Completely at Random (MCAR)
In this type, the probability of data being missing is unrelated to both observed and
unobserved data. In other words, missing is purely random and occurs by chance. MCAR
implies that the missing data is not systematically related to any variables in the dataset. For
example, a sensor failure that results in sporadic missing temperature readings can be
considered MCAR.
2. Missing at Random (MAR)
,Missing data is considered MAR when the probability of data being missing is related to
observed data but not directly to unobserved data. In other words, missingness is dependent
on some observed variables. For instance, in a medical study, men might be less likely to
report certain health conditions than women, creating missing data related to the gender
variable. MAR is a more general and common type of missing data than MCAR.
3. Missing Not at Random (MNAR)
MNAR occurs when the probability of data being missing is related to unobserved data or the
missing values themselves. This type of missing data can introduce bias into analyses because
the missingness is related to the missing values. An example of MNAR could be patients with
severe symptoms avoiding follow-up appointments, resulting in missing data related to the
severity of their condition.
Data Imputation Techniques:
There are several methods and techniques for data imputation, each with its strengths and
suitability depending on the nature of the data and the analysis goals. Let’s discuss some
commonly used data imputation techniques:
1. Mean/Median/Mode Imputation
• Mean Imputation: Replace missing values in numerical variables with the average of
the observed values for that variable.
• Median Imputation: Replace missing values in numerical variables with the middle
value of the observed values for that variable.
• Mode Imputation: Replace missing values in categorical variables with the most
frequent category among the observed values for that variable.
Advantages
, • Simplicity
• Preserves Data Structure
• Applicability
Disadvantages and Considerations
• Ignores Data Relationships
• May Distort Data
• Inappropriate for Missing Data Patterns
When to Use:
• Use mean imputation for numerical variables when missing data is missing
completely at random (MCAR) and the variable has a relatively normal distribution.
• Use median imputation when the data is skewed or contains outliers, as it is less
sensitive to extreme values.
• Use mode imputation for categorical variables when you have missing values that can
be reasonably replaced with the most frequent category.
2. Forward Fill and Backward Fill
• Forward Fill: In forward fill imputation, missing values are replaced with the most
recent observed value in the sequence. It propagates the last known value forward
until a new observation is encountered.
• Backward Fill: In backward fill imputation, missing values are replaced with the
next observed value in the sequence. It propagates the next known value backward
until a new observation is encountered.
For forward fill, replace each missing value with the most recent observed value that
precedes it in time. For backward fill, replace each missing value with the next
observed value that follows it in time.