Class notes

DATA QUALITY AND TRANSFORMATION

Rating

Sold

Pages

Uploaded on

08-11-2025

Written in

2025/2026

Short notes on ensuring data accuracy and consistency, and transforming data into suitable formats for analysis and modeling.

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Sathyabama Institute Of Science And Technology
Course: Data Science and Information

All documents for this subject (9)

Document information

Uploaded on: November 8, 2025
Number of pages: 61
Written in: 2025/2026
Type: Class notes
Professor(s): Abirami
Contains: All classes

Subjects

data quality
data
data process
text to data
datahub
data transform
transforming data
data quality and transformation

Content preview

SCSB1231 DATA AND INFORMATION SCIENCE

UNIT 3 DATA QUALITY AND TRANSFORMATION

Data Imputation – Data Transformation (minmax, log transform, z-score transform etc.,). –
Binning, Classing and Standardization. – Outlier/Noise & Anomalies.

,Data Imputation:

Data imputation is the process of replacing missing or incomplete data points in a
dataset with estimated or substituted values. These estimated values are typically derived
from the available data, statistical methods, or machine learning algorithms.

Data imputation fills missing values in datasets, preserving data completeness and quality. It
ensures practical analysis, model performance, and visualizations by preventing data loss and
maintaining sample size. Imputation reduces bias, maintains data relationships, and
facilitates various statistical techniques, enabling better decision-making and insights from
incomplete data.

Importance of Data Imputation in Analysis

Data imputation is crucial in data analysis as it addresses missing or incomplete data,
ensuring the integrity of analyses. Imputed data enables the use of various statistical methods
and machine learning algorithms, improving model accuracy and predictive power.
Without imputation, valuable information may be lost, leading to biased or less reliable
results. It helps maintain sample size, reduces bias, and enhances the overall quality and
reliability of data-driven insights.

Types of Missing Data
Below are the different types as follows:

1. Missing Completely at Random (MCAR)
In this type, the probability of data being missing is unrelated to both observed and
unobserved data. In other words, missing is purely random and occurs by chance. MCAR
implies that the missing data is not systematically related to any variables in the dataset. For
example, a sensor failure that results in sporadic missing temperature readings can be
considered MCAR.

2. Missing at Random (MAR)

,Missing data is considered MAR when the probability of data being missing is related to
observed data but not directly to unobserved data. In other words, missingness is dependent
on some observed variables. For instance, in a medical study, men might be less likely to
report certain health conditions than women, creating missing data related to the gender
variable. MAR is a more general and common type of missing data than MCAR.

3. Missing Not at Random (MNAR)
MNAR occurs when the probability of data being missing is related to unobserved data or the
missing values themselves. This type of missing data can introduce bias into analyses because
the missingness is related to the missing values. An example of MNAR could be patients with
severe symptoms avoiding follow-up appointments, resulting in missing data related to the
severity of their condition.

Data Imputation Techniques:

There are several methods and techniques for data imputation, each with its strengths and
suitability depending on the nature of the data and the analysis goals. Let’s discuss some
commonly used data imputation techniques:

1. Mean/Median/Mode Imputation

• Mean Imputation: Replace missing values in numerical variables with the average of
the observed values for that variable.
• Median Imputation: Replace missing values in numerical variables with the middle
value of the observed values for that variable.
• Mode Imputation: Replace missing values in categorical variables with the most
frequent category among the observed values for that variable.

Advantages

, • Simplicity
• Preserves Data Structure
• Applicability

Disadvantages and Considerations

• Ignores Data Relationships
• May Distort Data
• Inappropriate for Missing Data Patterns

When to Use:

• Use mean imputation for numerical variables when missing data is missing
completely at random (MCAR) and the variable has a relatively normal distribution.
• Use median imputation when the data is skewed or contains outliers, as it is less
sensitive to extreme values.
• Use mode imputation for categorical variables when you have missing values that can
be reasonably replaced with the most frequent category.

2. Forward Fill and Backward Fill

• Forward Fill: In forward fill imputation, missing values are replaced with the most
recent observed value in the sequence. It propagates the last known value forward
until a new observation is encountered.
• Backward Fill: In backward fill imputation, missing values are replaced with the
next observed value in the sequence. It propagates the next known value backward
until a new observation is encountered.

For forward fill, replace each missing value with the most recent observed value that
precedes it in time. For backward fill, replace each missing value with the next
observed value that follows it in time.

$3.99

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

lsharan

Get to know the seller

lsharan Sathyabama institute of science and technology

View profile

Sold

New on Stuvia

Member since

1 month

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller lsharan. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $3.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 46051 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

DATA QUALITY AND TRANSFORMATION

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?