Introduction to data analytics D491 (A+
Solutions)
Data Transformation - ANS-Data mapping: converting data from one format to another.
Data deduplication: eliminating repeated or redundant data.
Derived variables: creating new variables from existing ones.
Data sorting or ordering: arranging data in a specific sequence.
Data Transformation occurs in the ______ _____________ phase, the role this applies
to is the _____ A___ - ANS-Preparation phase, Data Analyst
____________ is great at combining unstructured data feeds from multiple sources. -
ANS-Hadoop
Examples of when to use _____ : stream processing, fraud detection, and prevention,
content management, risk management. - ANS-Hadoop
Set up sandbox, extract and transform data, condition data and exploring visually
occurs in the ______ ____________ phase - ANS-Data Preparation
When you convert a Microsoft Word file to a PDF, for example, you are ________data -
ANS-transforming
Running a virtual machine on Linux operating system on Windows is an example of
---------- - ANS-sandboxing
Some key features of an Analytical Sandbox may include tools and features for
c---------- and s---ing work with colleagues. Flexibility to allow analysts to try out different
analytical approaches and techniques. Clear documentation and support resources to
help analysts get up to speed quickly. - ANS-collaboration, sharing
Why is it important to collect data in a certain time frame? - ANS-Result: more precise
findings than working with an open-ended timeframe.
___ testing works by randomly showing two versions of the same asset (ad, website,
pop-up, offer, etc.) to different users - ANS-A/B
,What does it mean for a dependent variable to be binary?
(this is always applied to logistic regression) - ANS-A binary variable is a categorical
variable that can only take one of two values, usually represented as a Boolean — True
or False — or an integer variable — 0 or 1, yes or no, sick or not sick, obese or
underweight, etc., depending on the independent variable.
______ Analysis when you're looking to segment or categorize a dataset into groups
based on similarities, but aren't sure what those groups should be. - ANS-Cluster
Analysis
Preprocessing (of data) - ANS-the process of transforming raw data into an
understandable format
Bounce Rate - ANS-the percentage of visitors to a particular website who navigate
away from the site after viewing only one page.
Logistic Regression - ANS-A statistical analysis which determines an individual's risk of
the outcome as a function of a risk factor. The outcome of interest has two categories
(yes or no, obese or not obese, at risk of cancer or not at risk of cancer, happens or
does not happen, etc.).
K-means clustering - ANS-Informally, goal is to find groups of points that are close to
each other but far from points in other groups.
• Each cluster is defined entirely and only by its centre, or mean value µk
Random Forest - ANS-An algorithm used for regression or classification that uses a
collection of tree data structures trees "vote" on the best model.
Examples of when to use Random Forest - ANS-In HC: to identify the correct
combination of components in medicine and to analyze a patient's medical history to
identify diseases (for example using symptoms to predict whether a person's symptoms
are more closely tied to malaria or a simple fever, another example can be a cold or a
sinus infection).
Centroid Clustering - ANS-clusters are represented by their centroids.
hierarchical clustering with cluster distance defined by a centroid/assigned center
___ has many applications in diverse fields such as face recognition, computer vision,
image compression, bioinformatics, and fraud detection. - ANS-PCA
, Density Clustering - ANS-detecting areas where points are concentrated and where
they are separated by areas that are empty or sparse.
Data Wrangling is - ANS-the process of removing errors and combining complex data
sets to make them more accessible and easier to analyze.
Data Wrangling Examples - ANS-Merging several data sources into one data-set for
analysis.
Identifying gaps or empty cells in data and either filling or removing them.
Deleting irrelevant or unnecessary data.
Identifying severe outliers in data and either explaining the inconsistencies or deleting
them to facilitate analysis.
Merging several data sources into one data-set for analysis.
Identifying gaps or empty cells in data and either filling or removing them.
Deleting irrelevant or unnecessary data.
Identifying severe outliers in data and either explaining the inconsistencies or deleting
them to facilitate analysis. - ANS-Data Wrangling Examples
Maintaining Databases examples/purpose - ANS-Routines meant to help performance,
free up disk space, check for data errors, check for hardware faults, update internal
statistics, and many other obscure (but important) things.
Maintaining DB2® and Oracle databases involves updating statistics, monitoring
database, server, and space utilization, and planning backup and recovery strategies. -
an example of ? - ANS-Maintaining databases
Project initiation falls under what role? - ANS-Project Sponsor
How to find p-value? - ANS--look at alternative
-if less than, find area to the left of z value
-if greater than, find area to the right of z value
-if not equal to, if positive, find area to the right and double it. if negative, find area to the
left and double it.
creating tables and establishing relationships between those tables according to rules
designed both to protect the data and to make the database more flexible by eliminating
redundancy and inconsistent dependency. - ANS-normalization
Solutions)
Data Transformation - ANS-Data mapping: converting data from one format to another.
Data deduplication: eliminating repeated or redundant data.
Derived variables: creating new variables from existing ones.
Data sorting or ordering: arranging data in a specific sequence.
Data Transformation occurs in the ______ _____________ phase, the role this applies
to is the _____ A___ - ANS-Preparation phase, Data Analyst
____________ is great at combining unstructured data feeds from multiple sources. -
ANS-Hadoop
Examples of when to use _____ : stream processing, fraud detection, and prevention,
content management, risk management. - ANS-Hadoop
Set up sandbox, extract and transform data, condition data and exploring visually
occurs in the ______ ____________ phase - ANS-Data Preparation
When you convert a Microsoft Word file to a PDF, for example, you are ________data -
ANS-transforming
Running a virtual machine on Linux operating system on Windows is an example of
---------- - ANS-sandboxing
Some key features of an Analytical Sandbox may include tools and features for
c---------- and s---ing work with colleagues. Flexibility to allow analysts to try out different
analytical approaches and techniques. Clear documentation and support resources to
help analysts get up to speed quickly. - ANS-collaboration, sharing
Why is it important to collect data in a certain time frame? - ANS-Result: more precise
findings than working with an open-ended timeframe.
___ testing works by randomly showing two versions of the same asset (ad, website,
pop-up, offer, etc.) to different users - ANS-A/B
,What does it mean for a dependent variable to be binary?
(this is always applied to logistic regression) - ANS-A binary variable is a categorical
variable that can only take one of two values, usually represented as a Boolean — True
or False — or an integer variable — 0 or 1, yes or no, sick or not sick, obese or
underweight, etc., depending on the independent variable.
______ Analysis when you're looking to segment or categorize a dataset into groups
based on similarities, but aren't sure what those groups should be. - ANS-Cluster
Analysis
Preprocessing (of data) - ANS-the process of transforming raw data into an
understandable format
Bounce Rate - ANS-the percentage of visitors to a particular website who navigate
away from the site after viewing only one page.
Logistic Regression - ANS-A statistical analysis which determines an individual's risk of
the outcome as a function of a risk factor. The outcome of interest has two categories
(yes or no, obese or not obese, at risk of cancer or not at risk of cancer, happens or
does not happen, etc.).
K-means clustering - ANS-Informally, goal is to find groups of points that are close to
each other but far from points in other groups.
• Each cluster is defined entirely and only by its centre, or mean value µk
Random Forest - ANS-An algorithm used for regression or classification that uses a
collection of tree data structures trees "vote" on the best model.
Examples of when to use Random Forest - ANS-In HC: to identify the correct
combination of components in medicine and to analyze a patient's medical history to
identify diseases (for example using symptoms to predict whether a person's symptoms
are more closely tied to malaria or a simple fever, another example can be a cold or a
sinus infection).
Centroid Clustering - ANS-clusters are represented by their centroids.
hierarchical clustering with cluster distance defined by a centroid/assigned center
___ has many applications in diverse fields such as face recognition, computer vision,
image compression, bioinformatics, and fraud detection. - ANS-PCA
, Density Clustering - ANS-detecting areas where points are concentrated and where
they are separated by areas that are empty or sparse.
Data Wrangling is - ANS-the process of removing errors and combining complex data
sets to make them more accessible and easier to analyze.
Data Wrangling Examples - ANS-Merging several data sources into one data-set for
analysis.
Identifying gaps or empty cells in data and either filling or removing them.
Deleting irrelevant or unnecessary data.
Identifying severe outliers in data and either explaining the inconsistencies or deleting
them to facilitate analysis.
Merging several data sources into one data-set for analysis.
Identifying gaps or empty cells in data and either filling or removing them.
Deleting irrelevant or unnecessary data.
Identifying severe outliers in data and either explaining the inconsistencies or deleting
them to facilitate analysis. - ANS-Data Wrangling Examples
Maintaining Databases examples/purpose - ANS-Routines meant to help performance,
free up disk space, check for data errors, check for hardware faults, update internal
statistics, and many other obscure (but important) things.
Maintaining DB2® and Oracle databases involves updating statistics, monitoring
database, server, and space utilization, and planning backup and recovery strategies. -
an example of ? - ANS-Maintaining databases
Project initiation falls under what role? - ANS-Project Sponsor
How to find p-value? - ANS--look at alternative
-if less than, find area to the left of z value
-if greater than, find area to the right of z value
-if not equal to, if positive, find area to the right and double it. if negative, find area to the
left and double it.
creating tables and establishing relationships between those tables according to rules
designed both to protect the data and to make the database more flexible by eliminating
redundancy and inconsistent dependency. - ANS-normalization