DTSA 5504 - DATA MINING PIPELINE
EXAM QUESTIONS AND ANSWERS
What are Two Types of Data Attributes? - Answer-1.) Categorical (nominal, binary,
ordinal)
2.) Numeric (discrete, continuous)
What are some kinds of Data Statistics? - Answer-Categorical: % of each value,
Numeric: central tendency, dispersion
What elements make up Central Tendency? - Answer-Mean, Median, Mode,
Midrange
What elements make up Dispersion? - Answer-Range (max - min),
Quartiles(Q1:25%, Q3:75%), IQR (Q3-Q1), Variance, Standard Deviation
What are some examples of plot types for data visualization? - Answer-Boxplots,
histograms, scatterplots, pie, line, heatmap, word cloud, network, area, bubble
Object Similarity - Answer-n objects x p attributes
Object Dissimilarity - Answer-n objects x n objects
Nominal Similarity - Answer-s=1 if x=y, otherwise s=0
Nominal Dissimilarity - Answer-d=0 if x=y, otherwise d=1
Binary Symmetry - Answer-Equal chance of Y or N
Binary Asymmetry - Answer-Y is less likely than N
Symmetric Variables Equation - Answer-((r + s) / (q + r + s + t)) = d(i,j)
Asymmetric Variables Equation - Answer-(q / (q+r+s) or 1 - d(i,j)) = sim(i,j) or Jaccard
coefficient ; d(i,j) = (r + s)/(q + r + s)
Jaccard coefficient - Answer-(q / (q+r+s))
Ordinal Attributes - Answer-for all r(if) in {1,...,Mf}, z(if) = (r(if) - 1)/(Mf - 1)
Numeric Object Dissimilarity - Answer-Usually measured by distance with Minkowski
distance (I_p norm)
Minkowski Distance - Answer-d(i,j) = (abs(xi1, xj1)**p + ... + abs(xin, xjn)***p)**(1/p),
where p=1 (Manhattan Distance) or p=2 (Euclidean Distance)
, Distance Measure Properties - Answer-d(i,j) <= d(i,k) + d(k,j), triangular inequality
cosine similarity - Answer-cos(A,B) = (A*B) / ||A||||B|| = (A*B) / (sum(A)^2 *
sum(B)^2)
What operations are involved with sequential data and time series? - Answer-
Euclidean matching, dynamic time warping, minimum jump cost
Mixed Attribute Types - Answer-Weighted sum across attributes. d(i,j)=(sum(dij,
dij))/sum(dij)
When to use Euclidean/Manhattan processes? - Answer-Dense, continuous data
When to ignore null/null cases? - Answer-asymmetric attributes
When to use cosine similarity or Jaccard similarity? - Answer-sparse data
When to use seasonal patterns or subgroups? - Answer-Subset data
In a boxplot, what does the IQR represent? - Answer-The height of the bar in the
boxplot
In what ways can one transform data? - Answer-Smoothing, aggregation,
generalization, normalization, discretization, attribute construction
Formula for min-max normalization - Answer-v' = (v-min)/(max-min) * (max' - min') +
min'
Formula for mean normalization - Answer-v' = (v-mean)/(max-min)
Formula for standardized normalization - Answer-v' = (v-mean)/stdev
What does discretization involve? - Answer-Continuous->intervals, Split or merge,
Supervised or unsupervised labels
Methods for Unsupervised Discretization - Answer-Binning/Histogram Analysis,
Clustering Analysis, Intuitive partitioning
Properties for Supervised Discretization - Answer-Pre-determined class labels,
entropy-based interval splitting, X^2 analysis-based interval merging
Properties of Data Reduction - Answer-Dimensionality reduction = attributes,
numerosity reduction = objects
Properties of Attribute Selection - Answer-Forward selection, Backward elimination,
Feature engineering
Feature engineering - Answer-The process of determining which features might be
useful in training a model, and then converting raw data from log files and other
sources into said features. In TensorFlow, feature engineering often means
EXAM QUESTIONS AND ANSWERS
What are Two Types of Data Attributes? - Answer-1.) Categorical (nominal, binary,
ordinal)
2.) Numeric (discrete, continuous)
What are some kinds of Data Statistics? - Answer-Categorical: % of each value,
Numeric: central tendency, dispersion
What elements make up Central Tendency? - Answer-Mean, Median, Mode,
Midrange
What elements make up Dispersion? - Answer-Range (max - min),
Quartiles(Q1:25%, Q3:75%), IQR (Q3-Q1), Variance, Standard Deviation
What are some examples of plot types for data visualization? - Answer-Boxplots,
histograms, scatterplots, pie, line, heatmap, word cloud, network, area, bubble
Object Similarity - Answer-n objects x p attributes
Object Dissimilarity - Answer-n objects x n objects
Nominal Similarity - Answer-s=1 if x=y, otherwise s=0
Nominal Dissimilarity - Answer-d=0 if x=y, otherwise d=1
Binary Symmetry - Answer-Equal chance of Y or N
Binary Asymmetry - Answer-Y is less likely than N
Symmetric Variables Equation - Answer-((r + s) / (q + r + s + t)) = d(i,j)
Asymmetric Variables Equation - Answer-(q / (q+r+s) or 1 - d(i,j)) = sim(i,j) or Jaccard
coefficient ; d(i,j) = (r + s)/(q + r + s)
Jaccard coefficient - Answer-(q / (q+r+s))
Ordinal Attributes - Answer-for all r(if) in {1,...,Mf}, z(if) = (r(if) - 1)/(Mf - 1)
Numeric Object Dissimilarity - Answer-Usually measured by distance with Minkowski
distance (I_p norm)
Minkowski Distance - Answer-d(i,j) = (abs(xi1, xj1)**p + ... + abs(xin, xjn)***p)**(1/p),
where p=1 (Manhattan Distance) or p=2 (Euclidean Distance)
, Distance Measure Properties - Answer-d(i,j) <= d(i,k) + d(k,j), triangular inequality
cosine similarity - Answer-cos(A,B) = (A*B) / ||A||||B|| = (A*B) / (sum(A)^2 *
sum(B)^2)
What operations are involved with sequential data and time series? - Answer-
Euclidean matching, dynamic time warping, minimum jump cost
Mixed Attribute Types - Answer-Weighted sum across attributes. d(i,j)=(sum(dij,
dij))/sum(dij)
When to use Euclidean/Manhattan processes? - Answer-Dense, continuous data
When to ignore null/null cases? - Answer-asymmetric attributes
When to use cosine similarity or Jaccard similarity? - Answer-sparse data
When to use seasonal patterns or subgroups? - Answer-Subset data
In a boxplot, what does the IQR represent? - Answer-The height of the bar in the
boxplot
In what ways can one transform data? - Answer-Smoothing, aggregation,
generalization, normalization, discretization, attribute construction
Formula for min-max normalization - Answer-v' = (v-min)/(max-min) * (max' - min') +
min'
Formula for mean normalization - Answer-v' = (v-mean)/(max-min)
Formula for standardized normalization - Answer-v' = (v-mean)/stdev
What does discretization involve? - Answer-Continuous->intervals, Split or merge,
Supervised or unsupervised labels
Methods for Unsupervised Discretization - Answer-Binning/Histogram Analysis,
Clustering Analysis, Intuitive partitioning
Properties for Supervised Discretization - Answer-Pre-determined class labels,
entropy-based interval splitting, X^2 analysis-based interval merging
Properties of Data Reduction - Answer-Dimensionality reduction = attributes,
numerosity reduction = objects
Properties of Attribute Selection - Answer-Forward selection, Backward elimination,
Feature engineering
Feature engineering - Answer-The process of determining which features might be
useful in training a model, and then converting raw data from log files and other
sources into said features. In TensorFlow, feature engineering often means