Midterm Exam Preparation, Topics and Subjects from previous
terms 2020/2021
Final Exam Preparation, Topics and Subjects from previous terms
2020/2021
1
,CSE 578 Data Visualization: Midterm Recap, Topics and Subjects
I. Intro To Data Visualization (Week 1)
1) What is visualization
1. What is visualization?
2. Key purposes of visualization
2) Data Processing vs. Querying vs. Exploration
Def. awareness
Def. understanding
Difference between the following: Data Processing ->, Querying ->, Navigation ->, Exploration ->
Exploratory search steps: Analysis, Comparison, Aggregation, Transformation, Visualization
3) Intro To Data Exploration
Data challenges: Imprecision, Noise, Sparsity (having few data) = INS; 3Vs – Volume, Velocity(speed),
Variety-heterogeneity; HMLE – High-dimensional, multi-modal, inter-linked, evolving
- Data management/mining techniques for supporting scalable, real-time, analysis & exploration
- Most data in the world are imprecise, multi-modal, subjective
- Data exploration systems need to support both: effective data manipulation (filtering, integration-
join, set operations- union & intersections) & effective data analysis/retrieval (feature extraction,
similarity search, clustering (partitioning), aggregation, classification, preference-driven)
II. Intro To Data Exploration Components (Week 2)
1) Data Organization
What is a database? -> collection of data org. in some fashion
What is a data model? -> used to describe how data is organized (hierarchical, relational, OO, spatial,
fuzzy), a formalism to describe constraints that describe properties of data
What is a data schema? -> a set of constraints that describe properties of data, structure of data,
enable validation & efficient storage of data, enable querying & retrieval of data (comparison,
indexing, query optimization, query processing)
Levels of data organization -> structured data/databases, semi-structured data/databases,
unstructured data/databases
Structured data/databases: - the data are well-structured & organized (schema describes this
structure, DBMS enforces this struct.)-advantages: data organization is predictable : easier to query,
optimize, explore
▪ Ex. Relational data models : data is organized in tabular form, schema for each table
consists of attributes, functional dependencies: describe the relationships among the
attributes in the schema (ex. Key uniquely identifies a tuple)
Semi-structured Data : - the constraints that reflect the struct. Of data are flexible, data is self-
describing: each item in the db. describes its own schema
▪ Advantages: data organization is flexible/malleable, easier to integrate & exchange
▪ books from different sources that don’t have same structure, are able to be stored as
semi-struct. Data
2) Vector Data
-images, videos, social networks, books, sensor readings.
3 & 4) Vector spaces
- what are good features to use as basis vectors & how many to use as basis vectors?
- dimensionality curse: high dim. Makes it harder to analyze data, the less efficient & effective search
& analysis becomes
, - efficiency: search data structures are not efficient at high dimensions (trees), effectiveness: the more
dimensions we have the more data we need to discover patterns (prevent overfitting)
Def. linear independence & basis : Vectors in V = {v1, v2, v3, …vn} are linearly independent if they are
non-redundant &
b) The linearly independent set V is said to be a basis for S if for every vector u in S
every vector u can be written as a linear combination of V – property called completeness
5) Norms
- Most used family of length measurements = p-norms
- 1-norm: Manhattan distance, city block distance,L1 distance : |x1-x2| + |y1-y2|
- 2-norm: Euclidean distance, L2 distance,
- - norm , L distance: max(|x1-x2|, |y1-y2|)
6) Vector distance measures
- A metric distance must satisfy self-minimality, minimality, symmetry, triangular inequality
(effective pruning search space during retrieval)
- P-norms are metric
- Intersection similarity:
∑𝑖=1,..𝑑 min (𝑎𝑖 , 𝑏𝑖 )
𝑠𝑖𝑚𝑖𝑛𝑡 (𝑎⃗, 𝑏⃗⃗) =
∑𝑖=1,..𝑑 max (𝑎𝑖 , 𝑏𝑖 )
o 𝑠𝑖𝑚 ~ 1 if a, b similar
o 𝑠𝑖𝑚 ~ 0 if a, b ≠ similar
- Angle-based measures:
⃗⃗
⃗⃗∗ 𝒃
𝒂
o cosine similarity
|𝒂 ⃗⃗⃗⃗
⃗⃗|∗|𝒃|
o ⃗⃗ ∗ ⃗𝒃⃗ = ∑𝒏𝒊=𝟏(𝒂𝒊 ∗ 𝒃𝒊 )
dot product similarity 𝒂
- Cosine and dot product are the same if vectors are unit length
- Other:
Pearson’s correlation (similarity measure) – linear correlation among the corresponding
components of 2 vectors, KL-divergence (distance measure)- how one vector diverges from the
other, earth-movers distance: how one vector diverges from the other
7) Strings & sequences
Edit distance between 2 sequences is the min. no. of edit operations to convert one sequence to
another. (deletions, replacements, insertions)
- Time series matching: Synchronous/ non-elastic distance & similarity measures: Euclidean dist.
- Asynchrony in time series : distance & similarity measures : Edit Distance, Dynamic Time Warping,
Feature-based alignment
- motifs : frequently repeating patterns in time series, can also occur in multi-variate time series
III. Exploratory Querying & Visual Variables used in Data Exploration & Visualization (Week 3)
1) Data Processing vs Querying vs Exploration
• Similarity queries/ranked queries
• Drill-down/Roll-up
• Summarization/aggregation(MAX)
• Aggregate/iceberg queries