CPSC 330 final exam Ch 16, 17, 18 with
complete solutions
.... gives you the ability to summarize the major themes in a large collection of
documents (corpus). - ANSWER-Topic modelling
.... is a great EDA tool to get a sense of what's going on in a large corpus. - ANSWER-
Topic modelling
2 approaches to reduce multi-class classification into binary classification - ANSWER-
the one-vs.-rest approach
- 1v{2,3}, 2v{1,3}, 3v{1,2}
- Learn a binary model for each class which tries to separate that class from all of the
other classes.
the one-vs.-one approach
- 1v2, 1v3, 2v3
- Build a binary model for each pair of classes.
After creating profile in Content-based filtering, what do we do? - ANSWER-Create
Ridge() model
Predict the rating of a movie that the user has not seen
An unsupervised approach which only uses the user-item interactions given in the
ratings matrix - ANSWER-Collaborative filtering in a recommender system
Apply all of the classifiers on the test example.
Count how often each class was predicted.
Predict the class with most votes.
These are the properties of - ANSWER-One Vs. One approach (OVO) for prediction
Basic text preprocessing (7) - ANSWER-Tokenization
- the process of breaking down a text or document into individual words, phrases,
symbols, or other meaningful elements known as tokens. In simpler terms, tokenization
is like splitting a sentence into its component parts
- Sentence segmentation: Split text into sentences
- Word tokenization: Split sentences into words
Converting text to lowercase
,Removing punctuation and stopwords
- stopwords: commonly used words that are often considered insignificant or carry little
meaning for understanding the context of a text (such as "a," "an," "the," "is," "in,"
"and,").
Discarding words with length < threshold OR word frequency < threshold
Lemmatization:
- Consider the lemmas instead of inflected forms.
- lemmatization: finding the base form of a word
- For example, lemmatizing the words "running," "ran," and "runs" would give you the
base form "run."
- Vancouver's → Vancouver
- computers → computer
- rising → rise, rose, rises
POS: restrict to a specific part of speech; For example, only consider nouns, verbs, and
adjectives
Stemming
- the process of reducing words to their base or root form, often by removing suffixes, to
simplify analysis and improve text processing efficiency.
- Before stemming: UBC is located in the beautiful province of British Columbia... It's
very close to the U.S. border.
- After stemming: ubc is locat in the beauti provinc of british columbia ... it 's veri close to
the u.s. border .
Collaborative filtering vs. Content-based filtering (1) - ANSWER-use item features or not
Collaborative: Recommends items based on similar users or items without requiring
explicit knowledge of item features.
Content-based: Recommends items based on item features like text, genre, or
metadata
Explain Neural Networks - ANSWER-Neural networks apply a sequence of
transformations on your input data.
We are adding one "layer" of transformations in between features (inputs) and the
target. (output)
The hidden units (e.g., h[1], h[2], ...) represent the intermediate processing steps.
At a very high level you can also think of them as Pipelines in sklearn.
, Explain what Transfer learning is - ANSWER-Recall: CNNs can take in images without
flattening them ← solution to the image classification!
Training a CNN from scratch is not common due to the need for a large dataset,
powerful computers, and significant human effort.
Instead, a common practice is to download a pre-trained model and fine-tune it for your
task.
This is called transfer learning.
- Transfer learning is like using what you already know to learn something new faster. In
machine learning, it means using a pre-trained model's knowledge to solve a different
problem instead of starting from scratch. It saves time and resources while improving
performance.
Given a test point, get scores from all binary classifiers (e.g., raw scores for logistic
regression). This is - ANSWER-OVR
How does LDA work (4) - ANSWER-1. Create BOW representation for the text column
using CountVectorizer
2. Create a topic model with sklearn's LatentDirichletAllocation
3. LDA basically allows access to these two word representations
Topic-words association
- `lda.components_` gives us the weights associated with each word (columns) for each
topic (rows).
- In other words, it tells us which word is important for which topic.
Document-topic association
- Calling `transform` on the data gives us document_topics association.
- It tells us which topic is important for which document.
4. You could change the data representation (i.e., change the labels, round values, and
drop sum column)
If you have K classes, it'll train K binary classifiers, one for each class.
this is - ANSWER-One-vs.-Rest approach (OVR)
If you want to pull documents related to a particular lawsuit, we use - ANSWER-Topic
modeling
ImageNet - ANSWER-An image dataset
There are 14 million images and 1000 classes
complete solutions
.... gives you the ability to summarize the major themes in a large collection of
documents (corpus). - ANSWER-Topic modelling
.... is a great EDA tool to get a sense of what's going on in a large corpus. - ANSWER-
Topic modelling
2 approaches to reduce multi-class classification into binary classification - ANSWER-
the one-vs.-rest approach
- 1v{2,3}, 2v{1,3}, 3v{1,2}
- Learn a binary model for each class which tries to separate that class from all of the
other classes.
the one-vs.-one approach
- 1v2, 1v3, 2v3
- Build a binary model for each pair of classes.
After creating profile in Content-based filtering, what do we do? - ANSWER-Create
Ridge() model
Predict the rating of a movie that the user has not seen
An unsupervised approach which only uses the user-item interactions given in the
ratings matrix - ANSWER-Collaborative filtering in a recommender system
Apply all of the classifiers on the test example.
Count how often each class was predicted.
Predict the class with most votes.
These are the properties of - ANSWER-One Vs. One approach (OVO) for prediction
Basic text preprocessing (7) - ANSWER-Tokenization
- the process of breaking down a text or document into individual words, phrases,
symbols, or other meaningful elements known as tokens. In simpler terms, tokenization
is like splitting a sentence into its component parts
- Sentence segmentation: Split text into sentences
- Word tokenization: Split sentences into words
Converting text to lowercase
,Removing punctuation and stopwords
- stopwords: commonly used words that are often considered insignificant or carry little
meaning for understanding the context of a text (such as "a," "an," "the," "is," "in,"
"and,").
Discarding words with length < threshold OR word frequency < threshold
Lemmatization:
- Consider the lemmas instead of inflected forms.
- lemmatization: finding the base form of a word
- For example, lemmatizing the words "running," "ran," and "runs" would give you the
base form "run."
- Vancouver's → Vancouver
- computers → computer
- rising → rise, rose, rises
POS: restrict to a specific part of speech; For example, only consider nouns, verbs, and
adjectives
Stemming
- the process of reducing words to their base or root form, often by removing suffixes, to
simplify analysis and improve text processing efficiency.
- Before stemming: UBC is located in the beautiful province of British Columbia... It's
very close to the U.S. border.
- After stemming: ubc is locat in the beauti provinc of british columbia ... it 's veri close to
the u.s. border .
Collaborative filtering vs. Content-based filtering (1) - ANSWER-use item features or not
Collaborative: Recommends items based on similar users or items without requiring
explicit knowledge of item features.
Content-based: Recommends items based on item features like text, genre, or
metadata
Explain Neural Networks - ANSWER-Neural networks apply a sequence of
transformations on your input data.
We are adding one "layer" of transformations in between features (inputs) and the
target. (output)
The hidden units (e.g., h[1], h[2], ...) represent the intermediate processing steps.
At a very high level you can also think of them as Pipelines in sklearn.
, Explain what Transfer learning is - ANSWER-Recall: CNNs can take in images without
flattening them ← solution to the image classification!
Training a CNN from scratch is not common due to the need for a large dataset,
powerful computers, and significant human effort.
Instead, a common practice is to download a pre-trained model and fine-tune it for your
task.
This is called transfer learning.
- Transfer learning is like using what you already know to learn something new faster. In
machine learning, it means using a pre-trained model's knowledge to solve a different
problem instead of starting from scratch. It saves time and resources while improving
performance.
Given a test point, get scores from all binary classifiers (e.g., raw scores for logistic
regression). This is - ANSWER-OVR
How does LDA work (4) - ANSWER-1. Create BOW representation for the text column
using CountVectorizer
2. Create a topic model with sklearn's LatentDirichletAllocation
3. LDA basically allows access to these two word representations
Topic-words association
- `lda.components_` gives us the weights associated with each word (columns) for each
topic (rows).
- In other words, it tells us which word is important for which topic.
Document-topic association
- Calling `transform` on the data gives us document_topics association.
- It tells us which topic is important for which document.
4. You could change the data representation (i.e., change the labels, round values, and
drop sum column)
If you have K classes, it'll train K binary classifiers, one for each class.
this is - ANSWER-One-vs.-Rest approach (OVR)
If you want to pull documents related to a particular lawsuit, we use - ANSWER-Topic
modeling
ImageNet - ANSWER-An image dataset
There are 14 million images and 1000 classes