Exam Questions and answers, verified.
Artificial Intelligence - ✔✔-A computer performing tasks that a human can do
NLP Sentiment analysis is a form of... - ✔✔-classification
NLP topic modeling is a form of... - ✔✔-Dimensionality reduction
Tokenization - ✔✔-Splitting raw text into small, indivisible units for processing. Units can be words,
sentences, n-grams (n-word combos), other characters defined by regex
Stop words - ✔✔-Words that have very little semantic value
Stemming and Lemmatization - ✔✔-Cut word down to base form
Stemming- uses rough heuristics to reduce words to base
Lemmatization- uses vocabulary and morphological analysis (makes run, runs, running, and ran all the
same)
Named Entity Recognition - ✔✔-Identifies and tags named entities in text (people, places, organizations,
phone numbers, emails, etc)
aka entity extraction
Compound term extraction - ✔✔-extracting and tagging compound words or phrases in text
, Levenshtein distance - ✔✔-Minimum number of operations to get from one word to another. One way
of quantifying word similarity
Levenshtein operations - ✔✔-Deletions (delete a character)
Insertions (insert a character)
Mutation (change a character)
Corpus - ✔✔-Collection of texts
Bag of words model - ✔✔-- Simplified representation of text, where each document is recognized as a
bag of its words
- Grammar and word order are disregarded, but multiplicity is kept
Cosine similarity - ✔✔-Way to quantify the similarity between documents
1. Put each document in vector format
2. Find the cosine of the angle between the documents
Term frequency-inverse document frequency - ✔✔-(term frequency) * (inverse document frequency)
Term frequency - ✔✔-Term count/total terms
Inverse document frequency - ✔✔-- Considers how common a word is among all the documents
- Rare words get additional weight
Which classification models suffer from curse of dimensionality? - ✔✔-KNN, SVM, linear models
(linear/logisitic regression), decision trees
Distance-based models