100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Class notes

Cognitive Science 2: Summary of lectures

Rating
-
Sold
-
Pages
12
Uploaded on
14-01-2022
Written in
2021/2022

This document contains my notes and a summary of the lectures given by Chris Emmery in the course Cognitive Science 2 in Quartile 2 of the year . The course has been renewed before the start of this year, so it is the first time this format is taught. The mentioned equations are mentioned in the document as well, but no detailed explanations of the algorithms that are used in the labsessions. Only theoretical explanations.

Show more Read less
Institution
Course









Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
January 14, 2022
Number of pages
12
Written in
2021/2022
Type
Class notes
Professor(s)
Chris emmery
Contains
All classes

Subjects

Content preview

COGNITIVE SCIENCE 2
TECHNISCHE UNIVERSITEIT EINDHOVEN & TILBURG UNIVERSITY
QUARTILE 2: 2021-2022


1. Introduction
Text Mining Preliminaries
The bare minimum approach is to convert text to vectors. We need to convert the language in numbers.
𝑑𝑑 = the cat sat on the mat
cat mat on sat the
Bag-of-words representation: � 1 1 1 1 1 �. If the word is on, 1, otherwise, 0. It is unordered.
We can document these sentences as instances:
𝑑𝑑0 = the cat sat on the mat
𝑑𝑑1 = my cat sat on my cat

cat mat my on sat the
�1 1 0 1 1 1 �
1 0 1 1 1 0

The representation can be more easily done via Documents × Terms matrix.
• 𝑑𝑑 = 𝑉𝑉 × 𝑋𝑋 with 𝑉𝑉: vocabulary and 𝑋𝑋: feature space

|𝐴𝐴∩𝐵𝐵|
For document similarity, we use the Jaccard coefficient: 𝐽𝐽(𝐴𝐴, 𝐵𝐵) = .
|𝐴𝐴∪𝐵𝐵|
|𝐴𝐴 ∩ 𝐵𝐵| the number of times both documents have a 1. |𝐴𝐴 ∪ 𝐵𝐵| all words except where those never occurring
in any document.

Binary vs. frequency
• Binary is a very compact representation
• Algorithms like Decision Trees have a straight-forward and compact structure
• Binary says very little about the weight of each word
• We can’t use more advanced algorithms that work with Vector Spaces

Notation Term Frequencies
• 𝐷𝐷 = {𝑑𝑑1 , 𝑑𝑑2 , … , 𝑑𝑑𝑁𝑁 } is the set of documents
• 𝑇𝑇 = {𝑡𝑡1 , 𝑡𝑡2 , … , 𝑡𝑡𝑀𝑀 } is a set of index terms for 𝐷𝐷
• Each document 𝑑𝑑𝑖𝑖 ∈ 𝐷𝐷 can be represented as a frequency vector:
o 𝑑𝑑⃗𝑖𝑖 = 〈tf(𝑡𝑡1 , 𝑑𝑑𝑖𝑖 ), … , tf(𝑡𝑡𝑀𝑀 , 𝑑𝑑𝑖𝑖 )〉 , tf(𝑡𝑡, 𝑑𝑑) is the frequency of term 𝑡𝑡𝑚𝑚 ∈ 𝑇𝑇 for document 𝑑𝑑𝑖𝑖 .

Term frequency does not capture importance very well. It should be in log-scale. We use ln(𝑡𝑡𝑡𝑡(𝑡𝑡, 𝑑𝑑) + 1).
There are still problems: longer documents have more words. And rare terms do not occur much, but are
most important. If two documents have the similar rare words in it, they are more similar.
This is solved via (inverse) document frequency.
𝑁𝑁
idf𝑡𝑡 = log𝑏𝑏
df𝑡𝑡
with 𝑁𝑁: number of documents and df𝑡𝑡 is the document frequency of 𝑡𝑡 occurring; in how many documents
does 𝑡𝑡 occur. The base 𝑏𝑏 is typically 10.

Normalizing vector Representations: Both terms can be put together via multiplication:
𝑁𝑁
𝑤𝑤𝑡𝑡,𝑑𝑑 = ln(tf(𝑡𝑡, 𝑑𝑑) + 1) ∗ lg � �
𝑑𝑑𝑑𝑑𝑡𝑡

, One way of calculating the “length” of a document is the Euclidean distance: documents with many words
are far way.
𝑛𝑛

𝑑𝑑(𝑝𝑝⃗, 𝑞𝑞⃗) = ��(𝑝𝑝⃗𝑖𝑖 − 𝑞𝑞⃗𝑖𝑖 )2
𝑖𝑖=1


Now, it will just compare documents that have similar lengths. It is possible to correct for this using the
𝒍𝒍𝟐𝟐 normalization: �|𝑝𝑝⃗|�2 = �∑𝑖𝑖 𝑝𝑝⃗𝑖𝑖2.

Cosine similarity: the dot product of two numbers under the assumption that they are 𝑙𝑙2 normalized.
𝑛𝑛

SIM = 𝑝𝑝⃗ ∙ 𝑞𝑞⃗ = � 𝑝𝑝⃗𝑖𝑖 𝑞𝑞⃗𝑖𝑖
𝑖𝑖=1
𝑝𝑝⃗∙𝑞𝑞�⃗
SIM = (normalize in the similarity)
�𝑝𝑝⃗⋅𝑝𝑝⃗ ∗�𝑞𝑞�⃗⋅𝑞𝑞�⃗



2. Collecting Data
Noisy Text
Text can be noisy. If we don’t filter this, our model reads things that are not in the format. Also typos have a
tremendous effect on the size of the vocabulary, and the representation of your documents (thus the similarity
quality).

Language variations: abbreviations, acronyms, capitalization, character flooding, concatenations,
emoticons, dialect, slang, typos.

Regular Expressions
Finding and reducing those noisy text errors is possible via regular expressions: a mini scripting language
of logic that is one of the few things that is standardized in almost any programming language. It is a way to
define strings and can be used to find patterns and also replace the matches found.
patt = re.compile(“you”)
patt.finditer(text)

You can also do disjunctions: either this or that; “[Yy]ou” to find you and You.
regex_find(“[Yy]ou”, text)

Negation is also possible: “[ ^a-x]”, find everything except a – z.
regex_find(“[^a-z]”, text)

In logic, there are Kleene expressions:
• matching at a singular position with nothing or any position after. This matches “o[a-z]*”,
anything that begins with o.
regex_find(“o[a-z]*”, text)

• matching a letter with one or more symbols. So, nothing is not returned in this case.
regex_find(“u[a-z]+”, text)

• wild card: anything on the position of the dot. “.e.” finds ‘her’ in ‘there’ … This is greedy.
regex_find(“.e”, text)
If we find something with this, we can use this for substitution. This expression finds all exclamation
points and replaces it with ‘ !’.
re.sub(‘+!’, ‘ !’, “Amazing!!!!!!”)
‘Amazing !’

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
datasciencestudent Technische Universiteit Eindhoven
Follow You need to be logged in order to follow users or courses
Sold
39
Member since
5 year
Number of followers
31
Documents
15
Last sold
8 months ago

3.5

2 reviews

5
1
4
0
3
0
2
1
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions