College aantekeningen

Cognitive Science 2: Samenvatting van de lessen

Beoordeling

Verkocht

Pagina's

Geüpload op

14-01-2022

Geschreven in

2021/2022

Dit document bevat mijn notities en een samenvatting van de lessen gegeven door Chris Emmery voor Cognitive Science 2 in kwartiel 2 van het jaar . Dit vak is vernieuwd en dit is het eerste jaar dat deze veranderingen zijn doorgevoerd. De vergelijkingen en sommen die zijn genoemd in de lectures staan vermeldt, maar het bevat enkel een theoretische uitleg van de besproken algoritmes.

Meer zien Lees minder

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Tilburg University (UVT)
Studie: Data Science
Vak: Cognitive Science 2 (JBC090)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 14 januari 2022
Aantal pagina's: 12
Geschreven in: 2021/2022
Type: College aantekeningen
Docent(en): Chris emmery
Bevat: Alle colleges

Onderwerpen

universiteit
data science
cognitive science
text mining
bachelor
classification
decision trees
regularization
regex
knn
svd
representation
inf
tilburg university
eindhoven university of technology
tue

Voorbeeld van de inhoud

COGNITIVE SCIENCE 2
TECHNISCHE UNIVERSITEIT EINDHOVEN & TILBURG UNIVERSITY
QUARTILE 2: 2021-2022

1. Introduction
Text Mining Preliminaries
The bare minimum approach is to convert text to vectors. We need to convert the language in numbers.
𝑑𝑑 = the cat sat on the mat
cat mat on sat the
Bag-of-words representation: � 1 1 1 1 1 �. If the word is on, 1, otherwise, 0. It is unordered.
We can document these sentences as instances:
𝑑𝑑0 = the cat sat on the mat
𝑑𝑑1 = my cat sat on my cat

cat mat my on sat the
�1 1 0 1 1 1 �
1 0 1 1 1 0

The representation can be more easily done via Documents × Terms matrix.
• 𝑑𝑑 = 𝑉𝑉 × 𝑋𝑋 with 𝑉𝑉: vocabulary and 𝑋𝑋: feature space

|𝐴𝐴∩𝐵𝐵|
For document similarity, we use the Jaccard coefficient: 𝐽𝐽(𝐴𝐴, 𝐵𝐵) = .
|𝐴𝐴∪𝐵𝐵|
|𝐴𝐴 ∩ 𝐵𝐵| the number of times both documents have a 1. |𝐴𝐴 ∪ 𝐵𝐵| all words except where those never occurring
in any document.

Binary vs. frequency
• Binary is a very compact representation
• Algorithms like Decision Trees have a straight-forward and compact structure
• Binary says very little about the weight of each word
• We can’t use more advanced algorithms that work with Vector Spaces

Notation Term Frequencies
• 𝐷𝐷 = {𝑑𝑑1 , 𝑑𝑑2 , … , 𝑑𝑑𝑁𝑁 } is the set of documents
• 𝑇𝑇 = {𝑡𝑡1 , 𝑡𝑡2 , … , 𝑡𝑡𝑀𝑀 } is a set of index terms for 𝐷𝐷
• Each document 𝑑𝑑𝑖𝑖 ∈ 𝐷𝐷 can be represented as a frequency vector:
o 𝑑𝑑⃗𝑖𝑖 = 〈tf(𝑡𝑡1 , 𝑑𝑑𝑖𝑖 ), … , tf(𝑡𝑡𝑀𝑀 , 𝑑𝑑𝑖𝑖 )〉 , tf(𝑡𝑡, 𝑑𝑑) is the frequency of term 𝑡𝑡𝑚𝑚 ∈ 𝑇𝑇 for document 𝑑𝑑𝑖𝑖 .

Term frequency does not capture importance very well. It should be in log-scale. We use ln(𝑡𝑡𝑡𝑡(𝑡𝑡, 𝑑𝑑) + 1).
There are still problems: longer documents have more words. And rare terms do not occur much, but are
most important. If two documents have the similar rare words in it, they are more similar.
This is solved via (inverse) document frequency.
𝑁𝑁
idf𝑡𝑡 = log𝑏𝑏
df𝑡𝑡
with 𝑁𝑁: number of documents and df𝑡𝑡 is the document frequency of 𝑡𝑡 occurring; in how many documents
does 𝑡𝑡 occur. The base 𝑏𝑏 is typically 10.

Normalizing vector Representations: Both terms can be put together via multiplication:
𝑁𝑁
𝑤𝑤𝑡𝑡,𝑑𝑑 = ln(tf(𝑡𝑡, 𝑑𝑑) + 1) ∗ lg � �
𝑑𝑑𝑑𝑑𝑡𝑡

, One way of calculating the “length” of a document is the Euclidean distance: documents with many words
are far way.
𝑛𝑛

𝑑𝑑(𝑝𝑝⃗, 𝑞𝑞⃗) = ��(𝑝𝑝⃗𝑖𝑖 − 𝑞𝑞⃗𝑖𝑖 )2
𝑖𝑖=1

Now, it will just compare documents that have similar lengths. It is possible to correct for this using the
𝒍𝒍𝟐𝟐 normalization: �|𝑝𝑝⃗|�2 = �∑𝑖𝑖 𝑝𝑝⃗𝑖𝑖2.

Cosine similarity: the dot product of two numbers under the assumption that they are 𝑙𝑙2 normalized.
𝑛𝑛

SIM = 𝑝𝑝⃗ ∙ 𝑞𝑞⃗ = � 𝑝𝑝⃗𝑖𝑖 𝑞𝑞⃗𝑖𝑖
𝑖𝑖=1
𝑝𝑝⃗∙𝑞𝑞�⃗
SIM = (normalize in the similarity)
�𝑝𝑝⃗⋅𝑝𝑝⃗ ∗�𝑞𝑞�⃗⋅𝑞𝑞�⃗

2. Collecting Data
Noisy Text
Text can be noisy. If we don’t filter this, our model reads things that are not in the format. Also typos have a
tremendous effect on the size of the vocabulary, and the representation of your documents (thus the similarity
quality).

Language variations: abbreviations, acronyms, capitalization, character flooding, concatenations,
emoticons, dialect, slang, typos.

Regular Expressions
Finding and reducing those noisy text errors is possible via regular expressions: a mini scripting language
of logic that is one of the few things that is standardized in almost any programming language. It is a way to
define strings and can be used to find patterns and also replace the matches found.
patt = re.compile(“you”)
patt.finditer(text)

You can also do disjunctions: either this or that; “[Yy]ou” to find you and You.
regex_find(“[Yy]ou”, text)

Negation is also possible: “[ ^a-x]”, find everything except a – z.
regex_find(“[^a-z]”, text)

In logic, there are Kleene expressions:
• matching at a singular position with nothing or any position after. This matches “o[a-z]*”,
anything that begins with o.
regex_find(“o[a-z]*”, text)

• matching a letter with one or more symbols. So, nothing is not returned in this case.
regex_find(“u[a-z]+”, text)

• wild card: anything on the position of the dot. “.e.” finds ‘her’ in ‘there’ … This is greedy.
regex_find(“.e”, text)
If we find something with this, we can use this for substitution. This expression finds all exclamation
points and replaces it with ‘ !’.
re.sub(‘+!’, ‘ !’, “Amazing!!!!!!”)
‘Amazing !’

$5.55

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

datasciencestudent

3.5

(2)

Maak kennis met de verkoper

datasciencestudent Technische Universiteit Eindhoven

Bekijk profiel

Volgen

Verkocht

Lid sinds

5 jaar

Aantal volgers

Documenten

Laatst verkocht

10 maanden geleden

3.5

2 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper datasciencestudent. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $5.55. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 57429 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen