100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Web Data Processing Systems (X_400418), Master Vu Business Analytics/AI/Computer Science/Econometrie

Rating
4.0
(3)
Sold
7
Pages
22
Uploaded on
14-12-2022
Written in
2022/2023

A summary of all lectures (1 to 12) of the Web Data Processing Systems course at VU Amsterdam. Brief and clearly summarized with relevant images where necessary.

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
December 14, 2022
Number of pages
22
Written in
2022/2023
Type
Summary

Subjects

Content preview

Knowledge bases
First Information Retrieval was based on keywords. Now it is based on entities.

Symbolic Knowledge Bases (KBs)

● Meaning accessible to humans
● Constructed manually or from unstructured sources
● Can be expressed using first-order logic (knowledge graphs):




Latent Models

● Meaning is hidden
● Learned using machine learning techniques
● Prominent example: Google’s word2vec

RDF (Resource Description Framework)

● Standard used to report statements that describe properties of resources
● Statements can be represented as triplets of the form <s p o> (subject predicate object) and
serialized with different formats (RDF/XML, N3, Turtle)
● RDF dataset can be represented as a directed graph
● SPARQL is used to query RDF databases (inspired by SQL)
○ Finding answers to a query corresponds to finding all possible graph homomorphisms
between the query and the graph


Knowledge bases on the web
WordNet

● Groups words into sets of synonyms called synets.
● Words can be monosemous (one meaning) or polysemous (multiple meanings)
● Each synet has a gloss (short description) and is connected to other synets using relations. Most
important:
○ Hypernyms/Hyponums (isA)
○ Meronym/Holonyms (partOf)

DBpedia

● Project to convert Wikipedia pages to RDF
● Uses structured data on the pages
● Contains links to other KBs (widely popular in the “linked-data-cloud”
● Fairly large ontology but not rich in terms of expressiveness
● Alignment between infoboxes and ontologies is done via community-provided mappings

Yago (Yet another great ontology)

● Goals:
○ Unify Wikipedia and Wordnet

, ○ Extract clean facts
○ Check plausibility of facts via type checking
● High standard in terms of quality

Freebase

● Collaborative knowledge base by its community
● Acquired by Google, but shutdown in 2014

Wikidata

● Mainly text → hard to verify and keep consistency
● “Data version” of Wikipedia
○ Validated by community
○ Keeps provenance of the data
○ Multilingual
○ Supports plurality
● High quality knowledge


Natural Language Processing (NLP)
Knowledge acquisition: process to extract knowledge (to be integrated
into knowledge bases) from unstructured text or other data




Preprocessing
Tokenization

Split sequence into tokens (terms/words)
● Token: instance of a sequence of characters in some particular document that are grouped
together as a useful semantic unit
● Type: class of all tokens containing the same character sequence
● Example: “A rose is a rose is a rose”
○ Tokens: 8
○ Types: 3 ({a, is, rose})
Queries and documents have to be preprocessed identically. It determines which queries match.
Problems:
● Hyphens (Co-education, drag-and-drop)
● Names (San Francisco, Los Angeles)
● Language (compound nouns in German v.s. separate nouns in English)

Lemmatization

Goal: reduce words to base form (Lemma; as defined in dictionary)

, ● Am, are, be, is → be
● Car, cars, car’s, cars’ → car
Stemming

Goal: reduce words to their “roots”
● Are → ar
● Automate, automates, automatic, automation → automat

Stop word removal

Based on a stop list, remove all stop words. All words that are not part of the IR system’s dictionary.
● Saves memory
● Makes query processing faster

Part-of-speech (POS)

Assign a label to each token that indicates what the function is in the context.
● Function words: used to make sentences grammatically correct
○ Prepositions, conjunctions, pronouns, etc.
● Content words: used to carry the meaning of a sentence
○ Nouns, verbs, adjectives, adverbs
Part-of-speech tags allow for a higher degree of abstraction to estimate likelihoods.
How do they work?
● Rule-based taggers
● Stochastic taggers. Most used and rely on Hidden Markov Models. Based on likelihood.


Other NLP tasks
Parsing

Construct a tree that represents the syntactic structure of the string according to some grammars.




Constituency parsing

Breaks the phrase into sub-phrases. Nonterminals in the tree are types of phrases, the terminals are the
words in the sentence, and the edges are unlabeled.

Dependency parsing

Connect the words according to their relationships. Each vertex in the tree represents a word, child
nodes are words that are dependent on the parent, and edges are labeled
by the relationship.


Information Extraction
Two types of information extraction: Named Entity Recognition (NER) and Relation Extraction (RE).
$7.17
Get access to the full document:
Purchased by 7 students

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Reviews from verified buyers

Showing all 3 reviews
1 year ago

1 year ago

2 year ago

4.0

3 reviews

5
1
4
1
3
1
2
0
1
0
Trustworthy reviews on Stuvia

All reviews are made by real Stuvia users after verified purchases.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
thomezechiels Vrije Universiteit Amsterdam
Follow You need to be logged in order to follow users or courses
Sold
7
Member since
2 year
Number of followers
3
Documents
1
Last sold
9 months ago

4.0

3 reviews

5
1
4
1
3
1
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions