100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4,6 TrustPilot
logo-home
Samenvatting

Summary 2-page Overview Data Analysis & Retrieval Midterm (INFOB3DAR)

Beoordeling
-
Verkocht
1
Pagina's
2
Geüpload op
01-06-2022
Geschreven in
2021/2022

2-page overview (cheatsheet) of all the subjects discussed in the first part of the Data Analysis & Retrieval course, very concisely summarized. Summary of the summary. Includes several examples of MapReduce algorithms.









Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
1 juni 2022
Aantal pagina's
2
Geschreven in
2021/2022
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Precision/selec vity: frac on of retrieved Scoring & Ranking No Random Access Algorithm (NRA)
docs that is relevant to user’s need Term frequency ( t,d): how many mes does - Can’t find correspondin oid value other list
Recall/sensi vity: relevant frac on of all doc term occur in document - Store range of values of f in buffer, update
Log-frequency weigh ng (wt,d): when necessary, so if max-B decreased
Indexing wt,d = 0 if t,d = 0 since last round, update buffer entries for
Boolean retrieval wt,d = 1 + log( t,d) if t,d > 0 which value of B was not determined yet
- Term-document incidence matrix: shows - Upper limit of range is needed when lower
what docs a term appears in with 1 and 0 Document frequency (dft): nr of documents limit of a tuple exceeds upper limit of all
- ‘x AND y BUT NOT z’ ⇒ 1101 && 1100 in collec on that contain the term t entries in buffer not yet in top-k (will
&& (NOT 0101) = 1000 Inverse df (idft): idft = log(N / dft), where N is definitely have higher score than those)
- Sparse matrix: only store docIds that do the total nr of documents in collec on
contain the term in a pos ng list - Measure of rareness of a term
- Dic onary as hash table/B-tree/trie - log(1) = 0, so term that occurs in every
document does not contribute to score
List merging
1. Add current docID to result if in both lists -idf weigh ng: wt,d ⋅ idft
2. Otherwise, get next docID of list where - Can’t handle single term queries
current is lower than other - Favours long documents
- Skip pointer: when docID that can
be skipped to is lower than current Vector-space model
docID in other list, skip to it - Each doc represented as D-dimensional
vector, with D the nr of all known terms
Posi onal index: each term has a list of - Similarity is defined as the angle between
tuples <docID, list of posi ons> query vector and doc vector, normalised
- “to be or not to be”, ‘be’ must appear - Use the cosine of the angle
twice, 4 posi ons apart - Inner product: x • y = ||x||⋅||y||⋅ cos(𝛉)
𝐷 Frequent item sets
Wild-card queries - ||𝑥|| = ||𝑥||2 = ∑ 𝑥𝑖
2
Table r(U) with items U, columns are the
- pre* ⇒ all terms between pre - prf 𝑖=1 items,rows are transac ons.
- *post ⇒ inverted B-tree, tsop - tsoq - s(X): support of set of items X; nr of
- pre*post ⇒ (1) intersect pre* and *post Top-k searching transac ons where all of X was bought
(2) permuterm index Monotone: func on whose return value will - s(XY): rela on X → Y, s(XY) = s(X ∪ Y)
not increase/decrease when the value of - confidence: s(XY)/s(X)
Permuterm index one of the params is increased - Of all t’s that contain X, s(XY)/s(X) of
- For term hello, create entries hello$, them also contain Y
ello$h, llo$he, …, o$hell, connected to A ributes which are arguments of f are
same pos ng list as hello represented by separate lists. Algorithm
- Query he*o ⇒ add $, rotate un l * at end, For main-memory RDBMS with column 1. Find all frequent sets Z (exceeding some
e*o$h - o$he*, matches o$hello store, sorted access by efficient internal threshold for their support)
sor ng techniques, random access by index - Exceeding support threshold ensures
MapReduce structure or duplicate list sorted by oid. that you can split Z up into subsets
Map :: <k,v> ⇒ [<k1’,v1’>,...,<km’,vm’>] that together form rela on that also
- Input split into disjoint chunks of key-value Threshold Algorithm (TA) supports threshold (union of subsets)
pairs, assigned to 1 Map worker - Max-A/B: max value of A or B that could 2. For all non-empty subsets X of Z, test if X
- Map worker performs calc on each pair be found in coming rounds → (Z - X) holds with sufficient confidence
- Outputs from all workers grouped on key - Threshold T: max possible future value of f
- With f = A + B, T = max-A + max-B Gaussian elimina on
Reduce :: <k’,[v1’,...,vn’]> ⇒ [<k1’,v1’’>,...] - Buffer: tuples [oid,f], where f is calc using - Get all columns from le to right to
- Par on func on: hash(key) % R, where R the values for that oid in both lists represent the iden ty matrix, the
is nr of reduce tasks (>> nr of workers) - Par al top-k: set of tuples in buffer f ≥ T augmented column is then the solu on
- Sets of pairs distributed over Reduce - Update these values each round and move - Apply rules:
workers, which perform calc on each pair tuples with f ≥ T to the top-k - Interchanging two rows
- Reduce the list of values to just one value - Mul plying row by non-zero constant
- Adding mul ple of one row to other
Failure handling - First row is a, second b, third c
- Robustness: mask hardware failures by 1. b -= - 2a 5. c -= b
replica ng files in different racks 2. c -= a 6. c *= -½
- Master pings workers periodically 3. b *= -1 7. b -= 3c
- Map worker fail: completed/in-progress 4. a -= b 8. a += c
tasks set to idle and rescheduled to others
- Reduce worker fail: in-progress tasks idle
- Master fails: all is lost, ripperoni

Early combining
- k mes emit(<w,1>), just one emit(<w,k>)
- Init_Map() ⇒ Map() ⇒ Finalize_Map()
- Possible outcomes:

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
Suniht Universiteit Utrecht
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
94
Lid sinds
4 jaar
Aantal volgers
55
Documenten
19
Laatst verkocht
1 maand geleden

3,9

13 beoordelingen

5
7
4
2
3
2
2
0
1
2

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen