data-intensive programming lecture #3dip/slides/slides3.pdf · 2016. 9. 16. · inverted index •...

Data-intensiveProgrammingLecture#3

TimoAaltonenDepartmentofPervasiveComputing

GuestLectures

• I’lltrytoorganizetwoguestlectures

• Oct14,Tapio Rautonen,Gofore LTd,Makingsenseoutofyourbigdata

• Oct7,???

Outline

• CourseWork• ApacheSqoop• SQLRecap• MapReduceExamples

– InvertedIndex– FindingFriends– ComputingPageRank

• (Hadoop)– Combiner– Otherprogramminglanguages

CourseWork• MySportShop isasportsgearretailer.Allthesales

happensonlineintheirwebstore.Examplesoftheirproductsaredifferentgamejerseysandsportwatches.

• ThewebstorehasanApachewebserverfortheincomingHTTPrequests.Thewebserverlogsalltraffictoalogfile.– Usingtheselogs,onecanstudythebrowsingbehaviorofthe

users.• ThesalesdataofMySportShop isinPostrgreSQL,whichis

arelationaldatabase.Amongotherthings,thedatabasehasatableorder_items containingdataofallsaleseventsoftheshop.

CourseWork:Questions

• Basedonthedataanswertothefollowingquestions1. Whatarethetop-10bestsellingproductsintermsof

totalsales?2. Whatarethetop-10browsedproducts?3. Whatanomalyistherebetweenthesetwo?4. Whatarethemostpopularbrowsinghours?

CourseWork

• Sincethemanagersofthecompanydon’tuseHadoopbutaRDBMS,allthedatamustbetransferredtoPostgreSQL

• Inordertodothat– TransferApachelogs(withApacheFlume)totheHDFS– ComputethefrequenciesofviewingofdifferentproductsusingMapReduce(Question2)

– ComputetheviewinghourdatawithMapReduce(Q4)– Transfertheresults(withApacheSqoop)toPostgreSQL– FindanswertothequestionsinPostgreSQLusingSQL(Q1-4)

Environment:threeoptions1. YoucanuseyourowncomputerbyinstallingVirtualBox

5.x– Weofferyouavirtualmachine,whichhasbeeninstalledall

requiredsoftwareanddata– InthenextweeklyexercisesassistantssolveVirtualBox-related

problems,ifyouencounterany2. WeofferyouavirtualmachinefromTUTcloud

– Allrequiredsoftwareanddataisinstalled– Nographicaluserinterface– Guidanceavailableintheweeklyexercises

3. Owninstallation/cloudservicecanbeused– Nohelpfromthecoursepersonnel

CourseWork

• Theworkisdoneingroupsofthree– EnrollinMoodle:https://moodle2.tut.fi/course/view.php?id=9954

– openstodayat10o’clock

• DeadlineisOct14th

• Instructionsforreturningwillbepublishedlater– IntelliJIDEAproject

CourseWork

• Material– https://flume.apache.org/FlumeUserGuide.html– https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

– http://hadoop.apache.org/docs/r2.7.3/– https://www.postgresql.org/docs/9.5/static/index.html

MapReduce

l Simple programming modell Map is stateless

- allows running map functions in parallell Also Reduce can be executed in parallell The canonical example is the word count

InvertedIndex

• Collating– Problem:Thereisasetofitemsandsomefunctionofoneitem.

Itisrequiredtosaveallitemsthathavethesamevalueoffunctionintoonefileorperformsomeothercomputationthatrequiresallsuchitemstobeprocessedasagroup.Themosttypicalexampleisbuildingofinvertedindexes.

– Solution:Mappercomputesagivenfunctionforeachitemandemitsvalueofthefunctionasakeyanditemitselfasavalue.Reducerobtainsallitemsgroupedbyfunctionvalueandprocessorsavethem.Incaseofinvertedindexes,itemsareterms(words)andfunctionisadocumentIDwherethetermwasfound.

SimpleInvertedIndex

Thisdoccontainstext

Mydoccontainsmy

text

Doc#1

Doc#2

this,1doc,1contains,1text,1

my,2doc,2contains,2my,2text,2

this:1doc:1,2contains:1,2text:1,2my:2

Reducedoutput

• Reducedoutput:word,listofdocIDs

(Normal)InvertedIndex

Thisdoccontainstext

Mydoccontainsmy

text

Doc#1

Doc#2

this,(1,1)doc,(1,1)contains,(1,1)text,(1,1)

my,(2,1)doc,(2,1)contains,(2,1)my,(2,1)text,(2,1)

this:(1,1)doc:(1,1),(2,1)contains:(1,2),(2,1)text:(1,1),(2,1)my:(2,2)

Reducedoutput

• Reducedoutput:word(,list(docID,frequency)

UsingInvertedIndex:Searching• Documents

– D1:Helikestowink,helikestodrink.– D2:Helikestodrink,anddrink,anddrink.– D3:Thethinghelikestodrinkisink.– D4:Theinkhelikestodrinkispink.– D5:Helikestowinkanddrinkpinkink.

• Index– he:(1,2),(2,1),(3,1),(4,1),(5,1)– ink:(3,1),(4,1),(5,1)– pink:(4,1),(5,1)– thing:(3,1)– wink:(1,1),(5,1)

UsingInvertedIndex

• Indexingmakessearchenginesfast• Dataissparsesincemostwordappearonlyinonedocument– (id,val)tuples– sortedbyid– compact– veryfast

• Linearmerge

Indexhe:(1,2),(2,1),(3,1),(4,1),(5,1)ink:(3,1),(4,1),(5,1)pink:(4,1),(5,1)thing:(3,1)wink:(1,1),(5,1)

LinearMerge

• Finddocumentsmarchingquery{ink,wink}– Loadinvertedlistsforallquerywords– LinearmergeO(n)

• nisthetotalnumberofitemsinthetwolists• f()isascoringfunction:howwelldocmatchesthequery

(3,1) (4,1)

(1,1)

(5,1)

(5,1)

ink-->

wink-->

Matchingset: 1:f(0,1) 3:f(1,0) 4:f(1,0) 5:f(1,1)

ScoringFunction

• Specifywhichdocsarematched– in:countsofquerywordsinadoc– out:rankingscore

• howwelldocmatchesthequery• 0ifdocumentdoesnotmatch

– Example:

• BooleanAND:𝑓 𝑄, 𝐷 = ∏ (1: 𝑛, > 00: 𝑛, = 0

�,∈1

– 1iff allquerywordsarepresent

PhrasesandProximity

• Query“pinkink”asaphrase• Usingregularindex:

– match#and(pink,ink)->– scanmatchmatchdocumentsforquerystring(slow)

• Idea:indexallbi-gramsaswords– canapproximate“drinkpinkink”– fast,butindexsizeexplodes– inflexible:can’tquery#5(pink,ink)

• Constructproximityindex

D4:Theinkhelikestodrinkispink.

D5:Helikestowinkanddrinkpinkink.

(5,1)pink_ink->

(5,1)drink_pink->

ProximityIndex

• Embedpositioninformationtotheinvertedlists– calledpositional/proximityindex(prox-list)– handlesarbitraryphrases,windows– keyto“rich”indexing:structure,fields,tags,…

ProximityIndex

Thisdoccontainstext

Mydoccontainsmy

text

Doc#1

Doc#2

this,(1,1)doc,(1,2)contains,(1,3)text,(1,4)

my,(2,1)doc,(2,3)contains,(2,3)my,(2,4)text,(2,5)

this:(1,1)doc:(1,2),(2,1)contains:(1,3),(2,3)text:(1,4),(2,5)my:(2,1),(2,4)

Reducedoutput

• Reducedoutput:word,listof(docID,location)

ProximityIndex• Documents

– D1:Helikestowink,helikestodrink.– D2:Helikestodrink,anddrink,anddrink.– D3:Thethinghelikestodrinkisink.– D4:Theinkhelikestodrinkispink.– D5:Helikestowinkanddrinkpinkink.

• Index– he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1)– ink:(3,8),(4,2),(5,8)– pink:(4,8),(5,7)– thing:(3,2)– wink:(1,4),(5,5)

UsingProximityIndex

• Query:“pinkink”• LinearMerge

– comparedocIDs underpointer– ifmatch– checkpos(ink)- pos(pink)=1– near operator

(3,8) (4,2)

(4,8)

(5,8)

(5,7)

ink-->

pink-->

StructureandTags

• Documentsarenotalwaysflat– meta-data:title,author,date– structure:part,chapter,section,paragraph– tags:namedentity,link,translation

• Optionsfordealingwithstructure– createseparateindexforeachfield(likeinSQL)– pushstructureintoindexvalues– constructextendindex

ExtentIndex

• Special“term”foreachelement,fieldortag– spansaregionoftext

• wordsinthespanbelongtothefield

– allowsmultipleoverlappingspans– similarstand-offannotationformats

ExtentIndex• Documents

– D1:Helikestowink,helikestodrink.– D2:Helikestodrink,anddrink,anddrink.– D3:Thething helikestodrinkisink.– D4:Theinkhelikestodrinkispink.– D5:Helikestowinkanddrinkpinkink.

• Index– he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1)– ink:(3,8),(4,2),(5,8)– pink:(4,8),(5,7)– thing:(3,2)– wink:(1,4),(5,5)– link:(3,1:2),(4,1:2),(5,7:8)

UsingExtentIndex

• Query:findanink-relatedhyper-link• Sameapproachaswithproximity

– onlynow“tag”and“word”musthavedistance=0– LinearMerge,matchwhenpositionsfallintoextent– amenabletoalloptimizations

(3,8) (4,2)

(3,1:2)

(5,8)ink-->

link-> (4,1:2) (5,7:8)

OverviewonInvertedIndices

• Normal• Positional

– phrases,nearoperator

• Extent– metadata,structure

MR Example: Finding Friends

l http://stevekrenzel.com/finding-friends-with-mapreduce

l Facebook could use MapReduce in thefollowing way


l Facebook has a list of friends- the relation is bidirectional

l FB has lots of disk space and serve millions of requests per day

l Certain results are pre-computed to reduce theprocessing time of requests- E.g. ”You and Joe have 230 mutual friends”- The list of common friends is quite stable- so recalculating would be wasteful


l Idea: MapReduce is used to calculate the common friends daily and store results- later only a quick lookup is needed

l Assume the friends are stored asl Person ⟶ [List of friends]

- A ⟶ [B, C, D]- B ⟶ [A, C, D, E]- C ⟶ [A, B, D, E]- D ⟶ [A, B, C, E]- E ⟶ [B, C, D]


l Each line is input for mapperl For every friend in the list of friends, the mapper

will emit a (key, value) pair, where- key is

l (person, friend), if person < friendl (friend, person), otherwise

- value is the list of person’s friends

MR Example: Finding Friendsmap(A , [B, C, D]):

(A, B), [B, C, D](A, C), [B, C, D](A, D), [B, C, D]

map(B, [A, C, D, E]):(A, B), [A, C, D, E](B, C), [A, C, D, E](B, D), [A, C, D, E](B, E), [A, C, D, E]

map(C, [A, B, D, E]):(A, C), [A, B, D, E](B, C), [A, B, D, E](C, D), [A, B, D, E](C, E), [A, B, D, E]

map(D, [A, B, C, E]):(A, D), [A, B, C, E](B, D), [A, B, C, E](C, D), [A, B, C, E](D, E), [A, B, C, E]

map(E, [B, C, D]):(B, E), [B, C, D](C, E), [B, C, D](D, E), [B, C, D]

MR Example: Finding Friends• After shuffling inputs to the reducers:

(A, B), [[B, C, D], [A, C, D, E]](A, C), [[B, C, D], [A, B, D, E]](A, D), [[B, C, D], [A, B, C, E]](B, C), [[A, C, D, E], [A, B, D, E]](B, D), [[A, C, D, E], [A, B, C, E]](B, E), [[A, C, D, E], [B, C, D]](C, D), [[A, B, D, E], [A, B, C, E]](C, E), [[A, B, D, E], [B, C, D]](D, E), [[A, B, C, E], [B, C, D]]


l Each line is given to a reducerl Reducer computes an intersection of the sets

- and removes persons from the key pairl For example (A, B), [[B, C, D], [A, C, D, E]] is

reduced to (A, B), [C, D](A, C), [B, D] (B, E), [C, D](A, D), [B, C] (C, D), [A, B, E](B, C), [A, D, E] (C, E), [B, D](B, D), [A, C, E] (D, E), [B, C]

l Now, when D visit B the common friends arefound fast [A, C, E]

MR:PageRank• Google’sdescription

– reliesonthe“uniquelydemocratic”natureoftheweb– interpretsalinkfrompageAtopageBas“avote”

• aà BmeansAthinksBisworthsomething– manylinksmeanthatBmustbegood– content-independentmeasure

• Useasarankingfeature,combinedwithcontent– notallpageslinkingtoBareequallyimportant– asinglelinkfromSlashdotorCNNmaybeworththousands

• GooglePageRank– howmany“good”pageslinktoB

PageRank:RandomSurfer

• Analogy– userstartsbrowsingfromrandom– pickarandomout-goinglink

• repeat

– example:FàEàFàEàDà…– withprobability1- λ jump toarandom page

• PageRankofpagex– probabilityofbeingonpagexatarandommoment– formally

PageRank• InitializePR(x)=1/N

• Foreverypage:𝑃𝑅 𝑥 = 5678 +𝜆 ∑ ;<(>)@AB(>)

�>→D

– yà xcontributespartofitsPRtox– spreadsPRequallyamongout-links– PRscoresshouldsumto100%

• usetwoarraysPRt à PRt+1

• Iteration#1:– PR(B)=0.18*9.1+

0.82*[PR(C)+1/3*PR(E)+½*PR(F)+½*PR(G)+½*PR(I)]=31– PR(C)=0.18*9.1+0.82*9.1=9.1

• Iteration #2:– ...– PR(C)=0.18*9.1+0.82*PR(B)=26

PageRank

• Algorithmconverges• Observations:

– pageswithnoinlinks:PR=(1- λ)*1/N=0.16– same inlinksè same PR– one inlink from high PR>>many from low PR

PageRankwithMapReducemap(y,{x1,x2,…,xn})

forj=1..nemit(xj,(

;<(>)@AB(>)

)

reduce(x, { ;<(>5)@AB(>5)

,…, ;<(>E)@AB(>E)

})

𝑃𝑅 𝑥 = 5678

+𝜆 ∑ ;<(>)@AB(>)

�>→D

forj=1..n

emit(xj,;<(D)@AB(D)

)

• Resultgoesrecursivelytoanotherreducer• Stillsinknodesshouldbeconsidered

PageRankwithMapReducemap(y,{x1,x2,…,xn})

forj=1..nemit(xj,(

;<(>)@AB(>)

)emit(y,{x1,…,xn})

reduce(x, { ;<(>5)@AB(>5)

,…, ;<(>E)@AB(>E)

},{x1,…,xn} )

𝑃𝑅 𝑥 = 5678

+𝜆 ∑ ;<(>)@AB(>)

�>→D

forj=1..n

emit(xj,;<(D)@AB(D)

)emit(x,{x1,…,xn})

• Resultgoesrecursivelytoanotherreducer• Stillsinknodesshouldbeconsidered

Combiners

AA

11

A 1B 1

Map node 1

A

Reduce node forkey A

A111

A 3

Combiner

Combiners

• Combiner can ”compress”dataonamapper nodebefore sending it forward

• Combiner input/outputtypes must equal themapper outputtypes

• InHadoop Java,Combiners use theReducer interface

job.setCombinerClass(MyReducer.class);

Reducer asaCombiner

• Reducer can be used asaCombiner if it iscommutativeandassociative– Eg.max is

• max(1,2,max(3,4,5))=max(max(2,4),max(1,5,3))

• true forany order offunction applications…– Eg.avg isnot

• avg(1,2,avg(3,4,5))=2.33333≠avg(avg(2,4),avg(1,5,3))=3

• Note:if Reducer isnot c&a,Combiners can still be used– TheCombiner justhas tobe different from theReducer and

designed forthespecific case

AddingaCombinertoWordCount

walk,1run,1walk,1

run,1walk,2

Map

Combiner

Shuffle

Hadoop Streaming

• Map andReduce functions can be implemented inany language withtheHadoop Streaming API

• Inputisread from standard input• Outputiswritten tostandard output• Input/outputitems are lines oftheformkey\tvalue– \t isthetabulator character

• Reducer inputlines are grouped by key– Onereducer instance may receive multiple keys

Run Hadoop Streaming

• Debug using Unixpipes:cat sample.txt | ./mapper.py | sort | ./reducer.py

• OnHadoop:hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \-input sample.txt \-output output \-mapper ./mapper.py \-reducer ./reducer.py