data-intensive programming lecture #3dip/slides/slides3.pdf · 2016. 9. 16. · inverted index •...
TRANSCRIPT
Data-intensiveProgrammingLecture#3
TimoAaltonenDepartmentofPervasiveComputing
GuestLectures
• I’lltrytoorganizetwoguestlectures
• Oct14,Tapio Rautonen,Gofore LTd,Makingsenseoutofyourbigdata
• Oct7,???
Outline
• CourseWork• ApacheSqoop• SQLRecap• MapReduceExamples
– InvertedIndex– FindingFriends– ComputingPageRank
• (Hadoop)– Combiner– Otherprogramminglanguages
CourseWork• MySportShop isasportsgearretailer.Allthesales
happensonlineintheirwebstore.Examplesoftheirproductsaredifferentgamejerseysandsportwatches.
• ThewebstorehasanApachewebserverfortheincomingHTTPrequests.Thewebserverlogsalltraffictoalogfile.– Usingtheselogs,onecanstudythebrowsingbehaviorofthe
users.• ThesalesdataofMySportShop isinPostrgreSQL,whichis
arelationaldatabase.Amongotherthings,thedatabasehasatableorder_items containingdataofallsaleseventsoftheshop.
CourseWork:Questions
• Basedonthedataanswertothefollowingquestions1. Whatarethetop-10bestsellingproductsintermsof
totalsales?2. Whatarethetop-10browsedproducts?3. Whatanomalyistherebetweenthesetwo?4. Whatarethemostpopularbrowsinghours?
CourseWork
• Sincethemanagersofthecompanydon’tuseHadoopbutaRDBMS,allthedatamustbetransferredtoPostgreSQL
• Inordertodothat– TransferApachelogs(withApacheFlume)totheHDFS– ComputethefrequenciesofviewingofdifferentproductsusingMapReduce(Question2)
– ComputetheviewinghourdatawithMapReduce(Q4)– Transfertheresults(withApacheSqoop)toPostgreSQL– FindanswertothequestionsinPostgreSQLusingSQL(Q1-4)
Environment:threeoptions1. YoucanuseyourowncomputerbyinstallingVirtualBox
5.x– Weofferyouavirtualmachine,whichhasbeeninstalledall
requiredsoftwareanddata– InthenextweeklyexercisesassistantssolveVirtualBox-related
problems,ifyouencounterany2. WeofferyouavirtualmachinefromTUTcloud
– Allrequiredsoftwareanddataisinstalled– Nographicaluserinterface– Guidanceavailableintheweeklyexercises
3. Owninstallation/cloudservicecanbeused– Nohelpfromthecoursepersonnel
CourseWork
• Theworkisdoneingroupsofthree– EnrollinMoodle:https://moodle2.tut.fi/course/view.php?id=9954
– openstodayat10o’clock
• DeadlineisOct14th
• Instructionsforreturningwillbepublishedlater– IntelliJIDEAproject
CourseWork
• Material– https://flume.apache.org/FlumeUserGuide.html– https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
– http://hadoop.apache.org/docs/r2.7.3/– https://www.postgresql.org/docs/9.5/static/index.html
MapReduce
l Simple programming modell Map is stateless
- allows running map functions in parallell Also Reduce can be executed in parallell The canonical example is the word count
InvertedIndex
• Collating– Problem:Thereisasetofitemsandsomefunctionofoneitem.
Itisrequiredtosaveallitemsthathavethesamevalueoffunctionintoonefileorperformsomeothercomputationthatrequiresallsuchitemstobeprocessedasagroup.Themosttypicalexampleisbuildingofinvertedindexes.
– Solution:Mappercomputesagivenfunctionforeachitemandemitsvalueofthefunctionasakeyanditemitselfasavalue.Reducerobtainsallitemsgroupedbyfunctionvalueandprocessorsavethem.Incaseofinvertedindexes,itemsareterms(words)andfunctionisadocumentIDwherethetermwasfound.
SimpleInvertedIndex
Thisdoccontainstext
Mydoccontainsmy
text
Doc#1
Doc#2
this,1doc,1contains,1text,1
my,2doc,2contains,2my,2text,2
this:1doc:1,2contains:1,2text:1,2my:2
Reducedoutput
• Reducedoutput:word,listofdocIDs
(Normal)InvertedIndex
Thisdoccontainstext
Mydoccontainsmy
text
Doc#1
Doc#2
this,(1,1)doc,(1,1)contains,(1,1)text,(1,1)
my,(2,1)doc,(2,1)contains,(2,1)my,(2,1)text,(2,1)
this:(1,1)doc:(1,1),(2,1)contains:(1,2),(2,1)text:(1,1),(2,1)my:(2,2)
Reducedoutput
• Reducedoutput:word(,list(docID,frequency)
UsingInvertedIndex:Searching• Documents
– D1:Helikestowink,helikestodrink.– D2:Helikestodrink,anddrink,anddrink.– D3:Thethinghelikestodrinkisink.– D4:Theinkhelikestodrinkispink.– D5:Helikestowinkanddrinkpinkink.
• Index– he:(1,2),(2,1),(3,1),(4,1),(5,1)– ink:(3,1),(4,1),(5,1)– pink:(4,1),(5,1)– thing:(3,1)– wink:(1,1),(5,1)
UsingInvertedIndex
• Indexingmakessearchenginesfast• Dataissparsesincemostwordappearonlyinonedocument– (id,val)tuples– sortedbyid– compact– veryfast
• Linearmerge
Indexhe:(1,2),(2,1),(3,1),(4,1),(5,1)ink:(3,1),(4,1),(5,1)pink:(4,1),(5,1)thing:(3,1)wink:(1,1),(5,1)
LinearMerge
• Finddocumentsmarchingquery{ink,wink}– Loadinvertedlistsforallquerywords– LinearmergeO(n)
• nisthetotalnumberofitemsinthetwolists• f()isascoringfunction:howwelldocmatchesthequery
(3,1) (4,1)
(1,1)
(5,1)
(5,1)
ink-->
wink-->
Matchingset: 1:f(0,1) 3:f(1,0) 4:f(1,0) 5:f(1,1)
ScoringFunction
• Specifywhichdocsarematched– in:countsofquerywordsinadoc– out:rankingscore
• howwelldocmatchesthequery• 0ifdocumentdoesnotmatch
– Example:
• BooleanAND:𝑓 𝑄, 𝐷 = ∏ (1: 𝑛, > 00: 𝑛, = 0
�,∈1
– 1iff allquerywordsarepresent
PhrasesandProximity
• Query“pinkink”asaphrase• Usingregularindex:
– match#and(pink,ink)->– scanmatchmatchdocumentsforquerystring(slow)
• Idea:indexallbi-gramsaswords– canapproximate“drinkpinkink”– fast,butindexsizeexplodes– inflexible:can’tquery#5(pink,ink)
• Constructproximityindex
D4:Theinkhelikestodrinkispink.
D5:Helikestowinkanddrinkpinkink.
(5,1)pink_ink->
(5,1)drink_pink->
ProximityIndex
• Embedpositioninformationtotheinvertedlists– calledpositional/proximityindex(prox-list)– handlesarbitraryphrases,windows– keyto“rich”indexing:structure,fields,tags,…
ProximityIndex
Thisdoccontainstext
Mydoccontainsmy
text
Doc#1
Doc#2
this,(1,1)doc,(1,2)contains,(1,3)text,(1,4)
my,(2,1)doc,(2,3)contains,(2,3)my,(2,4)text,(2,5)
this:(1,1)doc:(1,2),(2,1)contains:(1,3),(2,3)text:(1,4),(2,5)my:(2,1),(2,4)
Reducedoutput
• Reducedoutput:word,listof(docID,location)
ProximityIndex• Documents
– D1:Helikestowink,helikestodrink.– D2:Helikestodrink,anddrink,anddrink.– D3:Thethinghelikestodrinkisink.– D4:Theinkhelikestodrinkispink.– D5:Helikestowinkanddrinkpinkink.
• Index– he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1)– ink:(3,8),(4,2),(5,8)– pink:(4,8),(5,7)– thing:(3,2)– wink:(1,4),(5,5)
UsingProximityIndex
• Query:“pinkink”• LinearMerge
– comparedocIDs underpointer– ifmatch– checkpos(ink)- pos(pink)=1– near operator
(3,8) (4,2)
(4,8)
(5,8)
(5,7)
ink-->
pink-->
StructureandTags
• Documentsarenotalwaysflat– meta-data:title,author,date– structure:part,chapter,section,paragraph– tags:namedentity,link,translation
• Optionsfordealingwithstructure– createseparateindexforeachfield(likeinSQL)– pushstructureintoindexvalues– constructextendindex
ExtentIndex
• Special“term”foreachelement,fieldortag– spansaregionoftext
• wordsinthespanbelongtothefield
– allowsmultipleoverlappingspans– similarstand-offannotationformats
ExtentIndex• Documents
– D1:Helikestowink,helikestodrink.– D2:Helikestodrink,anddrink,anddrink.– D3:Thething helikestodrinkisink.– D4:Theinkhelikestodrinkispink.– D5:Helikestowinkanddrinkpinkink.
• Index– he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1)– ink:(3,8),(4,2),(5,8)– pink:(4,8),(5,7)– thing:(3,2)– wink:(1,4),(5,5)– link:(3,1:2),(4,1:2),(5,7:8)
UsingExtentIndex
• Query:findanink-relatedhyper-link• Sameapproachaswithproximity
– onlynow“tag”and“word”musthavedistance=0– LinearMerge,matchwhenpositionsfallintoextent– amenabletoalloptimizations
(3,8) (4,2)
(3,1:2)
(5,8)ink-->
link-> (4,1:2) (5,7:8)
OverviewonInvertedIndices
• Normal• Positional
– phrases,nearoperator
• Extent– metadata,structure
MR Example: Finding Friends
l http://stevekrenzel.com/finding-friends-with-mapreduce
l Facebook could use MapReduce in thefollowing way
MR Example: Finding Friends
l Facebook has a list of friends- the relation is bidirectional
l FB has lots of disk space and serve millions of requests per day
l Certain results are pre-computed to reduce theprocessing time of requests- E.g. ”You and Joe have 230 mutual friends”- The list of common friends is quite stable- so recalculating would be wasteful
MR Example: Finding Friends
l Idea: MapReduce is used to calculate the common friends daily and store results- later only a quick lookup is needed
l Assume the friends are stored asl Person ⟶ [List of friends]
- A ⟶ [B, C, D]- B ⟶ [A, C, D, E]- C ⟶ [A, B, D, E]- D ⟶ [A, B, C, E]- E ⟶ [B, C, D]
MR Example: Finding Friends
l Each line is input for mapperl For every friend in the list of friends, the mapper
will emit a (key, value) pair, where- key is
l (person, friend), if person < friendl (friend, person), otherwise
- value is the list of person’s friends
MR Example: Finding Friendsmap(A , [B, C, D]):
(A, B), [B, C, D](A, C), [B, C, D](A, D), [B, C, D]
map(B, [A, C, D, E]):(A, B), [A, C, D, E](B, C), [A, C, D, E](B, D), [A, C, D, E](B, E), [A, C, D, E]
map(C, [A, B, D, E]):(A, C), [A, B, D, E](B, C), [A, B, D, E](C, D), [A, B, D, E](C, E), [A, B, D, E]
map(D, [A, B, C, E]):(A, D), [A, B, C, E](B, D), [A, B, C, E](C, D), [A, B, C, E](D, E), [A, B, C, E]
map(E, [B, C, D]):(B, E), [B, C, D](C, E), [B, C, D](D, E), [B, C, D]
MR Example: Finding Friends• After shuffling inputs to the reducers:
(A, B), [[B, C, D], [A, C, D, E]](A, C), [[B, C, D], [A, B, D, E]](A, D), [[B, C, D], [A, B, C, E]](B, C), [[A, C, D, E], [A, B, D, E]](B, D), [[A, C, D, E], [A, B, C, E]](B, E), [[A, C, D, E], [B, C, D]](C, D), [[A, B, D, E], [A, B, C, E]](C, E), [[A, B, D, E], [B, C, D]](D, E), [[A, B, C, E], [B, C, D]]
MR Example: Finding Friends
l Each line is given to a reducerl Reducer computes an intersection of the sets
- and removes persons from the key pairl For example (A, B), [[B, C, D], [A, C, D, E]] is
reduced to (A, B), [C, D](A, C), [B, D] (B, E), [C, D](A, D), [B, C] (C, D), [A, B, E](B, C), [A, D, E] (C, E), [B, D](B, D), [A, C, E] (D, E), [B, C]
l Now, when D visit B the common friends arefound fast [A, C, E]
MR:PageRank• Google’sdescription
– reliesonthe“uniquelydemocratic”natureoftheweb– interpretsalinkfrompageAtopageBas“avote”
• aà BmeansAthinksBisworthsomething– manylinksmeanthatBmustbegood– content-independentmeasure
• Useasarankingfeature,combinedwithcontent– notallpageslinkingtoBareequallyimportant– asinglelinkfromSlashdotorCNNmaybeworththousands
• GooglePageRank– howmany“good”pageslinktoB
PageRank:RandomSurfer
• Analogy– userstartsbrowsingfromrandom– pickarandomout-goinglink
• repeat
– example:FàEàFàEàDà…– withprobability1- λ jump toarandom page
• PageRankofpagex– probabilityofbeingonpagexatarandommoment– formally
PageRank• InitializePR(x)=1/N
• Foreverypage:𝑃𝑅 𝑥 = 5678 +𝜆 ∑ ;<(>)@AB(>)
�>→D
– yà xcontributespartofitsPRtox– spreadsPRequallyamongout-links– PRscoresshouldsumto100%
• usetwoarraysPRt à PRt+1
• Iteration#1:– PR(B)=0.18*9.1+
0.82*[PR(C)+1/3*PR(E)+½*PR(F)+½*PR(G)+½*PR(I)]=31– PR(C)=0.18*9.1+0.82*9.1=9.1
• Iteration #2:– ...– PR(C)=0.18*9.1+0.82*PR(B)=26
PageRank
• Algorithmconverges• Observations:
– pageswithnoinlinks:PR=(1- λ)*1/N=0.16– same inlinksè same PR– one inlink from high PR>>many from low PR
PageRankwithMapReducemap(y,{x1,x2,…,xn})
forj=1..nemit(xj,(
;<(>)@AB(>)
)
reduce(x, { ;<(>5)@AB(>5)
,…, ;<(>E)@AB(>E)
})
𝑃𝑅 𝑥 = 5678
+𝜆 ∑ ;<(>)@AB(>)
�>→D
forj=1..n
emit(xj,;<(D)@AB(D)
)
• Resultgoesrecursivelytoanotherreducer• Stillsinknodesshouldbeconsidered
PageRankwithMapReducemap(y,{x1,x2,…,xn})
forj=1..nemit(xj,(
;<(>)@AB(>)
)emit(y,{x1,…,xn})
reduce(x, { ;<(>5)@AB(>5)
,…, ;<(>E)@AB(>E)
},{x1,…,xn} )
𝑃𝑅 𝑥 = 5678
+𝜆 ∑ ;<(>)@AB(>)
�>→D
forj=1..n
emit(xj,;<(D)@AB(D)
)emit(x,{x1,…,xn})
• Resultgoesrecursivelytoanotherreducer• Stillsinknodesshouldbeconsidered
Combiners
AA
11
A 1B 1
Map node 1
A
Reduce node forkey A
A111
A 3
Combiner
Combiners
• Combiner can ”compress”dataonamapper nodebefore sending it forward
• Combiner input/outputtypes must equal themapper outputtypes
• InHadoop Java,Combiners use theReducer interface
job.setCombinerClass(MyReducer.class);
Reducer asaCombiner
• Reducer can be used asaCombiner if it iscommutativeandassociative– Eg.max is
• max(1,2,max(3,4,5))=max(max(2,4),max(1,5,3))
• true forany order offunction applications…– Eg.avg isnot
• avg(1,2,avg(3,4,5))=2.33333≠avg(avg(2,4),avg(1,5,3))=3
• Note:if Reducer isnot c&a,Combiners can still be used– TheCombiner justhas tobe different from theReducer and
designed forthespecific case
AddingaCombinertoWordCount
walk,1run,1walk,1
run,1walk,2
Map
Combiner
Shuffle
Hadoop Streaming
• Map andReduce functions can be implemented inany language withtheHadoop Streaming API
• Inputisread from standard input• Outputiswritten tostandard output• Input/outputitems are lines oftheformkey\tvalue– \t isthetabulator character
• Reducer inputlines are grouped by key– Onereducer instance may receive multiple keys
Run Hadoop Streaming
• Debug using Unixpipes:cat sample.txt | ./mapper.py | sort | ./reducer.py
• OnHadoop:hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \-input sample.txt \-output output \-mapper ./mapper.py \-reducer ./reducer.py