silos - distributed web archiving & analysis using map reduce
DESCRIPTION
SILOs - Distributed Web Archiving & Analysis using Map Reduce. Anushree Venkatesh Sagar Mehta Sushma Rao. AGENDA. Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs SILOs Architecture Modules Experiments. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Anushree VenkateshSagar MehtaSushma Rao
Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs
SILOs ArchitectureModules
Experiments
Life span of a web page – 44 to 75 days Limitations of centralized/distributed
crawlingExploring map reduce
Analysis of web [ subset ]Web graphSearch response quality
Tweaked page rank Inverted Index
Divide and conquer Functional programming counterparts -
> distributed data processing Plumbing behind the scenes -> Focus on
the problem Map – Division of key space Reduce – Combine results Pipelining functionality
Open source implementation of Map reduce in Java
HDFS – Hadoop specific file system Takes care of
fault tolerancedependencies between nodes
Setup through VM instance - Problems
Currently Single Node cluster
HDFS Setup
Incorporation of Berkeley DB
Seed List
Seed List
Compression
Compression
MParse
for URL
MParse
for URL
RURL, 1
(Remove Duplicate
s)
RURL, 1
(Remove Duplicate
s)
URL Extractor
MParse
for key word
MParse
for key word
RKeyWord
, URL
RKeyWord
, URL
Key Word Extractor
Page Content
Table
InvertedIndexTable
MURL, value
MURL, value
RURL, page
content
RURL, page
content
Distributed Crawler
MParent,
URL
MParent,
URL
RURL,
Parent
RURL,
Parent
Back Links Mapper
Back LinksTable
AdjacencyList
Table
DiffDiff
URL Table
Graph BuilderGraph Builder
<URL, parent URL>
Map
Input <url, 1>
if(!duplicate(URL)) {
Insert into url_table
Page_content = http_get(url);
<hash(url), url, hash(page_content),time_stamp >
Output Intermediate pair < url, page_content>
}
Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) {
Page_content = http_get(url);
Update url table(hash(url),current_time);
Output Intermediate pair < url, page_content>
}
Else {
Update url table(hash(url),current_time);
}
Reduce
Input < url, page_content >
If(! Exits hash(URL) in page content table) {
Insert into page_content_table
<hash(page_content), compress(page_content) >
}
Else if(hash(page_content_table(hash(url)) != hash(current_page_content) {
Insert into page_content_table
<hash(page_content), compress( diff_with_latest(page_content) )>
}
}
Currently outside of Map-Reduce
Manual transfer of files to HDFS
Currently Depth First Search, will be modified for Breadth First Search
Map Input < url, page_content>
List<keywords> = parse(page_content);For each keyword, emit
Output Intermediate pair < keyword, url>
ReduceCombine all <keyword, url> pairs with the same
keyword to emit<keyword, List<urls> >
Insert into inverted index table<keyword, List<urls> >
Top Words Along with their Frequency
CMU
Carnegie 2456Mellon 2107University 1157Alumni 786Center 466News 395Library 393PA 373Research 357Pittsburgh, 352Information 313School 309
Cornell
Cornell 742University 378College 158Admissions 128Research 99Student 94School 89Information 77York 74Alumni 71Academics 62Ithaca 59
Gatech
Tech 2704Georgia 1882Alumni 1115Services 885Association 646Career 493Baseball 416Engineering 408Tennis 222Information 219students 198Institute 173Atlanta 164
Top 6 URL domains that get traversed
CMU
alumni.cmu.edu 92hr.web.cmu.edu 13www.alumniconnections.com 16www.carnegiemellontoday.com 10www.cmu.edu 170www.library.cmu.edu 69
Cornell
www.cornell.edu 43www.cuinfo.cornell.edu 2www.gradschool.cornell.edu 2www.news.cornell.edu 7www.sce.cornell.edu 8www.vet.cornell.edu 1
Gatech
centennial.gtalumni.org 4cyberbuzz.gatech.edu 7georgiatech.searchease.com 9gtalumni.org 236ramblinwreck.cstv.com 56www.gatech.edu
14
Avg URL Depth
CMU
cmu.edu 2.73alumni.cmu.edu 2.18www.library.cmu.edu 2.23www.alumniconnections.com 4.81
Cornell
cornell.edu 1.34www.gradschool.cornell.edu 1www.news.cornell.edu 2.57www.sce.cornell.edu 1
Gatech
gatech.edu 1gtalumni.org 3ramblinwreck.cstv.com 2.57cyberbuzz.gatech.edu 2
21
Questions, Comments, Criticisms
HTML Parser Hadoop Framework (Apache) Peer Crawl