large scale crawling with apache nutch and friends
Post on 18-Oct-2014
4.089 views
DESCRIPTION
Presented by Julien Nioche, Director, DigitalPebble This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.TRANSCRIPT
![Page 1: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/1.jpg)
Large Scale Crawling with
Julien [email protected]
LUCENE/SOLR REVOLUTION EU 2013
Apache
and friends...
![Page 2: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/2.jpg)
2 / 43
About myself
DigitalPebble Ltd, Bristol (UK) Specialised in Text Engineering
– Web Crawling– Natural Language Processing– Information Retrieval– Machine Learning
Strong focus on Open Source & Apache ecosystem VP Apache Nutch User | Contributor | Committer
– Tika– SOLR, Lucene – GATE, UIMA– Mahout– Behemoth
![Page 3: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/3.jpg)
3 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
![Page 4: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/4.jpg)
4 / 43
Nutch?
“Distributed framework for large scale web crawling”(but does not have to be large scale at all)
Based on Apache Hadoop
Apache TLP since May 2010
Indexing and Search by
![Page 5: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/5.jpg)
5 / 43
A bit of history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
Sept 2010 : Storage abstraction in Nutch 2.x– 2012 : Gora TLP @Apache
![Page 6: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/6.jpg)
6 / 43
Recent Releases
trunk
2.2.12.0
1.5.11.3 1.41.1 1.21.0
06/1206/1106/1006/09
2.x
06/13
1.7
2.1
1.6
![Page 7: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/7.jpg)
7 / 43
Why use Nutch?
Features– Index with SOLR / ES / CloudSearch– PageRank implementation– Loads of existing plugins– Can easily be extended / customised
Usual reasons– Open source with a business-friendly license, mature, community, ...
Scalability– Tried and tested on very large scale– Standard Hadoop
![Page 8: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/8.jpg)
8 / 43
Use cases
Crawl for search– Generic or vertical– Index and Search with SOLR and al.– Single node to large clusters on Cloud
… but also– Data Mining– NLP (e.g.Sentiment Analysis)– ML
– MAHOUT / UIMA / GATE – Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)
with
![Page 9: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/9.jpg)
9 / 43
Customer casesSpecificity (Verticality)
Size
BetterJobs.com (CareerBuilder)– Single server
– Aggregates content from job portals
– Extracts and normalizes structure (description, requirements, locations)
– ~2M pages total
– Feeds SOLR index
SimilarPages.com– Large cluster on Amazon EC2 (up to 400
nodes)
– Fetched & parsed 3 billion pages
– 10+ billion pages in crawlDB (~100TB data)
– 200+ million lists of similarities
– No indexing / search involved
![Page 10: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/10.jpg)
10 / 43
http://commoncrawl.org/
Using Nutch 1.7 A few modifications to Nutch code
– https://github.com/Aloisius/nutch
Next release imminent
Open repository of web crawl data 2012 dataset : 3.83 billion docs ARC files on Amazon S3
CommonCrawl
![Page 11: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/11.jpg)
11 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
![Page 12: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/12.jpg)
12 / 43
Installation
http://nutch.apache.org/downloads.html
1.7 => src and bin distributions 2.2.1 => src only
'ant clean runtime'– runtime/local => local mode (test and debug)– runtime/deploy => job jar for Hadoop + scripts
Binary distribution for 1.x == runtime/local
![Page 13: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/13.jpg)
13 / 43
Configuration and resources
Changes in $NUTCH_HOME/conf– Need recompiling with 'ant runtime'– Local mode => can be made directly in runtime/local/conf
Specify configuration in nutch-site.xml– Leave nutch-default alone!
At least :
<property> <name>http.agent.name</name> <value>WhateverNameDescribesMyMightyCrawler</value></property>
![Page 14: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/14.jpg)
14 / 43
Running it!
bin/crawl script : typical sequence of steps
bin/nutch : individual Nutch commands– Inject / generate / fetch / parse / update ….
Local mode : great for testing and debugging
Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism– MapReduce UI to monitor crawl, check logs, counters
![Page 15: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/15.jpg)
15 / 43
Monitor Crawl with MapReduce UI
![Page 16: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/16.jpg)
16 / 43
Counters and logs
![Page 17: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/17.jpg)
17 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
![Page 18: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/18.jpg)
18 / 43
Typical Nutch Steps
1) Inject → populates CrawlDB from seed list
2) Generate → Selects URLS to fetch in segment
3) Fetch → Fetches URLs from segment
4) Parse → Parses content (text + metadata)
5) UpdateDB → Updates CrawlDB (new URLs, new status...)
6) InvertLinks → Build Webgraph
7) Index → Send docs to [SOLR | ES | CloudSearch | … ]
Sequence of batch operations
Or use the all-in-one crawl script
Repeat steps 2 to 7
Same in 1.x and 2.x
![Page 19: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/19.jpg)
19 / 43
Main steps from a data perspective
CrawlDBSeed List Segment
/crawl_generate//crawl_fetch//content//crawl_parse//parse_data//parse_text/
LinkDB
![Page 20: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/20.jpg)
20 / 43
Frontier expansion
Manual “discovery”– Adding new URLs by
hand, “seeding”
Automatic discovery of new resources (frontier expansion)– Not all outlinks are
equally useful - control– Requires content
parsing and link extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
![Page 21: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/21.jpg)
21 / 43
An extensible framework
Endpoints– Protocol– Parser– HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)– ScoringFilter (used in various places)– URLFilter (ditto)– URLNormalizer (ditto)– IndexingFilter– IndexWriter (NEW IN 1.7!)
Plugins– Activated with parameter 'plugin.includes'– Implement one or more endpoints
![Page 22: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/22.jpg)
22 / 43
Features
Fetcher– Multi-threaded fetcher– Queues URLs per hostname / domain / IP– Limit the number of URLs for round of fetching– Default values are polite but can be made more aggressive
Crawl Strategy – Breadth-first but can be depth-first– Configurable via custom ScoringFilters
Scoring– OPIC (On-line Page Importance Calculation) by default– LinkRank
![Page 23: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/23.jpg)
23 / 43
Features (cont.)
Protocols– Http, file, ftp, https– Respects robots.txt directives
Scheduling– Fixed or adaptive
URL filters– Regex, FSA, TLD, prefix, suffix
URL normalisers– Default, regex
![Page 24: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/24.jpg)
24 / 43
Features (cont.)
Other plugins– CreativeCommons– Feeds– Language Identification– Rel tags– Arbitrary Metadata
Pluggable indexing– SOLR | ES etc...
Parsing with Apache Tika– Hundreds of formats supported– But some legacy parsers as well
![Page 25: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/25.jpg)
25 / 43
Indexing
Apache SOLR– schema.xml in conf/– SOLR 3.4 – JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377
ElasticSearch– Version 0.90.1
AWS CloudSearch– WIP : https://issues.apache.org/jira/browse/NUTCH-1517
Easy to build your own– Text, DB, etc...
![Page 26: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/26.jpg)
26 / 43
Typical Nutch document
Some of the fields (IndexingFilters in plugins or core code)– url– content– title– anchor– site– boost– digest– segment– host– type
Configurable ones– meta tags (keywords, description etc...)– arbitrary metadata
![Page 27: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/27.jpg)
27 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
![Page 28: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/28.jpg)
28 / 43
NUTCH 2.x
2.0 released in July 2012
2.2.1 in July 2013
Common features as 1.x– MapReduce, Tika, delegation to SOLR, etc...
Moved to 'big table'-like architecture– Wealth of NoSQL projects in last few years
Abstraction over storage layer → Apache GORA
![Page 29: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/29.jpg)
29 / 43
Apache GORA
http://gora.apache.org/
ORM for NoSQL databases– and limited SQL support + file based storage
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
DataStore implementations
Current version 0.3
● Accumulo● Cassandra● HBase
● Avro● DynamoDB● SQL (broken)
![Page 30: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/30.jpg)
30 / 43
AVRO Schema => Java code
{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[…]
![Page 31: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/31.jpg)
31 / 43
Mapping file (backend specific – Hbase)
<gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
![Page 32: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/32.jpg)
32 / 43
DataStore operations
Basic operations– get(K key) – put(K key, T obj)– delete(K key)
Querying– execute(Query<K, T> query) → Result<K,T>– deleteByQuery(Query<K, T> query)
Wrappers for Apache Hadoop– GORAInput|OutputFormat– GoraRecordReader|Writer– GORAMapper|Reducer
![Page 33: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/33.jpg)
33 / 43
GORA in Nutch
AVRO schema provided and java code pre-generated
Mapping files provided for backends
– can be modified if necessary
Need to rebuild to get dependencies for backend– hence source only distribution of Nutch 2.x
http://wiki.apache.org/nutch/Nutch2Tutorial
![Page 34: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/34.jpg)
34 / 43
Benefits
Storage still distributed and replicated
… but one big table
– status, metadata, content, text → one place
– no more segments
Resume-able fetch and parse steps
Easier interaction with other resources
– Third-party code just need to use GORA and schema
Simplify the Nutch code
Potentially faster (e.g. update step)
![Page 35: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/35.jpg)
35 / 43
Drawbacks
More stuff to install and configure– Higher hardware requirements
Current performance :-(– http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html– N2+HBase : 2.7x slower than 1.x– N2+Cassandra : 4.4x slower than 1.x– due mostly to GORA layer : not inherent to Hbase or Cassandra– https://issues.apache.org/jira/browse/GORA-119 → filtered scans– Not all backends provide data locality!
Not as stable as Nutch 1.x
![Page 36: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/36.jpg)
36 / 43
2.x Work in progress
Stabilise backend implementations– GORA-Hbase most reliable
Synchronize features with 1.x– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)– No pluggable indexers yet (NUTCH-1568)
Filter enabled scans– GORA-119
• => don't need to de-serialize the whole dataset
![Page 37: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/37.jpg)
37 / 43
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
![Page 38: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/38.jpg)
38 / 43
Future
New functionalities – Support for SOLRCloud– Sitemap (from CrawlerCommons library)– Canonical tag– Generic deduplication (NUTCH-656)
1.x and 2.x to coexist in parallel– 2.x not yet a replacement of 1.x
Move to new MapReduce API– Use Nutch on Hadoop 2.x
![Page 39: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/39.jpg)
39 / 43
More delegation
Great deal done in recent years (SOLR, Tika)
Share code with crawler-commons(http://code.google.com/p/crawler-commons/)– Fetcher / protocol handling– URL normalisation / filtering
PageRank-like computations to graph library– Apache Giraph– Should be more efficient + less code to maintain
![Page 40: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/40.jpg)
40 / 43
Longer term
Hadoop 2.x & YARN
Convergence of batch and streaming– Storm / Samza / Storm-YARN / …
End of 100% batch operations ?– Fetch and parse as streaming ?– Always be fetching– Generate / update / pagerank remain batch
See https://github.com/DigitalPebble/storm-crawler
![Page 41: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/41.jpg)
41 / 43
Where to find out more?
Project page : http://nutch.apache.org/ Wiki : http://wiki.apache.org/nutch/ Mailing lists :
– [email protected]– [email protected]
Chapter in 'Hadoop the Definitive Guide' (T. White)– Understanding Hadoop is essential anyway...
Support / consulting : – http://wiki.apache.org/nutch/Support
![Page 42: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/42.jpg)
42 / 43
Questions
?
![Page 43: Large Scale Crawling with Apache Nutch and Friends](https://reader030.vdocuments.site/reader030/viewer/2022020109/54431ce9afaf9fe7098b4816/html5/thumbnails/43.jpg)
43 / 43