nutch
TRANSCRIPT
-
Nutch Overview
What Nutch is: http://nutch.apache.org/about.html
Nutch homepage: http://nutch.apache.org
Nutch-Hadoop Tutorial
How to Setup Nutch (V1.1) and Hadoop:
http://wiki.apache.org/nutch/NutchHadoopTutorial
Step-by-Step or Whole-web CrawlingThis is a script to crawl an Intranet as well as the web. It does not crawl using the 'bin/nutch crawl'command or 'Crawl' class present in Nutch. Therefore the filters present in 'conf/crawl-urlfilter.txt ' hasno effect on this script. The filters for this script must be set in 'conf/regex-urlfilter.txt'.
1.Injector
Convert injected urls to crawl db entries. Merge injected urls into crawl db.
command : bin/nutch inject /crawldb urls
example usage : bin/nutch inject crawl/crawldb urls
2.Generator
Select best-scoring urls due for fetch. Create segments. Partition selected urls by host.
command : bin/nutch generate /crawldb /segments -topN
example usage : bin/nutch generate crawl/crawldb crawl/segments -topN 1000
3.Fetcher
Fetch remote pages.command : bin/nutch fetch /segments/ -threads
example usage : bin/nutch fetch crawl/segments/20091210113212 -threads 10
4.CrawlDb update
Merging segment data into db.command : bin/nutch updatedb /crawldb /segments/ -filter
example usage : bin/nutch updatedb crawl/crawldb crawl/segments/20091210113212 -filter
Nutch
1
www.princexml.comPrince - Personal EditionThis document was created with Prince, a great way of getting web content onto paper.
-
5.LinkDb
Add segments to the database.(steps from 2 to 5) is done in iterations.
command : bin/nutch invertlinks /linkdb /segments/*
example usage : bin/nutch invertlinks crawl/linkdb crawl/segments/*
6.Index Process
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
7.Delete duplicates
bin/nutch dedup crawl/indexes
8.Merge Indexes
bin/nutch merge crawl/index crawl/indexes
OR
*Using Solr to index:
Before you execute the commands below, add the 2 solr .jar files of the Solr version you are using into theNutch lib folder and remove the old ones so as for these .jar files not to be deprecated. Then recompileNutch using the command "ant package".
6.Index process
Index content to be access. In this article is described the process, using Solr to index Delete duplicates.
command : bin/nutch solrindex http://:// /crawldb /linkdb /segments/*
example usage : bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
7.Delete duplicates
command : bin/nutch solrdedup http://://
example usage : bin/nutch solrdedup http://127.0.0.1:8983/solr/
Nutch
2
-
Useful Links
Nutch-Solr integration on Hadoop: http://thewiki4opentech.org/index.php/Nutch
Introduction to Nutch: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
Nutch
3
Nutch OverviewNutch-Hadoop TutorialStep-by-Step or Whole-web CrawlingUseful Links