nutch

Nutch Overview What Nutch is: http://nutch.apache.org/about.html Nutch homepage: http://nutch.apache.org Nutch-Hadoop Tutorial How to Setup Nutch (V1.1) and Hadoop: http://wiki.apache.org/nutch/NutchHadoopTutorial Step-by-Step or Whole-web Crawling This is a script to crawl an Intranet as well as the web. It does not crawl using the 'bin/nutch crawl' command or 'Crawl' class present in Nutch. Therefore the filters present in ' conf/crawl-urlfilter.txt ' has no effect on this script. The filters for this script must be set in ' conf/regex-urlfilter.txt'. 1.Injector • Convert injected urls to crawl db entries. • Merge injected urls into crawl db. command : bin/nutch inject <BASEDIR>/crawldb urls example usage : bin/nutch inject crawl/crawldb urls 2.Generator • Select best-scoring urls due for fetch. • Create segments. • Partition selected urls by host. command : bin/nutch generate <BASEDIR>/crawldb <BASEDIR>/segments -topN <NUMDOC> example usage : bin/nutch generate crawl/crawldb crawl/segments -topN 1000 3.Fetcher • Fetch remote pages. command : bin/nutch fetch <BASEDIR>/segments/<SEGMENT> -threads <THREADS> example usage : bin/nutch fetch crawl/segments/20091210113212 -threads 10 4.CrawlDb update • Merging segment data into db. command : bin/nutch updatedb <BASEDIR>/crawldb <BASEDIR>/segments/<SEGMENT> -filter example usage : bin/nutch updatedb crawl/crawldb crawl/segments/20091210113212 -filter Nutch 1

Upload: easo-thomas

Post on 17-Sep-2015

4 views

Category:

Documents

1 download

Report

Download

Embed Size (px):

TRANSCRIPT

Nutch Overview

What Nutch is: http://nutch.apache.org/about.html

Nutch homepage: http://nutch.apache.org

Nutch-Hadoop Tutorial

How to Setup Nutch (V1.1) and Hadoop:

http://wiki.apache.org/nutch/NutchHadoopTutorial

Step-by-Step or Whole-web CrawlingThis is a script to crawl an Intranet as well as the web. It does not crawl using the 'bin/nutch crawl'command or 'Crawl' class present in Nutch. Therefore the filters present in 'conf/crawl-urlfilter.txt ' hasno effect on this script. The filters for this script must be set in 'conf/regex-urlfilter.txt'.

1.Injector

Convert injected urls to crawl db entries. Merge injected urls into crawl db.

command : bin/nutch inject /crawldb urls

example usage : bin/nutch inject crawl/crawldb urls

2.Generator

Select best-scoring urls due for fetch. Create segments. Partition selected urls by host.

command : bin/nutch generate /crawldb /segments -topN

example usage : bin/nutch generate crawl/crawldb crawl/segments -topN 1000

3.Fetcher

Fetch remote pages.command : bin/nutch fetch /segments/ -threads

example usage : bin/nutch fetch crawl/segments/20091210113212 -threads 10

4.CrawlDb update

Merging segment data into db.command : bin/nutch updatedb /crawldb /segments/ -filter

example usage : bin/nutch updatedb crawl/crawldb crawl/segments/20091210113212 -filter

Nutch

1

www.princexml.comPrince - Personal EditionThis document was created with Prince, a great way of getting web content onto paper.
5.LinkDb

Add segments to the database.(steps from 2 to 5) is done in iterations.

command : bin/nutch invertlinks /linkdb /segments/*

example usage : bin/nutch invertlinks crawl/linkdb crawl/segments/*

6.Index Process

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

7.Delete duplicates

bin/nutch dedup crawl/indexes

8.Merge Indexes

bin/nutch merge crawl/index crawl/indexes

OR

*Using Solr to index:

Before you execute the commands below, add the 2 solr .jar files of the Solr version you are using into theNutch lib folder and remove the old ones so as for these .jar files not to be deprecated. Then recompileNutch using the command "ant package".

6.Index process

Index content to be access. In this article is described the process, using Solr to index Delete duplicates.

command : bin/nutch solrindex http://:// /crawldb /linkdb /segments/*

example usage : bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

7.Delete duplicates

command : bin/nutch solrdedup http://://

example usage : bin/nutch solrdedup http://127.0.0.1:8983/solr/

Nutch

2
Useful Links

Nutch-Solr integration on Hadoop: http://thewiki4opentech.org/index.php/Nutch

Introduction to Nutch: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

Nutch

3

Nutch OverviewNutch-Hadoop TutorialStep-by-Step or Whole-web CrawlingUseful Links