nutch

3
Nutch Overview What Nutch is: http://nutch.apache.org/about.html Nutch homepage: http://nutch.apache.org Nutch-Hadoop Tutorial How to Setup Nutch (V1.1) and Hadoop: http://wiki.apache.org/nutch/NutchHadoopTutorial Step-by-Step or Whole-web Crawling This is a script to crawl an Intranet as well as the web. It does not crawl using the 'bin/nutch crawl' command or 'Crawl' class present in Nutch. Therefore the filters present in ' conf/crawl-urlfilter.txt ' has no effect on this script. The filters for this script must be set in ' conf/regex-urlfilter.txt'. 1.Injector Convert injected urls to crawl db entries. Merge injected urls into crawl db. command : bin/nutch inject <BASEDIR>/crawldb urls example usage : bin/nutch inject crawl/crawldb urls 2.Generator Select best-scoring urls due for fetch. Create segments. Partition selected urls by host. command : bin/nutch generate <BASEDIR>/crawldb <BASEDIR>/segments -topN <NUMDOC> example usage : bin/nutch generate crawl/crawldb crawl/segments -topN 1000 3.Fetcher Fetch remote pages. command : bin/nutch fetch <BASEDIR>/segments/<SEGMENT> -threads <THREADS> example usage : bin/nutch fetch crawl/segments/20091210113212 -threads 10 4.CrawlDb update Merging segment data into db. command : bin/nutch updatedb <BASEDIR>/crawldb <BASEDIR>/segments/<SEGMENT> -filter example usage : bin/nutch updatedb crawl/crawldb crawl/segments/20091210113212 -filter Nutch 1

Upload: easo-thomas

Post on 17-Sep-2015

4 views

Category:

Documents


1 download

TRANSCRIPT

  • Nutch Overview

    What Nutch is: http://nutch.apache.org/about.html

    Nutch homepage: http://nutch.apache.org

    Nutch-Hadoop Tutorial

    How to Setup Nutch (V1.1) and Hadoop:

    http://wiki.apache.org/nutch/NutchHadoopTutorial

    Step-by-Step or Whole-web CrawlingThis is a script to crawl an Intranet as well as the web. It does not crawl using the 'bin/nutch crawl'command or 'Crawl' class present in Nutch. Therefore the filters present in 'conf/crawl-urlfilter.txt ' hasno effect on this script. The filters for this script must be set in 'conf/regex-urlfilter.txt'.

    1.Injector

    Convert injected urls to crawl db entries. Merge injected urls into crawl db.

    command : bin/nutch inject /crawldb urls

    example usage : bin/nutch inject crawl/crawldb urls

    2.Generator

    Select best-scoring urls due for fetch. Create segments. Partition selected urls by host.

    command : bin/nutch generate /crawldb /segments -topN

    example usage : bin/nutch generate crawl/crawldb crawl/segments -topN 1000

    3.Fetcher

    Fetch remote pages.command : bin/nutch fetch /segments/ -threads

    example usage : bin/nutch fetch crawl/segments/20091210113212 -threads 10

    4.CrawlDb update

    Merging segment data into db.command : bin/nutch updatedb /crawldb /segments/ -filter

    example usage : bin/nutch updatedb crawl/crawldb crawl/segments/20091210113212 -filter

    Nutch

    1

    www.princexml.comPrince - Personal EditionThis document was created with Prince, a great way of getting web content onto paper.

  • 5.LinkDb

    Add segments to the database.(steps from 2 to 5) is done in iterations.

    command : bin/nutch invertlinks /linkdb /segments/*

    example usage : bin/nutch invertlinks crawl/linkdb crawl/segments/*

    6.Index Process

    bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

    7.Delete duplicates

    bin/nutch dedup crawl/indexes

    8.Merge Indexes

    bin/nutch merge crawl/index crawl/indexes

    OR

    *Using Solr to index:

    Before you execute the commands below, add the 2 solr .jar files of the Solr version you are using into theNutch lib folder and remove the old ones so as for these .jar files not to be deprecated. Then recompileNutch using the command "ant package".

    6.Index process

    Index content to be access. In this article is described the process, using Solr to index Delete duplicates.

    command : bin/nutch solrindex http://:// /crawldb /linkdb /segments/*

    example usage : bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

    7.Delete duplicates

    command : bin/nutch solrdedup http://://

    example usage : bin/nutch solrdedup http://127.0.0.1:8983/solr/

    Nutch

    2

  • Useful Links

    Nutch-Solr integration on Hadoop: http://thewiki4opentech.org/index.php/Nutch

    Introduction to Nutch: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

    Nutch

    3

    Nutch OverviewNutch-Hadoop TutorialStep-by-Step or Whole-web CrawlingUseful Links