search engine - how to make it

Post on 02-Jul-2015

106 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Technical Presentation of How to Build Search Engine with open source technologies

TRANSCRIPT

Search EngineHow To Make it

Wednesday, December 12, 12

Search Engine

All documents

retrieved documents (RET)

relevant documents (REL)

RET ∩ REL

database search:- low recall- high precision

web search:- high recall- low precision

Search Quality Measurement

Wednesday, December 12, 12

Search EngineFile

System

3rd party apps

Database

File System Crawler

Crawler API

Database Crawler

AaBb

Text Parser

HTML Parser

PDF Parser

AaBbPDFTextHTML

DocumentImage...

Document Enhancing

Documents (title,

summary, author,

datetime)

Indexer

Documents (Categorized, Taxonomized)

Stop AnalyzerLanguage Analyzer

Index Searcher Index

Mobile Client

Web Client

Index Searcher

Document Landing Page

Wednesday, December 12, 12

Search Engine

• Process in Search Engine

• Crawling

• Parsing

• Indexing

• Searching

Wednesday, December 12, 12

Search Engine• Process in Search Engine

• Crawling

• Parsing

• Duplicate Content Detection

• Document Enhancement

• Indexing

• Searching

• Document ServingWednesday, December 12, 12

Search Engine

• Crawling

• Collecting Data

• Input : Data content to Search

• Output : Raw Content Data in its original format

Wednesday, December 12, 12

Search Engine• Crawling

AaBb

File System

3rd party apps

Database

File System Crawler

Crawler API

Database Crawler

AaBbPDFTextHTML

DocumentImage...

Wednesday, December 12, 12

Search Engine

• Parsing

• Process to extract elements from crawled documents

• Input : Raw Contents

• Output : Textual Structured Documents

Wednesday, December 12, 12

Search Engine• Parsing

AaBb

Text Parser

HTML Parser

PDF Parser

AaBbPDFTextHTML

DocumentImage...

Documents (title,

summary, author,

datetime)

Wednesday, December 12, 12

Search Engine

• Content Duplication Detection

• Bigger Data means Bigger Duplication on Data

• Search Engine implement similiar document detection

Wednesday, December 12, 12

Search Engine• Document Representation

Model: Term Frequency(Tf)Contoh:

Document 1(d1)=”andi likes to watch movie. His wife likes it too”

Document 2(d2)=”andi also likes to watch soccer game.”

Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}

Document representation in model Tf:d1={1, 2, 2, 2, 1, 1, 0}

d2={1, 1, 1, 0, 0, 0, 1}

Wednesday, December 12, 12

Search Engine• Document Similiarity

Similarity between document d1 dan d2 : S(d1, d2)

S(d1, d2)=|d1-d2|

d1={1, 2, 2, 2, 1, 1, 0}

d2={1, 1, 1, 0, 0, 0, 1}

Contoh:

S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|

S(d1, d2)=7

With above definition, less value we got means more those two documents are getting more similiar

Wednesday, December 12, 12

Search Engine• Alghoritms

1. Counting Tf for every document

2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document4. Repeat process 2 dan 3 until there is no value of S that less than Theshold

Wednesday, December 12, 12

Search Engine

• Document Enhancement

• Give tagging based on taxonomy

Wednesday, December 12, 12

Search Engine• Document Enhancement

Document Enhancing

Documents (title,

summary, author,

datetime)

Documents (Categorized, Taxonomized)

Wednesday, December 12, 12

Search Engine

• Indexing

• Indexing process from all information that have been gathered in one document

• Faster Searching process

• Able to search based on certain field

Wednesday, December 12, 12

Search Engine• Indexing

IndexerDocuments

(Categorized, Taxonomized)

Index

Stop Analyzer

Language Analyzer

Wednesday, December 12, 12

Search Engine

• Searching

Index SearcherIndex

Mobile Client

Web Client

Wednesday, December 12, 12

Search Engine

• Document Serving

• Search Engine also has a function to display result

Wednesday, December 12, 12

Search Engine

Index SearcherIndex

Mobile Client

Web ClientIndex

SearcherDocument

Landing Page

Wednesday, December 12, 12

Search Engine• Recommended Open Source

Technology• Search Engine : Lucene, Nutch

• Programming Library : Hadoop, Scala Actor

• Database : MongoDB, PostgreSQL

• Programming Language : Java, Scala, PHP

Wednesday, December 12, 12

Thank You

Wednesday, December 12, 12

top related