search engine - how to make it

22
Search Engine How To Make it Wednesday, December 12, 12

Upload: andreas-yunanto

Post on 02-Jul-2015

106 views

Category:

Technology


1 download

DESCRIPTION

Technical Presentation of How to Build Search Engine with open source technologies

TRANSCRIPT

Page 1: Search Engine - How to Make it

Search EngineHow To Make it

Wednesday, December 12, 12

Page 2: Search Engine - How to Make it

Search Engine

All documents

retrieved documents (RET)

relevant documents (REL)

RET ∩ REL

database search:- low recall- high precision

web search:- high recall- low precision

Search Quality Measurement

Wednesday, December 12, 12

Page 3: Search Engine - How to Make it

Search EngineFile

System

3rd party apps

Database

File System Crawler

Crawler API

Database Crawler

AaBb

Text Parser

HTML Parser

PDF Parser

AaBbPDFTextHTML

DocumentImage...

Document Enhancing

Documents (title,

summary, author,

datetime)

Indexer

Documents (Categorized, Taxonomized)

Stop AnalyzerLanguage Analyzer

Index Searcher Index

Mobile Client

Web Client

Index Searcher

Document Landing Page

Wednesday, December 12, 12

Page 4: Search Engine - How to Make it

Search Engine

• Process in Search Engine

• Crawling

• Parsing

• Indexing

• Searching

Wednesday, December 12, 12

Page 5: Search Engine - How to Make it

Search Engine• Process in Search Engine

• Crawling

• Parsing

• Duplicate Content Detection

• Document Enhancement

• Indexing

• Searching

• Document ServingWednesday, December 12, 12

Page 6: Search Engine - How to Make it

Search Engine

• Crawling

• Collecting Data

• Input : Data content to Search

• Output : Raw Content Data in its original format

Wednesday, December 12, 12

Page 7: Search Engine - How to Make it

Search Engine• Crawling

AaBb

File System

3rd party apps

Database

File System Crawler

Crawler API

Database Crawler

AaBbPDFTextHTML

DocumentImage...

Wednesday, December 12, 12

Page 8: Search Engine - How to Make it

Search Engine

• Parsing

• Process to extract elements from crawled documents

• Input : Raw Contents

• Output : Textual Structured Documents

Wednesday, December 12, 12

Page 9: Search Engine - How to Make it

Search Engine• Parsing

AaBb

Text Parser

HTML Parser

PDF Parser

AaBbPDFTextHTML

DocumentImage...

Documents (title,

summary, author,

datetime)

Wednesday, December 12, 12

Page 10: Search Engine - How to Make it

Search Engine

• Content Duplication Detection

• Bigger Data means Bigger Duplication on Data

• Search Engine implement similiar document detection

Wednesday, December 12, 12

Page 11: Search Engine - How to Make it

Search Engine• Document Representation

Model: Term Frequency(Tf)Contoh:

Document 1(d1)=”andi likes to watch movie. His wife likes it too”

Document 2(d2)=”andi also likes to watch soccer game.”

Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}

Document representation in model Tf:d1={1, 2, 2, 2, 1, 1, 0}

d2={1, 1, 1, 0, 0, 0, 1}

Wednesday, December 12, 12

Page 12: Search Engine - How to Make it

Search Engine• Document Similiarity

Similarity between document d1 dan d2 : S(d1, d2)

S(d1, d2)=|d1-d2|

d1={1, 2, 2, 2, 1, 1, 0}

d2={1, 1, 1, 0, 0, 0, 1}

Contoh:

S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|

S(d1, d2)=7

With above definition, less value we got means more those two documents are getting more similiar

Wednesday, December 12, 12

Page 13: Search Engine - How to Make it

Search Engine• Alghoritms

1. Counting Tf for every document

2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document4. Repeat process 2 dan 3 until there is no value of S that less than Theshold

Wednesday, December 12, 12

Page 14: Search Engine - How to Make it

Search Engine

• Document Enhancement

• Give tagging based on taxonomy

Wednesday, December 12, 12

Page 15: Search Engine - How to Make it

Search Engine• Document Enhancement

Document Enhancing

Documents (title,

summary, author,

datetime)

Documents (Categorized, Taxonomized)

Wednesday, December 12, 12

Page 16: Search Engine - How to Make it

Search Engine

• Indexing

• Indexing process from all information that have been gathered in one document

• Faster Searching process

• Able to search based on certain field

Wednesday, December 12, 12

Page 17: Search Engine - How to Make it

Search Engine• Indexing

IndexerDocuments

(Categorized, Taxonomized)

Index

Stop Analyzer

Language Analyzer

Wednesday, December 12, 12

Page 18: Search Engine - How to Make it

Search Engine

• Searching

Index SearcherIndex

Mobile Client

Web Client

Wednesday, December 12, 12

Page 19: Search Engine - How to Make it

Search Engine

• Document Serving

• Search Engine also has a function to display result

Wednesday, December 12, 12

Page 20: Search Engine - How to Make it

Search Engine

Index SearcherIndex

Mobile Client

Web ClientIndex

SearcherDocument

Landing Page

Wednesday, December 12, 12

Page 21: Search Engine - How to Make it

Search Engine• Recommended Open Source

Technology• Search Engine : Lucene, Nutch

• Programming Library : Hadoop, Scala Actor

• Database : MongoDB, PostgreSQL

• Programming Language : Java, Scala, PHP

Wednesday, December 12, 12

Page 22: Search Engine - How to Make it

Thank You

Wednesday, December 12, 12