Download - LOGO A Full Text Search Engine For BBS Lily 主讲人：顾荣指导老师：黄宜华 Email:[email protected]

LOGO

www.themegallery.com

A Full Text Search Engine For BBS Lily

主讲人：顾荣指导老师：黄宜华

Email:[email protected]

Contents

Background

Brief Intro to principle of Full Text Search Engine

Implement of FTSE for BBS Lily

Maybe Google&Baidu has done these...

Conclusion

1.Background

What is a full text search engine?1.1

1.2 Why do we need it?

What is a full text search engine?

In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ------From Wiki

Why do we need a FTSE for BBS Lily?

Total amount :around 3million posts

Over a thousand everyday.

Each post’s size :1K~4K

Data InBBS Lily

Base

Capacity

Increasing

Speed

Post

Granularity

Contents

Background




Conclusion

2.Brief Intro to the Principle of Full Text Search Engine

What happens after you press enter?What happens after you press enter?

Abstract IR Architecture

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

document acquisition

(e.g., web crawling)

About Representation Function

Documents

InvertedIndex

Bag of Words

case folding, tokenization, stopword removal, stemming

syntax, semantics, word knowledge, etc.

A Simple Inverted Index Demo

1

1

1

2

1

1

1

1

1

1

1

22

11

11

11

11

11

11

1 2 3

11

11

11

4

11

11

11

22

11

11

22

11

blueblue

catcat

eggegg

fishfish

greengreen

hamham

hathat

oneone

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

11 11redred

11 11twotwo

1red

1two

one fish, two fishDoc 1

red fish, blue hatDoc 2

cat in the hatDoc 3

green eggs and hamDoc 4

3

4

1

4

4

3

2

1

2

2

1

12

Map/Reduce’s Role…

Not so good…

1.must have sub-second response time2.for the web, only need relatively few results

Indexing ProblemIndexing Problem

Retrieval ProblemRetrieval Problem

Character DescriptionCharacter Description Suitable?Suitable?

Perfect !

1.scalability 2.relatively fast3.batch operation4.updates may not be important5.crawling is a challenge in itself

Contents

Background




Conclusion

3.Implement of FTSE for Lily BBS

3.4

Outline of Work Flow3.1

3.2

3.3

3.5

Crawl Web Pages & Mine Info

Indexing Process

Set up Web Retrieval Interface

Optimization

Response Query String

3.1 Outline of work Flow

Web Page 0Web Page 0

Web Page 1Web Page 1

Web Page nWeb Page n

Crawl && Info Mining

Formated Files

/Content

/Vice Info

Inverted Index&&Ranking

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

JSP PageJSP Page

Split

Term0,Term1…Term n

Search &Merge

Target DID

Result List

TitleContext

AuthorURLHot

token 1 token 0token n

IndexForIndices

Crawler

Web Retrival

Map/Reduce

3.2 Crawl Web Pages & Mine Info

3.2.1 Target

Framework of Lily BBS

Strategy of Crawler

Strategy of Miner3.2.4

3.2.2

3.2.3

Target of Crawler&Miner

Crawl every postFrom BBS lily Continuously .

Fault tolerance

Mine wanted infoFrom each post that Crawler has got from web;store the them in a designed pattern.

A

Crawler

B

Miner

Framework of BBS Lily (1)

Post 0 Post 1 Post n

Title in hereBBSLily

Title in here

section 12Title in here

section0 Title in here

section2Title in here

section1 ………

Title in

here

Board 0 Board 1 Title in

here

Board n………

………

Framework of BBS Lily (2)

Strategy of Crawler——DFS

Post 1 Post nPost 0

Title in hereBBSLily

Title in here

Section 12Title in here

section0 Title in here

section2Title in here

section1 ………

Title in

here

Board 0 Board 1 Title in

here

Board n………

………

-Traversal catalog links to

get the content;

-Automatic link to Next

Page and do the routine

job.

tips

Strategy of Miner——Regex

Use HtmlParserTo get Tags’ Content

Extract Info by regex

Store in a designed pattern

[Each post will be stored in a line as the pattern blew] [Each post will be stored in a line as the pattern blew]

Click to add Text

URL’/007’hot’/007’auhtor’/007’title’/007’content

See Demo

3.3 Indexing Process

3.3.1 Target

Filter Source File

Build Inverted Index

3.3.2

3.3.3

Partition Inverted Index File

3.3.5

3.3.4

Second-Level Index (Index for Indices)

Target of Indexing Process

Run a series of Map/Reduce operations to generate Inverted Indices with rank and position info.

Indexing Process

Txt_Filter

PartitionIndex Table

Inverted Index

IndexFor

Indices

Filter Source File

Although Source File stores posts in a well-designed pattern ,We still need to filter it before we do the Inverted Indices job.

1.Examine and eliminate noises and duplications

-“http://bbs.nju.edu.cn/...’\007’ null ‘\007’ \null ‘\

007’ null ‘\007’ null”

-About duplications…2.It is natural to pre-process the data before we really handle it.

Reasons

Build Inverted Index

The process of building Inverted Index is smart ,it will be smarter if we can calculate and record some side info properly at the same time.

The side info includes rank 、 positions etc.

Details…

Build Inverted Index—Side Info

1.TF-IDF (Term Frequency-Inverse Document Frequency):

2.Positions info do not need any calculation , the can be record as a Integer Pair like(StartIndex,EndIndex).

Side Info

•| D | : total number of documents in the corpus • : number of documents where the term ti appears (that is )

Build Inverted Index--structure

1. For each post in filtered source file , the offset in the file can be considered as its DID;

2. Each line of Inverted Index file stores a term with its info ,the details are as blew:

term infoinfo=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo....SingleDIDInfo=DID:rank:positionspositions=position%position%position%position...position=IsTitle|start|end

Eg.黑莓 48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562

Partition Inverted Index File

After last step,we got the Inverted Index File.

However,the file is so big…..

Source file size Inverted index file size

48M 72.5M

182M 240M

703M 828M


In last step,we partitioned the Inverted Index file into a certain num parts,for example 16.Each file contains some term-info pairs.

So,when a term is given?How can we know which part-file is it in?which line is it in?

We need an Index for Indices.

Ps.This really works.The second-level index file’s size is less than 10% of the source file.


Source file size

Inverted index file size

Second-Level Index file size

48.1M 72.5M 2.375M

182M 240M 5.17M

703M 828M 10.5M

3.4 Set up Web Retrieval Interface

3.4.1 Target

Sort Pages3.4.2

Target of Web Retrieval Interface

Make an Interface which accpet user’s query and response search results.

1.Restrict the query string;2.Sort search result dynamically;3.Response results page by page.

Web

Retrieval

Interface

Sort Pages

Term 0

Term 2

Term 1

Here is a demo.

Doc1 10

Doc3 90

Doc7 20

Text in here

Query String

Word

Segement Doc2 20

Doc7 80

Doc5 15

Doc3 05

Doc2 40

Doc6 40

Merge

Rank Again

Doc7 100

Doc3 95

Doc6 40

Doc2 40

Doc5 15

Doc1 10

Merge the rank and Rank again~

3.5 Optimization

a)For each term only top 1500 DID are reserved at most.b)Use TreeMap to sort..

Reduce Sort Time Reduce I/O operations

……Cache Strategy

Optimization measures in different areas.

a)Response Page is created dynamically.b)Each time return 10 records.

..........

a) Put some hot Inverted Index file in the memory.

b) Cache replacement --- LRU

Contents

Background




Conclusion

4. Maybe Google&Baidu has done...

Parallelly

Word Segement

……

User's query

Rank

3.A better rank strategy :To descirbe the relationship between a token and DID precisely

4.Record each user's query string;a) feed back to Word Segementb) Provide remind function.(By input change event)

1. Search Stuff parallelly

2. An OutStanding Word Segement

Algorithm

…….

Contents

Background




Conclusion

5. Conclusion

summary5.1

5.2

5.3

Highlights

About Map/Reduce…

Summary

Crawler

Indexingprocess

Web retrieval

A Hard Coding Crawler and Miner,aimed to get data for BBS lily

The indexing process runs as a sequence of MapReduce operations.

Set up a Web Interface for userto retrieve info.

Crawler

Web Interface

Indexingprocess

Related Work has three parts as blew:

Highlights

This stuff is COOL~ It can provide a friendly User Experience,when we wanna search something in our BBS Lily.

This stuff is COOL~ It can provide a friendly User Experience,when we wanna search something in our BBS Lily.

Use Map/Reduce to process data offline.It has provided several benefits such as:1.The indexing code is simpler, smaller, and easier tounderstand. 2. we can keep conceptually unrelated computations separate.3. The indexing process has become much easier toOperate and maintain.

View of Application

View of Technics

the system

1Map/Reduce is not just a Programming Model,actually it’s also a Life Model…

Click to add Text

About Map/Reduce

Many thanks to…

Teacher Huang;Yang Xiaoliang;Xiao Tao;Liu Yulong;Zhang Lu;NUAA & NJU…

LOGO

www.themegallery.com

Email:[email protected]

Download - LOGO A Full Text Search Engine For BBS Lily 主讲人：顾荣 指导老师：黄宜华 Email:[email protected]

Top Related

Download - LOGO A Full Text Search Engine For BBS Lily 主讲人：顾荣指导老师：黄宜华 Email:[email protected]