LOGO
www.themegallery.com
A Full Text Search Engine For BBS Lily
主讲人:顾荣 指导老师:黄宜华
Email:[email protected]
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
1.Background
What is a full text search engine?1.1
1.2 Why do we need it?
What is a full text search engine?
In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ------From Wiki
Why do we need a FTSE for BBS Lily?
Total amount :around 3million posts
Over a thousand everyday.
Each post’s size :1K~4K
Data InBBS Lily
Base
Capacity
Increasing
Speed
Post
Granularity
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
2.Brief Intro to the Principle of Full Text Search Engine
What happens after you press enter?What happens after you press enter?
Abstract IR Architecture
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
offlineonline
document acquisition
(e.g., web crawling)
About Representation Function
Documents
InvertedIndex
Bag of Words
case folding, tokenization, stopword removal, stemming
syntax, semantics, word knowledge, etc.
A Simple Inverted Index Demo
1
1
1
2
1
1
1
1
1
1
1
22
11
11
11
11
11
11
1 2 3
11
11
11
4
11
11
11
22
11
11
22
11
blueblue
catcat
eggegg
fishfish
greengreen
hamham
hathat
oneone
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
11 11redred
11 11twotwo
1red
1two
one fish, two fishDoc 1
red fish, blue hatDoc 2
cat in the hatDoc 3
green eggs and hamDoc 4
3
4
1
4
4
3
2
1
2
2
1
12
Map/Reduce’s Role…
Not so good…
1.must have sub-second response time2.for the web, only need relatively few results
Indexing ProblemIndexing Problem
Retrieval ProblemRetrieval Problem
Character DescriptionCharacter Description Suitable?Suitable?
Perfect !
1.scalability 2.relatively fast3.batch operation4.updates may not be important5.crawling is a challenge in itself
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
3.Implement of FTSE for Lily BBS
3.4
Outline of Work Flow3.1
3.2
3.3
3.5
Crawl Web Pages & Mine Info
Indexing Process
Set up Web Retrieval Interface
Optimization
Response Query String
3.1 Outline of work Flow
Web Page 0Web Page 0
Web Page 1Web Page 1
Web Page nWeb Page n
Crawl && Info Mining
Formated Files
/Content
/Vice Info
Inverted Index&&Ranking
<DID,Rank>……
<DID,Rank>……
<DID,Rank>……
<DID,Rank>……
<DID,Rank>……
<DID,Rank>……
JSP PageJSP Page
Split
Term0,Term1…Term n
Search &Merge
Target DID
Result List
TitleContext
AuthorURLHot
token 1 token 0token n
IndexForIndices
Crawler
Web Retrival
Map/Reduce
3.2 Crawl Web Pages & Mine Info
3.2.1 Target
Framework of Lily BBS
Strategy of Crawler
Strategy of Miner3.2.4
3.2.2
3.2.3
Target of Crawler&Miner
Crawl every postFrom BBS lily Continuously .
Fault tolerance
Mine wanted infoFrom each post that Crawler has got from web;store the them in a designed pattern.
A
Crawler
B
Miner
Framework of BBS Lily (1)
Post 0 Post 1 Post n
Title in hereBBSLily
Title in here
section 12Title in here
section0 Title in here
section2Title in here
section1 ………
Title in
here
Board 0 Board 1 Title in
here
Board n………
………
Framework of BBS Lily (2)
Strategy of Crawler——DFS
Post 1 Post nPost 0
Title in hereBBSLily
Title in here
Section 12Title in here
section0 Title in here
section2Title in here
section1 ………
Title in
here
Board 0 Board 1 Title in
here
Board n………
………
-Traversal catalog links to
get the content;
-Automatic link to Next
Page and do the routine
job.
tips
Strategy of Miner——Regex
Use HtmlParserTo get Tags’ Content
Extract Info by regex
Store in a designed pattern
[Each post will be stored in a line as the pattern blew] [Each post will be stored in a line as the pattern blew]
Click to add Text
URL’/007’hot’/007’auhtor’/007’title’/007’content
See Demo
3.3 Indexing Process
3.3.1 Target
Filter Source File
Build Inverted Index
3.3.2
3.3.3
Partition Inverted Index File
3.3.5
3.3.4
Second-Level Index (Index for Indices)
Target of Indexing Process
Run a series of Map/Reduce operations to generate Inverted Indices with rank and position info.
Indexing Process
Txt_Filter
PartitionIndex Table
Inverted Index
IndexFor
Indices
Filter Source File
Although Source File stores posts in a well-designed pattern ,We still need to filter it before we do the Inverted Indices job.
1.Examine and eliminate noises and duplications
-“http://bbs.nju.edu.cn/...’\007’ null ‘\007’ \null ‘\
007’ null ‘\007’ null”
-About duplications…2.It is natural to pre-process the data before we really handle it.
Reasons
Build Inverted Index
The process of building Inverted Index is smart ,it will be smarter if we can calculate and record some side info properly at the same time.
The side info includes rank 、 positions etc.
Details…
Build Inverted Index—Side Info
1.TF-IDF (Term Frequency-Inverse Document Frequency):
2.Positions info do not need any calculation , the can be record as a Integer Pair like(StartIndex,EndIndex).
Side Info
•| D | : total number of documents in the corpus • : number of documents where the term ti appears (that is )
Build Inverted Index--structure
1. For each post in filtered source file , the offset in the file can be considered as its DID;
2. Each line of Inverted Index file stores a term with its info ,the details are as blew:
term infoinfo=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo....SingleDIDInfo=DID:rank:positionspositions=position%position%position%position...position=IsTitle|start|end
Eg.黑莓 48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562
Partition Inverted Index File
After last step,we got the Inverted Index File.
However,the file is so big…..
Source file size Inverted index file size
48M 72.5M
182M 240M
703M 828M
Second-Level Index (Index for Indices)
In last step,we partitioned the Inverted Index file into a certain num parts,for example 16.Each file contains some term-info pairs.
So,when a term is given?How can we know which part-file is it in?which line is it in?
We need an Index for Indices.
Ps.This really works.The second-level index file’s size is less than 10% of the source file.
Second-Level Index (Index for Indices)
Source file size
Inverted index file size
Second-Level Index file size
48.1M 72.5M 2.375M
182M 240M 5.17M
703M 828M 10.5M
3.4 Set up Web Retrieval Interface
3.4.1 Target
Sort Pages3.4.2
Target of Web Retrieval Interface
Make an Interface which accpet user’s query and response search results.
1.Restrict the query string;2.Sort search result dynamically;3.Response results page by page.
Web
Retrieval
Interface
Sort Pages
Term 0
Term 2
Term 1
Here is a demo.
Doc1 10
Doc3 90
Doc7 20
Text in here
Query String
Word
Segement Doc2 20
Doc7 80
Doc5 15
Doc3 05
Doc2 40
Doc6 40
Merge
Rank Again
Doc7 100
Doc3 95
Doc6 40
Doc2 40
Doc5 15
Doc1 10
Merge the rank and Rank again~
3.5 Optimization
a)For each term only top 1500 DID are reserved at most.b)Use TreeMap to sort..
Reduce Sort Time Reduce I/O operations
……Cache Strategy
Optimization measures in different areas.
a)Response Page is created dynamically.b)Each time return 10 records.
..........
a) Put some hot Inverted Index file in the memory.
b) Cache replacement --- LRU
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
4. Maybe Google&Baidu has done...
Parallelly
Word Segement
……
User's query
Rank
3.A better rank strategy :To descirbe the relationship between a token and DID precisely
4.Record each user's query string;a) feed back to Word Segementb) Provide remind function.(By input change event)
1. Search Stuff parallelly
2. An OutStanding Word Segement
Algorithm
…….
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
5. Conclusion
summary5.1
5.2
5.3
Highlights
About Map/Reduce…
Summary
Crawler
Indexingprocess
Web retrieval
A Hard Coding Crawler and Miner,aimed to get data for BBS lily
The indexing process runs as a sequence of MapReduce operations.
Set up a Web Interface for userto retrieve info.
Crawler
Web Interface
Indexingprocess
Related Work has three parts as blew:
Highlights
This stuff is COOL~ It can provide a friendly User Experience,when we wanna search something in our BBS Lily.
This stuff is COOL~ It can provide a friendly User Experience,when we wanna search something in our BBS Lily.
Use Map/Reduce to process data offline.It has provided several benefits such as:1.The indexing code is simpler, smaller, and easier tounderstand. 2. we can keep conceptually unrelated computations separate.3. The indexing process has become much easier toOperate and maintain.
View of Application
View of Technics
the system
1Map/Reduce is not just a Programming Model,actually it’s also a Life Model…
Click to add Text
About Map/Reduce
Many thanks to…
Teacher Huang;Yang Xiaoliang;Xiao Tao;Liu Yulong;Zhang Lu;NUAA & NJU…