enhancing internet search engines to achieve concept- based retrieval f. lu, t. johnsten, v....
Post on 25-Dec-2015
217 Views
Preview:
TRANSCRIPT
Enhancing Internet Search Engines to Achieve Concept-
based Retrieval
F. Lu, T. Johnsten, V. Raghavan,
and D. Traylor
Agenda
• Information on the Internet.
• Boolean Retrieval Model and the Internet.
• Personalized Search.
• Concept-Based Retrieval (RUBRIC / CS3).
• CS3 and Boolean Search Engines.
• Deep Web Sources.
• Current & Future Work.
Information on the Internet
• Large volume.
• Rapid growth rate.
• Wide variations in quality and type.
Boolean Retrieval Model and the Internet
• Most Internet search engines are based on the Boolean Retrieval Model.
• Boolean Retrieval Model is relatively easy to implement.
• Limitations:– Inability to assign weights to query or document terms.
– Inability to rank retrieved documents.
– Naïve users have difficulty in using
Personalized Search
Personalized Engine
QueryProcessor
User Query
Search Engine
Query Augmentation Search Results
ResultProcessor
Personalized Results
UserProfile
GeneralProfile
Concept-Based Retrieval
• Address shortcomings of Boolean Retrieval Model.
• Search Requests specified in terms of concepts structured as rule-base trees.
Development of Rule-Base Trees (General)
• Top-down refinement strategy.
• Support for AND / OR relationships.
• Support for user-defined weights.
Development of Rule-Base Trees (CS3)
• Concept-Set Structuring System (CS3)
• CS3 supports the creation, storage and modification of user-defined concepts
• Post-processing of results of sub-queries
• CS3 user-interface.
CS3 User Interface
Evaluation of Rule-Base Trees (RUBRIC)
• Run-time, bottom-up analysis.
• Propagation of weight values (MIN / MAX).
• Disadvantage of run-time analysis.
Evaluation of Rule-Base Trees (CS3)
• Static, bottom-up analysis.
• Construct Minimal Term Set (MTS).
• Propagation of terms.
• CS3 user-interface.
MTS-Minimal Term Set
A MTS for a topic is a set of terms such that if each term in the set appears in the document, the document would get a RSV larger than 0. If not, the RSV would be 0.
A topic could have more than one MTSs. A user can choose from those MTSs to perform a
search to his needs.
CS3 and Boolean Search Engines
• CS3 is designed to interface with existing Boolean search engines.
• U.S. Department of Energy’s “Information-Bridge” search engine.
• U.S. Department of Transportation’s “National Transportation Library” search engine.
System Architecture
Client (Java/ Applet )
CORBA CGI
Server (JAVA) Server (JAVA/C++)
JDBC
ORACLE
DOE
InfoBridge…
etc.
Information-Bridge and CS3
• Search request: Boolean Vs. Concept
• Output: Non-Ranked Vs. Ranked.
• Calculation of RSV:– Given a document D and a set S of MTS
expressions satisfied by D, the RSV of D is equal to the sum of all the weights of S plus the maximum weight in S.
Information-Bridge and CS3 (Example)
• Boolean search request (“Environmental Science Network” Form):– (“Hydrogeology” OR “Dnapl” OR (“Colloid*”
AND “Environmental Transport”)).
• Concept (CS3):– “Hydrogeology”.– Rule-Base Tree.
CS3 Hydrogeology Rule Base
CS3 search results
Deep Web Sources• Also referred to as hidden Web or invisible
Web
• Resides behind search forms in databases e.g. monster.com, louisiana1st.com, PubMed.
• Web pages in deep Web are generated dynamically based on the submitted queries.
• Not indexed by current search engines. Search engines index content on the surface Web.
Deep Web Sources and Concept-based Retrieval
• Deep Web in terms of size and quality:Size (Deep Web) = 500 * Size (Surface Web)Quality (Deep Web) = 1000 * Quality
(Surface Web)• Queries submitted at deep Web sources are more
stable compared to queries submitted to search engines
• So, naturally concept-based retrieval is more suitable for deep Web sources
Current and Future Work
• Conduct experiments to evaluate effectiveness (future).
• Investigate alternative methods to compute RSVs [KADR00, KDR01*].
• Learning edge weights through relevance feedback [KR00].
• Thesaurii based rulebase generation [KLR00].
Relevant URLs
[LJRT99*]
RaghavanHome Publications since 1991
www.allinonenews.com
top related