search domain basics

Post on 09-Jun-2015

139 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Praveen Manvi

April - 2009

Search Domain Basics

Objectives

Search Goals

Business Models Structured Vs Un-Structured content Search Terminologies Technologies behind search

Goal : “Make it like this”

Simple, Mostly accurate & fast

But that’s not always possible

Business ModelsSponsored Search

Content Match

TLR Confidential

It’s all about Bill Boards

TLR Confidential

Vertical search, or domain-specific search

Structured Vs Un-structured DataUnstructured – 80%, Structured

– 20%

Relational = structured all other = unstructured.

Why not use SQL/RDBMS?SQL Search limits – %bla bla% pretty

limited by schema & SQL (a limited DSL)Cannot handle Bad user inputs but actually

phonetically correct inputsDifficult to implement various search

requirement like Proximity - Java close to Serialization - if they are close to each other it means it’s a software content

Difficult to scale, manage changes & implement parallelization (Map-Reduce)

Sample Search requestsSample Collection: Sun JDK classesHow many times “synchronized” key word

has been in JDK java classes other than java.lang package?

How many static methods are present in JDK classes that have synchronized methods

How many java classes are there in the Collection framework that use synchronized keyword and have more than 200 lines

Search TerminologiesProximity search :A search where users

to specify that documents returned should have the words near each other.

Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself.

Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.

Contd…Stemming: The ability for a search to

include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run."

Lemmatization: is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item

Contd…Noise or Stop words :Conjunctions,

prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.

Thesaurus : A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.

Index: Normailzed presentation of words

Contd….Semantic Search: is a process used to

improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.

Web Search Vs Enterprise SearchWeb Search : Content is public & generic.

Uses keywords, Links (relevancy) based some kind of historic traffic. Usually http crawlers are used for content acquisition

Enterprise Search : Also contains private documents that domain specific, Quality of content should be highest quality content & not necessarily popular Information/metadata needs to be secure with role based access to the content. It has to support security (Realms, Roles), SLAs and many other requirements.

Search TechnologiesRDBMS to store metadataCache service - for fast accessParsers – to interpret input queriesInternationalization – For handling

different languagesSearch DSL – catering to particular

domainMap/Reduce, Parallelization & AlgorithmsIndexing, File storage systems/ Multi-

threading

Contd…

Thank You!

top related