goat search

22
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Upload: montana

Post on 23-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Goat search. Revorg GOAT Search Solution (Powered by Lucene). About Me. Grover Fields Revorg, LLC (Owner) M.S. Information System (Troy University) B.S. Industrial Engineering (Florida A&M University) Stanford Project Management Courses. About Me. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Goat search

GOAT SEARCHRevorg GOAT Search Solution (Powered by Lucene)

Page 2: Goat search

About MeGrover Fields Revorg, LLC (Owner) M.S. Information System (Troy University) B.S. Industrial Engineering (Florida A&M

University) Stanford Project Management Courses

Page 3: Goat search

About Me 10+ years of development, analysis, and

implementation 10+ years of ColdFusion experience 2+ years of Java experience Commonspot, Strongmail, ClickFix

(Developer) Email: [email protected] Web site: http://www.groverfields.com

Page 4: Goat search

Agenda What?

What can we do with GOAT? Why?

Why do we want to use GOAT and not Verity? How?

How do we do that? Conclusion and alternative solutions

Page 5: Goat search

What What is a Search Engine?

Builds an index on text Answers queries using that index, a la Verity

Existing database already A search engine offers?

Scalability Reliance Ranking Tweaking Integrates different sources (email, web pages, files, DATABASES)

Page 6: Goat search

What is a search engine? (cont.) Works on words, not on substrings

Auto != automatic, automobile Indexing process:

Convert document Extract text and meta data Normalize text Write (inverted) index

Page 7: Goat search

Apache Lucene Overview Lucene Java 2.4

A high-performance, full-featured text search engine library written entirely in Java.

It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

No GUI http://lucene.apache.org

Page 8: Goat search

Apache Lucene Overview Java library for indexing and searching No dependencies Works with Java 1.4 or later Input for indexing: Document objects

Each document: set of Fields, field name, field content Stores its index as files on disk or memory No document converters No web crawler

Page 9: Goat search

Lucene Java users HBCU.info LinkedIn IBM OmniFind Yahoo! Edition Techorati.com Eclipse Monster.com …

Page 10: Goat search

Lucene Java Summary Java Library for indexing and searching Lightweight /no dependencies Powerful and fast and tested! No document conversion No GUI

Page 11: Goat search

Why?

Cost of Enterprise Search Solution Need for search speed Java projects to work on

Things to do

Page 12: Goat search

Verity Limitations 10,000 documents for ColdFusion Developer Edition

125,000 documents of ColdFusion Standard Edition

250,000 documents for ColdFusion Enterprise Edition What do developers do in a shared hosting

environment? Is it possible for the hosting company to limit the

number of documents per Web site?

Page 13: Goat search

T-SQL Limitations? Search for “Yahoo” on my blog

SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC

Multiply that time 10, 100, 500, or 1000 users/hr?

Page 14: Goat search

T-SQL Limitations?

Full table scan = 1 THING PERFORMANCE KILLER!!! No search sorting

RDBMS isn’t designed to do this but allows it Use the right tools!

Page 15: Goat search

How? GOAT Search Solution

Lucene 2.4.0 ColdFusion MX 8

MX is fine but GUI needs to be rolled back Commons IO 1.4

Simply package .jar files Simply Web based GUI

Page 16: Goat search

How? Macromedia JDBC Drivers

Same drivers that ColdFusion uses No additional drivers to install

Supports RDBMS ONLY MSSQL MySQL Oracle

No File system support (Yet)

Page 17: Goat search

Basics? Indexing extracts both meaning and structure from

unstructured information by indexing each document Contains a complete list of all the words used in a given

document along with metadata about that document Lucene creates a collection that normalizes both the

structured and unstructured data. Search requests then check these collections rather than

scanning the actual documents and database fields. This provides a faster search of information, regardless of the

file type and whether the source is structured or unstructured.

Page 18: Goat search

Basics? Collection

A special database created by Lucene that contains metadata that describes the documents Documents

A sequence of fields Similar to a row in a database table

Row 1 Row 2, etc

Fields A named sequence of terms Similar to a column in a table

Primary Key Column 1

Terms Is a string

Page 19: Goat search

Knowledge? Index

A special database created by Lucene that contains metadata that describes the documents

Query Syntax Similar to Google’s advanced search:

field:value E.G. resume: coldfusion http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Results Primary Key list of values XML based on the document CFX Tag integration

Page 20: Goat search

Alternative Solutions for Search Commercial vendors:

FAST, $100k Autonomy, $80k Google, $50k

Commercial search engines based on Lucene IBM OmniFind Yahoo Edition

RDBMS with Integrated Search Oracle MySQL MSSQL PERFORMANCE KILLERS

Page 21: Goat search

RoadMapRoad Map

A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials.

Overhaul Java programming (still novice) Integrate with other products

Aperture Nutch Solr

File system integration .txt, .pdf, .doc, .ppt, etc.

Geospatial based searches E.G. All jobs within a 50 mile radius

Page 22: Goat search

References Apache.org Adobe.com Ben Forta’s Blog Slideshare.net

Multiple authors Other references