building a vertical search site (using lots of apache software, of course)

13
Building a Vertical Search Site (using lots of Apache software, of course)

Upload: vernon-shelton

Post on 17-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a Vertical Search Site (using lots of Apache software, of course)

Building a VerticalSearch Site

(using lots of Apache software, of course)

Page 2: Building a Vertical Search Site (using lots of Apache software, of course)

Just the Facts, Ma’am

• Ken Krugler - CTO/co-founder of Krugle

• We use lots of Apache S/W at krugle.org– Httpd, Lucene, Nutch, Solr, Xerces, etc.

• I’ll describe our architecture

• And the sometimes painful lessons learned

Page 3: Building a Vertical Search Site (using lots of Apache software, of course)

Three Faces of Krugle

• Free public site - http://www.krugle.org

• Partner sites– http://sourceforge.krugle.com– http://developerworks.krugle.com– http://aws.krugle.com

• Enterprise appliance

Page 4: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle.org free public site

• Search code, projects, & technical web pages

• 150,000 projects• 2.5billion lines of code• 40million web pages

Page 5: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle.org Architecture (web)

• Web tier runs Apache• Also mod_perl

– “glue” for Javascript to backend RESTful API

– Partner APIs

• “Dirty” side of system

Page 6: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle.org Architecture (API)

• API server uses Resin• Webapps provide

RESTful API services• Filer is big disk array

– LightTPD, NFS

• Searchers run Hadoop, Lucene

Page 7: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle.org Architecture (CPI)

• Page crawl uses Nutch• Code crawl uses bits

of Nutch, custom stuff• Fuzzy parsers created

using ANTLR• Project data in

MySQL, pushed to Solr

• Code index is Lucene

Page 8: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle partner sites

• IBM developerWorks• Sourceforge.net• Amazon Web Services• Yahoo! Dev Network• Collabnet

Page 9: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle Architecture (partners)

• Higher level API• Wraps RESTful API• Handled in web tier• Big chunks of Perl• LightTPD cache

Page 10: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle enterprise server

• Krugle inside firewall• Talks to major SCMs• SCM Comment search• Includes public site info

Page 11: Building a Vertical Search Site (using lots of Apache software, of course)

Krugle Architecture (enterprise)

• Collapses web tier, API server, code searchers, filer, and DB server

• Separate admin system (DB, GUI, code crawler, configuration) as Jetty-hosted webapp

Page 12: Building a Vertical Search Site (using lots of Apache software, of course)

RESTful API

• HTTP requests, XML responses

• Works well with Perl middleware

• Some load/memory issues

• Solr integration challenges

• Integration test challenges

Page 13: Building a Vertical Search Site (using lots of Apache software, of course)

Key Lessons

• If it isn’t broke, don’t upgrade– There’s always a newer version– That includes the build system

• Be prepared to pay for free software– Motivating project contributors to do things

• Moderation in architectural abstraction– There’s always a higher and lower option