building enterprise search engines using open source technologies

www.anant.us | [email protected] | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007

Large Scale Search with Open Source Technologies

Building Search Engines

What do we do?

Streamline, Organize & Unify

Business Information

Agenda

•Challenge - Why does this matter?•Search Engine - 30k Foot View•Open - Lucene, Cassandra & Spark•Customizing - Apache Lucene/SolR•Custom Parser - Written in Scala

Challenge – Why does this matter?

Knowledge

Project Informatio

n

Client Service

InformationCorporate

Guides

Collaborative

Documents

Assets& Files

Corporate Resources

Appleseed Framework (Portal, Base, Search)

G Drive Delta

DropBox

G Drive Delta

NutshellDropbox

Freshbooks

G DriveG Sites

(KB)G Drive

WorkflowyEvernote

G DriveDropBox

OwnCloud

PocketLeaves

AIC (WP)Anant (WP)

Search Engine – 30 Thousand Foot View

The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search.

Lucene – More than meets the eye

WhoNext?

Think of it like a “NoSQL” Database that has great indexing.. everywhere.

Cassandra – NoSQL With Structure

WhoNext?

Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join.

Spark – Way Better MapReduce

WhoNext?

Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster.

Configuring - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want.

• Configuration - Schema–Data Types–Pre-Processing –Collection Definitions–Managed vs. Unmanaged

• Configuration - ZooKeeper–Synchronize Configurations–Distribute Shards–Manage Replicas–Elect Leaders

• Configuration - SolrConfig–Handlers–Components–Indexing Configurations–Memory / Cache–File System

• Lessons Learned–Try to use out of the box–Try to configure your way –Make sure to upgrade–Not everything can be configured

Configuring - SolR - 2/3

• Before Docker –Setup Zookeeper

•Customize zoo.cfg•Setup Zookeeper Servers

–Setup SolR Distro•Download SolR•Clean up SolR•Customize Schema.xml•Customize SolrConfig.xml•Setup Different Solr Servers

–Start the Cloud•Custom Start Scripts

• Today w/ Docker – docker run --name zookeeper \

-p 127.0.0.1:2181:2181 \-p 127.0.0.1:2888:2888 \-p 127.0.0.1:3888:3888 \jplock/zookeeper

– docker run --link zookeeper:ZK -i \-p 127.0.0.1:8983:8983 \-t dockerimages/docker-solr \ /bin/bash -c '\cd /opt/solr/example; \java -jar \-Dbootstrap_confdir=./solr/collection1/conf \-Dcollection.configName=myconf \ -DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT \-DnumShards=2 \start.jar';

https://hub.docker.com/r/dockerimages/docker-solr/

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production



https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production

Configuring - SolR - 3/3

• SolrConfig - Example • Schema - Example

https://cwiki.apache.org/confluence/display/solr/Configuring+solrconfig.xml

https://wiki.apache.org/solr/SchemaXml



https://wiki.apache.org/solr/SchemaXml

SolR Cloud / Zookeeper

User Interface - Super Advanced

Customizing - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want.

• Customization - Parsing–Need Specialized Syntax?–Java or Scala Based–Open Plugin Structure–Several Examples

• Customization - Highlighting–Need Special Coloring?–Specialized Syntax Aware–Open Plugin Structure–Several Examples

• Customization - Term Counts–Need Specific Information?–Create the Logic–Register the Component–Complicated Examples

• Lessons Learned–Major version upgrades = pain–Newer classes can be extended better

–Long term investment

Customizing - SolR - 2/3

• Custom Component in Scala or Java • Installing the Component

http://wiki.apache.org/solr/SolrPlugins http://sujitpal.blogspot.com/2011/03/using-lucenes-new-queryparser-framework.html

http://wiki.apache.org/solr/SolrPlugins

http://sujitpal.blogspot.com/2011/03/using-lucenes-new-queryparser-framework.html



Customizing - SolR - 3/3

Creating a Custom Parser with ScalaBuilding a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is.

• Why a Specialized Syntax?–Legacy Syntax–Boolean AND Proximity Queries–Specialized Fielded Expressions–Ranges / Classifications

• Why not ANTLR or JavaCC?–Old Parser was in Parboiled(1)–Parboiled2 was in Scala–No need to learn a separate Syntax for Creating Syntax

• Lessons Learned–Parboiled2 Documentation = bad–Understand the syntax–Interactive REPL in Scala = good–Write tons of unit tests–Long term investment

• Customizing SolR with Scala–Found a good Virtual Mentor–Learned Scala (not for Spark)–Started from the ground up–Reduced from ~1k to 400 LOC

JavaCC vs. parboiled2 (Scala)

• Java CC - SurroundQuery.jj • Scala based Parboiled2

Questions & Contact


@anantcorp

facebook.com/anantCorp

linkedin.com/company/anant

[email protected]/in/xingh

Rahul SinghCEO & Founder

Questions & Contact

• Brown Bag Session or Meetup?• Modern Enterprise• Mastering Services in the Service of Others• Hybrid Agile Project Management• Building Search Engines• CICD / DevOps• Connecting Internet Software


Streamlined DataIntegration / Data PipelinesOrganized Knowledge

Search / Data WarehousesUnified Interfaces

Portals / Dashboards / Mobile