building enterprise search engines using open source technologies
TRANSCRIPT
www.anant.us | [email protected] | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Large Scale Search with Open Source Technologies
Building Search Engines
What do we do?
Streamline, Organize & Unify
Business Information
Agenda
•Challenge - Why does this matter?•Search Engine - 30k Foot View•Open - Lucene, Cassandra & Spark•Customizing - Apache Lucene/SolR•Custom Parser - Written in Scala
Challenge – Why does this matter?
Knowledge
Project Informatio
n
Client Service
InformationCorporate
Guides
Collaborative
Documents
Assets& Files
Corporate Resources
Appleseed Framework (Portal, Base, Search)
G Drive Delta
DropBox
G Drive Delta
NutshellDropbox
Freshbooks
G DriveG Sites
(KB)G Drive
WorkflowyEvernote
G DriveDropBox
OwnCloud
PocketLeaves
AIC (WP)Anant (WP)
Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search.
Lucene – More than meets the eye
WhoNext?
Think of it like a “NoSQL” Database that has great indexing.. everywhere.
Cassandra – NoSQL With Structure
WhoNext?
Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join.
Spark – Way Better MapReduce
WhoNext?
Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster.
Configuring - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want.
• Configuration - Schema–Data Types–Pre-Processing –Collection Definitions–Managed vs. Unmanaged
• Configuration - ZooKeeper–Synchronize Configurations–Distribute Shards–Manage Replicas–Elect Leaders
• Configuration - SolrConfig–Handlers–Components–Indexing Configurations–Memory / Cache–File System
• Lessons Learned–Try to use out of the box–Try to configure your way –Make sure to upgrade–Not everything can be configured
Configuring - SolR - 2/3
• Before Docker –Setup Zookeeper
•Customize zoo.cfg•Setup Zookeeper Servers
–Setup SolR Distro•Download SolR•Clean up SolR•Customize Schema.xml•Customize SolrConfig.xml•Setup Different Solr Servers
–Start the Cloud•Custom Start Scripts
• Today w/ Docker – docker run --name zookeeper \
-p 127.0.0.1:2181:2181 \-p 127.0.0.1:2888:2888 \-p 127.0.0.1:3888:3888 \jplock/zookeeper
– docker run --link zookeeper:ZK -i \-p 127.0.0.1:8983:8983 \-t dockerimages/docker-solr \ /bin/bash -c '\cd /opt/solr/example; \java -jar \-Dbootstrap_confdir=./solr/collection1/conf \-Dcollection.configName=myconf \ -DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT \-DnumShards=2 \start.jar';
https://hub.docker.com/r/dockerimages/docker-solr/
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production
Configuring - SolR - 3/3
• SolrConfig - Example • Schema - Example
https://cwiki.apache.org/confluence/display/solr/Configuring+solrconfig.xml
https://wiki.apache.org/solr/SchemaXml
SolR Cloud / Zookeeper
User Interface - Super Advanced
Customizing - SolR - 1/3SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want.
• Customization - Parsing–Need Specialized Syntax?–Java or Scala Based–Open Plugin Structure–Several Examples
• Customization - Highlighting–Need Special Coloring?–Specialized Syntax Aware–Open Plugin Structure–Several Examples
• Customization - Term Counts–Need Specific Information?–Create the Logic–Register the Component–Complicated Examples
• Lessons Learned–Major version upgrades = pain–Newer classes can be extended better
–Long term investment
Customizing - SolR - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPlugins http://sujitpal.blogspot.com/2011/03/using-lucenes-new-queryparser-framework.html
Customizing - SolR - 3/3
Creating a Custom Parser with ScalaBuilding a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is.
• Why a Specialized Syntax?–Legacy Syntax–Boolean AND Proximity Queries–Specialized Fielded Expressions–Ranges / Classifications
• Why not ANTLR or JavaCC?–Old Parser was in Parboiled(1)–Parboiled2 was in Scala–No need to learn a separate Syntax for Creating Syntax
• Lessons Learned–Parboiled2 Documentation = bad–Understand the syntax–Interactive REPL in Scala = good–Write tons of unit tests–Long term investment
• Customizing SolR with Scala–Found a good Virtual Mentor–Learned Scala (not for Spark)–Started from the ground up–Reduced from ~1k to 400 LOC
JavaCC vs. parboiled2 (Scala)
• Java CC - SurroundQuery.jj • Scala based Parboiled2
Questions & Contact
www.anant.us | [email protected] | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
@anantcorp
facebook.com/anantCorp
linkedin.com/company/anant
[email protected]/in/xingh
Rahul SinghCEO & Founder
Questions & Contact
• Brown Bag Session or Meetup?• Modern Enterprise• Mastering Services in the Service of Others• Hybrid Agile Project Management• Building Search Engines• CICD / DevOps• Connecting Internet Software
www.anant.us | [email protected] | 202.905.28181010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Streamlined DataIntegration / Data PipelinesOrganized Knowledge
Search / Data WarehousesUnified Interfaces
Portals / Dashboards / Mobile