batch indexing & near real time, keeping things fast
DESCRIPTION
Presented by Marc Sturlese, Architect, Backend engineer, Trovit In this talk I will explain how we combine a mixed architecture using Hadoop for batch indexing and Storm, HBase and Zookeeper to keep our indexes updated in near real time.Will talk about why we didn't choose just a default Solr Cloud and it's real time feature (mainly to avoid hitting merges while serving queries on the slaves) and the advantages and complexities of having a mixed architecture. Both parts of the infrastucture and how they are coordinated will be explained with details.Finally will mention future lines, how we plan to use Lucene real time feature.TRANSCRIPT
![Page 1: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/1.jpg)
Batch Indexing & Near Real Time, keeping things fast
Marc SturleseSoftware engineer @ Trovit
![Page 2: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/2.jpg)
About me...
• Marc Sturlese – @sturlese
• Software engineer @Trovit. R&D focused
• Responsible for search and scalability
![Page 3: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/3.jpg)
Agenda
• Who we are
• Batch architecture. Hadoop & Hive
• Near real time architecture. Storm & stuff
• Putting it all together
• Alternatives and Future directions
• Questions
![Page 4: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/4.jpg)
Who we are
Trovit, a search engine for classifieds
![Page 5: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/5.jpg)
Who we are
![Page 6: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/6.jpg)
Batch Layer
• Hadoop based
• Documents are crunched by a pipeline of MR jobs
• Hive to save stats of each phase
![Page 7: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/7.jpg)
Batch LayerPipeline overview
Incoming data
Deployment
Lucene Indexes
Ad Processor Diff Matching Expiration Deduplication Indexing
t – 1
External Data
Hive Stats
Hadoop Cluster
![Page 8: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/8.jpg)
Batch LayerThe good things!
• Index always built from scratch. Small number of big segments
• Multicast deployment allows to send indexes to all slaves at the same time.
• Backups convenient on HDFS
![Page 9: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/9.jpg)
Batch LayerThat was cool but...
• Not even close to real time
• Crunch documents in batch means to wait until all is processed. This can take a few hours
• We want to show the user fresher results!
![Page 10: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/10.jpg)
Near real time LayerStorm and stuff to the rescue
![Page 11: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/11.jpg)
Near real time LayerStorm properties
• Distributed real time computation system
• Fault tolerance
• Horizontal scalability
• Low latency
• Reliability
![Page 12: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/12.jpg)
Near real time LayerStorm in action
Slave
Slave
Solr prod replicas
SlaveXML feed
XML feed
Kafka partition
Kafka partition
Storm topologySources
Kafka spout
Kafka spout
XML spout Doc Manager bolt Indexer bolt
SHUFFLEGROUPING GROUPING
FIELD
![Page 13: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/13.jpg)
Near real time LayerStorm in action
• Spouts just read and send
• Doc Manager Bolt processes and classifies
• Indexer Bolt adds documents to Solr
• Replicated logic with different implementation
• Careful not to overload Solr slaves...
![Page 14: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/14.jpg)
Near real time LayerStorm in action
![Page 15: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/15.jpg)
Near real time LayerStorm in action. But...
![Page 16: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/16.jpg)
Near real time LayerStorm in action. But...
• Now Solr has to handle user queries and storm inserts
• Field grouping on Indexer Bolt for politeness
• Small bulks to reduce insert requests
• Committing on many cores, same host, same time can be painful
![Page 17: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/17.jpg)
Near real time LayerStorm in action - Committing
Indexer Bolt Cars US
Real state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs BR R2 Real state ES R1
Indexer Bolt Jobs BR
ZooKeeper Locker
Slave 1 Slave 2 Slave N
. . .
![Page 18: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/18.jpg)
Near real time LayerStorm in action
• Adding documents now is fast
• Keep number of segments small
• Avoid merges on big segments
• Just add new docs (no deletes or updates)
![Page 19: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/19.jpg)
Mixed ArchitecturePutting it all together
15
Slave
Slave
Solr prod replicas
SlaveXML feed
XML feed
Kafka partition
Kafka partition
Storm topologySources
Hbase doc info
Bulk addExists?
MR Pipeline
zk
![Page 20: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/20.jpg)
Mixed ArchitectureSwapping indexes
• NRT docs might not be contained in the new batch index (even fresher than the “being built” batch index)
• This can lead to inconsistencies...
![Page 21: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/21.jpg)
Mixed ArchitectureSwapping indexes. Time jumps!
![Page 22: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/22.jpg)
Mixed ArchitectureSwapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexerBatch indexer
![Page 23: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/23.jpg)
Mixed ArchitectureSwapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexerBatch indexer
![Page 24: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/24.jpg)
Mixed ArchitectureSwapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexerBatch indexer
NRT t+1
NRT t+2
![Page 25: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/25.jpg)
Mixed ArchitectureSwapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexerBatch indexer
NRT t+1
NRT t+2
![Page 26: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/26.jpg)
Mixed ArchitectureSwapping indexes
• NRT indexed docs must be stored in a temporary storage
• Fetch missing docs from the storage and add them before the next deploy
• This avoids time jumps
![Page 27: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/27.jpg)
Mixed ArchitectureStorm and Hadoop
• Near real time inserts, low latency
• Hadoop handles deletes and updates. No rush on those
• No merges on big segments so optimal query response times
• Tolerant to human errors
• Temporary lost of accuracy on the NRT layer
![Page 28: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/28.jpg)
AlternativesSolrCloud - Why not?
• Good for the vast majority of use cases
• Incremental inserts/updates/deletes oriented. Pay segment merges per real time
• Need to deploy full indexes fast (faster that rsync or http replication)
• Now full deploy easier with aliases
![Page 29: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/29.jpg)
Future linesLucene real time feature
• Allows to see docs in the index before they are committed
• Good but not a must right now for the use case
• Very easy to integrate on the current architecture
![Page 30: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/30.jpg)
??
![Page 31: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/31.jpg)
Thanks for your attention!
Marc [email protected]
Lucene/Solr Revolution 2013, San Diego, May 1 2013
![Page 32: Batch indexing & near real time, keeping things fast](https://reader033.vdocuments.site/reader033/viewer/2022060107/554a0ed1b4c90507558b4ad3/html5/thumbnails/32.jpg)
CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge gets you in the door
TOMORROW Breakfast starts at 7:30Keynotes start at 8:30