netdocuments- journey from fast to solr
DESCRIPTION
Presented by David Hamson & Mou Nandi, NetDocuments - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 NetDocuments, a SaaS document management company, is migrating their large document repository from Microsoft FAST to Solr. During this presentation, the speakers will discuss the the entire process, including major decision points and lessons learned. The migration is a two-phase implementation: The first being a short-cut of moving the FAST xml data directly to Solr to get a Solr meta-data index available quickly and the second phase implements the full architecture, including both meta-data and full text processing and search. The presenters will talk about architecting Solr to meet the company's requirements of scaling to billions of work-product documents, low indexing latency, and high availability. NetDocuments uses the search engine to build the user experience and also for document discovery by users. Solr was architected to scale and perform in order to address these two very different needs and also to match all the features and functionality available with FAST. Finally, the presenters will share the benchmark results from tests run on various hardware configurations and on different file systems, and also share results from search quality testing as the capabilities of Solr were tested on a single server, both single Solr core as well as multiple Solr cores.TRANSCRIPT
Journey from FAST to Solr
Presented By : David Hamson , Mou Nandi
Goal of the Session
• NetDocuments • Why move to Solr from FAST • Architec8ng Solr to work as a core module for a Cloud Document
Management product user interface building and document discovery
• Tes8ng and benchmarking Solr to scale and perform for billions of documents with 200 QPS and 200 DPS
• Lessons learned/ shortcuts found migra8ng from FAST to Solr
2/14
Who We Are
2/14
A Leading cloud content management and collabora8on service for small to medium businesses (SMB) and professional services firms
Who We Serve
We service over 1,000 customers across 128 countries worldwide and host over 250+million documents.
2/14
Why Migrate to Solr
• Product roadmap does not fit with company roadmap • Large hardware footprint , expensive to scale • High indexing latency • Unpredictable and untraceable document loss • A black box search engine, dependency on MicrosoT FAST support team • No control over new features • Expensive license
2/14
• Solr supports massive index • Ac8ve hardworking development community • Access to what’s happening under the hood • Improved hardware footprint • Reduced licensing cost
Migration to Solr
2/14
FAST Instance 1
FAST Instance 2
FIXML
More FAST Instances
MDI + FTI
FIXML
Fast Doc Processors
Fast Doc Processors
ND Document
Fast Indexer
Fast Indexer MDI + FTI
• 95 % of searches are metadata search -‐ Metadata index does not need rich text processing
• Flexibility to implement different architecture for MDI and FTI
• Highest level of logging can not trace the document loss during a heavy feeding traffic
Migration to Solr – Solr Indexing
2/14
ND Pipeline
Solr MD XML
Solr FT XML
Aspire
ND Document
Solr MD Instance 1
MDI Solr MDI
MD
FT
Solr FTI FTI
Solr FT Instance
Solr MD Instance 1
MDI Solr MDI
Solr FTI FTI
Solr FT Instance
The Migration Project
2/14
• Only create MDI • Use FAST data to prototype Solr • Use the fixmls to build the Solr index • Use 100% filter queries
Phase 1 - MDI
• Build a robust feeding pipeline to handle both MD FT • Building a text processing pipeline
Phase 2 – FTI
• Implement new Solr features Phase 3
Some ft. view of NetDocuments Search Architecture
2/14
Web App
File System
Web Queue Solr MDI
Solr FTI
Web App
MD H
andl
er P
ool
FT P
roce
ssor
poo
l
Disp
atch
er p
ool
Query Distributor
Administration ( monitoring, debugging, stats)
FT Q
ueue
Disp
atch
er q
ueue
MDH5
MDH4
MDH3
MDH2
MDH1
FTP5
FTP4
FTP3
FTP2
FTP1
D5
D4
D3
D2
D1
NDPipeline -‐
Benchmarking Solr Config Parameter for indexing
• Created Solr index from fixmls with different ram buffer, merge factor and auto commit configura8on
2/14
• We did not see any performance difference between HDD ( 15k rpm) and the iodrive2 with ND documents
• 15 threads running at a 8me from client feeder applica8on
Testing with HDD and SSD
2/14
Testing using different file system
• We did not see huge performance difference between ext3 and xfs on HDD or SSD, with ND Documents
• We chose to use ext3 for FTI with 15K HDD on RAID10 • We are using xfs for iodrive for MDI as suggested by fusion Io
Benchmarking Solr Indexing and Query Process
2/14
search going to 5 shards search going to 10 shards
5 solr meter instances 10 Solr meter instances
Each shard serving 3000 queries per min Each shard serving 1500 queries/min
Total 15000 queries/min Total 15000 queries/min
avg response 8me 8 ms avg response 8me 12 ms
cpu 20 % cpu 32 %
ram -‐ 52 G ram -‐ 53 G
cache warmup 8me 2.5 S cache warmup 8me 2.7 S
cachehit ra8o .98 cachehit ra8o .98
cache size 2276 cache size 2276
no evic8on no evic8on
index updated every 7 sec index updated every 7 sec
test ran 5 min test ran 8 min
Implemented and compared mul8-‐core index processing and query performance compared to single core index
6/14
qTime does not vary much with start row increase.
Benchmark qtime increase as Solr scales and start row increases
Tuning System queries for Solr
• System searches are metadata searches • Thousands of real-‐life queries were extracted from FAST query log • Extensive use of filter queries and filter cache give excellent response 8me for complex queries
• Example queries:
FAST Query : ANDNOT(ANDNOT(ANDNOT(AND(AND(ndcabinets:string(“cab1", mode="and"),ndcredate:range(2011-‐09-‐26T00:00:00,2012-‐04-‐13T23:59:59)),FILTER(ndacl:string(“acl1 acl2 acl3 ",mode="OR"))),nddeletedcabs:string(“cab1", mode="and")),ndexten:string("ndws", mode="and")),ndexten:string("ndflt", mode="and")) Solr Query: hlp://solrserver:port/solrSearch/core0/select?shards=solrserver:port/solrSearch/core0,1solrserver:port/solrSearch/core1&start=0&rows=500&fl=ndenvurl,nddocmodnum_s_std,nd8tle_t_idx_std&sort=ndlastmoddate_tdt_idx+desc&q=ndenvurl:*&fq=ndcabinets_smul8_idx:cab1&fq=ndcredate_tdt_idx:[2011-‐09-‐26T00:00:00Z TO 2012-‐04-‐13T23:59:59Z]&fq={!cache=false cost=100}(ndacl_smul8_idx:acl1 OR ndacl_smul8_idx:acl2 OR ndacl_smul8_idx:acl3)&fq=-‐nddeletedcabs_smul8_idx:cab1&fq=-‐ndexten_s_idx:ndws&fq=-‐ndexten_s_idx:ndflt
2/14
THANK YOU