tuning solr for logs
DESCRIPTION
Tuning Solr for indexing and searching logs.TRANSCRIPT
![Page 1: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/1.jpg)
![Page 2: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/2.jpg)
Tuning Solr for Logs
Radu Gheorghe
@radu0gheorghe @sematext
![Page 3: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/3.jpg)
/me does...
Logsenesearch consulting + = logging consulting
.com/logsene
![Page 4: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/4.jpg)
Tuning. Is it worth it?
baseline last run
# of logs 10M 310M
EC2 bill/month 700 450
![Page 5: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/5.jpg)
What to optimize for?
http://www.seasonslogs.co.uk/images/products/SL_001.pnghttps://openclipart.org/image/300px/svg_to_png/169833/Server_1U.png
capacity: how many logs
the same hardware can keep
while still providing decent
performance
![Page 6: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/6.jpg)
What's decent performance? “It depends”
Assumptions
indexing: enough to keep up with generated logs*
search concurrency
search latency: 2s for debug queries, 5s for charts
*account for spikes!
![Page 7: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/7.jpg)
Enough theory, let's start testing!
Solr instance
m3.2xlarge (8CPU, 30GB RAM, 2x80GB SSD)
Solr 4.10.1
Feeder instance
c3.2xlarge (8CPU, 15GB RAM, 2x80GB SSD)
apache access logs
python script to parse and feed them
![Page 8: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/8.jpg)
Baseline test
15GB heap
debug query
status:404 in the last hour
charts query
all time status counters
all time top IPs
user agent word cloud
http://blog.sematext.com/2013/12/19/getting-started-with-logstash/
![Page 9: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/9.jpg)
Baseline result
100K 2.5M 4M 6M 9M 10M0
2000
4000
6000
8000
10000
12000
debugchartsEPS
![Page 10: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/10.jpg)
100K 2.5M 4M 6M 9M 10M0
2000
4000
6000
8000
10000
12000
debugchartsEPS
Baseline result
capacity
![Page 11: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/11.jpg)
100K 2.5M 4M 6M 9M 10M0
2000
4000
6000
8000
10000
12000
debugchartsEPS
Baseline result
capacitybottleneck: facets eat CPU
![Page 12: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/12.jpg)
100K 2.5M 4M 6M 9M 10M0
2000
4000
6000
8000
10000
12000
debugchartsEPS
Baseline result
capacitybottleneck: facets eat CPU
on average,CPU is OK
![Page 13: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/13.jpg)
100K 2.5M 4M 6M 9M 10M0
2000
4000
6000
8000
10000
12000
debugchartsEPS
Baseline result
capacitybottleneck: facets eat CPU
indexing limitedbecause pythonscripts eatsfeeder CPU
on average,CPU is OK
![Page 14: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/14.jpg)
Indexing throughput: is it enough?
“it depends”
how long do you keep your logs?
1M logs/day * 10 days <> 0.3M logs/day * 30 days. Both need 10M capacity
1M logs/day * 30 days? Needs 3 servers, each getting 0.3M logs/day
Baseline run: 10M index fills up in <1/2h at 7K EPS
![Page 15: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/15.jpg)
Indexing throughput: is it enough?
“it depends”
how long do you keep your logs?
1M logs/day * 10 days <> 0.3M logs/day * 30 days. Both need 10M capacity
1M logs/day * 30 days? Needs 3 servers, each getting 0.3M logs/day
how big are your spikes? (assumption: 10x regular load)
7K EPS is enough for 10M capacity if you keep logs >5h
![Page 16: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/16.jpg)
1.5M 3M 5M 8M 11M0
1000
2000
3000
4000
5000
6000
7000
8000
chartsEPSdebug
Rare commits
10% above baseline
auto soft commits every 5 seconds
auto hard commits every 30 minutes
RAMBufferSize=200MB; maxBufferedDocs=10M
![Page 17: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/17.jpg)
Same results with
even rarer commits (auto-soft every 30s, 500MB buffer)
omitNorms + omitTermFreqAndPositions
larger caches
cache autowarming
THP disabled
mergeFactor 5
mergeFactor 20
but indexingwas cheaper
manually ranqueries, too
![Page 18: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/18.jpg)
1.5M 3M 5M 8M 10M 12M0
1000
2000
3000
4000
5000
6000
7000
8000
chartsEPSdebug
DocValues on IP and status code
20% above baseline
![Page 19: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/19.jpg)
3M 10M 18M 24M 31M 36M0
10002000300040005000600070008000
chartsEPSdebug
Detour: what if user agent was string?
3.6x baseline
![Page 20: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/20.jpg)
8M 16M 24M 32M 40M 48M 56M 64M 67M 69M 70M 70.5M0
1000
2000
3000
4000
5000
6000
7000
8000
chartsEPSdebug
… and if user agent used DocValues?
6.7x baseline
reducing indexingadds 5% capacity
![Page 21: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/21.jpg)
3M 7M 11M 15M 19M 23M 27M 28M0
5000
10000
15000
20000
25000
30000
35000
chartsEPSdebug
Time based collections (1 minute)
2.7x baseline
OOM (150 collections)
![Page 22: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/22.jpg)
10M 40M 70M 100M 130M 160M 190M 213M0
1000
2000
3000
4000
5000
6000
7000
8000
chartsEPSdebug
Time based collections (10 minutes)
21x baseline
still OOM(~100 collections)
![Page 23: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/23.jpg)
50M 100M 150M 200M 250M 300M 310M 330M 340M0
1000
2000
3000
4000
5000
6000
7000
8000
chartsEPSdebug
10min collections: 20GB heap; optimize old
31x baseline,5 days projected retentionwith 10x spikes
no more OOM,just slower queries
34x baseline,10 days projectedretention (10x)
![Page 24: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/24.jpg)
Software optimizations recap
Definitely worth it Nice to have I wouldn't bother
time-based collections
noop I/O scheduler merge policy tuning
DocValues omit norms, term frequencies and positions
autowarm
rare soft commits optimize “old” collections
super-rare soft commitsdisable THP
![Page 25: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/25.jpg)
20M 70M 120M 170M 220M 270M 320M 372M0
1000
2000
3000
4000
5000
6000
7000
chartsEPSdebug
r3.2xlarge: +30GB RAM, +$0.14/h, 1x160GB SSD
37x baseline,9 days projected retentionwith 10x spikes
less indexing throughputthan m3.2xlarge
![Page 26: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/26.jpg)
20M 50M 80M 110M 140M 170M 177M0
100020003000400050006000700080009000
chartsEPSdebug
c3.2xlarge: -15GB RAM, -$0.14/h
17x baseline,5 days projected retentionwith 10x spikes
![Page 27: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/27.jpg)
Monthly EC2 cost per 1M logs*
m3.2xlarge: $1.3r3.2xlarge: $1.33c3.2xlarge: $1.78
TODO (a.k.a. truth always messes with simplicity): more/expensive facets => more CPU => c3 looks better less/cheap facets => not enough instance storage => EBS (magnetic/SSD/provisioned IOPS)? => storage-optimized i2? => old-gen instances with magnetic instance storage? use different instance types for “hot” and “cold” collections?
*on-demand pricing at 2014-11-07
![Page 28: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/28.jpg)
How NOT to build an indexing pipeline
custom script: reads apache logs from files
parses them using regex
takes 100% CPU and 100% RAM from a c3.2xlarge instance
maxes out at 7K EPS
![Page 29: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/29.jpg)
Enter Apache Flume*
*Or Logstash. Or rsyslog. Or syslog-ng. Or any other specialized event processing tool
agent.sources = spoolSrcagent.sources.spoolSrc.type = spooldiragent.sources.spoolSrc.spoolDir = /var/log
agent.sources.spoolSrc.channels = solrChannelagent.channels = solrChannelagent.channels.solrChannel.type = fileagent.sinks.solrSink.channel = solrChannel
agent.sinks = solrSinkagent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSinkagent.sinks.solrSink.morphlineFile = conf/morphline.confagent.sinks.solrSink.morphlineId = 1
put Solr and Morphlinejars in lib/
channel
source
sink
![Page 30: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/30.jpg)
morphline.conf (think Unix pipes)morphlines : [ { id : 1 commands : [ { readLine { charset : UTF-8 } } { grok { dictionaryFiles : [conf/grok-patterns] expressions : { message : """%{COMBINEDAPACHELOG}""" } } } { generateUUID { field : id } } { loadSolr { solrLocator : { collection : collection1 solrUrl : "http://10.233.54.118:8983/solr/" } } } ] }]
same ID as in the flume.confsink definition
process one line at a time(there's also readMultiLine)
https://github.com/cloudera/search/blob/master/samples/solr-nrt/grok-dictionaries/grok-patterns
parses each property(eg: IP, status code)in its own field
Solr cando it, too*
use zkHostfor SolrCloud
*http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/
![Page 31: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/31.jpg)
Result: 2.4K EPS, feeder machine almost idle
![Page 32: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/32.jpg)
2.4K EPS is typically enough for this
application server+ Flume agent
application server+ Flume agent
application server+ Flume agent
scales nicely with # of serversbut all buffering and processingis done here
![Page 33: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/33.jpg)
but not for this
application server+ Flume agent
application server+ Flume agent
application server+ Flume agent
centralized bufferingand processing
Flume agent
Flume agent
![Page 34: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/34.jpg)
or this
application server+ Flume agent
application server+ Flume agent
application server+ Flume agent
buffer, then process (separately)
Flume agent
Flume agent
Flume agent
![Page 35: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/35.jpg)
Increase throughput: batch sizes; memory channel
agent.sources = spoolSrcagent.sources.spoolSrc.type = spooldiragent.sources.spoolSrc.spoolDir = /var/logagent.sources.spoolSrc.batchSize = 5000
agent.sources.spoolSrc.channels = solrChannelagent.channels = solrChannelagent.channels.solrChannel.type = file memoryagent.channels.solrChannel.capacity = 1000000agent.channels.solrChannel.transactionCapacity = 5000agent.sinks.solrSink.channel = solrChannel
agent.sinks = solrSinkagent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSinkagent.sinks.solrSink.morphlineFile = conf/morphline.confagent.sinks.solrSink.morphlineId = 1agent.sinks.solrSink.batchSize = 5000
solrLocator : { collection : collection1 solrUrl : "http://10.233.54.118:8983/solr/" batchSize : 5000}
make sure you have enough heap
![Page 36: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/36.jpg)
Result: 10K EPS, 6%CPU usage (2x baseline)
![Page 37: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/37.jpg)
More throughput? Parallelize
Depends* on the bottleneck
*last time I use this word, I promise
source channel sink
more threads (if applicable)
more sources
multiplexingchannel selector
load balancingsink processor
more threads (if applicable)
Source1 C1
Source1
C1
Source2
Source1
C1
C2
C1 Sink1
C1
Sink1
Sink2
![Page 38: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/38.jpg)
Result: default Solr install maxed out at 24K EPS
![Page 39: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/39.jpg)
TODO: log in JSON where you can
Then, in morphline.conf, replace the grok command with the much ligher: readJson {}
Easy with apache logs, maybe not for other apps:
LogFormat "{ \ \"@timestamp\": \"%{%Y-%m-%dT%H:%M:%S%z}t\", \ \"message\": \"%h %l %u %t \\\"%r\\\" %>s %b\", \... \"method\": \"%m\", \ \"referer\": \"%{Referer}i\", \ \"useragent\": \"%{User-agent}i\" \ }" ls_apache_jsonCustomLog /var/log/apache2/logstash_test.ls_json ls_apache_json
More details at:http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/
![Page 40: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/40.jpg)
Conclusions
Use time-based collections and DocValues
Rare soft&hard commits are good Pushing them too far is probably not worth it
Hardware: test and see what works for you A balanced, SSD-backed machine (like m3) is a good start
Use specialized event processing tools Apache Flume is a fine example
Processing and buffering on the application server side scales betterBuffer before [heavy] processingMind your batch sizes, buffer types and parallelizationLog in JSON where you can
![Page 41: Tuning Solr for Logs](https://reader034.vdocuments.site/reader034/viewer/2022051413/559446061a28ab0e738b4619/html5/thumbnails/41.jpg)
Thank you!
Feel free to poke me @radu0gheorghe
Check us out at the booth, sematext.com and @sematext
We're hiring, too!