building a log analysis pipeline
DESCRIPTION
Quick internal presentation on work we've been doing to deploy an ELK stack for our security analysis needs.TRANSCRIPT
Building a Log Analysis PipelineA BRIEF TOUR
Problem Limited visibility into the environment
SIEM solutions inadequate for risk management purposes
Requests for extracts difficult or impossible to provide
Unable to connect together different data sources
Requirements Cheap
◦ Budget + Labor 0◦ Hobby project
Scalable◦ SIEM data in the TB range◦ Need to have historical data◦ Decoupled from logging infrastructure
Performance◦ Batch processing is okay◦ …but batches can’t be too slow◦ Need near real-time exploration options
Confidentiality◦ This is security data. Let’s not create more problems than solutions.
Resources SIEM does a good job with log aggregation
◦ Stores raw syslog events
Easy to access to raw events on the SIEM
Data is relatively large, but not BIG
A Plan Is Born“I have a cunning plan!” – S. Baldrick, Blackadder
Early ApproachesMETHOD 1 - MONGODB
◦ Python regexp to create JSON◦ Load to MongoDB◦ Run Mongo MapReduce
Worked – but slow. Required AWS for sufficient memory to run MapReduce flows
METHOD 2 – PURE PYTHON
◦ Python regexp to create CSV◦ Pull off to Analysis Workspace◦ Python MapReduce in shell
Worked – but limited and rigid
Premature Data Truncation Leads to Poor Results
Loose ability to query context
Additional queries not possible without custom redesign◦ Blocks vs. Passes◦ Port information
Querying peer node relations, etc. not practical
Unleash the ELK!
Elasticsearch◦ Full text search engine based on
Apache Lucene◦ Incredibly fast and flexible query
DSL◦ Built for distributed search
(horizontal scale) from the ground up
Logstash Open Source log intake and processor
Easy to use pattern matching◦ No more opaque regexs!
Terrific metadata enrichment
Scores of plugins◦ Inputs, outputs, filters, codecs
Kibana◦ Lightweight HTML5 interface to
Elasticsearch for logs◦ Not a full SIEM replacement◦ Targeting the Splunk market
Infrastructure On SIEM
◦ Python for creating extracts◦ Bash for taring up raw logs
Transport◦ SCP from SIEM to Windows file share◦ USB from Windows file share◦ Sneaker net to analysis workspace
On Analysis Workspace◦ Vagrant◦ Chef
Demo
Pieces Involved
Next Steps – Infrastructure
Complete provisioning scripts for Hadoop & AWS
Transfer raw GZ files to encrypted S3 bucket◦ Allow extract AWS EMR jobs to run
Process via Logstash into Elasticsearch◦ Elasticsearch for short-term exploration◦ Archive structured data to S3
Setup Elasticsearch-Hadoop connector
Use AWS EMR to do ad hoc extracts off of structured S3 buckets
Next Steps – Data Products
Full MaxMind integration◦ Accuracy & detail
Reputation◦ REN-ISAC integration
Graph exploration◦ Who else talked to whom◦ Clustering
Future◦ Proxy logs◦ DNS logs
Thanks Google Groups
IRC #logstash, #chef, #vagrant, #elasticsearch
Seattle Search and Machine Learning Meetup
Seattle Chef Meetup
Hortonworks Sandbox
The Phoenix Project
Data-Driven Security
AlienVault
…and more!