cruising on a security data lake - secure360€¦ · understanding secops big data goals...
TRANSCRIPT
Cruising on a Security Data Lake
Solving Big Data Challenges in SECOPS
Charles HerringCo-Founder, CTO
CharlesHerring.comWitFoo.com
@charlesherring
About Charles1995-2002: Forward Deployed US Navy Hornet Avionics Tech
2002-2005 US Naval Postgraduate School Network Security Group Division Officer
sk3wl 0f r00t team member
2003-2008: InfoWorld Test Center Contributing Product Reviewer – Network and Information Security
2005-2012 DoD Security, Data & Workflow Consultant
2012-2016: Consulting Security Architect for Lancope then Cisco Systems
2016 Chief Nerd (CTO) & co-founder at WitFoo
Understanding SECOPS Big Data Goals• Horizontal Scale for Long term retention• Streaming processing (near-real time)• Retrospective Analysis of data• Diverse data inputs at reasonable integration costs• Normalized Data for ad hoc analysis• Machine learning• Reasonable RAM and CPU costs (disk is cheap)• Law Enforcement integrity• Contextualization & State management
WitFoo Research: SECOPS 7 Unstable Conversations1. Responders cannot understand/process what their tools are communicating.2. Security Managers cannot understand what resources their responders need.3. Security Manager cannot effectively communicate with the broader business.4. Organizations cannot hold vendors accountable.5. Organizations cannot safely share information with each other in an operational way.6. Organizations cannot report cybercrime to Law Enforcement in a way that does not
create organization risk.7. Law Enforcement can not effectively prosecute cyber criminals due to insufficient or
complex evidence.
Playing Data “Long Game”
Business MetricsData CollectionInfinite Horizontal ScaleOpen Access to Data
Linear Scale
AnalysisSignatureBehavioral
Anomaly (Machine Learning)
FTE RequirementsPersonnel Efficiency
Tool EfficiencyCompliance/Readiness
Analyst Threat Hunter
Security Architect Manager Executive Board of
Directors
WitFoo Precinct: 5 Full Cycles of FailureGeneration RDBM Raw Data Cluster Processing Buffering
1st (Fife) MySQL HDFS Map Reduce -
2nd (McNulty) MySQL Cassandra Logstash Kafka
3rd (Kojak) MySQL ElasticSearch Logstash Kafka
4th (Foley) MySQL ElasticSearch Custom Scala Kafka
5th (Deckard) MySQL with NDB Cluster Custom Scala Kafka
6th (Benson) MySQL/InnoDB with Manual Distro
Spark + Scala Kafka
MapReduce and other disk failures• IOPS = Input/Output Operations per Second• CyberSecurity Analysis requires high IOPS (10k – 1M)• Disk storage cost is low but IOPS is expensive• Memory and indexing needed• MapReduce & HDFS fails because of low IOPS throughput
Big Data Pipeline Game Pieces
Inputs Buffer Parser Normalized Pre-processRDBM/Graph State/MLMulti-index
Input Store/Index Analyze
Data Questions• When to normalize? (Normalize never problem)• To JOIN or not to JOIN?• Linear data scale?• Disk to RAM ratio requirements?• To index/parse or full string search?• OSS vs Commercial support• Connector interoperabilty
Distributed Storage• Broker – handles replication; provides logical access to data• Linear or non-linear scale• Memory dependency on scale• Relational/Graph or NoSQL• Code Must bridge Gaps in data storage
Broker? Linear scale? Memory independent?
Relational?
No No No
No
No
Data Structure Comparison
Multi-index
Normalized
Relational
Creation Difficulty
Low
High
Medium
QueryDifficulty
Low
Medium
High
Visualization Difficulty
High
High
Low
State/Entity Management
Difficulty
High
High
Low
BaselineDifficulty
High
High
Medium
EvaluatedTechnologies
Parsing Questions• Is processing compute sub-linear, linear or greater?• What is the cost of creating and maintaining parsing?• Is parsing important?• Multi-index vs single (normalized) index
Input Parsing ComparisonInitial Value Flexibility Output
Options
Resource Requirement
Commercial Support
Best
Good
Worst Best
Fair
Poor
Fair
Best
Worst
Best
Worst
Poor
• Cost Challenges• Multi-index limitation
• Exponential resource cost in parsing
• Multi-index limitation
• Expensive Devs• No commercial support
Risks and Problems
Poor Good Good Poor
• Expensive Devs• No commercial support
Big Datastore Technology ExperienceLanguage
NoSQL
CQL (from SQL)
NoSQL & SQL
Hadoop
SDK/ODBC Availability
Best
Linear Scale v. Resource
Best
Complex Query Time
Best
Fair
Worst
Good
Good
Commercial Support
Best
Good
Worst
Fair
1st 85% Satisfaction
Best
Worst
Fair
Last 15% Satisfaction
Best
Worst
Poor
Good
Poor
Good
Fair
Poor Poor
Fair
Streaming Pipelines for Ingest
Raw Data Normalized
Relations Persist
Java-based data pipeline language
Distributed Message buffering technology using RAM and Disk
Micro-batching Pipeline
Extract Batch Data
Distributed Batch Processing
Persist Analysis
Distributed compute framework
Pre-processing – State and Machine LearningInitial Value Flexibility Dev Cost
Best
WorstFair
Poor
Good
Best
Good
Fair Best
Poor
Good
Fair
Scale Ease
Best
Good
Poor
Fair
Challenges
• Scale/OPS challenges• Stateful calc a challenge
• OPS Overhead Mesos or HDFS• Great for complex/divers data sets• Stateful calc a challenge
• Dev overhead• Requires custom distributed logic• Stateful calc a challenge
• Requires extreme RDBM data structure• Load balancing (non-stateful) distribution of
processing
Big Data Pipeline Game Pieces
Inputs Buffer Parser Normalized Pre-processRDBM/Graph State/MLMulti-index
Input Store/Index Analyze
ELK Pipeline
Data BufferTransport Parse Multi-index State / ML UX
Fair Poor Poor Good Poor Good
Best of Breed
Data BufferTransport Parse/Normalize Pre-process Normalized State / MLRelations UX
Skip multi-index – no lasting value
Best Best Best BestBest Best Best Best
WitFoo Precinct (Deckard) Pipeline
Data BufferTransport Parse/Normalize Pre-process Normalized State / MLRelations UX
Best Best Best Best GoodBestFair Fair
WitFoo Precinct (Benson) Pipeline
Data BufferTransport Parse/Normalize Pre-process Normalized State / MLRelations UX
Best Best Best Best GoodBestBest Best
Precinct Architecture
ProcessingInput
Data Data
Input
Replication2202/tcp
Input Cluster
Data Cluster
API8080/tls
NDB3306/tcp
Summary Points• No “silver bullets”• Getting to 85% easy; the last 15% requires planning• No commercial OTSS covers all needs• Holistic, “long game” plan bears most fruit• Industry hype can be unbalanced against “long game”• Be wary of “better than we have” or “good enough”
philosophy• Resource (CPU, RAM) and license costs can grown
quickly• Investing in OSS Projects can reduce costs and increase
success (Kafka, Spark, Cassandra, etc.)
Cruising on a Security Data Lake
Solving Big Data Challenges in SECOPS
Charles HerringCo-Founder, CTO
CharlesHerring.comWitFoo.com
@charlesherring