cruising on a security data lake - secure360€¦ · understanding secops big data goals...

Cruising on a Security Data Lake

Solving Big Data Challenges in SECOPS

Charles HerringCo-Founder, CTO

CharlesHerring.comWitFoo.com

@charlesherring

About Charles1995-2002: Forward Deployed US Navy Hornet Avionics Tech

2002-2005 US Naval Postgraduate School Network Security Group Division Officer

sk3wl 0f r00t team member

2003-2008: InfoWorld Test Center Contributing Product Reviewer – Network and Information Security

2005-2012 DoD Security, Data & Workflow Consultant

2012-2016: Consulting Security Architect for Lancope then Cisco Systems

2016 Chief Nerd (CTO) & co-founder at WitFoo

Understanding SECOPS Big Data Goals• Horizontal Scale for Long term retention• Streaming processing (near-real time)• Retrospective Analysis of data• Diverse data inputs at reasonable integration costs• Normalized Data for ad hoc analysis• Machine learning• Reasonable RAM and CPU costs (disk is cheap)• Law Enforcement integrity• Contextualization & State management

WitFoo Research: SECOPS 7 Unstable Conversations1. Responders cannot understand/process what their tools are communicating.2. Security Managers cannot understand what resources their responders need.3. Security Manager cannot effectively communicate with the broader business.4. Organizations cannot hold vendors accountable.5. Organizations cannot safely share information with each other in an operational way.6. Organizations cannot report cybercrime to Law Enforcement in a way that does not

create organization risk.7. Law Enforcement can not effectively prosecute cyber criminals due to insufficient or

complex evidence.

Playing Data “Long Game”

Business MetricsData CollectionInfinite Horizontal ScaleOpen Access to Data

Linear Scale

AnalysisSignatureBehavioral

Anomaly (Machine Learning)

FTE RequirementsPersonnel Efficiency

Tool EfficiencyCompliance/Readiness

Analyst Threat Hunter

Security Architect Manager Executive Board of

Directors

WitFoo Precinct: 5 Full Cycles of FailureGeneration RDBM Raw Data Cluster Processing Buffering

1st (Fife) MySQL HDFS Map Reduce -

2nd (McNulty) MySQL Cassandra Logstash Kafka

3rd (Kojak) MySQL ElasticSearch Logstash Kafka

4th (Foley) MySQL ElasticSearch Custom Scala Kafka

5th (Deckard) MySQL with NDB Cluster Custom Scala Kafka

6th (Benson) MySQL/InnoDB with Manual Distro

Spark + Scala Kafka

MapReduce and other disk failures• IOPS = Input/Output Operations per Second• CyberSecurity Analysis requires high IOPS (10k – 1M)• Disk storage cost is low but IOPS is expensive• Memory and indexing needed• MapReduce & HDFS fails because of low IOPS throughput

Big Data Pipeline Game Pieces

Inputs Buffer Parser Normalized Pre-processRDBM/Graph State/MLMulti-index

Input Store/Index Analyze

Data Questions• When to normalize? (Normalize never problem)• To JOIN or not to JOIN?• Linear data scale?• Disk to RAM ratio requirements?• To index/parse or full string search?• OSS vs Commercial support• Connector interoperabilty

Distributed Storage• Broker – handles replication; provides logical access to data• Linear or non-linear scale• Memory dependency on scale• Relational/Graph or NoSQL• Code Must bridge Gaps in data storage

Broker? Linear scale? Memory independent?

Relational?

No No No

No

No

Data Structure Comparison

Multi-index

Normalized

Relational

Creation Difficulty

Low

High

Medium

QueryDifficulty

Low

Medium

High

Visualization Difficulty

High

High

Low

State/Entity Management

Difficulty

High

High

Low

BaselineDifficulty

High

High

Medium

EvaluatedTechnologies

Parsing Questions• Is processing compute sub-linear, linear or greater?• What is the cost of creating and maintaining parsing?• Is parsing important?• Multi-index vs single (normalized) index

Input Parsing ComparisonInitial Value Flexibility Output

Options

Resource Requirement

Commercial Support

Best

Good

Worst Best

Fair

Poor

Fair

Best

Worst

Best

Worst

Poor

• Cost Challenges• Multi-index limitation

• Exponential resource cost in parsing

• Multi-index limitation

• Expensive Devs• No commercial support

Risks and Problems

Poor Good Good Poor

• Expensive Devs• No commercial support

Big Datastore Technology ExperienceLanguage

NoSQL

CQL (from SQL)

NoSQL & SQL

Hadoop

SDK/ODBC Availability

Best

Linear Scale v. Resource

Best

Complex Query Time

Best

Fair

Worst

Good

Good

Commercial Support

Best

Good

Worst

Fair

1st 85% Satisfaction

Best

Worst

Fair

Last 15% Satisfaction

Best

Worst

Poor

Good

Poor

Good

Fair

Poor Poor

Fair

Streaming Pipelines for Ingest

Raw Data Normalized

Relations Persist

Java-based data pipeline language

Distributed Message buffering technology using RAM and Disk

Micro-batching Pipeline

Extract Batch Data

Distributed Batch Processing

Persist Analysis

Distributed compute framework

Pre-processing – State and Machine LearningInitial Value Flexibility Dev Cost

Best

WorstFair

Poor

Good

Best

Good

Fair Best

Poor

Good

Fair

Scale Ease

Best

Good

Poor

Fair

Challenges

• Scale/OPS challenges• Stateful calc a challenge

• OPS Overhead Mesos or HDFS• Great for complex/divers data sets• Stateful calc a challenge

• Dev overhead• Requires custom distributed logic• Stateful calc a challenge

• Requires extreme RDBM data structure• Load balancing (non-stateful) distribution of

processing

Big Data Pipeline Game Pieces

Inputs Buffer Parser Normalized Pre-processRDBM/Graph State/MLMulti-index

Input Store/Index Analyze

ELK Pipeline

Data BufferTransport Parse Multi-index State / ML UX

Fair Poor Poor Good Poor Good

Best of Breed

Data BufferTransport Parse/Normalize Pre-process Normalized State / MLRelations UX

Skip multi-index – no lasting value

Best Best Best BestBest Best Best Best

WitFoo Precinct (Deckard) Pipeline


Best Best Best Best GoodBestFair Fair

WitFoo Precinct (Benson) Pipeline


Best Best Best Best GoodBestBest Best

Precinct Architecture

ProcessingInput

Data Data

Input

Replication2202/tcp

Input Cluster

Data Cluster

API8080/tls

NDB3306/tcp

Summary Points• No “silver bullets”• Getting to 85% easy; the last 15% requires planning• No commercial OTSS covers all needs• Holistic, “long game” plan bears most fruit• Industry hype can be unbalanced against “long game”• Be wary of “better than we have” or “good enough”

philosophy• Resource (CPU, RAM) and license costs can grown

quickly• Investing in OSS Projects can reduce costs and increase

success (Kafka, Spark, Cassandra, etc.)

Cruising on a Security Data Lake

Solving Big Data Challenges in SECOPS

Charles HerringCo-Founder, CTO

CharlesHerring.comWitFoo.com

@charlesherring

cruising on a security data lake - secure360€¦ · understanding secops big data goals...

Documents