carolinacon presentation on streaming analytics
TRANSCRIPT
CarolinaCon 11
One Step Closer to the Matrix: Machine Learning and Augmented Reality in Streaming Data
Rob WeissJohn Eberhardt
What’s the Story?
• Rob and John have been working together for years• Rob is a Network Engineer and Hacker• John is a Data Scientist and Architect
• Two Great Tastes that Taste Great Together• Different perspectives bring new answers
• Rob and John are interested in how to create a paradigm shift in user interaction with data and network security• We are also probably slightly insane
CarolinaCon 11
The Defender’s Challenge
• The attacker has an inherent advantage – no rules!• So the defense problem is asymmetric• Classical methods fail more rapidly as computing power
becomes cheaper and more readily available• The Fortress or “Big Walls” security model is outdated and,
frankly, ineffective• Qualified people are in short supply• Can we crowdsource network defense?
CarolinaCon 11
How We Got Started
• A research project in a galaxy far, far away• We started modeling zero day attacks• We combined machine learning and streaming analytics to
detect novel patterns statistically• It worked well enough, but there were limitations
• Not sensitive enough• Not specific enough• Proprietary software limited flexibility• It still required a pretty sophisticated operator – and
those are in short supply• So . . .
CarolinaCon 11
Taking a Different Approach
CarolinaCon 11
• Could we do for raw data what GUIs did for computers and revolutionize human interaction with data?
• Complex streaming analytics are not tractable to the human
• The “last mile” requires a user interface that creates flow for the human analyst out of data
• Harness the power of metaphor to explain complex concepts to the human analyst (e.g. Windows)
• Streaming Analytics + Streaming User Experience = “Data Looming”
• Can we really make a prosthetic for the brain?
What? Don’t Flip Out . . .
CarolinaCon 11
Data Looming
• Can you point out every individual thread and show me how it is woven? Probably not.
• Can you tell me what it is? I sure hope so!
CarolinaCon 11
Data Looming
Watch threads on a loom – to the naked eye, the loom is too complex and moving too quickly for you to pick out the details, but you can quickly see when the overall pattern changes – usually within very few iterations. A simple, intuitive, scalable visualization of streaming analytics allows the human analyst to connect the “last mile” of disconnected events and is at the heart of what we are doing – merging complex streaming analytics with the sparse pattern detection capabilities of the human brain.
Pattern Recognition is For the Birds
A child can learn to recognize this pattern in 15 seconds, but a computer still can’t.
#1 - Eagle #2 - Swan #3 - ????
CarolinaCon 11
Getting to The Big Idea
Zero Day Work
William Gibson’sNeuromancer The Matrix
John Maeda’s Simplicityby Design
Open Source Network Expertise Data ScienceExpertise
Crowdsourcing
Hacktastic Innovation Explosion!!!
CarolinaCon 11
How I Did It by Victor Frankenstein
• Accelerate data analysis by extending streaming analytics to broader groups of less skilled human analysts
• Combine the speed, precision and recall of a computer, through an immersive interface, with the inherent sparse pattern recognition capabilities of the human brain• Streaming Analytics allow for rapid, real time
adjudication of data and make the user experience dynamic
• An immersive user experience makes complex analytics data “real” to the human and enables experiential learning
• Combining them in a single environment enables sparse pattern recognition in dynamic systems
CarolinaCon 11
How I Did It Continued (Abby Normal)
• Data: Streaming data from sensors, collectors, files, etc.• Platform: Streaming analytics process and analyze these
data, including attribution to the real world• Visual Language Construct: Integrates streaming data,
streaming analytics, and streaming user experience in a pluggable architecture
• Streaming User Experience: Immersive 3-D user experience allows analysts to interact directly with streaming data and analytics
CarolinaCon 11
Architecture (Meet the Architect)
Data Sensor (N+1)
Data Collector (N+1)
Kafka
Zookeeper
Kafka Queue
Nimbus
Worker Node
Storm
Trident-ML
Analytics Platform
Visual Language Construct
Streaming User Experience
Analytics and Countermeasures
Game Players
CarolinaCon 11
Design Principles
Principle Enables
Open Source Components Supports integration of streaming analytics and immersive user experience to create a dynamic feedback loop –rapidly adapt the platform from lessons learned from human experience
Streaming Analytics Accelerating analytics to keep pace with data collection (facilitating high collection rate)
Immersive Streaming User Experience
Extending the user interface to allow broader groups of analysts to use sophisticated analytics (addressing the recruiting challenge)
Pluggable Architecture “Bring your own” tools and analytics supports crowdsourcing and allows for aggressive exploitation of new analytics and user experience paradigms
CarolinaCon 11
Larry Byrd: Network Defender of the Future
A basketball player can watch your network. When an attack occurs, our player can quickly identify pattern shift using the same brain computation as when the player identifies a
shift in the offensive strategy of the opposing basketball team. Think about this as a data prosthetic for the human brain.
CarolinaCon 11
Enough of Us Talking at You
• Fight fire with fire – crowdsource all comers and create an asymmetric defense
• Align economic incentives, human behaviors, and defense objectives
• Do for data what GUIs did for computers – make it accessible!
• This isn’t about technology . . . it’s about revolutionizing the way humans interact with data to enable a game-changing leap forward
CarolinaCon 11
Innovation Is Often Strange
CarolinaCon 11
But Wait, There’s More!
Altamira Technologies Corporation 2014CarolinaCon 11
Demo Concept
Concept• Normal work environment – “normal” patterns give way to aberrations• This behavior is focused on network data, but could easily be any other
streaming dataDesign• Analytics cluster traffic based on source and destination port patterns
over time using k-means clustering• Cubes represent nodes on the network; streaming spheres represent
packets• Colors represent the behavior of nodes / packets based upon traffic –
Green is a client, Blue is a Server, Yellow is “undetermined behavior”
CarolinaCon 11
Green (client) Blue (server) Yellow (??)
Source Centroid 54760 1001 5066
Dest Centroid 791 54518 5511
Questions I Can Ask
• Is a given node on the network behaving as expected?• Watch the node colors - they should be consistent in a normal network:
some white nodes, a lot of blue (client) nodes, and some green nodes. What happens over time?
• Does my use of source and destination ports mark me out as a client or server? Does my role appear consistent or change?• The node colors indicate what they are – watch the colors of the nodes –
machines should have clear and consistent roles• Is my pattern of nodes that I am interacting with consistent? Am I interacting
with different partners?• Watch the stream patterns – machines should interact with consistent
groups• Do my behaviors adhere to regular time cycles? Can I apply time cycles to any of
the above (e.g., a workday)?• Watch the patterns change as cyclical time progresses in our “workday”
CarolinaCon 11
DEMO TIME!
Altamira Technologies Corporation 2014CarolinaCon 11
About Rob and John
• Rob Weiss is a senior systems engineer at G2 (www.g2-inc.com) with over 24 years of experience in government and commercial markets. He started with Legos and is now a tool builder and problem solver. Currently runs the Altamira Red Team and performs information security research, looking for hard problems to solve. Twitter: @3XPlo1T2
• John Eberhardt is a Data Scientist at 3E Services (www.3eservicesllc.com) with 20 years of quantitative problem solving and a penchant for trying to decipher symbolism in obscure 16th century literature. John has experience in analytical problem solving in healthcare, life sciences, security, financial services, consumer products, and transportation. Twitter: @JohnSEberhardt3
CarolinaCon 11
Repositories
• Apache Storm: https://github.com/apache/storm• Trident-ML: https://github.com/pmerienne/trident-ml• Rob Weiss: https://github.com/j105rob
CarolinaCon 11
Squiggly (probably won’t use this)
• A self organizing system consists of groups A, B, and C interacting
• Hence, the current state of A is {A|B,C}
• They influence each other {B|A,C}, {C|A,B} which means the system is described by f{{A|B,C},{B|A,C},{C|A,B}}
• However these groups are neither unitary nor static, which means at any given time they can have sub-attributes {Ai...An}, {Bi...Bn}, {Ci...Cn} that are unknown
• So now the system is described by f{{Ai | {Bi...Bn}, {Ci...Cn}},{Bi |{Ai...An}, {Ci...Cn}},{Ci |{Ai...An}, {Bi...Bn}}}
• How do you solve this np-hard problem?