the perfect fit: scalable graph for big data

42
Grab some coffee and enjoy the preshow banter before the top of the hour!

Upload: inside-analysis

Post on 03-Aug-2015

139 views

Category:

Technology


3 download

TRANSCRIPT

Grab some coffee and enjoy the pre-­show banter

before the top of the

hour!

The Briefing Room

The Perfect Fit: Scalable Graph for Big Data

Twitter Tag: #briefr The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Twitter Tag: #briefr The Briefing Room

  Reveal the essential characteristics of enterprise software, good and bad

  Provide a forum for detailed analysis of today’s innovative technologies

 Give vendors a chance to explain their product to savvy analysts

  Allow audience members to pose serious questions... and get answers!

Mission

Twitter Tag: #briefr The Briefing Room

Topics

June: INNOVATORS

July: SQL INNOVATION

August: REAL-TIME DATA

Twitter Tag: #briefr The Briefing Room

When You’re Hot…

Ø  Biggest Web engines use graph

Ø  Very powerful for finding relationships

Ø More versatile than other DB formats

Ø Great for unwinding complex scenarios

Twitter Tag: #briefr The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected] @robinbloor

Twitter Tag: #briefr The Briefing Room

SYSTAP

  SYSTAP builds highly-scalable open source solutions for big graphs

  Its flagship product is Blazegraph, a platform that supports semantic web and graph database APIs. It features fault tolerant storage & query capabilities and online backup & failover.

Blazegraph achieves its scale and high throughput by leveraging GPU acceleration via its Mapgraph technology

Twitter Tag: #briefr The Briefing Room

Guest: Brad Bebee

Brad Bebee is the CEO and Managing Partner at SYSTAP, LLC. Brad leads the efforts to use SYSTAP technologies for high performance graph databases and analytics to delivery solutions for multiple business and mission areas. Over the course of his career, he has served as a CTO, CFO, managed operating divisions, and performed advanced technology development for commercial and government customers. He is an active contributor to SYSTAP’s open source software projects. His technology experience ranges from early work in modeling methodologies and knowledge representation dating back to precursors of DARPA’s DAML program to more recent work with large scale data analytics using the Hadoop ecosystem, Accumulo, and related technologies. He has extensive experience in architecture and software modeling methodologies, where he has lead and collaborated upon multiple publications receiving recognition for his research.

http://blazegraph.com/

The  Perfect  Fit:  Scalable  Graph  for  Big  Data  

June  30,  2015  Bloor  Group  Briefing  Room  

http://blazegraph.com/ 11

Big  Data  Startup  Award  Winner:    2015  Big  Data  InnovaBons  Summit  

 

Helping  customers  achieve  their  business  objecBves  with  graph  data  is  our  vision,  mission,  and  the  essence  of  our  soJware  

soluBons.  

Today,  we  serve  Fortune  500  companies,  startups,  governments,  and  research  

organizaBons  with  technology  to  power  their  graphs.  

 

http://blazegraph.com/

Graph Databases Grew at Over 500% in the Last Two Years

Popularity changes per category – March 2015

Popu

larit

y C

hang

es

Graph Databases

12

http://blazegraph.com/

The Amount of Graph Data is Exploding

Billion+ Edges

13SYSTAP™, LLC© 2006-2015 All Rights Reserved

http://blazegraph.com/ SYSTAP™, LLC

© 2006-2015 All Rights Reserved 14

Graph Applications are Everywhere

•  Community Detection / Clustering

•  Recommendation Systems

•  Fault Prediction in Industrial and Internet of Things (IoT)

•  Drug Discovery / Repurposing

•  Precision Medicine / Genomics

•  Fraud Detection •  Time Series,

Compliance

•  Cyber •  Defense / Security

http://blazegraph.com/

Graphs  are  different.    You  need  the  right  paradigm  and  hardware  to  scale  

https://datatake.files.wordpress.com/2015/09/latency.png

Graph Cache Thrash The CPU just waits for graph data from main memory...

Type

of C

ache

or M

emor

y

Access Latency Per Clock Cycle

SYSTAP™, LLC© 2006-2015 All Rights Reserved

15

http://blazegraph.com/

Solutions to the Graph Scaling Problem Using Graph Databases and GPUs

●  Embedded●  High Availability●  Scale-out

●  GPU Acceleration●  100s of Times Faster

than CPU main memory-based systems

●  Up to 40X Cheaper●  10,000X Faster than

disk-based technologies

http://blazegraph.com/

Uncovering influence links in molecular knowledge networks to streamline personalized medicine | Shin, Dmitriy et al.Journal of Biomedical Informatics , Volume 52 , 394 - 405

Finding  the  Next  Cure  for  Cancer  is  a    Billion+  Edge  Graph  Challenge  

17

http://blazegraph.com/

Graph is BIG and changing(Trillion+ Edges)

18

http://blazegraph.com/

Graphs Enable People to Find KnowledgeA Bunch of Pages An Answer

19

http://blazegraph.com/

Graphs Enable Enterprises to Manage Metadata

•  Data  outlives  specific  system  implementaBons.  •  Data  outlives  applicaBons.  •  Achieve  Metadata  independence  using  declaraBve  standards  

to  manage  metadata  and  express  transformaBons.  

Data SourcesData Providers

Knowledge Graph: Instance Data + Ontology (RDF + OWL)

ACLsQuery Catalog

Constraints Rules Events Mappings Widgets Views

20

http://blazegraph.com/

Knowledge  Base  of  Biology  (KaBOB)  

Open  Biomedical  Ontologies  

biomedical    data  &  

informaBon  

applicaBon  data  

biomedical  knowledge  

Entrez  Gene  

17  databases  

DIP  

UniProt  

GOA  

GAD  

HGNC  

InterPro  

Gene  Ontology  

Sequence  Ontology  

Cell  Type  Ontology   ChEBI   NCBI  

Taxonomy  Protein  Ontology  

12  ontologies  

… …

21

http://blazegraph.com/

Powering  Their  Graphs  with  Blazegraph™  

SYSTAP™, LLC© 2006-2015 All Rights Reserved

Information Management / Retrieval

Genomics / Precision Medicine

Defense, Intel, Cyber

22

http://blazegraph.com/

The  right  scaling  approach  depends  on  the  business  need  

SYSTAP™, LLC© 2006-2015 All Rights Reserved

Single  GPU  (500+M)  

MulB-­‐GPU  Clusters  (100+B)  

23

Fast   Fastest  Speed  

Data  Scale  (E

dges)   Scale  Out  

(1T+)  

High  Availability  

(50B)  

JVM  

Journal  

Embedded  Single  Server  

(50B)  Millions  

Billions  

Trillions  

http://blazegraph.com/

Blazegraph™  stands  out!  

•  Wikimedia  EvaluaBon:    hfps://docs.google.com/a/systap.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-­‐ZkMqT8Y5b2NYVKbU/edit#gid=0    

SYSTAP™, LLC© 2006-2015 All Rights Reserved 24

http://blazegraph.com/

Blazegraph™:    Embedded  and  Single  Server  •  High  performance,  Scalable  

–  50B  edges/node  –  RDF/SPARQL  level  query  language  –  Efficient  Graph  Traversal  –  High  9s  soluBon  

•  Property  graphs  –  Blueprints,  gremlin,  rextser  

•  REST  API  (NSS)  •  Extension  points  

–  Stored  queries  for  custom  applicaBon  logic  on  the  server.  

–  Custom  services  &  indices  –  Custom  funcBons  –  Vertex-­‐centric  programs  

•  Embedded  Server  

•  Standalone  Server  

JVM  

Journal  

WAR  

Journal  

25

http://blazegraph.com/

Blazegraph™:    High  Availability  •  Shared  nothing  architecture  

–  Same  data  on  each  node  –  Coordinate  only  at  commit  –  Transparent  load  balancing  

•  Scaling  –  50  billion  triples  or  quads  –  Query  throughput  scales  linearly  

•  Self  healing  –  AutomaBc  failover  –  AutomaBc  resync  aJer  disconnect  –  Online  single  node  disaster  recovery  

•  Online  Backup  –  Online  snapshots  (full  backups)  –  HA  Logs  (incremental  backups)  

•  Point  in  Bme  recovery  (offline)  

HAService  

Quorum  k=3  

size=3  

follower  

leader  

HAService  

HAService  

26

http://blazegraph.com/

Blazegraph™:    Scale-­‐out  

•  Shard-­‐based  horizontal  scale-­‐out  to  support  1  Trillion+  Edge  Graphs  

•  Fast  parallel  load    •  Efficient  Query  Through  

CoordinaBon  Between  Data  Services  

•  Coming  soon!  Support  for  HDFS  for  failover.  

27

http://blazegraph.com/

How  do  I  use  GPUs  to  scale  graphs?  

●  Parallel Processing on GPU Clusters for Trillion+ Edge Graphs

●  High-Level API

●  Partitioning and Overlapping Communications

●  HPC and DARPA Pedigree

28

http://blazegraph.com/

Blazegraph GPU: Ridiculously Fast for Graphs

Blazegraph™ plug-in for GPU Acceleration with familiar graph APIs

Graph  DB  

29

http://blazegraph.com/

Mapgraph HPC with NVIDIA GPUs$16K / GTEP (K40 - Today)$4K / GTEP (Pascal 2016)

Blazegraph  MulB-­‐GPU:    Extreme  Scale,  40X  more  Affordable!  

Cray XMT-2$~180K / GTEP

Large Hadoop Cluster $~18M / GTEP

Future Blazegraph SaaS On-demand

1 GTEP = 1 Billion Traversed Edges Per

Second

40X!10X!

30

Twitter Tag: #briefr The Briefing Room

Perceptions & Questions

Analyst: Robin Bloor

Of Graphs and Networks

Robin Bloor, PhD

Johnny-Come-Lately

Aside from the three letter agencies, until recently, nobody cared much

about graphs…

WHY?

Reasons for Graph Apathy…

1  Unfamiliarity (it’s obscure because it’s obscure)

2  RDBMS do not store graphs well and SQL is inadequate for querying graphs

3  No common BI applications, it’s mainly analytics

4  Semantic technology has taken a lifetime to evolve

Reasons to Care

u Graphs express very different (and important) data relationships

u Graphs are largely unexplored

u Graphs are ideal for MDM

u Graphs express semantic relationships

Semantics: The Type 0 Language

Colorless green ideas sleep furiously

Colorless green

sleep

furiously

ideas

The Net Net

The ultimate goal is INFERENCING:

Knowledge discovery (rather than pattern discovery)

through graph processing

u  What are the “low hanging fruit” graphical applications – in your company’s experience?

u  Does your company find itself competing with Hadoop Giraph? What are the compelling differences?

u  Is Blazegraph a triple-store at the physical level (i.e., a pure RDF implementation) or does it implement a variety of physical structures?

u  At what level of data volume/workload is hardware acceleration a necessity?

u  What is the largest amount of data currently under management with any of your customers?

u  Which companies/technologies do you compete with directly?

Twitter Tag: #briefr The Briefing Room

Twitter Tag: #briefr The Briefing Room

Upcoming Topics

www.insideanalysis.com

June: INNOVATORS

July: SQL INNOVATION

August: REAL-TIME DATA

Twitter Tag: #briefr The Briefing Room

THANK YOU for your

ATTENTION!

Some images provided courtesy of Wikimedia Commons