strata presentation: one billion objects in 2gb: big data analytics on small clusters with doradus...

24
One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group

Upload: randyguck

Post on 14-Jul-2015

377 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Randy Guck

Principal Engineer

Dell Software Group

Page 2: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

What is Doradus? Storage and query service

Leverages Cassandra NoSQL DB

Pure Java

- Stateless

- Embeddable or standalone

Open source: Apache 2.0 License

30 Doradus: The Tarantula Nebula

Source: Hubble Space Telescope

Page 3: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Why Use Doradus? Easy to use: no client driver

Spider storage manager

- Good for unstructured data

OLAP storage manager

- Near real time data warehousing

Compared to Cassandra alone:

- Data model, searching, analytics

Compared to Hadoop:

- Fast data loads and queries

- Dense storage: less hardware

Cassandra

Data

Applications

REST API

OLAP Spider

Doradus

Page 4: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

A Multi-Node Cluster

Doradus

Cassandra

Data

Node 2

Cassandra

Data

Doradus

Cassandra

Data

Node 1 Node 3

Applications

REST API

Secondary Doradus

instances are optional

Page 5: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Why Did We Build Doradus OLAP? Some tough customer requirements:

- Statistical queries most important

- Need to scan millions of objects/second

- User-customizable “insights” = millions of possible queries

Couldn’t use indexes, pre-computed queries, etc.

Disk physics

- ~100's of random reads/second

- ~1000's of serial reads/second

Needed a radically new approach!

Page 6: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Doradus OLAP Combines ideas from:

- Online Analytical Processing: data arranged in static cubes

- Columnar databases: Column-oriented storage and compression

- NoSQL databases: Sharding

Features:

- Fast loading: up to 500K objects/second/node

- Dense storage: 1 billion objects in 2 GB!

- Fast cube merging: typically seconds

- No indexes!

Page 7: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Example: Message Tracking Schema

Message Participant Address

PersonManager

Employees

Person

Address

Attachments

Message

Participants

Message

Address

Participants

Attachment

Page 8: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

DQL Object Queries Builds on Lucene syntax

- Full text queries

Adds link paths

- Directed graph searches

- Quantifiers and filters

- Transitive searches

Other features

- Stateless paging

- Sorting

Examples:

- LastName = Smith AND NOT (FirstName : Jo*)

AND BirthDate = [1986 TO 1992]

- ALL(Participants).ANY(Address.WHERE

(Email='*.gmail.com')).Person.Department :

support

- Employees^(4).Office='San Jose’

Page 9: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

DQL Aggregate Queries Metric functions

- COUNT, AVERAGE, MIN,

MAX, DISTINCT, ...

Multi-level grouping

Grouping functions

- BATCH, BOTTOM, FIRST,

LAST, LOWER, SETS,

TERMS, TOP, TRUNCATE,

UPPER, WHERE, ...

Examples:

- metric=COUNT(*), AVERAGE(Size),

MIN(Participants.Address.Person.Birthdate)

- metric=DISTINCT(Attachments.Extension);

groups=Tags,

Participants.Address.Person.Department;

query=Attachments.Size > 100000

- metric=AVERAGE(Size);

groups=TOP(10,Participants.Address.Email)

Page 10: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

OLAP Data Loading

EventsEventsEvents

EventsEventsPeople

EventsEvents

Computers

EventsEventsDomains

Sources

Page 11: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEvents

Computers

EventsEventsDomains

Batch 2

Batch 3

...

Sources Batches

Batch 4

Page 12: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEvents

Computers

EventsEventsDomains

Batch 2

Batch 3

...

2014-03-01

2014-02-28

2014-02-27

Sources Batches Shards

Batch 4

Merge

Page 13: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEvents

Computers

EventsEventsDomains

Batch 2

Batch 3

...

2014-03-01

2014-02-28

2014-02-27

Sources Batches Shards OLAP Store

Batch 4

Merge

Page 14: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Storing Batches

Field Values

ID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ...

Size 48120 5435 24220 ...

SendDate 1280246462000 1279354872112 1279357261413 ...

Priority 0 0 1 ...

Subject.txt ballades encash nautch

colloquy geared

nettlier outdoors culvert

hypothec winder

stolons ungot guiding

rupiahs outgone

...

Subject 1 2 0 ...

...

Data is sorted by object ID and stored as columnar, compressed blobs

Key Columns

Email/Message/2014-03-01/{Batch GUID}/ID [compressed data]

Email/Message/2014-03-01/{Batch GUID}/Size [compressed data]

Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data]

... ......

OLAP Table

Field Value Arrays

Compressed rows

Page 15: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Merging Batches

Key Columns

Email/Message/2014-03-01/ID [compressed data]

Email/Message/2014/03-01/Size [compressed data]

Email/Message/2014-03-01/SendDate [compressed data]

... ...

Email/Person/2014-03-01/ID [compressed data]

Email/Person/2014-03-01/FirstName [compressed data]

Email/Person/2014-03-01/LastName [compressed data]

... ...

Email/Address/2014-03-01/ID [compressed data]

Email/Address/2014-03-01/Person [compressed data]

Email/Address/2014/-03-01/Message [compressed data]

... ...

Email/Message/2014-02-28/ID [compressed data]

Email/Message/2014-02-28/Size [compressed data]

...

Batch #1: Shard 2014-03-01

Message Table

ID ...

Size ...

SendDate ...

...

Batch #2: Shard 2014-03-01

Message Table

ID ...

Size ...

SendDate ...

...

...

OLAP Store

Person Table

ID ...

FirstName ...

Lastname ...

...

Address Table

ID ...

Person ...

Messages ...

...

Person Table

ID ...

FirstName ...

Lastname ...

...

Address Table

ID ...

Person ...

Messages ...

...

Message table data

Shard 2014-03-01

Person table data

Shard 2014-03-01

Address table data

Shard 2014-03-01

Data for other shards

Page 16: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Does Merging Take Long?

Page 17: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

OLAP Query Execution Example query:

- Count messages with Size between 1000-10000 and

HasBeenSent=false in shards 2014-03-01 to 2014-03-31

How many rows are read?

- 2 fields x 31 shards = 62 rows

- Typically represents millions of values

Value arrays are scanned in memory

Physical rows are read on “cold” start only

- Multiple caching levels for “warm” and “hot” data

Page 18: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

1 Billion Objects in 2GB?Example Security Event (CSV format):

Fixed Fields Variable Fields

Computer Name MAILSERVER18 1 MAILSERVER18$

Log Name Security 2

Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation

Type Success Audit 4 (0x0,0x142999A)

Source Security 5 3

Category Logon/Logoff 6 Kerberos

Event ID 540 7 Kerberos

User Domain NT AUTHORITY

User Name SYSTEM

User SID S-1-5-18

MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,"Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,"(0x0,0x142999A)",3,Kerberos,Kerberos

Page 19: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Events Schema

EventsInsertion

Strings

Fields:

• ComputerName (text)

• LogName (text)

• Timestamp (timestamp)

• Type (text)

• Source (text)

Fields:

• Index (integer)

• Value (text)

• Event (link)

Count: 115 Million Count: 880 Million

Params

Event (inverse)

• Category (text)

• EventID (integer)

• UserDomain (text)

• UserSID (text)

• Params (link)

Page 20: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Event Schema Load Load stats:

Total shards: 860

Total events: 114,572,247

Total ins strings: 879,529,753

Total objects: 994,102,000

Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)

Space usage::nodetool -h localhost status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address Load Owns Host ID Token Rack

UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1

Page 21: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Demo 1) Count all Events in all shards

- 860 shards => 115M events

2) Find the top 5 hours-of-the-day when certain privileged events fail:

- Event IDs are any of 577, 681, 529

- Event type is ‘Failure Audit’

- Insertion string 8 is (0x0,0x3E7)

- Event occurred in first half of 2005 (181 shards)

Page 22: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Doradus OLAP Summary Advantages:

Simple REST API

All fields are searchable without indexes

Ad-hoc statistical searches

Support for graph-based queries

Near real time data warehousing

Dense storage = less hardware

Horizontally scalable when needed

Page 23: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Doradus OLAP Summary Good for applications where data:

Is continuous/streaming

Is structured to semi-structured

Can be loaded in batches

Is partitionable, especially by time

Is typically queried in a subset of shards

Emphasizes statistical queries

Page 24: Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Thank You! Where to find Doradus

- Source: github.com/dell-oss/Doradus

- Downloads: search.maven.org

Contact me

- [email protected]

- @randyguck