dealing with datasets that grow so large that they become awkward to work with using on-hand...

Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools.

Handling BigData On the Public Cloud

Based on InterOp 2011 presentation by Liran Zelkha ([email protected])

Co founder of ScaleBase Before that, lead Aluna – a database and

architecture consulting company Over 15 years of hands on technology

experience

mailto:[email protected]

Agenda

What is Big Data Big Data On Public Clouds Some solutions

What is Big Data?

Big Data (from wikipedia) …are datasets that grow so large that they become

awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime." Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.

Top 3 ways to know you have big data:

Number 3:

... you get a call from the utility company asking you not to run 'that brownout query' again. (@aristippus303 at Datawatch)

Number 2:

... it piles up so high that it disappears into the clouds (@evertlammerts - I assume pun was intended?)

Number 1:

... the SAN undergoes gravitational collapse and you get cited by OSHA for an unlicensed singularity. (@datamartist)

But seriously

Its not a single number It is a set of parameters

Volume of Data

Velocity of Data

Big Data Parameters

Complexity of Analysis

Big Data

http://www2.neilmcgovern.com/main.html

Where do we see big data?

Everywhere Data Warehouse OLTP

Web 2.0 SaaS Billing Fraud detection CMS …Family history …Social networking

Volume of data

How much data do you have? The more, the merrier

– Better analysis Used to be measured in 100’s of GB,

then TB, now PB But even a 300GB DB can still have Big

Data problems “If you have over 1TB of data – you

have a Big Data problem”, IDC

Velocity of data

How many users access the data? How many writes occur on your data? How much transactions does your

database have? Measured in TPS, counted by the

thousands

Complexity of Analysis

How complex are your queries? An example:

SELECT * FROM ( SELECT w.*, ROWNUM rnum FROM ( SELECT distinct w.watcher_id from watch w left outer join Profile p on p.watcher_id = w.watcher_id join atom_feed af on af.resource_id_hash = w.resource_id_hash join atom_feed_entry afe on afe.atom_feed_id = af.atom_feed_id where (p.LAST_ENTRY_PROCESSED_DATE is null or p.LAST_ENTRY_PROCESSED_DATE < afe.create_date) and (p.email_enabled_flag is null or p.email_enabled_flag != 'F') and af.resource_id = w.resource_id and afe.create_date <= sysdate - ? ORDER BY w.watcher_id ASC ) w where ROWNUM <= ? ) where rnum > ?;

Big Data on Public Clouds

Again from Wikipedia

– Public cloud or external cloud describes cloud computing in the traditional mainstream sense, whereby resources are dynamically provisioned on a fine-grained, self-service basis over the Internet, via web applications/web services, from an off-site third-party provider who bills on a fine- grained utility computing basis.

Public Cloud Implications Pros:

Elastic Unlimited storage Unlimited capacity

Cons: Performance Standard hardware (no appliances...)

Some Solutions

Column Store Database

New databases that internally store the data in columns, and not rows.

Very good for OLAP Excellent for BigData

NoSQL Database

Again, from Wikipedia:

– NoSQL is the term used to designate database management systems that differ from classic relational database management systems (RDBMSes) in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term that would include classic relational databases as a subset.

NoSQL Database Non-relational databases Usually store data in memory,

replicated across multiple machines

Great latency

Unstructured Schema Since SQL is not used, ERD can be

dynamic Some solutions store data as objects of

any kind Some use binary serialization of the

object Others use Map API (put, get) Players include: Casandra, HiveDB,

MemBase, MongoDB

newSQL Dubbed by 451 analyst Matthew Aslett

"NewSQL" is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as 'ScalableSQL' to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term 'NewSQL' in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

New Databases New database engines Usually scale very well, can store a lot

of data, and targeted for virtual environments

Players include– NimbusDB– VoltDB

New MySQL Storage Engines

New databases that look like MySQL from the outside– MySQL network protocol– MySQL SQL flavor

Players include– Akiban– ScaleDB

ScaleBase ScaleBase offers Database Load

Balancers Scalability and high availability for

your database, totally transparent to your application

Summary There are many ways to handle

BigData on cloud environments Understand your data

requirements well – and use the right tool for the job

No one tool fits them all!

dealing with datasets that grow so large that they become awkward to work with using on-hand...

Documents