dealing with datasets that grow so large that they become awkward to work with using on-hand...
TRANSCRIPT
Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools.
Handling BigData On the Public Cloud
Based on InterOp 2011 presentation by Liran Zelkha ([email protected])
Co founder of ScaleBase Before that, lead Aluna – a database and
architecture consulting company Over 15 years of hands on technology
experience
Agenda
What is Big Data Big Data On Public Clouds Some solutions
What is Big Data?
Big Data (from wikipedia) …are datasets that grow so large that they become
awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime." Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.
Top 3 ways to know you have big data:
Number 3:
... you get a call from the utility company asking you not to run 'that brownout query' again. (@aristippus303 at Datawatch)
Number 2:
... it piles up so high that it disappears into the clouds (@evertlammerts - I assume pun was intended?)
Number 1:
... the SAN undergoes gravitational collapse and you get cited by OSHA for an unlicensed singularity. (@datamartist)
But seriously
Its not a single number It is a set of parameters
Volume of Data
Velocity of Data
Big Data Parameters
Complexity of Analysis
Big Data
http://www2.neilmcgovern.com/main.html
Where do we see big data?
Everywhere Data Warehouse OLTP
Web 2.0 SaaS Billing Fraud detection CMS …Family history …Social networking
Volume of data
How much data do you have? The more, the merrier
– Better analysis Used to be measured in 100’s of GB,
then TB, now PB But even a 300GB DB can still have Big
Data problems “If you have over 1TB of data – you
have a Big Data problem”, IDC
Velocity of data
How many users access the data? How many writes occur on your data? How much transactions does your
database have? Measured in TPS, counted by the
thousands
Complexity of Analysis
How complex are your queries? An example:
SELECT * FROM ( SELECT w.*, ROWNUM rnum FROM ( SELECT distinct w.watcher_id from watch w left outer join Profile p on p.watcher_id = w.watcher_id join atom_feed af on af.resource_id_hash = w.resource_id_hash join atom_feed_entry afe on afe.atom_feed_id = af.atom_feed_id where (p.LAST_ENTRY_PROCESSED_DATE is null or p.LAST_ENTRY_PROCESSED_DATE < afe.create_date) and (p.email_enabled_flag is null or p.email_enabled_flag != 'F') and af.resource_id = w.resource_id and afe.create_date <= sysdate - ? ORDER BY w.watcher_id ASC ) w where ROWNUM <= ? ) where rnum > ?;
Big Data on Public Clouds
Again from Wikipedia
– Public cloud or external cloud describes cloud computing in the traditional mainstream sense, whereby resources are dynamically provisioned on a fine-grained, self-service basis over the Internet, via web applications/web services, from an off-site third-party provider who bills on a fine- grained utility computing basis.
Public Cloud Implications Pros:
Elastic Unlimited storage Unlimited capacity
Cons: Performance Standard hardware (no appliances...)
Some Solutions
Column Store Database
New databases that internally store the data in columns, and not rows.
Very good for OLAP Excellent for BigData
NoSQL Database
Again, from Wikipedia:
– NoSQL is the term used to designate database management systems that differ from classic relational database management systems (RDBMSes) in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term that would include classic relational databases as a subset.
NoSQL Database Non-relational databases Usually store data in memory,
replicated across multiple machines
Great latency
Unstructured Schema Since SQL is not used, ERD can be
dynamic Some solutions store data as objects of
any kind Some use binary serialization of the
object Others use Map API (put, get) Players include: Casandra, HiveDB,
MemBase, MongoDB
newSQL Dubbed by 451 analyst Matthew Aslett
"NewSQL" is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as 'ScalableSQL' to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term 'NewSQL' in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.
New Databases New database engines Usually scale very well, can store a lot
of data, and targeted for virtual environments
Players include– NimbusDB– VoltDB
New MySQL Storage Engines
New databases that look like MySQL from the outside– MySQL network protocol– MySQL SQL flavor
Players include– Akiban– ScaleDB
ScaleBase ScaleBase offers Database Load
Balancers Scalability and high availability for
your database, totally transparent to your application
Summary There are many ways to handle
BigData on cloud environments Understand your data
requirements well – and use the right tool for the job
No one tool fits them all!