big data managementbig data managementement motivation t a manag “without data you are just...
TRANSCRIPT
emen
tta
Man
agBig
Dat
Big Data ManagementBig Data Management
December 2015 Alberto Abelló & Oscar Romero 1
emen
tMotivation
ta M
anag
“Without data you are just another person with an opinion ”
Big
Dat an opinion.”
William Edwards Deming
“It is a capital mistake to theorize before one has data ”has data.
Sherlock Holmes (A Study in Scarlett)
Prescriptive
Descriptive
Predictive
December 2015 Alberto Abelló & Oscar Romero 3
emen
tBig Data facets
ta M
anag
The Original
Big
Dat
as Technology as Data Distinctions as Data Distinctions as Signals
O t it as Opportunity as Metaphor as New Term for Old Stuff
December 2015 Alberto Abelló & Oscar Romero 7
Timo Elliott
emen
tBig Data definition
ta M
anag
Velocity
Big
Dat
Volume Variety Variety …
V i bilit Variability Validity/Veracity Value
From IBM “Understanding Big Data”
erab
ytes
Mill
ions
of
T
December 2015 Alberto Abelló & Oscar Romero 8
year
emen
tTypes of Big Data Analyzed in Industry
ta M
anag
Big
Dat
December 2015 Alberto Abelló & Oscar Romero 10
emen
tBig Data sources
ta M
anag
Structured
Big
Dat
Created (i.e., business data) Provoked (e.g., customer feedback) Transacted Compiled (e.g., demographics) Experimental (e.g., sampling customers)
Unstructured Unstructured Captured (e.g., search words) User-generated (e g social networks) User generated (e.g., social networks)
December 2015 Alberto Abelló & Oscar Romero 11
emen
tBig Data related areas
ta M
anag
Volume and Velocity Declarative querying Query optimization
Big
Dat Query optimization
Variety and Variability Data quality Data integration Web and text mining Web and text mining Information retrieval
Validity/Veracity Data consistency UncertaintyUncertainty Statistical reasoning Data linkage (provenance)
Value Analyticsy
Data mining Algorithmics Automatic learning Simulation Privacy
Biologists Linguistics Chemists Sociologists
Engineers Engineers
December 2015 Alberto Abelló & Oscar Romero 12
emen
tta
Man
agBig
Dat
VOLUME AND VELOCITY MANAGEMENT
December 2015 Alberto Abelló & Oscar Romero 13
emen
tKey features of cloud databases
ta M
anag
a) Quick/Cheap set upf
Big
Dat b) Simple call level interface or protocol
No declarative query languageW k d l th ACIDc) Weaker concurrency model than ACID
d) Flexible schemaAbilit t d i ll dd tt ib t Ability to dynamically add new attributes
e) Multi-tenancyf) Ability to horizontally scale f) Ability to horizontally scale g) Ability to replicate & distribute (fragmentation)h) Efficient use of distributed indexes and RAMh) Efficient use of distributed indexes and RAM
December 2015 Alberto Abelló & Oscar Romero 14
emen
tDistributed Database
ta M
anag
A distributed database (DDB) is a database where d t t i di t ib t d l
Big
Dat data management is distributed over several
nodes in a network. Each node is a database itself Each node is a database itself
Potential heterogeneity Nodes communicate through the network
December 2015 Alberto Abelló & Oscar Romero 15
emen
tParallel database architectures
ta M
anag
Big
Dat
D. DeWitt & J. Gray, “Parallel Database Systems: h f f h f b ” 992
December 2015 Alberto Abelló & Oscar Romero 16
The future of High Performance Database Processing”, 1992Figure from D. Abady
emen
tSchemaless Databases (I)
ta M
anag
( )CREATE TABLE Student ( id int,
name varchar2(50), h 2(50)
Big
Dat surname varchar2(50),
enrolment date);Insert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’, true, ‘Lleida’); Insert into Student (1, ‘Oscar’, ‘Romero’, NULL); I t i t St d t (1 ‘O ’ ‘R ’ ‘01/01/2012’)
WRONGOK
OKInsert into Student (1, ‘Oscar’, ‘Romero’, ‘01/01/2012’);
Consequences
OK
Consequences Gain flexibility May reduce the impedance mismatch
Coupled with HLLs (e g Java) Coupled with HLLs (e.g., Java) Lose semantics (also consistency)Insert into Student (1, {‘Oscar’, ‘Romero’, ‘01/01/2012’});
Th d t i d d i i l i l t (!)Th d t i d d i i l i l t (!) The data independence principle is lost (!)The data independence principle is lost (!) The ANSI / SPARC architecture is not followed Applications can access and manipulate the database
internal structuresinternal structuresDecember 2015 Alberto Abelló & Oscar Romero 19
emen
tSchemaless Databases (II)
ta M
anag
NOSQL solution for the impedance
Big
Dat mismatch
Several new data models were introduced Graph data model Document-oriented databases Key-value (~ hash tables) Streams (~ vectors and matrixes)Streams ( vectors and matrixes)
These new models lack of an explicit schema (defined by the user)schema (defined by the user) However, an implicit schema remains
December 2015 Alberto Abelló & Oscar Romero 20
emen
tImpedance Mismatch
ta M
anag
Big
Dat
December 2015 Alberto Abelló & Oscar Romero 21
Petra Selmer, Advances in Data Management 2012
emen
tImpedance Mismatch
ta M
anag
Big
Dat
December 2015 Alberto Abelló & Oscar Romero 22
Petra Selmer, Advances in Data Management 2012
emen
tDifferent applications
ta M
anag
Not Only SQL (different problems entail different solutions)
Big
Dat OLTP
Object-Relational Distributed databases Parallel databases Parallel databases
Scientific databases and other Big Data repositories Key-value stores
Data Warehousing & OLAP MOLAP Column stores Multidimensional features
Text / documents Text / documents Document databases
XML/JSON databases Stream processing
St Stream processor Semantic Web and Open Data
Graph databases
December 2015 Alberto Abelló & Oscar Romero 24
emen
tSpecific platforms
ta M
anag
Google BigTable Published in 2006
Big
Dat Published in 2006
Implemented by Hbase Also Dynamo and Cassandra
Google MapReduce Published in 2004 Implemented by Hadoop Implemented by Hadoop
MongoDB Published in 2007
Neo4J/Sparksee Published in 2010/2008SAP HANA SAP HANA Published in 2011 Prototyped in SanssouciDB Prototyped in SanssouciDB
December 2015 Alberto Abelló & Oscar Romero 26
emen
tPolyglot Systems
ta M
anag
Federate different kinds of storage systems
Big
Dat
Martin Fowlerhttp://martinfowler.com/bliki/PolyglotPersistence.html
27December 2015 Alberto Abelló & Oscar Romero
emen
tInternal Structures
ta M
anag
Big
Dat
Ben Stopford
December 2015 Alberto Abelló & Oscar Romero 28
pProgscon & JAX Finance 2015
emen
tThe Problem is Not SQL
ta M
anag
Q Relational systems are too generic…
OLTP: stored procedures and simple queries
Big
Dat OLTP: stored procedures and simple queries
OLAP: ad-hoc complex queries Documents: large objects Streams: time windows with volatile data Scientific: uncertainty and heterogeneity
But the overhead of RDBMS has nothing to … But the overhead of RDBMS has nothing to do with SQL Low-level, record-at-a-time interface is not the
lsolution No standard No ACID
Michael StonebrakerSQL Databases vS. NoSQL DatabasesCommunications of the ACM, 53(4), 2010
December 2015 Alberto Abelló & Oscar Romero 29
emen
tHadoop ecosystem
ta M
anag
Big
Dat
Small Analytics Big Analytics
Warehousing
Data Lake
By Víctor Herrero.
December 2015 Alberto Abelló & Oscar Romero 31
By Víctor Herrero.Big Data Management & Analytics (UPC School)
emen
tBrewery or bottled beer?
ta M
anag
D It Y lf
Big
Dat Do It Yourself
• Expensive• Ad hoc development
Off the Shelf• Economies of scale• Concrete functionalities
December 2015 Alberto Abelló & Oscar Romero 32
Florian Waas analogy
emen
tConclusions
ta M
anag
Big Data is not only volume
Big
Dat
Cloud databases are not simply databases in the cloud
Distributed Database principles Variety matters Variety matters
December 2015 Alberto Abelló & Oscar Romero 33
emen
tBibliography
ta M
anag
M. T. Özsu and P. Valduriez. Principles of
Big
Dat Distributed Database Systems, 3rd Ed.
Springer, 2011 A. Ghazal et al. BigBench: towards an
industry standard benchmark for big data y ganalytics. SIGMOD Conference, 2013
R. Cattell. Scalable SQL and NoSQL Data R. Cattell. Scalable SQL and NoSQL Data Stores. SIGMOD Record 39(4), 2010
L Liu M T Özsu (Eds ) Encyclopedia of L. Liu, M.T. Özsu (Eds.). Encyclopedia of Database Systems. Springer, 2009
December 2015 Alberto Abelló & Oscar Romero 34