Download - 3.2-HadoopDB
-
8/18/2019 3.2-HadoopDB
1/44
HadoopDB: An ArchitecturalHybrid of MapReduce and DBMS
Technologies for AnalyticalWorkloads
Azza Abouzeid1 !a"il Ba#da$a%liko%ski1
Daniel Abadi1 A&i Silberschatz1 Ale'ander Rasin(
1)ale *ni&ersity (Bro%n *ni&ersity
+azzakba#dadnaa&i,-cs.yale.edu/ale'r-cs.bro%n.edu
Presented by Ying Yang
10/01/2012
-
8/18/2019 3.2-HadoopDB
2/44
Outline
• Introduction, desired propertiesand background
• HadoopDB rc!itecture
• "esults
• #onclusions
-
8/18/2019 3.2-HadoopDB
3/44
Introduction
• Analytics are i"portant today
• Data a"ount is e'ploding
• $re&ious proble" 0 DBMS on Shared-
nothing architectures
a collection of independent possibly &irtual"achines each %ith local disk and local
"ain "e"ory connected together on ahigh2speed net%ork.3
-
8/18/2019 3.2-HadoopDB
4/44
Introduction
-
8/18/2019 3.2-HadoopDB
5/44
Introduction
-
8/18/2019 3.2-HadoopDB
6/44
Introduction
• Approachs4
5$arallel databasesanalytical DBMSsyste"s that deploy on a shared2nothing
architecture35Map6Reduce syste"s
-
8/18/2019 3.2-HadoopDB
7/44
Introduction
$ales "ecord %&a'ple
• 7onsider a large data set of sales log records eachconsisting of sales infor"ation including4
1 a date of sale 2 a price
• We %ould like to take the log records and generate a reportsho%ing the total sales for each year.
• $%(%#) Y%"*date+ $ year, $-*price+
• ."O- sales "OP BY year
uestion:
• Ho% do %e generate this report e8ciently and cheaply o&er"assi&e data contained in a shared2nothing cluster of 1999sof "achines:
-
8/18/2019 3.2-HadoopDB
8/44
Introduction
$ales "ecord %&a'ple using Hadoop:
;uery4 7alculate total sales for each year.
We %rite a MapReduce progra"4
Map4 Takes log records and e'tracts a key2&alue pair ofyear and sale price in dollars.
-
8/18/2019 3.2-HadoopDB
9/44
Introduction
-
8/18/2019 3.2-HadoopDB
10/44
Introduction
Parallel Databases:
$arallel Databases are like single2node databases e'cept4
Data is partitioned across nodes
>ndi&idual relational operations can be e'ecuted in parallel
S?@?7T )?ARdate3 AS year S*Mprice3R
-
8/18/2019 3.2-HadoopDB
11/44
Introduction
• Parallel Databases:
ault tolerance – Eot score so %ell
– Assu"ption4 dozens of nodes in clustersfailures arerare
• -ap"educe
Satises fault tolerance
Works on heterogeneus en&iron"entDra%back4 perfor"ance – Eot pre&ious "odeling
– Eo enhacing perfor"ance techniFues
-
8/18/2019 3.2-HadoopDB
12/44
Introduction
• >n su""ary
-
8/18/2019 3.2-HadoopDB
13/44
Desired properties
• $erfor"ance$arallel DBMS3
• ault toleranceMapReduce3
• Heterogeneus en&iron"entsMapReduce3
• le'ible Fuery interfaceboth3
-
8/18/2019 3.2-HadoopDB
14/44
Outline
• Introduction, desired propertiesand background
• HadoopDB rc!itecture
• "esults
• #onclusions
-
8/18/2019 3.2-HadoopDB
15/44
• Main goal4 achie&e the properties describedbefore
• 7onnect "ultiple single2datanode syste"s
– Hadoop reponsible for task coordination andnet%ork layer
– ;ueries parallelized along the nodes
• ault tolerant and %ork in heterogeneus
nodes• $arallel databases perfor"ance
– ;uery processing in database engine
HadoopDB rc!itecture
-
8/18/2019 3.2-HadoopDB
16/44
HadoopDB rc!itecture
-
8/18/2019 3.2-HadoopDB
17/44
HadoopDB rc!itecture
-
8/18/2019 3.2-HadoopDB
18/44
HadoopDB3s #o'ponents
Database connector
• replace the functionality of HDS %ith the
database connector. i&e Mappers the ability
to read the results of a database Fuery• Responsabilities
– 7onnect to the database
– ?'ecute the S;@ Fuery
– Return the results as key2&alue pairs• Achie&ed goal
– Datasources are si"ilar to datablocks in HDS
-
8/18/2019 3.2-HadoopDB
19/44
HadoopDB3s #o'ponents
7atalog
• Metadata about databases
• Database location dri&er class
credentials
• Datasets in cluster replica or partitioning
• 7atalog stored as '"l le in HDS• $lan to deploy as separated
ser&icesi"ilar to Ea"eEode3
-
8/18/2019 3.2-HadoopDB
20/44
HadoopDB3s #o'ponents
Data loader
• Responsibilities4 – lobally repartitioning data
– Breaking single data node in chunks
– Bulk2load data in single data node chunks
• T%o "ain co"ponents4 – lobal hasher
• Map6Reduce #ob read fro" HDS and repartitionsplits data
across nodes i.e. relational database instances33
– @ocal Hasher• 7opies fro" HDS to local le syste"subdi&ides data
%ithin each node3
-
8/18/2019 3.2-HadoopDB
21/44
HadoopDB3s #o'ponents
Data loader
lobal Hasher local Hasher
4
"a5data
6les
BlockA
AaA
A
B
B
B
Ab
Az
-
8/18/2019 3.2-HadoopDB
22/44
HadoopDB3s #o'ponents
SMS S;@ to MapReduce to S;@3$lanner
• SMS planner e'tends Hi&e.
• Hi&e processing Steps4
– Abstract Synta' Tree building – Se"antic analyzer connects to catalog
– DA of relational operatorsDirected acyclicgraph3
–
-
8/18/2019 3.2-HadoopDB
23/44
HadoopDB3s #o'ponents
S?@?7T
)?ARdate3 ASyearS*Mprice3
R
-
8/18/2019 3.2-HadoopDB
24/44
HadoopDB3s #o'ponents
SMS $lanner e'tensions
T%o phases before e'ecution – Retrie&e data elds to deter"ine
partitioning keys
– Tra&erse DA botto" up3. Rule basedS;@ generator
-
8/18/2019 3.2-HadoopDB
25/44
HadoopDB3s #o'ponents
MapReduce #ob generated by SMS assu"ing sales ispartitioned by )?ARsaleDate3. This feature is stillunsupported
-
8/18/2019 3.2-HadoopDB
26/44
HadoopDB3s #o'ponents
MapReduce #ob generated by SMS assu"ing no partitioning of sales
-
8/18/2019 3.2-HadoopDB
27/44
Outline
• Introduction, desired propertiesand background
• HadoopDB rc!itecture
• "esults
• #onclusions
-
8/18/2019 3.2-HadoopDB
28/44
?n&iron"ent4
• A"azon ?7( Glarge instances
•?ach instance – IJ B "e"ory – ( &irtual cores
– KJ9 B storage
– L bits @inu' edora K
Benck'arks
-
8/18/2019 3.2-HadoopDB
29/44
Bench"arked syste"s4
• Hadoop – (JLMB data blocks
– 19( MB heap size
– (99Mb sort buNer
• HadoopDB
– Si"ilar to Hadoop conf – $ostgreS;@ K.(.J
– Eo co'press data
Benck'arks
-
8/18/2019 3.2-HadoopDB
30/44
Bench"arked syste"s4
• Oertica – Ee% parallel database colu'n store3
– *sed a cloud edition
– All data is co'pressed
• DBMS2P – 7o"ercial parallel ro%
– Run on ?7( not cloud edition a&ailable3
Benck'arks
-
8/18/2019 3.2-HadoopDB
31/44
*sed data4
• Http log les ht"l pages ranking
•Sizes per node34 – 1JJ "illions user &isits Q (9igabytes3 – 1K "illions ranking Q1igabyte3
– Stored as plain te't in HDS
Benck'arks
-
8/18/2019 3.2-HadoopDB
32/44
(oading data
-
8/18/2019 3.2-HadoopDB
33/44
rep )ask
13ull table scan highly
selecti&e lter
(3Rando" data no
roo" for inde'ing
3Hadoop o&erhead
out%eighs Fuery
processing ti"e in
single2node databases
S?@?7T R!? U'yzUV/
-
8/18/2019 3.2-HadoopDB
34/44
Query:
Select pageUrl,
pageRank
from Rankings
where pageRank > 10
All except Hadoop used
clustered indices on the
pageRank column.
HadoopDB each chunk is
50MB,oerhead scheduling
o! small data leads to
"ad per!ormance#
$election )ask
-
8/18/2019 3.2-HadoopDB
35/44
Smaller query
SELECT SUBSTRso!rce"#,1,
$%,SU&a'Re(en!e%
)R*& User+isits
R*U# B- SUBSTRso!rce"#,
1,$%
Larger query
SELECT so!rce"#,
SU&a'Re(en!e%
)R*& User+isits
R*U# B- so!rce"#
ggregation )ask
-
8/18/2019 3.2-HadoopDB
36/44
Query
SELECT so!rce"#, C*U/TpageRank%,
SU&pageRank%,SU&a'Re(en!e%
)R*& Rankings S R, User+isits S
U+
2ERE R.pageURL 3 U+.'estURL /4 U+.(isit4ate BETEE/ 560007017189
/4 560007017669
R*U# B- U+.so!rce"#
No full table scan due to
clustered indexing Hash partitioning and efficient
join algorithm
7oin )ask
-
8/18/2019 3.2-HadoopDB
37/44
D. ggregation )ask
-
8/18/2019 3.2-HadoopDB
38/44
● HaddopDB approach parallel databases in
absence of failures
–
PostgreSQL not column store – DBMS-X !" o#erl$ optimistic
– %o compression in PosgreSQL data
– erhead in the interaction bet'een Hadoop and
PostgreSQL(
● &utperforms Hadoop
$u''ary
-
8/18/2019 3.2-HadoopDB
39/44
● )ault tolerance and
heterogeneus en#ironments
Benck'arks
-
8/18/2019 3.2-HadoopDB
40/44
● )ault tolerance is important
.ault tolerance and!eterogeneus en8iron'ents
-
8/18/2019 3.2-HadoopDB
41/44
● *ertica is faster
● *ertica reduces the number of
nodes to achie#e the same order
of magnitude
● )ault tolerance is important
Discussion
-
8/18/2019 3.2-HadoopDB
42/44
● +pproach parallel databases and
fault tolerance
● PostgreSQL is not a column store
● Hadoop and hi#e relati#el$ ne'
open source pro,ects● HadoopDB is fleible and etensible
#onclusion
-
8/18/2019 3.2-HadoopDB
43/44
● Hadoop 'eb page
● HadoopDB article
● HadoopDB pro,ect● *ertica
● +pache Hi#e
"eerences
-
8/18/2019 3.2-HadoopDB
44/44
● .han/ $ou0