datacubes in apache hive at apachecon
DESCRIPTION
Introduces OLAP cubes in Apache Hive and unified analytics platform GrillTRANSCRIPT
![Page 1: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/1.jpg)
Data cubes in Apache Hive
Amareshwari SriramadasuJaideep Dhok
![Page 2: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/2.jpg)
Amareshwari
Sriramadasu
•Engineer at Inmobi
•Working in Hadoop and eco systems since 2007
•Apache Hadoop – PMC
•Apache Hive– Committer
![Page 3: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/3.jpg)
JaideepDhok
•Engineer at Inmobi
•Working in distributed systems and Hadoop eco system since 2007
•Apache Hive and Hadoop Contributor
![Page 4: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/4.jpg)
Background
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
![Page 5: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/5.jpg)
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
![Page 6: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/6.jpg)
Global Mobile technology company enabling- Developers & Publishers to monetize
- Advertisers to engage and acquire users
@ Scale
About InMobi
![Page 7: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/7.jpg)
Digital advertising – Intro
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells Real estate on digital inventory
Has reach to users
Wants to target Users
Brings money
Market place
Consumer
![Page 8: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/8.jpg)
Hadoop @ InMobi – Factual Reporting & Analytics
130 TB Hadoop warehouse
5 TB SQL warehouse
![Page 9: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/9.jpg)
Users• Advertisers and publishers
• Regional officers
• Account managers
• Executive team
• Business analysts
• Product analysts
• Data scientists
• Engineering systems
• Developers
Use cases
![Page 10: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/10.jpg)
Sharing reports/data with customer (account level)
Understand trends through data & exploration (analysis)
Debugging / Postmortem of issues (troubleshooting)
Sizing & Estimation (Ex: inventory, reach)
Summary of Product lines, Geographies, Network (Ex: Rev by Geo)
Sales/Revenue Targets vs Actuals
Tracking campaign performance
Tracking any metrics on REAL- TIME basis
Use cases
![Page 11: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/11.jpg)
Categorize the use cases• Batch queries
• Adhoc queries
• Interactive queries
• Scheduled reports
• Infer insights through ML algorithms
Use cases
![Page 12: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/12.jpg)
Adhoc querying system Dashboard system
Customer facing system
Analytics systems at Inmobi
![Page 13: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/13.jpg)
•Disparate user experience
•Disparate data storage systems causing inability to scale
•Schema management
•Not leveraging community around
Problems
![Page 14: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/14.jpg)
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
![Page 15: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/15.jpg)
Associates structure to dataProvides Metastore and catalog service – HcatalogProvides pluggable storage interfaceAccepts SQL like batch queriesHQL is widely adopted language by systems like Shark, ImpalaHas strong apache community
Data warehouse features like facts, dimensionsLogical table associated with multiple physical storagesInteractive queriesScheduling queriesUIW
hat
does
Hiv
e
pro
vid
e
Wh
at is m
issing in
H
ive
Apache Hive to the rescue
![Page 16: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/16.jpg)
Background
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
![Page 17: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/17.jpg)
Data Layout – Fact data
Aggrk : measures (mak <= ma(k-1)), dimensions (dak <
da(k-1))
…..
Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1)
Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr)
Raw data : measures (mr), dimensions(dr)
Other side of pyramid is aggregated at timed dimension
![Page 18: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/18.jpg)
Data Layout – snowflake
Aggrk : measures (mak <= ma(k-1)), dimensions (dak <
da(k-1))
…..
Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1)
Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr)
Raw data : measures (mr), dimensions(dr)
Dim2_1
Dim3
Dim2
Dim4_1
Dim4
Dim1
![Page 19: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/19.jpg)
Data Model
Cube Storage Fact Table
Physical Fact tables
Dimension Table
Physical Dimension
tables
![Page 20: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/20.jpg)
Data Model - Cube
Dimension• Simple Dimension• Referenced Dimension• Hierarchical Dimension• Expression Dimension• Timed dimension
Measure• Column Measure • Expression Measure
Cube
Measures Dimensions
Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
![Page 21: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/21.jpg)
Data Model – Storage
Storage
Name
End point
PropertiesEx : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
![Page 22: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/22.jpg)
Data Model – Fact Table
Fact table
Cube
Fact table
Storage
Fact Table
Columns
Cube that it belongsStorages on which it is present and the associated update periods
![Page 23: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/23.jpg)
Data Model – Dimension table
Dimension Table
Columns
Dimension referencesStorages on which it is present and associated snapshot dump period, if any.
Cube
Dimension table
Dimension table
Dimension table
Storage
![Page 24: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/24.jpg)
Data Model – Storage tables and partitions
Storage table
Belongs to fact/dimensionAssociated storage descriptor
Partitioned by columnsNaming convention – storage name followed by fact/dimension namePartition can override its storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
![Page 25: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/25.jpg)
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
![Page 26: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/26.jpg)
CUBE SELECT [DISTINCT] select_expr, select_expr, ...FROM cube_table_referenceWHERE [where_condition AND] TIME_RANGE_IN(colName , from, to)[GROUP BY col_list][HAVING having_expr][ORDER BY colList][LIMIT number]
cube_table_reference:cube_table_factor
| join_tablejoin_table:
cube_table_reference JOIN cube_table_factor [join_condition]
| cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition]
cube_table_factor:cube_name [alias]
| ( cube_table_reference )join_condition:
ON equality_expression ( AND equality_expression )*equality_expression:
expression = expressioncolOrder: ( ASC | DESC )
colList : colName colOrder? (',' colName colOrder?)*
Queries on Data cubes
![Page 27: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/27.jpg)
Resolve candidate tables and storages
Automatically resolve joins
Resolve aggregates and groupby expressions
Allows SQL over Cube QL
Queries can span multiple storages
Accepts multi time range queries
All Hive QL features
Query features
![Page 28: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/28.jpg)
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100
cube select name, stateid from citytable limit 100
Example query
![Page 29: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/29.jpg)
Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest')
• GROUP BY(citytable.name)
cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
![Page 30: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/30.jpg)
Available in Hive• Data warehouse
features like facts, dimensions
• Logical table associated with multiple physical storages
Available in Grill • Pluggable execution engine
for HQL• Query history, caching• Scheduling queries
Where is it available
![Page 31: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/31.jpg)
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
![Page 32: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/32.jpg)
Unified analytics platform at Inmobi
• Supports multiple execution engines• Supports multiple storages
Provides analytics on the system
Provides query history
Grill
![Page 33: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/33.jpg)
Grill Architecture
![Page 34: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/34.jpg)
Implements an interface• execute• explain• executeAsynchronously• fetchResults• Specify all storages it can support
Pluggable execution engine
![Page 35: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/35.jpg)
Cube QL query
Rewrite query for available execution engine’s supported storages
Get cost of the rewritten query from each execution engine
Pick up execution engine with least cost and fire the query
Cube query with multiple execution engines
![Page 36: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/36.jpg)
Grill Server
![Page 37: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/37.jpg)
Grill server – code status
![Page 38: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/38.jpg)
• Number of queries - 700 to 900 per day
• Number of dimension tables - 125
• Number of fact tables – 24
• Number cubes – 15
• Size of the data
• Total size – 136 TB
• Dimension data – 400 MB compressed per hour
• Raw data - 1.2 TB per day
• Aggregated facts- 53GB per day
Data ware house statistics
![Page 39: Datacubes in Apache Hive at ApacheCon](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa3b4a79591c038b463e/html5/thumbnails/39.jpg)
Jira reference
• https://issues.apache.org/jira/browse/HIVE-4115
Github source for Hive
• https://github.com/InMobi/hive
Github source for grill
• https://github.com/InMobi/grill
Documentation
• http://inmobi.github.io/grill
References