Download - Better Performance for Big Data
![Page 1: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/1.jpg)
Better Performance for Big Data
Shuya Zhang; Shyam Sundar Somasundaram
[10/03/13]
1
[1] Bhasker Allene, Marco Righini, “Better Performance for Big Data” Intel White Paper, 2013 Intel Corporation.
Reference
![Page 2: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/2.jpg)
What is Hadoop
Apache Hadoop
an open-source software framework Supports data-intensive distributed applications
Enables the running of
applications on large
clusters of commodity
Hardware Derived from Google's MapReduce and Google File System
(GFS) papers
2
Hadoop is one of the poster children of Big Data,
especially the most beloved one!
![Page 3: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/3.jpg)
The Intel Distribution for Hadoop framework
● Oozie is a workflow scheduler
● Hive enables SQL queries on Hadoop
● Hbase is a non-relational, distributed database
![Page 4: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/4.jpg)
Oozie Workflow
Step1 Convert EBCDIC to ASCII
Step 2 Scan for New Columns
Step3 Move Columns List to Metadata
Step4 Optimize Data for Hive
Step5 Move Data to Hive Warehouse
Step6 Drop and Create Hive Table Structure
![Page 5: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/5.jpg)
Data Flow & Data Optimization
● Benefits:- Data is stored in a normalized
way
- Hive queries quite similar to RDBMS queries
- The learning curve is minimized, with fewer computations and less disk space
- Data is easily consumable for data analysis tools
![Page 6: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/6.jpg)
Comparing SQL and Hive Queries
Query 1• RDBMS: select distinct tp_ndg as N010from scontrpf50m0130331• Hive: SELECT DISTINCT N010 FROM O0A011 LIMIT10;
Query 2• RDBMS: SELECT DISTINCT B. COD_UO, b.Tp_conto, a. Tp_ndg as N010 FROM A JOIN SCONTRPF50M0130331 CUBOM0100M0130331 B ON A. NDG = B. NDG WHERE POSIZ_SOFF_INCAGLT = '1 'and a. TP_NDG in ('DIN', 'IOC', 'SPF') and b. dt_accs_rapprt> = '2013-03-01 'and b. dt_accs_rapprt <= '2013-03-15' ORDER BY B. COD_UO, b. Tp_conto, a. TP_NDG• HIVE: SELECT DISTINCT B.DOOR, B.TP_INCOME, A.N010 FROM O0A011 A JOIN CUBOM0100M0130331 B ON A.NDG = B.NDG WHERE N011 ='1' AND A.N010 in ('DIN', 'IOC', 'SPF') AND B.R021 > '130100' AND B.R021 < '130316'; ORDER BY DOOR, TP_INCOME, N010;
Query 3• RDBMS: select b. cod_uo, b. forma_tec,TP_NDG as N010, substring (a. sae_rae, 1, 3) as N003, a. u_segmgest_2004 as N088, a. u_modserv_gest as N089, sum (b. qc_rata_scd) as D505 from scon- trpf50m0130331 to join cubom0100m0130331 b on a. ndg = b. ndg where a. TP_NDG in ('DIN', 'IOC', 'SPF') and b. forma_tec in ('MW500', 'MW100', 'MW200') group by b. cod_uo, b. forma_tec, a. TP_NDG, substring (a. sae_rae, 1, 3), a. u_segmgest_2004, a. u_modserv_gest order by cod_uo, forma_tec, TP_NDG, sub- string (a. sae_rae, 1, 3), a. u_seg- mgest_2004, a. u_modserv_gest
![Page 7: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/7.jpg)
What is related with our course Hadoop VS SQLhttp://www.youtube.com/watch?v=3Wmdy80QOvw#t=16
New Trend: Big Data, Cloud Computing
RDBMS(Relational DBMS)
OLAP
NoSQL
Database Management Systems
Tables Cubes Collections
StructuredData
StructuredData
Structured/Unstructured
![Page 8: Better Performance for Big Data](https://reader036.vdocuments.site/reader036/viewer/2022062408/56813b7e550346895da49c1d/html5/thumbnails/8.jpg)
Questions?