- oracle · “the oracle grid engine 6.2 software has dramatically lowered for us the cost of...
TRANSCRIPT
![Page 1: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/1.jpg)
![Page 2: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/2.jpg)
<Insert Picture Here>
Scalable Enterprise Data Processing for the CloudWith Oracle Grid Engine
Daniel TempletonProduct Manager, Oracle Grid Engine
Tom WhiteEngineer, Cloudera
![Page 3: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/3.jpg)
3
Oracle OpenWorld Latin America 2010
December 7–9, 2010
![Page 4: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/4.jpg)
4
Oracle OpenWorld Beijing 2010
December 13–16, 2010
![Page 5: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/5.jpg)
5
Oracle Products Available Online
Oracle Store
Buy Oracle license and support online today at
oracle.com/store
SHOP NOW
![Page 6: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/6.jpg)
6
<Insert Picture Here>
Program Agenda
• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources
![Page 7: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/7.jpg)
7
<Insert Picture Here>
Program Agenda
• The new data landscape• Data-oriented computing• Compute infrastructure• Data infrastructure• Data-oriented computing revisited• Additional resources
![Page 8: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/8.jpg)
8
The Data Landscape
• Structured data– Relational data, XML, etc.– Well-defined data structure – e.g. Schema, DTD
• Facilitates automated analysis – e.g. SQL, XSLT– Managed life cycle
• Unstructured data– Everything that's not structured– No predictable or useful structure
• Somewhat subjective– Analysis requires customization and manual intervention– No clear life cycle because no clear classification
![Page 9: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/9.jpg)
9
Unstructured Data
• Documents, logs, records, dumps, etc.– Distributed across files across machines across the network
• Growing rapidly– 85% of enterprise data
• Growing at 61.7% compounded annually
• Expensive to store it all– How to decide what to keep?
• Potentially massive source of business value– Business value locked behind lack of structure
![Page 10: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/10.jpg)
10
The Data Landscape
• NYSE is generating 1TB per day
• Facebook is generating 20TB per day– Compressed!
• CERN is generating 40TB per day
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
4
5
6
7
8
9
10
11
12
13
Worldwide Enterprise Disk Storage Consumption ModelRevenue by Segment, 2005–2014 ($B)
Traditional replicated data
Traditional structured data
Traditional unstructured data
$B
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Worldwide Enterprise Disk Storage Consumption ModelCapacity Shipment Share by Segment, 2005–2014 (%)
Traditional replicated
Traditional structured data
Traditional unstructured
![Page 11: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/11.jpg)
11
<Insert Picture Here>
Program Agenda
• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources
![Page 12: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/12.jpg)
12
Data-Oriented Computing
• Compute is now cheap; moving data is still expensive– Big change from a decade ago– More CPU cores than can be used effectively– More data than can be processed
• Do the work close to the data– “What to run” → “What data to process”
• Data no longer assumed to float in a SAN– Data locality is a core concept– The network is the data
![Page 13: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/13.jpg)
13
Structured Versus Unstructured
Compute Data
![Page 14: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/14.jpg)
14
Structured Versus Unstructured
Compute Database
![Page 15: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/15.jpg)
15
Structured Versus Unstructured
ComputeDatabase
Cluster
![Page 16: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/16.jpg)
16
Structured Versus Unstructured
Compute Local Disk
![Page 17: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/17.jpg)
17
Data-Oriented Computing and the Cloud
• Public clouds rapidly becoming the dominant storage vehicle
• Large data analytics fits well with private or public clouds– Mind the transfer!
• Bandwidth and latency issues make hybrid cloud solutions unfavorable 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Worldwide Enterprise Disk Storage Consumption ModelCapacity Shipment Share by Segment, 2005–2014 (%)
Traditional replicated
Traditional structured data
Traditional unstructured
Content depots/public cloud
Actualv
![Page 18: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/18.jpg)
18
Typical Data-Oriented Computing Use Cases
• Large data files– Implicitly chunked across network– Process massively in parallel
• Fragmented data records– Process in place– Aggregation implicit in the computation
• Hacking by determined developers– Now called “Data Science”
• Streaming data– Dump into storage and proceed as above
![Page 19: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/19.jpg)
19
Basic Data-Oriented Computing Building Blocks
• Data Infrastructure– Massively scalable
• Also in terms of cost– Network-centric– Data locality
• Compute Infrastructure– Highly scalable management of compute resources– Support for multi-tenancy
• Users & applications– Support for accounting and billing
• Fundamental to cloud model
![Page 20: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/20.jpg)
20
<Insert Picture Here>
Program Agenda
• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources
![Page 21: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/21.jpg)
21
Oracle Coherence
• Highly-scalable in-memory data grid– Aggregates total memory of nodes into a single cache
• More nodes = more cache space– Coherency maintained through extremely optimized protocol– No single point of failure
• Object oriented– Every object lives on a particular node– Objects replicated for redundancy
• Can be backed by a traditional data store– Write ahead, write behind, etc.
![Page 22: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/22.jpg)
22
Oracle Coherence Embedded Data Grid
Execution Host Execution Host Execution Host Execution Host
Master HostOracleGrid
Engine
OracleCoherence
![Page 23: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/23.jpg)
23
Apache Hadoop HDFS
• Highly-scalable on-disk data grid– Aggregates assigned disk space of nodes into a single pool
• More nodes = more storage space– Data locations maintained by a master node
• File oriented– Every file is broken into data blocks– Every block lives on a particular node– Blocks replicated for redundancy
• Core component of Hadoop– Powerful marriage of compute with data
![Page 24: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/24.jpg)
24
<Insert Picture Here>
Program Agenda
• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources
![Page 25: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/25.jpg)
25
Oracle Grid EngineBusiness-driven Workload Management
• Powerful workload manager– Efficiently match workload to available resources– Schedule according to business policies– Aggregate user and uses onto a set of resource pools– Extreme scalability– Full accounting
• Flexible resource broker– Share resources among services according to SLOs– Lease additional capacity from the cloud on demand– Set idle/underutilized machines into reduced power mode
![Page 26: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/26.jpg)
26
Award-winning Sun Grid EngineThousands of Successful Grids
Excellence in Cluster Technology
![Page 27: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/27.jpg)
27
Redefining the Enterprise Data Center
• Tear down application resource silos– Resource sharing according to needs and policies
• Reduce the cost of data center ownership– More efficient use of resources– Idle or underused machines powered down until needed
• On-demand scale-out to cloud resources– Insulates applications from cloud service providers– Facilitates private cloud model
• Support for data-oriented compute models– Apache Hadoop– Oracle Coherence
![Page 28: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/28.jpg)
28
Common Use Cases
Modeling/Processing
Streaming
Monte Carlo
Validation
![Page 29: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/29.jpg)
29
Map/Reduce
• Defined in a paper from Google in 2004– Apache Hadoop is the best known implementation
• Data processing in two steps– Map: process input data across network– Reduce: assemble intermediate results into final result
• Example: counting words in a book– Map: for each page, emit every word into a giant hash table– Reduce: merge all hash tables together and count the number
of values for each key
• Massively parallel processing – embarrassingly parallel– Inherently data aware
![Page 30: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/30.jpg)
30
<Insert Picture Here>
Program Agenda
• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources
![Page 31: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/31.jpg)
31
Rethinking Unstructured Data With Hadoop
• MapReduce provides unified interface– Rich ecosystem of tools for data analysis
• Hive, Pig, et al → Cloudera Distribution of Hadoop– Almost as accessible as structured data
• HDFS is a low-cost distributed file system– Adding capacity means just adding (cheap) nodes– Changes the economies of data storage
• Possible to extract the value from unstructured data and feasible to keep large amounts of it around– Tremendous opportunity for discovered knowledge
![Page 32: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/32.jpg)
32
“Hadoop is a key ingredient in allowing LinkedIn to build many of our most computationally difficult features, allowing us to harness our incredible data about the professional world for our users”Jay Kreps, Principal Engineer
What Linkedin Is Saying
![Page 33: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/33.jpg)
33
Word Count Example Revisited
MAP
MAPMAP
MAPMAPMAP
MAPMAP
REDUCEREDUCE
capacity: 14334intellect: 12377mind: 9574money: 5967truth: 5868...
Store datain HDFS
Map phase:count words
per data block
Shuffle Reduce phase:aggregate counts
Extract resultsfrom HDFS
Word Count Algorithm
![Page 34: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/34.jpg)
34
Unstructured Enterprise Data Analytics
• Not all necessarily Hadoop– MPI, Java, legacy, or even different Hadoop versions
• Grid Engine unifies the workload across the resources– Better efficiency– Lower cost of management– Cross-domain workflows
• Plus enterprise class features:– Demand-driven cloud connectivity and power management– Advanced scheduling policies
• Advance resource reservations– Full accounting and reporting suite
![Page 35: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/35.jpg)
35
<Insert Picture Here>
“The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data
intensive, Hadoop centered, computing. Oracle Grid Engine allows us to run Hadoop jobs within exactly
the same scheduling and submission environment we use for traditional scalar and parallel loads.”
Gianluigi ZanettiDirector,
Biomedical Applications,CRS4
![Page 36: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/36.jpg)
36
Word Count Example Re-Revisited
HDFS
Oracle Grid Engine
OpenMPI JavaMap/Reduce: Word Count
MA
P
MA
P
MA
P
MA
P /
RE
DU
CE
MA
P /
RE
DU
CE
Map/Reduce
MA
P
MA
P
MA
P /
RE
DU
CE
![Page 37: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/37.jpg)
37
<Insert Picture Here>
Program Agenda
• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources
![Page 38: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/38.jpg)
38
References For Getting Started
• Oracle Grid Engine OTN Page:– http://www.oracle.com/technetwork/oem/grid-engine-166852.html
• Hadoop Project Page:– http://hadoop.apache.org/
• Cloudera:– http://www.cloudera.com/– http://www.cloudera.com/hadoop-tutorial/
• Hadoop World 2010:– http://www.cloudera.com/company/press-center/hadoop-world-nyc/
![Page 39: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/39.jpg)
39
![Page 40: - Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,](https://reader034.vdocuments.site/reader034/viewer/2022042222/5ec95fad530c8d0e6a65ad5b/html5/thumbnails/40.jpg)
40