big data
DESCRIPTION
TRANSCRIPT
Why Big Data?
Understanding Big Data
Cheap Storage
$100 gets you 3million times
more storage in 30 years)
Inexpensive Computing
1980 10 MIPS/$ 2005 10M MIPS/$
Device Explosion
>5.5 billion (70+% of global population)
KEY TRENDS
Social Networks
>2 Billionusers
Ubiquitous Connection
Web traffic2010 130 Exabyte (10 E18)
2015 1.6 ZettaByte (10 E21)
Sensor Networks
>10 Billion
Internet of things Audio /
VideoLog Files
Text/Image
Social Sentiment
Data Market FeedseGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising
Collaboration
eCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
PayablesPayroll
Inventory
Contacts
Deal Tracking
Terabytes(10E12)
Gigabytes(10E9)
Exabytes(10E18)
Petabytes(10E15)
Velocity - Variety - variability
Volu
me
1980190,000$
20100.07$
19909,000$
200015$Storage/GB
ERP / CRM WEB 2.0
Internet of things
What is Big Data?
Big Data, BIG OPPORTUNITY
Big Data is a top priority for institutions
49% CEOs and CIOs are planning big data projects
Software Growth
2012
2013
2014
2015
0
41.8 2.5
3.44.6
Bil
lio
ns
$
34% compound annual growth rate2
Services Growth
2012
2013
2014
2015
048
2.7 3.9 5.16.5
Bil
lio
ns
$
39% compound annual growth rate2
1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 20122. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012
Big Data Scenarios
OPERATIONAL DATA
Traditional E-Commerce Data Flow
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Excess Data
Logs
ETL Some Data
Data Warehouse
OPERATIONAL DATA
New E-Commerce Big Data Flow
Raw Data“Store it All” Cluster
Raw Data“Store it All” Cluster
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Data Warehouse
Logs
Logs
How much do views for certain products increase when our TV ads run?
Devices: Internet and Internet of things
Internet of
things Invisible devicesTrillions of networked
nodes
Low bandwidth last-mile
connection
100kBit/sec
Mostly addressed by local schemes
Machine-centric Sensing-focus
Trillions of computer-enabled
devices which are part of the
IoT
Global addressing
User-centricCommunication-
focus
Internet
Laptops / tablets / smartphones
Billions of networked devices
High-bandwidth access
Cable: 10Mbs+Fiber: 50-100Mbs
6+billion people
1.5 billion use net
US: 4.3 devices per adult
Microsoft Big Data Solution
Power View Excel with PowerPivot Embedded BIPredictive Analytics
APPsLOBCRMERP
Microsoft PDW
SSAS SSRS
Devices CrawlersSensors Bots
Hadoop On Windows ServerHDInsight Service
Microsoft Hadoop VisionInsights to all users by activating new types of data
Integrate with Microsoft Business Intelligence
Choice of deployment on Windows Server + Windows Azure
Integrate with Windows Components (AD, Systems Center)Easy installation and configuration of Hadoop on Windows
Simplified programming with . Net & Javascript integration
Integrate with SQL Server Data Warehousing
Diff
ere
nti
ati
on
Hadoop Distributed Architecture
FIRST, STORE THE DATA
Server
ServerServer
MapReduce: Move Code to the Data
Files
Server
SECOND, TAKE THE PROCESSING TO THE DATA
So How Does It Work?
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
ServerServer
ServerServer
RUNTIME
Code
MapReduce – Workflow
Our weather model and resulting data sets should be accessible to universities and other institutions.
Aerospace Development Manager, U.S. Federal Government
It takes more time to hand a project from the seismic guys to me to the engineers in production than it does to figure out the oil field plays.
Geologist, Major oil and gas company
MapReduce – Workflow
Windows Azure HDInsight Service
Distributed Storage(HDFS)
Query(Hive)
Distributed Processing
(MapReduce)
Scripting(Pig)
NoSQ
L Data
base
(HB
ase
)
Metadata(HCatalog)
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/ REST)
Rela
tiona
l(S
QL
Serve
r)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processin
g(RHadoo
p)
Eve
nt Pip
elin
e(Flu
me)
Active Directory (Security)
Monitoring & Deployment
(System Center)
C#, F#, .NET
JavaScript
Pipelin
e / w
orkflo
w(O
ozie
)
Azure Storage Vault (ASV)
PD
W Po
lybase
Busin
ess
Inte
lligence
(E
xcel, Po
wer
Vie
w, S
SA
S)
HDINSIGHT / HADOOP Eco-System
World's Data (Azure Data Marketplace)
Eve
nt
Drive
n
Proce
ssing
LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages
MICROSOFT CONFIDENTIAL – INTERNAL ONLY
Front end
Front end
Stream Layer
Partition Layer
HDFS on Azure: Tale of two File Systems
Name Node
de
Data Node Data Node
Front end
HDFS API
DFS (1 Data Node per Worker Role)and Compute Cluster
Azure Storage (ASV)
…
Azure Blob Storage
Azure Storage (ASV)• Default file system for HDInsight Service• Provides sharable, persistent, highly-scalable Storage with high
availability (Azure Blob Store)• Azure storage itself does not provide compute• Fast access from compute nodes to data in same data center• Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path>
• Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>
Programming HDInsightExisting Ecosystem
Hive, Pig, Mahout, Cascading, Scalding, Scoobi, Pegasus…
.NET
JavaScript
DevOps / IT Pros
C#, F# Map/Reduce, LINQ to Hive, .NET management clients
JavaScript Map/Reduce, Browser hosted console, Node.js management clients
PowerShell, Cross Platform CLI tools
Authoring Jobs App Integration
Building Developer Experiences
Core Hadoop
Consistent REST API’s
Breadth of Clients (Java, JS, .NET, etc)
Authoring frameworks and languages
End User Tooling (IDE’s, Analyst tools, Command lines)
ConnectivityProgrammabilitySecurityLoosely coupled
LightweightLow cost to
extendScenario oriented
Innovation flows upward
New compute models
Perf enhancements
Extend breadth & depthEnable new scenariosIntegrate with current tool chains
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION
IN THIS PRESENTATION.