Big data analytics
Rafal LukawieckiStrategic Consultant
Project Botticelli Ltd
[email protected] @rafaldotnet
Objectives
Explain big data analytics
Introduce data mining, Hadoop, and PDW
The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.
Portions © 2014 Project Botticelli Ltd & entire material © 2014 Microsoft Corp unless noted otherwise. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.
Video tutorialsIntroduction to BI & Big DataDAXMDXData MiningExcel BI
PPTsprojectbotticelli.com/ppt
Offer!15% Off: 2014SWISS15Valid in March 2014 only
Register onprojectbotticelli.com
Domain Common big data scenariosFinancial services Modeling true risk
Threat analysis and fraud detection
Trade surveillanceCredit scoring and analysis
Media & Entertainment
Recommendation enginesAd targeting
Search qualityAbuse and click fraud detection
Retail Point of sales transaction analysisCustomer churn analysis
Sentiment analysis
Telecommunications
Customer churn preventionNetwork performance optimization
Call Detail Record (CDR) analysisNetwork failure prediction
Government Cyber security (botnets, fraud)Traffic congestion and re-routing
Environmental monitoringAntisocial monitoring via social media
Healthcare Genomics researchCancer research
Health pandemics early detectionAir quality monitoring
Massively Parallel Processing (MPP) for queries
In-memory columnstore
Multiple nodes with dedicated CPU, memory, storage
Incrementally extensible
Scale from terabytes to multi-petabytes
PDW principles
Low latency
Sub-zero processing of large event streams
Continuous insight through historical data mining
Event targets
Event sources
PDW: near real-time insightsReal-time with complex event processing
Advanced analyticsDescriptive & predictiveClustering, neural nets, decision trees, time series, naïve Bayes, sequence clustering, linear and logistic regression
Semantic searchConceptual similarities
GeospatialGeometry and geography
Big dataHadoop, Mahout
Parallel Data WarehouseHDP
Windows Azure
Apache Hadoop distribution
Developed by Hortonworks & Microsoft
Integrated with Microsoft BI
Microsoft HDInsight
Big data + traditional BI = powerful + easy
Big, fast, or
complex data
Microsoft
HDInsight
Tabular
OLAPSQL
0101010101010101011010101010101010
01010101010101101010101010
Interaction, exploration,reporting,
visualisationPDW +
Polybase
Hadoop principles
Practical method for massive parallelisation of analytical data processing
DistributeddataDistributed processing
Analytics engine of Microsoft, Yahoo, Google, Facebook, Netflix, Klout…
Hadoop data
HDFS (Hadoop File System)Network rack aware to minimise transfers
Access like normal filesQuery the Hive, like a data warehouse, using HiveQL
Hadoop MapReduce
Your processing logic is split between map and reduce functions
Map your problem into smaller (divide)
Reduce results into higher-level aggregates (conquer)
MapReduce is like divide-and-conquer
DistributeddataDistributedprocessing
Hadoop cluster
Yahoo! Hadoop cluster, about 2007.Source: http://developer.yahoo.com. Picture used with permission.
Hadoop cluster
DistributeddataDistributedprocessingBuster Cluster, an early research project by Miles Osborne, University of Edinburgh, School of Informatics. Picture used with permission. http://homepages.inf.ed.ac.uk/miles/
Hadoop cluster
Cloudrent-a-Hadoop-cluster, or:“Supercomputer for cents”Windows Azure HD Insight
DistributeddataDistributedprocessing
Processing logic in HDInsight1.6 2.1 3.0
Write MapReduce jobs in Java, or any Windows language, using stdin-stdout
Pig Latin with User-Defined Functions (UDFs) in Python, JS, C#, Java, and .NET
Low-level, fast, harder
Easy, massively parallel
Processing logic in HDInsight 3.0Hadoop 2.2
Middleware between data and processing
MapReduce and Pig, or:Tez (interactive)HBase (online)streaming, graph, in-memory, search...
YARN Apps
Hadoop data scienceMahout 0.9 (not HDInsight 3.0 yet)
Machine learningScalable data mining
Collaborative filtering, recommenders, clustering, singular value decomposition, parallel frequent pattern mining, naive Bayes, decision tree
Summary
Big data = too complex for traditional methods
HDInsight + PDWfor yourbig data opportunity
projectbotticelli.comBI video tutorials, PPTs, and articles15% Off: 2014SWISS15Valid in March 2014 only
Follow: @rafaldotnetEmail: [email protected]: rafal.net
The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.
Portions © 2014 Project Botticelli Ltd & entire material © 2014 Microsoft Corp unless noted otherwise. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.