how to use big data

Digicomp 1

Kursleitung:

Die Microsoft BI Plattform in der Cloud

Matthias Gessenay, 20. Januar 2016 / [email protected]

2Digicomp

Copyrights

Folien z.T. entnommen aus dem Azure Readiness Slidedeck von Microsoft (https://github.com/Azure-Readiness/CloudDataCamp/blob/master/Presentation/HDInsight/Hadoop%20in%20Azure.pptx)

Folien z.T. entnommen aus der MS Ignite Session PowerBI Overview (http://www.google.ch/url?sa=t&rct=j&q=&esrc=s&source=web&cd=8&cad=rja&uact=8&ved=0ahUKEwiH3pygp7XKAhVBVRoKHQ9KCJwQFghcMAc&url=http%3A%2F%2Fvideo.ch9.ms%2Fsessions%2Fignite%2F2015%2Fdecks%2FBRK2556_Doyle.pptx&usg=AFQjCNHOr7Kb8pJEFnLKHvAMUho0AOBhjA)

https://github.com/Azure-Readiness/CloudDataCamp/blob/master/Presentation/HDInsight/Hadoop in Azure.pptx

https://github.com/Azure-Readiness/CloudDataCamp/blob/master/Presentation/HDInsight/Hadoop in Azure.pptx

http://www.google.ch/url?sa=t&rct=j&q=&esrc=s&source=web&cd=8&cad=rja&uact=8&ved=0ahUKEwiH3pygp7XKAhVBVRoKHQ9KCJwQFghcMAc&url=http://video.ch9.ms/sessions/ignite/2015/decks/BRK2556_Doyle.pptx&usg=AFQjCNHOr7Kb8pJEFnLKHvAMUho0AOBhjA

Digicomp 3

Einführung in Apache Hadoop

4Digicomp

Apache Hadoop

6Digicomp

Data volume

Hadoop speichert Dateien in einem verteilten Dateisystem

Verteilt über viele Server

Dateien können über viele Knoten verteilt werden

Hadoop kann sehr grosse Datenmengen speichern

Skalierbar von einigen zu vielen tausend Knoten

Dateien können grösser sein als die Kapazität eines einzelnen Knotens

7Digicomp

Data variety

Hadoop speichert Dateien in einem nicht-relationalen Format

CalibriDigicomp

Hadoop vs. SQL

RelationalDatabase

SCALE (storage & processing)

HadoopPlatform

schema

speed

governance

best fit use

processing

Required on write Required on read

Reads are fast Writes are fast

Standards and structured Loosely structured

Limited, no data processing Processing coupled with data

data typesStructured Multi and unstructured

Interactive OLAP Analytics

Complex ACID Transactions

Operational Data Store

Data Discovery

Processing unstructured data

Massive Storage/Processing

CalibriDigicomp

YARN: Next Generation Hadoop (Azure DataLake ist auf Yarn gebaut)

Single Use System

Batch Apps

Multi Use Data Platform

Batch, Interactive, Online, Streaming, …

1st Gen of Hadoop

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

Redundant, Reliable Storage(HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Flexible DataProcessing

Hive, Pig, others…

BatchMapReduce

Batch & InteractiveTez

Online Data Processing

HBase, Accumulo

Stream Processing

Storm

others…

2nd Gen of Hadoop

Classic Hadoop

Apps

CalibriDigicomp

http://hortonworks.com/blog/introducing-apache-hadoop-yarn/

Hadoop 2.0: Yarn

http://hortonworks.com/blog/introducing-apache-hadoop-yarn/

11Digicomp

Datenknoten

Verteilt

Lokaler Speicher

Fehlertolerant (3 Kopien per Block)

Splittet Dateien in Blöcke

Namensknoten

Speichert keine Daten

Weiss aber, wo welche Blöcke liegen

HDFS: Hadoop Storage

CalibriDigicomp

Hadoop MapReduce

………

Do work() Do work() Do work()

Digicomp 13

Apache Hadoop in Azure

14Digicomp

HDInsight: What’s Different?

Nicht so viel …

HDP on Windows

HDP on Linux

Compute und Storage sind verteilt

Azure Blob Storage

CalibriDigicomp

HDInsight Storage Infrastructure

HDInsight Compute Nodes (Large VMs)

Azure Blob Storage

Azure Flat Network Storage

Stream datato compute

Push databack to storage

map sort shuffle reduce

http://dennyglee.com/2013/03/18/why-use-blob-storage-with-hdinsight-on-azure/

http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx

http://dennyglee.com/2013/03/18/why-use-blob-storage-with-hdinsight-on-azure/

16Digicomp

HDInsight Demo

17Digicomp

Microsoft Self Service-BI

CalibriDigicomp

Mächtige Self-Service BI mit Excel 2013

19Digicomp

Suited for self-service data that fits in Excel

Data driven shaping – design while you drive

Ideal for sampling data

Partition data in Hadoop/Hive based on user workloads

No governors to prevent users from pulling «too much data»

Does not read compressed or binary files (yet)

Power Query

22Digicomp

Demo - HDInsight

23Digicomp

Azure Data Lake

Basierend auf Apache YARN

Praktisch unbegrenzte Datenmengen / Rechenpower

Zahlung nach Nutzung

Aktuell noch auf Einladung

Neue Sprache: U-SQL

CalibriDigicomp

Demo

25Digicomp

PowerBI

Cloud Dashboards

On Premise-Technologie verfügbar (DataZen)

Datenanbindung via PowerBI sehr einfach

Hybrid möglich

CalibriDigicomp

Demo

CalibriDigicomp

Fragen?

how to use big data

Technology