introduction to big data(hadoop) eco-system the modern ... · the modern data platform for...

27
1 © Cloudera, Inc. All rights reserved. Roger Ding Cloudera February 3rd, 2018 Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Upload: dangkien

Post on 12-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

1 © Cloudera, Inc. All rights reserved.

Roger Ding Cloudera February 3rd, 2018

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Page 2: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

2 © Cloudera, Inc. All rights reserved.

Agenda

•Hadoop History

•Introduction to Apache Hadoop Eco-System

•Transition from Legacy Data Platform to Hadoop

•Resources, Q & A

Page 3: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

3 © Cloudera, Inc. All rights reserved.

Legacy RDBMS Quick Check

• Centralized Storage

• Centralized Computing

• Send data to compute

• Bottleneck

• Network bandwidth

• Slow disk I/O

• Scale-Up

• Add more memory, upgrade CPU, replace server every several years

• High Cost

• High-end Processing and Storage

• Hard to plan

• Time to Data

• Structure Data

• Up-front modeling

• Schema-on-write

• Transforms lose data

• No agility

Page 4: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

4 © Cloudera, Inc. All rights reserved.

Google 1999: Indexing the Web

Page 5: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

5 © Cloudera, Inc. All rights reserved.

The Original Inspirations for Hadoop

2003 2004

Page 6: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

6 © Cloudera, Inc. All rights reserved.

2006 Core Hadoop: HDFS, MapReduce

The Beginning: Building Hadoop

Page 7: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

7 © Cloudera, Inc. All rights reserved.

Agenda

•Hadoop History

• Introduction to Apache Hadoop Eco-System

•Transition from Legacy Data Platform to Hadoop

•Resources, Q & A

Page 8: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

8 © Cloudera, Inc. All rights reserved.

Hadoop Eco-System Primer

• Hadoop consists of 3 core components

• HDFS(Hadoop Distributed File System): Self-healing, Distributed Storage Framework

• MapReduce: Distributed Computing Framework

• YARN(Yet Another Resource Management): Distributed Resource Management Framework

• Many other projects based around core Hadoop

• Referred to as the “Hadoop Ecosystem” projects

• Spark, Pig, Hive, Impala, HBase, Flume, Sqoop, etc

• A set of machines running Hadoop Software is known as a Hadoop Cluster

• Individual machines are known as ‘nodes’

Page 9: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

9 © Cloudera, Inc. All rights reserved.

HDFS: Economically Feasible to Store More Data

Self-healing, high bandwidth clustered storage.

1

2

3

4

5

2

4

5

HDFS

1

2

5

1

3

4

2

3

5

1

3

4

HDFS breaks incoming files into blocks and stores them redundantly across the cluster.

$300-$1,000 per TB Affordable & Attainable

Page 10: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

10 © Cloudera, Inc. All rights reserved.

MapReduce: Power to predictably process large data

Distributed computing framework.

1

2

3

4

5

2

4

5

MR

1

2

5

1

3

4

2

3

5

1

3

4

Processes large jobs in parallel across many nodes and combines the results.

Page 11: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

11 © Cloudera, Inc. All rights reserved.

2006 2008 2009 2010 2011 2012 2013

HBase ZooKeeper

Solr Pig

Core Hadoop

Hive Mahout HBase

ZooKeeper Solr Pig

Core Hadoop

Sqoop Avro Hive

Mahout HBase

ZooKeeper Solr Pig

Core Hadoop

Flume Bigtop Oozie

HCatalog Hue

Sqoop Avro Hive

Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Spark Tez

Impala Kafka Drill

Flume Bigtop Oozie

HCatalog Hue

Sqoop Avro Hive

Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Parquet Sentry Spark

Tez Impala Kafka Drill

Flume Bigtop Oozie

HCatalog Hue

Sqoop Avro Hive

Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

2007

Solr Pig

Core Hadoop

Knox Flink

Parquet Sentry Spark

Tez Impala Kafka Drill

Flume Bigtop Oozie

HCatalog Hue

Sqoop Avro Hive

Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

2014 2016

Kudu

RecordService Ibis

Falcon Knox Flink

Parquet Sentry Spark

Tez Impala Kafka Drill

Flume Bigtop Oozie

HCatalog Hue

Sqoop Avro Hive

Mahout HBase

ZooKeeper Solr Pig

YARN Core Hadoop

Core Hadoop (HDFS,

MapReduce)

A Decade of Hadoop – A platform won’t stop growing

Page 12: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

12 © Cloudera, Inc. All rights reserved.

Some Hadoop Eco-System Projects

• Data Storage

• HDFS, HBase, KUDU

• Computing Framework

• MapReduce, Spark, Flink

• Data Ingestion

• Sqoop, Flume, Kfaka

• Data Serialization in HDFS

• Avro, Parquet

• Analytics

• Pig, Hive, Impala

• Orchestration

• Zookeeper

• Workflow, Coordination

• OOZIE

• Security (Authorization)

• Sentry

• Search

• Solr

Page 13: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

13 © Cloudera, Inc. All rights reserved.

Hadoop Eco-System – Storage Engine

• HDFS (2006): Large files, block storage

• HBase (2008): Key-Value store

• KUDU (2016): Store structured data

Page 14: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

14 © Cloudera, Inc. All rights reserved.

Hadoop Eco-System – Computing Framework

• Spark (2012)

• Originated at UC Berkeley AMPLab

• In-memory computing framework

• Processes data in-memory vs. MapReduce two-stage paradigm

• Can Perform 10 to 100 times faster than MapReduce for certain applications

• Flexible (Scala, Java, Python API) vs. MapReduce (Java)

• Include 4 components on top of Core Spark: Spark Streaming, GraphX, MLLib, Spark SQL

Page 15: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

15 © Cloudera, Inc. All rights reserved.

Hadoop Eco-System – Analytics • Hive (2010)

• Originated at Facebook

• Compile SQL queries to MapReduce or Spark jobs

• Data warehouse tool in Hadoop Eco-System

• Good for ETL, batch, long-running job.

• Impala (2013)

• Originated at Cloudera

• MPP(Massively Parallel Processing) SQL Engine

• Much faster than Hive Query or Spark SQL; Support high concurrency; But no Fault tolerance

• Good for short-running, BI-Style ad-hoc queries.

• BI tool like Tableau, MicroStrategy connect to Impala through ODBC/JDBC

Page 16: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

16 © Cloudera, Inc. All rights reserved.

Hadoop Data Processing Pattern

• Distributed Storage

• Distributed Computing

• Send compute to data

• Scale-Out

• Add more nodes

• Cost Effective

• Commodity hardware

• Time to Data

• No Up-front modeling

• Schema-on-read

• 100% fidelity of original data

• Data agility

Page 17: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

17 © Cloudera, Inc. All rights reserved.

Agenda

•Hadoop History

• Introduction to Apache Hadoop Eco-System

•Transition from Legacy Data Platform to Hadoop

•Resources, Q & A

Page 18: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

18 © Cloudera, Inc. All rights reserved.

Data Silos

Engineering Marketing Sales HR Customer

Service

• Slow down your company

• Limits communication and collaboration

• Decrease the quality and credibility of data

Page 19: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

19 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub Making Hadoop Fast, Easy, and Secure

A new kind of data

platform: • One place for unlimited data

• Unified, multi-framework data access

Cloudera makes it: • Fast for business

• Easy to manage

• Secure without compromise

OPERATIONS Cloudera Manager

Cloudera Director

DATA MANAGEMENT

Cloudera Navigator

Encrypt and KeyTrustee

Optimizer

STRUCTURED Sqoop

STREAMING Kafka, Flume

PROCESS, ANALYZE, SERVE

UNIFIED SERVICES

RESOURCE MANAGEMENT YARN

SECURITY Sentry, RecordService

STORE

INTEGRATE

BATCH Spark, Hive, Pig

MapReduce

STREAM Spark

SQL Impala

SEARCH Solr

OTHER Kite

NoSQL HBase

OTHER Object Store

FILESYSTEM HDFS

RELATIONAL Kudu

Page 20: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

20 © Cloudera, Inc. All rights reserved.

Data Mgmt. Chain

Data Sources Data Ingest Data Storage & Processing Serving, Analytics &

Machine Learning

ENTERPRISE DATA HUB

Apache Kafka Stream ingestion

Apache Sqoop Ingestion of data from relational sources

Apache Hadoop Storage (HDFS) & deep batch processing

Apache Kudu Storage & serving for fast changing data

Apache HBase NoSQL data store for real time applications

Apache Impala MPP SQL for fast

analytics

Cloudera Search

Real time search

Connected Things/ Data Sources

Structured Data Sources Security, Scalability & Easy Management

Deployment Flexibility:

Datacenter Cloud

Apache Spark Batch, Stream & iterative processing, ML

Apache Hive Batch Processing, ETL Apache Flume

Stream ingestion

Page 21: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

21 © Cloudera, Inc. All rights reserved.

The best-in-class organizations use Cloudera

#1 Largest

Biotech in the world.

7 out of the

top 10 cancer

drugs by 2020 are being made by Cloudera customers.

#1largest global

genomic

repository

#1 Largest

Payer in the US

will be covering

123 million lives and pay out

$950B to

providers in 2015.

this Hospital was one of the first

four to receive Stage 7 status from HIMSS, the

highest possible

distinction in electronic medical

records implementation, uses Cloudera to host a

variety of data, and was awarded by US DHHS a

Gold Medal of

Honor.

#1 commercial

hospital chain worldwide.

Over 150 health & life

science organizations

use enterprise-class

Cloudera software.

#1 most utilized Patient

Centered Medical Home

program.

#1 Largest health data

company,

with 500M+

anonymous patient

records. #1 Largest Health IT

company in the World, $3B+

in revenue has 1000’s of nodes of Cloudera.

Page 22: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

22 © Cloudera, Inc. All rights reserved.

Broad Institute’s industry standard GATK pipeline’s new version is based on Apache Spark, over 20,000 global users may migrate to Spark

Thanks to the contributions of Cloudera Engineers, GATK4 now uses Apache Spark for both traditional local multithreading and for parallelization on Spark-capable compute infrastructure and services, Such as google Dataproc.

“It has been a privilege collaborating with the Broad Institute over the last two years to ensure that GATK4 can use the power of Apache Spark to make genomics workflows more scalable than precious approaches”, said Tom White, principal data scientist at Cloudera.

Page 23: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

23 © Cloudera, Inc. All rights reserved.

Seattle Children’s Research Institute

• 200+ PI’s at Seattle Children’s Research Institute

• 9 Research Centers including cancer, brain, birth, infectious disease

• Was no integrated data platform across the 9 Centers

• Evaluated multiple packaged applications, all multi-millions of dollars

• Selected Cloudera as the platform, created their own web user interface

Benefit Today, a single lab at SCRI can evaluate and diagnose a single patient per week after receiving the whole exome and clinical record. After implementation, the lab could diagnose 4-5 patients per week.

Page 24: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

24 © Cloudera, Inc. All rights reserved.

Agenda

•Hadoop History

• Introduction to Apache Hadoop Eco-System

•Transition from Legacy Data Platform to Hadoop

•Resources, Q&A

Page 25: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

25 © Cloudera, Inc. All rights reserved.

Start Your Big Data Journey

•Download Cloudera QuickStart Virtual Machine Today

•Practice !

•Practice !!

•Practice !!!

Page 26: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

26 © Cloudera, Inc. All rights reserved.

Meetups

AI + Big Data Healthcare Meetup

Washington DC Area Apache Spark Interactive

https://www.meetup.com/AI-and-Big-Data-Healthcare-Meetup/

http://www.meetup.com/Washington-DC-Area-Spark-Interactive/

1600+ members 2,700+ members

Page 27: Introduction to Big Data(Hadoop) Eco-System The Modern ... · The Modern Data Platform for Innovation and Business Transformation ... Building Hadoop ... said Tom White, principal

27 © Cloudera, Inc. All rights reserved.

Thank you!

[email protected]