introduction to big data(hadoop) eco-system the modern ... · the modern data platform for...
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
Roger Ding Cloudera February 3rd, 2018
Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation
2 © Cloudera, Inc. All rights reserved.
Agenda
•Hadoop History
•Introduction to Apache Hadoop Eco-System
•Transition from Legacy Data Platform to Hadoop
•Resources, Q & A
3 © Cloudera, Inc. All rights reserved.
Legacy RDBMS Quick Check
• Centralized Storage
• Centralized Computing
• Send data to compute
• Bottleneck
• Network bandwidth
• Slow disk I/O
• Scale-Up
• Add more memory, upgrade CPU, replace server every several years
• High Cost
• High-end Processing and Storage
• Hard to plan
• Time to Data
• Structure Data
• Up-front modeling
• Schema-on-write
• Transforms lose data
• No agility
4 © Cloudera, Inc. All rights reserved.
Google 1999: Indexing the Web
5 © Cloudera, Inc. All rights reserved.
The Original Inspirations for Hadoop
2003 2004
6 © Cloudera, Inc. All rights reserved.
2006 Core Hadoop: HDFS, MapReduce
The Beginning: Building Hadoop
7 © Cloudera, Inc. All rights reserved.
Agenda
•Hadoop History
• Introduction to Apache Hadoop Eco-System
•Transition from Legacy Data Platform to Hadoop
•Resources, Q & A
8 © Cloudera, Inc. All rights reserved.
Hadoop Eco-System Primer
• Hadoop consists of 3 core components
• HDFS(Hadoop Distributed File System): Self-healing, Distributed Storage Framework
• MapReduce: Distributed Computing Framework
• YARN(Yet Another Resource Management): Distributed Resource Management Framework
• Many other projects based around core Hadoop
• Referred to as the “Hadoop Ecosystem” projects
• Spark, Pig, Hive, Impala, HBase, Flume, Sqoop, etc
• A set of machines running Hadoop Software is known as a Hadoop Cluster
• Individual machines are known as ‘nodes’
9 © Cloudera, Inc. All rights reserved.
HDFS: Economically Feasible to Store More Data
Self-healing, high bandwidth clustered storage.
1
2
3
4
5
2
4
5
HDFS
1
2
5
1
3
4
2
3
5
1
3
4
HDFS breaks incoming files into blocks and stores them redundantly across the cluster.
$300-$1,000 per TB Affordable & Attainable
10 © Cloudera, Inc. All rights reserved.
MapReduce: Power to predictably process large data
Distributed computing framework.
1
2
3
4
5
2
4
5
MR
1
2
5
1
3
4
2
3
5
1
3
4
Processes large jobs in parallel across many nodes and combines the results.
11 © Cloudera, Inc. All rights reserved.
2006 2008 2009 2010 2011 2012 2013
HBase ZooKeeper
Solr Pig
Core Hadoop
Hive Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Sqoop Avro Hive
Mahout HBase
ZooKeeper Solr Pig
Core Hadoop
Flume Bigtop Oozie
HCatalog Hue
Sqoop Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Spark Tez
Impala Kafka Drill
Flume Bigtop Oozie
HCatalog Hue
Sqoop Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Parquet Sentry Spark
Tez Impala Kafka Drill
Flume Bigtop Oozie
HCatalog Hue
Sqoop Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
2007
Solr Pig
Core Hadoop
Knox Flink
Parquet Sentry Spark
Tez Impala Kafka Drill
Flume Bigtop Oozie
HCatalog Hue
Sqoop Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
2014 2016
Kudu
RecordService Ibis
Falcon Knox Flink
Parquet Sentry Spark
Tez Impala Kafka Drill
Flume Bigtop Oozie
HCatalog Hue
Sqoop Avro Hive
Mahout HBase
ZooKeeper Solr Pig
YARN Core Hadoop
Core Hadoop (HDFS,
MapReduce)
A Decade of Hadoop – A platform won’t stop growing
12 © Cloudera, Inc. All rights reserved.
Some Hadoop Eco-System Projects
• Data Storage
• HDFS, HBase, KUDU
• Computing Framework
• MapReduce, Spark, Flink
• Data Ingestion
• Sqoop, Flume, Kfaka
• Data Serialization in HDFS
• Avro, Parquet
• Analytics
• Pig, Hive, Impala
• Orchestration
• Zookeeper
• Workflow, Coordination
• OOZIE
• Security (Authorization)
• Sentry
• Search
• Solr
13 © Cloudera, Inc. All rights reserved.
Hadoop Eco-System – Storage Engine
• HDFS (2006): Large files, block storage
• HBase (2008): Key-Value store
• KUDU (2016): Store structured data
14 © Cloudera, Inc. All rights reserved.
Hadoop Eco-System – Computing Framework
• Spark (2012)
• Originated at UC Berkeley AMPLab
• In-memory computing framework
• Processes data in-memory vs. MapReduce two-stage paradigm
• Can Perform 10 to 100 times faster than MapReduce for certain applications
• Flexible (Scala, Java, Python API) vs. MapReduce (Java)
• Include 4 components on top of Core Spark: Spark Streaming, GraphX, MLLib, Spark SQL
15 © Cloudera, Inc. All rights reserved.
Hadoop Eco-System – Analytics • Hive (2010)
• Originated at Facebook
• Compile SQL queries to MapReduce or Spark jobs
• Data warehouse tool in Hadoop Eco-System
• Good for ETL, batch, long-running job.
• Impala (2013)
• Originated at Cloudera
• MPP(Massively Parallel Processing) SQL Engine
• Much faster than Hive Query or Spark SQL; Support high concurrency; But no Fault tolerance
• Good for short-running, BI-Style ad-hoc queries.
• BI tool like Tableau, MicroStrategy connect to Impala through ODBC/JDBC
16 © Cloudera, Inc. All rights reserved.
Hadoop Data Processing Pattern
• Distributed Storage
• Distributed Computing
• Send compute to data
• Scale-Out
• Add more nodes
• Cost Effective
• Commodity hardware
• Time to Data
• No Up-front modeling
• Schema-on-read
• 100% fidelity of original data
• Data agility
17 © Cloudera, Inc. All rights reserved.
Agenda
•Hadoop History
• Introduction to Apache Hadoop Eco-System
•Transition from Legacy Data Platform to Hadoop
•Resources, Q & A
18 © Cloudera, Inc. All rights reserved.
Data Silos
Engineering Marketing Sales HR Customer
Service
• Slow down your company
• Limits communication and collaboration
• Decrease the quality and credibility of data
19 © Cloudera, Inc. All rights reserved.
Cloudera Enterprise Data Hub Making Hadoop Fast, Easy, and Secure
A new kind of data
platform: • One place for unlimited data
• Unified, multi-framework data access
Cloudera makes it: • Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS Cloudera Manager
Cloudera Director
DATA MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED Sqoop
STREAMING Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT YARN
SECURITY Sentry, RecordService
STORE
INTEGRATE
BATCH Spark, Hive, Pig
MapReduce
STREAM Spark
SQL Impala
SEARCH Solr
OTHER Kite
NoSQL HBase
OTHER Object Store
FILESYSTEM HDFS
RELATIONAL Kudu
20 © Cloudera, Inc. All rights reserved.
Data Mgmt. Chain
Data Sources Data Ingest Data Storage & Processing Serving, Analytics &
Machine Learning
ENTERPRISE DATA HUB
Apache Kafka Stream ingestion
Apache Sqoop Ingestion of data from relational sources
Apache Hadoop Storage (HDFS) & deep batch processing
Apache Kudu Storage & serving for fast changing data
Apache HBase NoSQL data store for real time applications
Apache Impala MPP SQL for fast
analytics
Cloudera Search
Real time search
Connected Things/ Data Sources
Structured Data Sources Security, Scalability & Easy Management
Deployment Flexibility:
Datacenter Cloud
Apache Spark Batch, Stream & iterative processing, ML
Apache Hive Batch Processing, ETL Apache Flume
Stream ingestion
21 © Cloudera, Inc. All rights reserved.
The best-in-class organizations use Cloudera
#1 Largest
Biotech in the world.
7 out of the
top 10 cancer
drugs by 2020 are being made by Cloudera customers.
#1largest global
genomic
repository
#1 Largest
Payer in the US
will be covering
123 million lives and pay out
$950B to
providers in 2015.
this Hospital was one of the first
four to receive Stage 7 status from HIMSS, the
highest possible
distinction in electronic medical
records implementation, uses Cloudera to host a
variety of data, and was awarded by US DHHS a
Gold Medal of
Honor.
#1 commercial
hospital chain worldwide.
Over 150 health & life
science organizations
use enterprise-class
Cloudera software.
#1 most utilized Patient
Centered Medical Home
program.
#1 Largest health data
company,
with 500M+
anonymous patient
records. #1 Largest Health IT
company in the World, $3B+
in revenue has 1000’s of nodes of Cloudera.
22 © Cloudera, Inc. All rights reserved.
Broad Institute’s industry standard GATK pipeline’s new version is based on Apache Spark, over 20,000 global users may migrate to Spark
Thanks to the contributions of Cloudera Engineers, GATK4 now uses Apache Spark for both traditional local multithreading and for parallelization on Spark-capable compute infrastructure and services, Such as google Dataproc.
“It has been a privilege collaborating with the Broad Institute over the last two years to ensure that GATK4 can use the power of Apache Spark to make genomics workflows more scalable than precious approaches”, said Tom White, principal data scientist at Cloudera.
23 © Cloudera, Inc. All rights reserved.
Seattle Children’s Research Institute
• 200+ PI’s at Seattle Children’s Research Institute
• 9 Research Centers including cancer, brain, birth, infectious disease
• Was no integrated data platform across the 9 Centers
• Evaluated multiple packaged applications, all multi-millions of dollars
• Selected Cloudera as the platform, created their own web user interface
Benefit Today, a single lab at SCRI can evaluate and diagnose a single patient per week after receiving the whole exome and clinical record. After implementation, the lab could diagnose 4-5 patients per week.
24 © Cloudera, Inc. All rights reserved.
Agenda
•Hadoop History
• Introduction to Apache Hadoop Eco-System
•Transition from Legacy Data Platform to Hadoop
•Resources, Q&A
25 © Cloudera, Inc. All rights reserved.
Start Your Big Data Journey
•Download Cloudera QuickStart Virtual Machine Today
•Practice !
•Practice !!
•Practice !!!
26 © Cloudera, Inc. All rights reserved.
Meetups
AI + Big Data Healthcare Meetup
Washington DC Area Apache Spark Interactive
https://www.meetup.com/AI-and-Big-Data-Healthcare-Meetup/
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/
1600+ members 2,700+ members