hadoop a highly available and secure enterprise data warehousing solution
TRANSCRIPT
www.edureka.co/r-for-analytics
www.edureka.co/hadoop-admin
Hadoop : A Highly Available and Secure Enterprise Data warehousing Solution
Slide 2Slide 2Slide 2 www.edureka.co/hadoop-admin
At the end of this webinar we will Know about:
What is Big Data
Why do Enterprise care about Big Data
Why your DWH needs Hadoop?
Security in Hadoop
How Hadoop maintains high Availability
Data warehousing tools in Hadoop
Agenda
Slide 3Slide 3Slide 3 www.edureka.co/hadoop-admin
What is Big Data
Slide 4Slide 4Slide 4 www.edureka.co/hadoop-admin
Slide 5Slide 5Slide 5 www.edureka.co/hadoop-admin
What is Wrong with our traditional DWH Solutions
Slide 6Slide 6Slide 6 www.edureka.co/hadoop-admin
Storing Unstructured data like images and video
Processing images and video
Storing and processing other large files
PDFs, Excel files
Processing large blocks of natural language text
Blog posts, job ads, product descriptions
Processing semi-structured data
CSV, JSON, XML, log files
Sensor data
When RDBMS Makes no Sense?
Slide 7Slide 7Slide 7 www.edureka.co/hadoop-admin
Ad-hoc, exploratory analytics
Integrating data from external sources
Data cleanup tasks
Very advanced analytics (machine learning)
When RDBMS Makes no Sense?
Slide 8Slide 8Slide 8 www.edureka.co/hadoop-admin
It is:
– Unstructured
– Unprocessed
– Un-aggregated
– Un-filtered
– Repetitive
– Low quality
– And generally messy.
Oh, and there is a lot of it.
Big Problems with Big Data
Slide 9Slide 9Slide 9 www.edureka.co/hadoop-admin
Storage capacity
Storage throughput
Pipeline throughput
Processing power
Parallel processing
System Integration
Data Analysis
Scalable storage
Massive Parallel Processing
Ready to use tools
Technical Challenges
Slide 10Slide 10Slide 10 www.edureka.co/hadoop-admin
Too many channels for data
Technical Challenges
Slide 11Slide 11Slide 11 www.edureka.co/hadoop-admin
Why do Enterprise care about Big Data
Slide 12Slide 12Slide 12 www.edureka.co/hadoop-admin
Slide 13Slide 13Slide 13 www.edureka.co/hadoop-admin
Slide 14Slide 14Slide 14 www.edureka.co/hadoop-admin
You said RDBMS does not have solution
for Big Data, Then who has???
Slide 15Slide 15Slide 15 www.edureka.co/hadoop-admin
I Have The solution for Big Data Problem
Hadoop
Hadoop : The Savior
Slide 16Slide 16Slide 16 www.edureka.co/hadoop-admin
How Hadoop differs from RDBMS
Hadoop can store all types of data in it so that you have flexibility of analyzing all types of data.
You can drill down the big data to find even the rare insight which was not possible earlier.
Slide 17Slide 17Slide 17 www.edureka.co/hadoop-admin
First Load the data then do whatever you want to do.
This is Possible because of the cheap storage and distributed HDFS.
Hadoop Is The New DWH Solution
• This is ETL• Before loading you should
transform data in particular format
• This puts an restriction on the type of data that can be stored
Slide 18Slide 18Slide 18 www.edureka.co/hadoop-admin
First Load the data then do whatever you want to do.
This is Possible because of the cheap storage and distributed HDFS.
Hadoop Is The New DWH Solution
• This is ETL• Before loading you should
transform data in particular format
• This puts an restriction on the type of data that can be stored
• This is ELT• There is no need to transform
the data beforehand• You can have all kind of data on
board• Freedom to work with all data
Slide 19Slide 19Slide 19 www.edureka.co/hadoop-admin
Hadoop is the new Data Warehouse for all kind of BI requirements.
Hadoop Does ELT Not ETL
Slide 20Slide 20Slide 20 www.edureka.co/hadoop-admin
Core Features of Hadoop
Slide 21Slide 21Slide 21 www.edureka.co/hadoop-admin
Hadoop Is Fault Tolerant And Super Consistent
Slide 22Slide 22Slide 22 www.edureka.co/hadoop-admin
Maintaining High Availability(HA)
In Distributed Computing, failure is a norm, which means YARN should have acceptable amount of availability
NameNode - No Horizontal Scale
NameNode - No High Availability
DataNode
DataNode
DataNode
….
Client get Block Locations
Read Data
NameNodeNS
Block Management
Slide 23Slide 23Slide 23 www.edureka.co/hadoop-admin
Secondary NameNode:
"Not a hot standby" for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
SecondaryNameNode
NameNode
metadata
metadata
Single PointFailure
You give me metadata
every hour, I will make it
secure
NameNode – Single Point of Failure
Slide 24Slide 24Slide 24 www.edureka.co/hadoop-admin
Node Manager
HDFS
YARN
Resource Manager
Shared edit logs
All name space edits logged to shared NFS storage; single writer
(fencing)
Read edit logs and applies to its own namespace
Secondary Name Node
DataNode
Standby NameNode
Active NameNode
ContainerApp
Master
Node Manager
DataNode
ContainerApp
Master
Data Node
Client
DataNode
ContainerApp
Master
Node Manager
DataNode
ContainerApp
Master
Node Manager
NameNode High Availability
Next Generation MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Hadoop 2.0 Cluster Architecture - HA
Demo
Achieving HDFS and YARN High Availability
Slide 26Slide 26Slide 26 www.edureka.co/hadoop-admin
Hadoop is Secure
Slide 27Slide 27Slide 27 www.edureka.co/hadoop-admin
Security
Service-level authorization and web proxy capabilities in YARN.
Access Control Lists(ACL) : The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the POSIX model
Slide 28Slide 28Slide 28 www.edureka.co/hadoop-admin
Security – Simple Flow
Security Risks
Insufficient Authentication Do not authenticate users services
No Privacy and No Integrity Insecure Network Transport No Message level security
Arbitrary Code Execution No User verification for MapReduce code
execution, malicious users could submit a job
Client Job Tracker
HDFS
Task Tracker
Task
HDFS
Task Tracker
Task
Slide 29Slide 29Slide 29 www.edureka.co/hadoop-admin
Managing users, permissions , quotas, etc …
Checking Resources Usage And Users Permissions
Demo
Demo on ACL
Slide 31Slide 31Slide 31 www.edureka.co/hadoop-admin
Hadoop provides traditional SQL interface as well asNoSQL Interface foe data storage
Slide 32Slide 32Slide 32 www.edureka.co/hadoop-admin
Hive ??
Slide 33Slide 33Slide 33 www.edureka.co/hadoop-admin
Hive Architecture
Slide 34Slide 34Slide 34 www.edureka.co/hadoop-admin
Hbase and its Architecture??
Hive and HBase Integration
Questions
Slide 36
Slide 37
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better!
Please spare few minutes to take the survey after the webinar.
Survey