foxvalley bigdata

BIG DATA AND THE

HADOOP ECOSYSTEMTOM ROGERS

NORTHWESTERN UNIVERSITY

FEINBERG SCHOOL OF MEDICINE

DEPARTMENT OF ANESTHESIOLOGY

WHAT IS BIG DATA?

The 3 V’s

WHAT IS BIG DATA?Volume Terabytes Petabytes Exabytes

WHAT IS BIG DATA?Volume

Velocity System Logs Medical Monitors Machinery Controls

WHAT IS BIG DATA?Volume

Velocity

Variety

Varacity Variability

RDBMSSocial MediaXMLJSON Documents IoT

Value

How do we collect, store and process all this data?

• Open Source Apache Software.

• Distributed processing across clusters of computers.

• Designed to scale to thousands of computers.

• Local computation and storage.

• Expects hardware failure which is handled at the application layer.

A cute yellow elephant

HADOOP ECOSYSTEM OVERVIEW• Distributed storage and processing.

• Runs on commodity server hardware.

• Scales horizontally for seamless failover.

• Hadoop is open source software.

TRADITIONAL DATA REPOSITORIES• Very structured in 3NF or Star topologies.

• Is the enterprise “Single Source of Truth”

• Optimized for operations reporting requirements.

• Scales vertically.

• Limited interaction with external or unstructured data sources.

• Complex management schemes and protocols.

TRADITIONAL DATA SOURCES IN HEALTHCARE• Data for the Healthcare EDW originates

from the functional clinical and administrative responsibilities.

• Sources can be as sophisticated as highly complex on-line systems or as simple as Excel spreadsheets.

• Complex validation and transformation processes before inclusion into the EDW.

• Staging of the data transformation requires separate storage and processing space, but is often times done on the same physical hardware as the EDW.

INTEGRATION OF HADOOP AND TRADITIONAL IT• Hadoop does is not replace traditional

storage or processing technologies.

• Hadoop can include data from traditional IT sources to discover new value.

• Compared to traditional IT, setting up and operating a Hadoop platform can be very inexpensive.

• Can be seen as very expensive when adding to existing traditional IT environments.

EMERGING AND NON-TRADITIONAL DATA• New knowledge is discovered by applying

known experience in context with unknown or new experience.

• New sources of data are being created in a seemingly unending manner.

• Social media and mobile computing provide sources of new data unavailable in the past.

• Monitors, system logs, and document corpus all provide new ways of capturing and expressing the human experience that cannot be captured or analyzed by traditional IT methodologies.

INTEGRATION OF HADOOP AND NON-TRADITIONAL DATA• Hadoop is designed to store and process

non-traditional data sets.

• Optimized for unstructured file based data sources.

• Core applications developed specifically for different storage, processing, analysis and display activities.

• Development of metadata definitions and rules combined with data from disparate data sources can be used for deeper analytic discovery.

DATA ANALYSIS• Inspecting, transforming and modeling

data to discover knowledge, make predictions and suggest conclusions.

• 3rd party data analysis can be integrated into traditional IT environments or big data solutions.

• Traditionally conducted by working on discrete data sets in isolation from the decision making process.

• Data scientists are integrated into core business processes to create solutions for critical business problems using big data platforms.

COMPLETE HADOOP ECOSYSTEM• Integration between traditional and non-

traditional data is facilitated by the Hadoop ecosystem.

• Data is stored on a fault tolerant distributed file system in the Hadoop cluster.

• Data is processed close to where the data is located to reduce latency and time consuming transfer processes.

• The Hadoop Master controller or “NameNode” monitors the processes of the Hadoop cluster and automatically executes actions to continue processing when failure is detected.

HADOOP CORE COMPONENTS

Storage

HDFS

Hive

Hbase

Management

Zoo Keeper

Avro

Oozie

Whirr

Processing

MapReduce

Spark

Integration

Sqoop

Flume

Programming

Pig

Hive QL

Jaql

Insight

Mahout

Hue

Beeswax

CORE COMPONENT - STORAGE• HDFS – A distributed file system designed to run on commodity grade hardware in the Hadoop computing

ecosystem. This file system is highly fault tolerant and provides very high throughput to data and is suitable for very large data sets. Fault tolerance is enabled by making redundant copies of data sectors and distributing them throughout the Hadoop cluster. • Key Characteristics Include:

• Streaming data access – Designed for batch processing instead of interactive use.• Large data sets – Typically in gigabytes to terabytes in size.• Single Coherency Model - To enable high throughput access.• Moving computational process is cheaper than moving data.• Designed to be easily portable.

• Hive – A data warehouse implementation in Hadoop that facilities the query and management of large datasets kept in the distributed storage.• Key Features:

• Tools for ETL• A methodology for providing structure for multiple data formats.• Access to files stored in HDFS or Hbase• Executes queries via the MapReduce application.

CORE COMPONENT – STORAGE …..• HBase – A distributed, scalable big data database. For random access realtime read/write access to big data.

• Key Features:• Modular scalability.• Strict consistent reads and writes.• Automatic sharding of tables (partitioning tables to smaller more manageable parts).• Automatic failover.

CORE COMPONENT - MANAGEMENT• Zoo Keeper – A centralized service for maintaining configurations, naming providing distributed

synchronization and group services.• Avro – A data serialization program.• Oozie – A Hadoop workflow Scheduler• Whirr – A cloud neutral library for running cloud services.

CORE COMPONENT - PROCESSING• MapReduce – An implementation for processing and generating large data sets with a parallel, distributed

algorithm on a Hadoop cluster.• Key Features:

• Automatic parallelization and distribution• Fault-tolerance• I/O Scheduling• Status Monitoring

CORE COMPONENT - INTEGRATION• Sqoop – a utility designed to efficiently transfer bulk data between Hadoop and relational databases.• Flume – A service, based on streaming data flows, for collecting, aggregating and moving large amounts of

system log data.

CORE COMPONENT – PROGRAMMING• Pig – A high level language for analyzing very large data sets and is designed is able to efficiently utilize

parallel processes to achieve its results.• Key Properties:

• Ease of programming – Complex tasks are explicitly encoded as data flow sequences making them easy to understand and implement.

• Significant optimization opportunities – the system optimizes execution automatically.• Extensibility – Users can encode their own functions.

• HiveQL – A SQL like query language for data stored in Hive Tables which converts queries into MapReduce jobs.

• Jaql – A data processing and query language used to processing JSON on Hadoop.

CORE COMPONENT - INSIGHT• Mahout – A library of callable machine learning algorithms which uses the MapReduce paradigm.

• Supports four main data use cases:• Collaborative filtering – analyzes behavior and make recommendations.• Clustering – organizes data into naturally occurring groups.• Classification – learns from known characteristics of existing categorizations and makes

assignments of unclassified items into a category.• Frequent item or market basket mining – analyzes data items in transactions and identifies items

which typically occur together.• Hue – Is a set of web applications that enable a user to interact with a Hadoop cluster. Also lets the user

browse and interact with Hive, Impala, MapReduce jobs and Oozie workflows.• Beeswax – An application which allows the user to perform queries on the Hive data warehousing

application. You can create Hive tables, load data, run queries and download results in Excel spreadsheet format or CSV format.

HADOOP DISTRIBUTIONSAmazon Web Services Elastic MapReduce• One of the first Hadoop commercial offerings• Has the largest commercial Hadoop market share• Includes strong integration with other AWS cloud products• Auto scaling and support for NoSQL and BI integration

Cloudera• 2nd largest commercial marketshare• Experience with very large deployments• Revenue model based on software subscriptions• Aggressive innovation to meet customer demands

HortonWorks• Strong engineering partnerships with flagship companies.• Innovation driven through the open source community.• Is a key contributor to the Hadoop core project.• Commits corporate resources to jump start Hadoop community projects.

HADOOP DISTRIBUTIONS …International Business Machines• Vast experience in distributed computing and data management.• Experience with very large deployments.• Has advanced analytic tools, and global recognition.• Integration with vast array of IBM management and productivity software.

MapR Technologies• Heavy focus and early adopter of enterprise features.• Supports some legacy file systems such as NFS.• Adding performance enhancements for HBase, high-availability and disaster recovery.

Pivotal• Spin off from EMC and VMWare.• Strong cadre of technical consultants and data scientists.• Focus on MPP SQL engine and EDW with very high performance.• Has an appliance with integrated Hadoop, EDW and data management in a single rack.

HADOOP DISTRIBUTIONS …Teradata• Specialist and strong background in EDW.• Has a strong technical partnership with HortonWorks. • Has very strong integration between Hadoop and Teradata’s management and EDW

tools.• Extensive financial and technical resources allow creation of unique and powerful

appliances.

Microsoft Windows Azure HDInsight• A product designed specifically for the cloud in partnership with HortonWorks.• The only Hadoop distribution that runs in the Windows environment.• Allows SQL Server users to also execute queries that include data stored in Hadoop.• Unique marketing advantage for offering the Hadoop stack to traditional Windows

customers.

RECOMMENDATION Commitment and Leadership in the Open Source Community

Strong Engineering Partnerships

Innovation driven from the community

Innovative

Secure

Big Data/Health Research

Collaboration

CLUSTER DIAGRAM• NameNode is a single master server

which manages the file system and file system operations.

• Data Nodes are slave servers that manage the data and the storage attached to the data.

• NameNode is a single point of failure for the HDFS Cluster.

• A SecondaryNameNode can be configured on a separate server in the cluster which creates checkpoints for the namespace.

• SecondaryNameNode is not a failover NameNode.

CLUSTER HARDWARE CONFIGURATION AND COSTFactor/Specification Option 1 Option 2Replication Factor 3 3Size of Data to Move 500 TB 500 TBWorkspace Factor 1.25 1.25Compression 1 (no compression) 3Hadoop Storage Requirement 1875 TB 625 TBStorage Per Node 16 TB 16 TBRack Size 42U 42UNode Unit $4000 $4000Rack Unit Cost $1500 $1500Node (1 NameNode & DataNodes) (119 nodes * $4,000) = $480,500 (41 nodes * $4000) = $164,000Rack Cost (3 racks * $1,500) = $4,500 (1 Rack * $1500) = $1,500Total Cost $480,500 $165,500

HADOOP SANDBOX IN ORACLE VIRTUALBOXHost Specification• Windows 10• Intel® Core™ i7-4770 CPU @ 3.40GHz• 16GB Installed RAM• 64-bit OS, x64• 1.65 TB Storage

VM Specification• Cloudera Quickstart Sandbox• Red Hat• Intel® Core™ i7-4770 CPU @ 3.40GHz• 10GB Allocated RAM• 32MB Video Memory• 64-bit OS• 64GB Storage• Shared Clipboard: Bidirectional• Drag’n’Drop: Bidirectional

CLOUDERA HADOOP DESKTOP & INTERFACEOpening Cloudera interface and view of the CDC “Healthy People 2010” data set that was uploaded to the Redhat OS

HUE FILE BROWSER

• Folder List• File Contents• Displayed file content is

from the Vulnerable Population and Environmental Health data of the “Healthy People 2010” data set.

ADDING DATA TO HIVE

• Folder List• File Contents• Displayed file content is

from the Vulnerable Population and Environmental Health data of the “Healthy People 2010” data set.

ADDING DATA TO HIVE …Choosing a delimiter type

Defining columns

ADDING DATA TO HIVE …

• Hive Table List• Table properties

HIVE QUERY EDITOR

BIBLIOGRAPHY• "2015/03/18 - Apache Whirr Has Been Retired." Accessed January 24, 2016. https://whirr.apache.org/.• "Apache Avro 1.7.7 Documentation." Apache Avro 1.7.7 Documentation. Accessed January 24, 2016. https://avro.apache.org/docs/current/.� �• "Apache HBase – Apache HBase™ Home." Apache HBase – Apache HBase™ Home. Accessed January 24, 2016. https://hbase.apache.org/.• "Apache Mahout." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/hadoop/mahout/.• "Best Practices for Selecting Apache Hadoop Hardware - Hortonworks." Hortonworks. September 01, 2011. Accessed January 26, 2016. http://hortonworks.com/blog/best-

practices-for-selecting-apache-hadoop-hardware/.• "CDH3 Documentation." Beeswax. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/3-x/3u6/Hue-1.2-User-Guide/hue1.html.• "CDH4 Documentation." Introducing Hue. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-0/Hue-2-User-Guide/hue2.html.• "Data Analysis." Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Data_analysis.• "Hadoop Is Transforming Healthcare." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/industry/healthcare/.• "HDFS Architecture Guide." HDFS Architecture Guide. Accessed January 24, 2016. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction.• "Healthy People." Centers for Disease Control and Prevention. January 22, 2013. Accessed January 30, 2016. http://www.cdc.gov/nchs/healthy_people.htm.• "Home - Apache Hive - Apache Software Foundation." Home - Apache Hive - Apache Software Foundation. Accessed January 24, 2016.

https://cwiki.apache.org/confluence/display/Hive/Home;jsessionid=A2FE8C570A86815B0B4890A923872351.• "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-Leading-

Commercial-Hadoop-Distributions-Stack-Up.html.• "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-Leading-

Commercial-Hadoop-Distributions-Stack-Up.html#slide1.• "Map Reduce (MR) Framework." [Gerardnico]. Accessed January 24, 2016. http://gerardnico.com/wiki/algorithm/map_reduce.• "Oozie - Apache Oozie Workflow Scheduler for Hadoop." Oozie - Apache Oozie Workflow Scheduler for Hadoop. Accessed January 24, 2016. http://oozie.apache.org/.• "Sizing Your Hadoop Cluster." - For Dummies. Accessed January 26, 2016. http://www.dummies.com/how-to/content/sizing-your-hadoop-cluster.html.• "Sqoop -." Sqoop -. Accessed January 24, 2016. http://sqoop.apache.org/.• "TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing." Preparing Data for Analytics: Making It Easier and Faster. Accessed January 24,

2016. https://tdwi.org/articles/2015/04/14/preparing-data-for-analytics.aspx.• "UC Irvine Health Does Hadoop. With Hortonworks Data Platform." Hortonworks. Accessed January 26, 2016. http://hortonworks.com/customer/uc-irvine-health/.• "Welcome to Apache Flume¶." Welcome to Apache Flume — Apache Flume. Accessed January 24, 2016. https://flume.apache.org/.• "Welcome to Apache Pig!" Welcome to Apache Pig! Accessed January 24, 2016. https://pig.apache.org/.• "Welcome to Apache ZooKeeper™." Apache ZooKeeper. Accessed January 24, 2016. https://zookeeper.apache.org/.• Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Jaql.

foxvalley bigdata

Data & Analytics