big data projects and use casesbiconsulting.hu/letoltes/2015budapestdata/budapestdata...hadoop-ds...
TRANSCRIPT
Claus SamuelsenIBM Analytics, [email protected]
Big Data projects and use cases
IBM Sofware
2 © 2014 IBM Corporation
Text AnalyticsText Analytics
POSIX Distributed Filesystem POSIX Distributed Filesystem
Multi-workload, multi-tenant scheduling
Multi-workload, multi-tenant scheduling
IBM BigInsights Enterprise Management
Machine Learning on Big R
Machine Learning on Big R
Big R (R support) Big R (R support)
IBM Open Platform with Apache Hadoop*(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)
IBM Open Platform with Apache Hadoop*(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)
IBM BigInsights Data Scientist
IBM BigInsights Analyst
Big SQLBig SQL
BigSheetsBigSheets
Industry standard SQL (Big SQL)
Industry standard SQL (Big SQL)
Spreadsheet-style tool (BigSheets)
Spreadsheet-style tool (BigSheets)
*IBM Open Platform with Apache Hadoop is a 100% open source Apache Hadoop distribution. IBM will include the Open Data Platform common kernel once available.
Overview of BigInsights
Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist
features • Community support
. . . . . .
3 © 2014 IBM Corporation
IBM Big SQL – Runs 100% of the queries
Key points With Impala and Hive, many queries
needed to be re-written, some significantly
Owing to various restrictions, some queries could not be re-written or failed at run-time
Re-writing queries in a benchmark scenario where results are known is one thing – doing this against real databases in production is another
Other environments require significant effort at scale
Results for 10TB scale shown here
4 © 2014 IBM Corporation
Hadoop-DS benchmark – Single user performance @ 10TB
Big SQL is 3.6x faster than Impala and 5.4x faster than Hive 0.13 for single query stream using 46 common queries
Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
l © 2009 IBM Corporation
Big Data Projects
● Stock Trade Analysis
● Log File Root Cause Analysis
● 360 Degree Customer View
● Gamers Behaviour
● Weather Data Analysis
● Sensitive Data Access
● Tax Fraud Investigation
● Warehouse Augmentation
● Positive side effects of drugs
● CRM analysis
● Ontologies
● Document classification
● Roaming Log Analysis
● Connected Cars
● Historical Archive Research
● DNA sequencing
l © 2009 IBM Corporation
Warehouse Augmentation
Banking IndustryFraud Analysis
The customer wanted to implement two different kinds of fraud analysis:Transaction fraud and Social Engeneering fraud.
Problem:Existing data warehouse does not allow for long running jobsExtending the data warehouse has a huge cost
l © 2009 IBM Corporation
Warehouse Augmentation
Banking IndustryFraud Analysis
Solution:Moving data to IBM BigInsightsreduces the cost significantlyNo limitations on long running jobs
Obtaining the data from the various sources is the most time consuming processUsing BigSQL we can run the same queries in Hadoop as in the traditional warehouse
With BigSQL customer can connect using their standard JDBC/ODBC based SQL tools.
l © 2009 IBM Corporation
Document Classification
Insurrance IndustryAutomatic classification
Problem:Insurance documents are not standardized.They are typically free form documentswritten as e-mails, MS Words etc. Incoming documents are not classified, and are therefore often sent to wrong department or wrong person, thus resulting in unacceptable long processing time.
l © 2009 IBM Corporation
Document Classification
Solution:
Using BigInsights Text Analytics new documents can be classified automatic.
Customer had described what was the characteristics of the different classes the the documents had to be put into.
Using these descriptions we could in three weeks implements the rules in BigInsights to a degree that satisfied the customer.
l An IBM Proof of Technology
l © 2013 IBM Corporation
IBM big data • IBM big data • IBM big data
IBM big data • IBM big data • IBM big data
IBM
big
dat
a
• I
BM
big
dat
aIB
M bi g data • IB
M big d ata
THINK
IBM Software
11 © 2014 IBM Corporation
Application Portability & IntegrationData shared with Hadoop ecosystemComprehensive file format support
Superior enablement of IBM and Third Party software
PerformanceModern MPP runtime
Powerful SQL query rewriterCost based optimizer
Optimized for concurrent user throughputResults not constrained by memory
Federation
Distributed requests to multiple data sources within a single SQL statement
Main data sources supported:DB2 LUW, Teradata, Oracle, Netezza,
Informix, SQL Server
Enterprise Features
Advanced security/auditingResource and workload management
Self tuning memory managementComprehensive monitoring
Rich SQLComprehensive SQL Support
IBM SQL PL compatibilityExtensive Analytic Functions
Distinguishing characteristics
IBM Software
12 © 2014 IBM Corporation
Big SQL – Behind the scenes
Big SQL is derived from an existing IBM shared-nothing RDBMS– A very mature MPP architecture– Already understands distributed joins and optimization
Behavior is sufficiently different – Certain SQL constructs are disabled– Traditional data warehouse partitioning – is unavailable– New SQL constructs introduced
On the surface, porting a shared nothing RDBMS to a shared nothing cluster (Hadoop) seems easy, but …
databasepartition
databasepartition
databasepartition
databasepartition
Traditional Distributed RBMS Architecture
IBM Software
13 © 2014 IBM Corporation
Architecture Overview
Big SQL Worker
Native I/O
Engine
Java I/O Engine
TempData
HBase
HDFSData HDFS
Data HDFSData
HDFS Data Node
MRTask
Tracker
Other Service
Big SQL Scheduler
Big SQL Master
Database Service
Hive Metastore
Big SQL Worker
Native I/O
Engine
Java I/O Engine
TempData
HBase
HDFSData HDFS
Data HDFSData
HDFS Data Node
MRTask
Tracker
Other Service
Big SQL Worker
Native I/O
Engine
Java I/O Engine
TempData
HBase
HDFSData HDFS
Data HDFSData
HDFS Data Node
MRTask
Tracker
Other Service
DDL Engine