analyzing big data - jeff scheel

33
© 2014 IBM Corporation Open '14 Analyzing Big Data Jeff Scheel Chief Engineer Linux on Power June 2, 2014 [email protected]

Upload: kangaroot

Post on 26-Jan-2015

110 views

Category:

Technology


0 download

DESCRIPTION

Analyzing Big Data. A presentation given by Jeff Scheel, Chief Engineer for Linux on Power at IBM, at the OPEN14 conference in Belgium.

TRANSCRIPT

Page 2: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation2

Agenda

1. Getting started with Big Data

2. OpenPOWER Foundation

3. The future of Analytics

Page 3: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation

Getting started with Big Data

Page 4: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation4

Big Data is growing and moving fast from a variety of sources, are you keeping up?

• 1 Trillion connected devices generate 2.5 quintillion bytes data / day

• 80% of the world’s data today is unstructured

• 1 in 2 business leaders don’t have access to data they need

Page 5: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation5

“Data is the new oil”In its raw form, oil has little value. Once processed and refined, it helps power the world.

“Big Data has arrived at Seton Health Care Family, fortunately accompanied by an analytics tool that will help deal with the complexity of more than two million patient contacts a year…”

“Data is the new oil.”Clive Humby

“At the World Economic Forum last month in Davos, Switzerland, Big Data was a marquee topic. A report by the forum, “Big Data, Big Impact,” declared data a new class of economic asset, like currency or gold.

“Increasingly, businesses are applying analytics to social media such as Facebook and Twitter, as well as to product review websites, to try to “understand where customers are, what makes them tick and what they want”, says Deepak Advani, who heads IBM’s predictive analytics group.”

“Companies are being inundated with data—from information on customer-buying habits to supply-chain efficiency. But many managers struggle to make sense of the numbers.”

Page 6: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation6

The challenge: handling the large Volume, Variety, Velocity, and Veracity of data to find new insights and improve business outcome

BI / Reporting Exploration / Visualization

FunctionalApp

IndustryApp

PredictiveAnalytics

ContentAnalytics

Analytic Applications

IBM Big Data Platform

Systems Management

Application Development

Visualization & Discovery

Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing Data Warehouse

MFG - Analyze & correlate log records to improve service and predict failures

Telco - Address customer satisfaction, Predict churn, and match promotions in real time

Healthcare - Detect life-threatening conditions at hospitals in time to intervene

Retail - Multi-channel customer sentiment and experience analysis

Financial Services - Make risk decisions based on real-time transactional data

Law Enforcement - Identify criminals and threats from video, audio feeds

Page 7: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation7

Customers are deploying new infrastructure to leverage all data types

Data inMotion

Data atRest

Data inMany Forms

Information Ingestion and Operational Information

Decision Management

BI and Predictive Analytics

Navigation and Discovery

IntelligenceAnalysis

Landing Area,Analytics Zoneand Archive

Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine Learning

Real-timeAnalytics

Video/Audio Network/Sensor Entity Analytics Predictive

Exploration,Integrated Warehouse, and Mart Zones

Discovery Deep

Reflection Operational Predictive Stream Processing

Data Integration Master Data

Streams

Information Governance, Security and Business Continuity

Hadoop Infrastructure – currently being deployed on commodity hardware

Hadoop Infrastructure – currently being deployed on commodity hardware

Page 8: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation8

WATSON

Two new Watson-based products:

• Interactive Care Insights for Oncology

• The WellPoint Interactive Care Guide and Interactive Care Reviewer

IBM and Red Hat innovating in Healthcare with Watson

• Watson's oncology education:

• 600,000 pieces of medical evidence

• 2 million pages of text

• 25,000 training cases

• Watson can review 1.5 million patient records faster than it takes most office computers to boot up

Page 9: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation9

Big Data implementation patternsCommon analysis of structured &

unstructured data

WarehouseHadoop

App / BIVisualization / Exploration

Warehouse and BigInsights partitioning

HadoopWarehouse

App / BIVisualization Exploration

App / BIVisualization Exploration

App / BIVisualization Exploration

HadoopWarehouse

Warehouse batch offload

Warehouse

App /BIVisualization Exploration

Hadoop

StructuredUnstructured

App / BIVisualization Exploration

Separate unstructured & structured analysis

StructuredUnstructured

Structured Structured

Page 10: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation10

What the experts say

1. Seek project input from Sales, Marketing, and Operations teams

2. Select projects which are well-defined and have quick ROI – less than a year

3. Leverage your experiences from data warehouse and business intelligence projects

4. Avoid starting with “Big Bang”

Source: http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=POL03133USEN

Page 11: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation11

More ideas for starting

Warehouse

App /BIVisualization Exploration

Hadoop

Existing BI Stack

App / BIVisualization Exploration

Separate unstructured & structured analysis

New

Find a small problem to solve, i.e. an internal phone directory, and start “on-the-side”.

Locate relevant data and identify pieces what are “in motion” or “at rest”.

For data at rest, build opensource Hadoop on your PowerLinux system or try the InfoSphere BigInsights Basic Edition (no charge).

For data in motion, use the InfoSphere Streams trial download.

Reference the IBM Information Center for details onhow to import data into Hadoop and how to write applications using Streams Studio.

Explore Datameer to visualize your Hadoop based Big Data

Page 12: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation12

PowerLinux jump start services facilitate starting with Big Data Analytics

5 Day IBM Power Analytics Services Jump StartIncludes:• 5 days, on-site service offering • Quick Analytics Assessment Workshop•Software Installation• Hands on education in getting started• Evaluating the analytical approach for your business that will make the biggest impact • Quick sample application to consume customer data Reference Architecture Workshop

Why Jump Start Services for your IBM Power Analytics solution?• Learn how to optimally leverage IBM Power System for Analytics• Learn the benefits and reasoning of Big Data •Learn how to gain business value from the data you have

2 Day IBM Power Analytics Services Jump StartIncludes:• 2 days, on-site Big Data Analytics service offering•Software installation • Hands on education in getting started Evaluating the analytical approach for your business that will make the biggest impact

IBM Systems Lab Services & Training - Power SystemsServices for PowerLinux, AIX, and OSContact – Linda Hoben, Opportunity Manager, [email protected]

IBM Power Servers is an ideal platform for streaming data and performing analytic computations for a multitude of applications.

Let us help make you successful!

Page 13: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation13

IBM POWER has a strong history in transactional processing workloads

1,556 2,845 5,669 9,200 12,60223,871

32,046

50,164

63,021

95,081

150,000$109.00

$89.00

$52.70

$43.00

$17.80

$8.31 $5.42 $5.19 $2.97 $2.81 $0.69

0

20000

40000

60000

80000

100000

120000

140000

160000

S70 S7A S80 S85 p690 p690+ p690++ p5-595 p5-595+ P6 595 P7 780

$0

$20

$40

$60

$80

$100

$120

tpcC $/tpcC

Page 14: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation14

POWER8 Processor

Caches • 512 KB SRAM L2 / core• 96 MB eDRAM shared L3• Up to 128 MB eDRAM L4

(off-chip)

Cores • 12 cores (SMT8)• 8 dispatch, 10 issue,

16 exec pipe• 2X internal data

flows/queues• Enhanced prefetching• 64K data cache,

32K instruction cache

Accelerators• Crypto & memory expansion• Transactional Memory • VMM assist • Data Move / VM Mobility Energy Management

• On-chip Power Management Micro-controller• Integrated Per-core VRM• Critical Path Monitors

Technology•22nm SOI, eDRAM, 15 ML 650mm2

Memory• Up to 230 GB/s

sustained bandwidth

Bus Interfaces• Durable open memory

attach interface• Integrated PCIe Gen3• SMP Interconnect• CAPI (Coherent

Accelerator Processor Interface)

ComputerWorld: To make the chip faster, IBM has turned to a more advanced manufacturing process, increased the clock speed and added more cache memory, but perhaps the biggest change heralded by the Power8 cannot be found in the specifications. After years of restricting Power processors to its servers, IBM is throwing open the gates and will be licensing Power8 to third-party chip and component makers. The Register: the Power8 is so clearly engineered for midrange and enterprise systems for running applications on a giant shared memory space, backed by lots of cores and threads. Power8 does not belong in a smartphone unless you want one the size of a shoebox that weighs 20 pounds. But it most certainly does belong in a badass server, and Power8 is by far one of the most elegant chips that Big Blue has ever created, based on the initial specs. PCWorld: With Power8, IBM has more than doubled the sustained memory bandwidth from the Power7 and Power7+, to 230 GB/s, as well as I/O speed, to 48 GB/s. Put another way, Watson’s ability to look up and respond to information has more than doubled as well.

Microprocessor report: Called Power8, the new chip delivers impressive numbers, doubling the performance of its already powerful predecessor, Power7+. Oracle currently leads in server-processor performance, but IBM’s new chip will crush those records. The Power8 specs are mind boggling.

Source: Hotchips presentation

Page 15: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation15

POWER8 delivers 2.5x performance on Big Data / HadoopPOWER8 reduces the number of servers by 60% based on the best x86 published Terasort result

POWER8 S822L will deliver over 2x the performance of the best published x86 system

… and continues to offer far superior RAS

POWER8 delivers 1.7X over HP on a per-core normalized benchmark.

POWER8 exploits additional cores, more threads, larger caches, memory bandwidth

Terasort is a popular benchmark to measure the performance of a Hadoop solution

Sorts a large dataset (10 TB) in parallel Exercises the Map-reduced framework

and Hadoop Distributed File System (HDFS)

>2x>2x

Relative System Performance

0

0.5

1

1.5

2

2.5

3

POWER8 Cisco

2.5x2.5x

IBM Analytics Stack: IBM Power System S822L; 24 cores / 192 threads, POWER8; 3.0GHz, 512 GB memory, RHEL 6.5, InfoSphere BigInsights 3.0

Compared to a 16 Cores HP system

http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns944/le_tera.pdf

Page 16: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation16

Power Systems S822LPower Systems

S812L• 1-socket, 2U• Linux Only

• 2-socket, 2U• Linux Only

• 2-socket, 2U• All Operating Systems

Power Systems S822

Power Systems S814

• 1-socket, 4U• All Operating Systems

Power Systems S824

• 2-socket, 4U• All Operating

SystemsPower Systems S824L

• 2-socket, 4U• Linux Only• SOD

1 & 2 Sockets

New IBM Power Systems based on POWER8

Page 17: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation

OpenPOWER Foundation – The emerging ecosystem

Page 18: Analyzing Big Data - Jeff Scheel

18 © OpenPOWER Foundation 2014

Industry trends

• The number of companies designing & building servers is increasing

– Traditionally there have been few companies designing systems: HP, IBM, SUN, Dell, etc.

– Today there are many more: Google, Microsoft, Facebook, Rackspace, Huawei, Sugon, Inspur, etc.

– A fairly mature ecosystem including the Taiwanese ODMs is a key enabler of this trend

• Numerous disruptive forces are impacting these custom system designs and driving designers to consider new ways of innovating

– Ability to handle rapid growth in Big Data & Analytics based solutions– Choice and Innovation– CPU SOC integration drive need for chip development

• These trends create a need for a server targeted “chip-system-software” ecosystem

– IBM has technology and a software stack ready to meet these needs– IBM recognizes the need to work with partners to create this ecosystem– IBM recognizes the need for choice and options in processor sourcing

Page 19: Analyzing Big Data - Jeff Scheel

19 © OpenPOWER Foundation 2014

OpenPOWER Foundation Structure

OpenPOWER is an industry foundation based on the POWER architecture, enabling an Open community for development and opportunity for member differentiation and growth

Page 20: Analyzing Big Data - Jeff Scheel

20 © OpenPOWER Foundation 2014

Building collaboration and innovation at all levels

Welcoming new members in all areas of the ecosystem100+ inquiries and numerous active dialogues underway

Boards/Systems

I/O, Storage, Acceleration

Chip/SOC

System/Software/Services

Page 21: Analyzing Big Data - Jeff Scheel

21 © OpenPOWER Foundation 2014

OpenPOWER Proposed Ecosystem Enablement

XCATXCAT

System Operating Environment Software StackA modern development environment is emerging

based on tools and services

CloudSoftware

OperatingSystem / KVM

Standard OperatingEnvironment

(System Mgmt)

So

ftw

are

Power Open Source Software Stack Components

ExistingOpen

Source Software

Communities

Firmware

Hardware

New OSS Community

OpenPOWERTechnology

OpenPOWERFirmware

CAPP

PC

Ie

POWER8

CAPI over PCIe

“Standard POWER Products” – 2014

Har

dw

a re

“Custom POWER SoC” – Future

Customizable

Framework to Integrate System IP on Chip

Industry IP License Model

Multiple Options to Design with POWER Technology Within OpenPOWER

Page 22: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation22

Non-IBM POWER8 products

http://www.enterprisetech.com/2014/04/28/inside-google-tyan-power8-server-boards/

The Tyan reference (ATX) board, SP010, measures 12” by 9.6”➢ one single-chip module (SCM)➢ four DDR3 memory slots➢ four 6 Gb/sec SATA peripheral connectors➢ two USB 3.0 ports➢ two Gigabit Ethernet network interfaces➢ keyboard and video➢ intended for developers

The Google reference board➢ two single-chip module (SCM)➢ four modified SATA ports➢ Google use only

Page 23: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation

The future of Analytics

Page 24: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation24

The future of Analytics: An open approach

Open Platform for Choice

Page 25: Analyzing Big Data - Jeff Scheel

25 © OpenPOWER Foundation 2014

POWER8 CAPI

CustomHardware

Application

POWER8

CAPP

Coherence Bus

PSL

FPGA or ASIC

Customizable HardwareApplication Accelerator • Specific system SW, middleware, or user application• Written to durable interface provided by PSL

POWER8

PCIe Gen 3Transport for encapsulated messages

Processor Service Layer (PSL)• Present robust, durable interfaces to applications• Offload complexity / content from CAPP

Virtual Addressing• Accelerator can work with same memory addresses that the

processors use• Pointers de-referenced same as the host application• Removes OS & device driver overhead

Hardware Managed Cache Coherence• Enables the accelerator to participate in “Locks” as a normal thread

Lowers Latency over IO communication model

Coherent Accelerator Processor Interface (CAPI)

Page 26: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation26

Coherent Accelerator Processor Interface (CAPI) Overview

CAPP PCIe

POWER8 Processor

Typical I/O Model Flow

Flow with a Coherent Model

Shared Mem. Notify Accelerator

AccelerationShared Memory

Completion

DD CallCopy or PinSource Data

MMIO NotifyAccelerator

AccelerationPoll / Int

CompletionCopy or UnpinResult Data

Ret. From DDCompletion

FPGA

Fu

nctio

n n

Fu

nctio

n 0

Fu

nctio

n 1

Fu

nctio

n 2

CAPI

IBM Supplied POWER Service Layer

Page 27: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation27

Example: Innovative “In-Memory” NoSQL/KVS Integrated Solution - via POWER8 CAPI-attached Flash

WWW

10Gb Uplink

POWER8 Server

Flash Array w/ up

to 40TB

Differentiated NoSQL(POWER8 + CAPI Flash)

Infrastructure Attributes- 192 threads in 4U Server drawer

- 40 TB of memory based Flash per 4U Drawer- Shared Memory & Cache for dynamic tuning

- Elimination of I/O and Network Overhead- Cluster solution in a box

5X Cost Reduction with

equivalent performance

WWW

500GB Cache Node500GB

Cache Node500GB Cache Node500GB

Cache Node500GB Cache Node500GB

Cache Node

Backup Node

Load Balancer

Today’s NoSQLin memory (x86)

10Gb Uplink

Infrastructure Requirements- Large Distributed (Scale out)

- Large Memory per node- Networking Bandwidth Needs

- Load Balancing

Power CAPI-attached Flash model for NoSQL offers dramatic (24:1) density advantage

Page 28: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation

Wrap-up

Page 29: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation29

For more information on Big Data / Analytics

● Sales kits

– PartnerWorld

– IBM internal

● Worldwide contacts

– Renato Loffreda-Mancinelli, World Wide Business Analytics and Big Data Solutions on Power - Business Dev. Leader ([email protected])

– Michael Tabron, Solution Offering Manager, Power Analytics ([email protected])

– Gina King, Solution Offering Manager, Big Data Analytics ([email protected])

– Bob Friske, Marketing Manager ([email protected])

Page 30: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation30

Q & A

Summary:

1.Getting started with Big Data is the toughest part. Start simple, small, and on the side.

2.The OpenPOWER Foundation enables new systems and helps support the emerging analytic solutions around NoSQL databases.

3.POWER8 technology like CAPI will enable new solutions from IBM and the OpenPOWER Foundation

Page 31: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation31

Special notices

This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area.

Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied.

All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions.

IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice.

IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.

All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.

IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.

Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

Revised September 26, 2006

Page 32: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation

Backup

Page 33: Analyzing Big Data - Jeff Scheel

© 2014 IBM Corporation33

Where to find more information? http://openpowerfoundation.org/