visual mapping of clickstream data

60
Visual Mapping of Clickstream Data: Introduction and Demonstration Cedric Carbone, Ciaran Dynes Talend

Upload: hadoop-summit

Post on 06-May-2015

1.538 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Visual Mapping of Clickstream Data

Visual Mapping of Clickstream Data: Introduction and Demonstration

Cedric Carbone, Ciaran DynesTalend

Page 2: Visual Mapping of Clickstream Data

2© Talend 2014

Visual mapping of Clickstream data: introduction and demonstration

Ciaran Dynes VP Products

Cedric Carbone CTO

Page 3: Visual Mapping of Clickstream Data

3© Talend 2014

Agenda

• Clickstream live demo

• Moving from hand-code to code generation

• Performance benchmark

• Optimization of code generation

Page 4: Visual Mapping of Clickstream Data

4© Talend 2014

Hortonworks Clickstream demo

http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/

Page 5: Visual Mapping of Clickstream Data

5© Talend 2014

Trying to get from this…

Page 6: Visual Mapping of Clickstream Data

6© Talend 2014

Big Data – “pure Hadoop”Visual design in Map Reduce and optimize

before deploying on Hadoop

to this…

Page 7: Visual Mapping of Clickstream Data

7© Talend 2014

Demo overview

• Demo flow overview :-1. Load raw Omniture web log files to HDFS

• Can discuss the ‘schema on read’ principle, how it allows any data type to be easily loaded to a ‘data lake’ and is then available for analytical processing

• http://ibmdatamag.com/2013/05/why-is-schema-on-read-so-useful/

2. Define a Map/Reduce process to transform the data

• Identical skills to any graphical ETL tool

• Lookup customer and product data to enrich the results

• Results written back to HDFS

3. Federate the results to a visualisation tool of your choice

• Excel

• Analytics tool such Tableau, Qlikview, etc.

• Google Charts

Page 8: Visual Mapping of Clickstream Data

8© Talend 2014

Big Data Clickstream Analysis

Clickstream Dashboard

TALENDLoad to HDFS

TALENDBIG DATA

(Integration)

TALENDFederate to

analytics

HADOOP

HDFS Map/Reduce

Web logs

Hive

Page 9: Visual Mapping of Clickstream Data

9© Talend 2014

Native Map/Reduce Jobs

• Create classic ETL patterns using native Map/Reduce- Only data management solution on the market to generate native

Map/Reduce code

• No need for expensive big data coding skills

• Zero pre-installation on the Hadoop cluster

• Hadoop is the “engine” for data processing #dataos

Page 10: Visual Mapping of Clickstream Data

10© Talend 2014

SHOW ME

Page 11: Visual Mapping of Clickstream Data

11© Talend 2014

PERFORMANCE OF CODE GENERATION

Page 12: Visual Mapping of Clickstream Data

12© Talend 2014

MapReduce 2.0, YARN, Storm, Spark

• Yarn: Ensures predictable performance & QoS for all apps

• Enables apps to run “IN” Hadoop rather than “ON”

• In Labs: Streaming with Apache Storm

• In Labs: mini-Batch and In-Memory with Apache Spark

Applications Run Natively IN Hadoop

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, Spark)

GRAPH(Giraph)

NoSQL(MongoDB)

EVENTS(Falcon)

ONLINE(HBase)

OTHER(Search)

Source: Hortonworks

Page 13: Visual Mapping of Clickstream Data

13© Talend 2014

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, Spark)

GRAPH(Giraph)

NoSQL(MongoDB)

Events(Falcon)

ONLINE(HBase)

OTHER(Search)

Talend: Tap – Transform – Deliver

TRANSFORM (Data Refinement)

PROFILE PARSEMAP CDCCLEANSE STANDARD-IZE

MACHINELEARNINGMATCH

TAP(Ingestion)

SQOOP

FLUME

HDFS API

HBase API

HIVE

800+

DELIVER(as an API)

ActiveMQKaraf

CamelCXF

KafkaStorm

MetaSecurity

MDMiPaaS

GovernHA

Page 14: Visual Mapping of Clickstream Data

14© Talend 2014 © Talend 2013

• Context : 9 Nodes cluster, Replication: 3 - DELL R210-II, 1 Xeon® E3 1230 v2, 4 Cores, 16 Go RAM

- Map Slots : 2 Slots / Node

- Reduce Slots : 2 Slots / Node

• Total Processing Capabilities : - 9*2 Maps Slots : 18 Maps

- 9*2 Reduce Slots : 18 Reduces

• Data Volume : 1,10,100GB

Talend Labs Benchmark Environment

Page 15: Visual Mapping of Clickstream Data

15© Talend 2014 © Talend 2013

• PIG and Hive Apache communities are usingTPCH benchmarks- https://issues.apache.org/jira/browse/PIG-2397

- https://issues.apache.org/jira/browse/HIVE-600

• We are currently running the same tests in our labs- Pig Hand Coded script vs. Talend Pig generated code

- Pig Hand Coded script vs. Talend Map/Reduce generated code

- Hive QL produced by community vs. Hive ELT capabilities

• Partial results already available for Pig- Very good results

TPCH Benchmark

Page 16: Visual Mapping of Clickstream Data

16© Talend 2014

Optimizing Job configuration ?

• By default, Talend follows Hadoop recommendations regarding the number of reducers usable for the job execution.

• The rule is that 99% of the total reducers available can be used- http://wiki.apache.org/hadoop/HowManyMapsAndReduces

- For Talend benchmark, default max reducers is :

• 3 nodes : 5 (3*2 = 6 * 99% = 5)

• 6 nodes : 11 (6*2 = 12 * 99% = 11)

• 9 nodes : 17 (9*2 = 18 * 99% = 17)

- Another customer benchmark, default max reducer :

• 700 * 99% = 693 nodes (assumption with half Dell and half HP servers)

© Talend 2013

Page 17: Visual Mapping of Clickstream Data

17© Talend 2014

TPCH Results : Pig Hand Coded vs Pig generated

© Talend 2013

• 19 tests with results similar or better to Pig Hand Coded scripts

Page 18: Visual Mapping of Clickstream Data

18© Talend 2014

TPCH Results : Pig Hand Coded vs Pig generated

© Talend 2013

• 19 tests with results similar or better to Pig Hand Coded scripts

• Code is already optimized and automatically applied

Talend code

is faster

Page 19: Visual Mapping of Clickstream Data

19© Talend 2014

PERFORMANCE IMPROVEMENTS

Page 20: Visual Mapping of Clickstream Data

20© Talend 2014

TPCH Results : Pig Hand Coded vs Pig generated

© Talend 2013

• 19 tests with results similar or better to Pig Hand Coded scripts

• 3 tests will benefit from a new COGROUP feature

RequiresCoGroup

1

Page 21: Visual Mapping of Clickstream Data

21© Talend 2014

Example: How Sort works for Hadoop

Talend has implemented the TeraSort Algorithm for Hadoop

1. 1st Map/Reduce Job is generated to analyze the data ranges- Each Mapper reads its data and analyze its bucket critical values

- The reduce will produce Quartile files for all the data to sort

2. 2nd Map/Reduce job is started- Each Map does simply send the key to sort to the reducer

- A custom partitioner is created to send the data to the best bucket depending on the quartile file previously created

- Each reducer will output the data sorted by buckets

• Research: tSort : GraySort, MinuteSort

© Talend 2013

2

Page 22: Visual Mapping of Clickstream Data

22© Talend 2014

How-to-Get Sandbox!

• Videos on the Jumpstart- How to Launch http://youtu.be/J3Ppr9Cs9wA

- Clickstream video http://youtu.be/OBYYFLmdCXg

• To get the Sandbox- http://www.talend.com/contact

Page 23: Visual Mapping of Clickstream Data

23© Talend 2014

Step-by-Step Directions

• Completely Self-contained Demo VM Sandbox

• Key Scenarios like Clickstream Analysis

Page 24: Visual Mapping of Clickstream Data

24© Talend 2014

Come try the Sandbox

Hortonworks Dev Café & Talend

2

Page 25: Visual Mapping of Clickstream Data

25© Talend 2014

RUNTIME PLATFORM (JAVA, Hadoop, SQL, etc.)

Talend Platform for Big Data v5.4

Talend Platform for Big Data

TALEND UNIFIED PLATFORM

Studio Repository Deployment Execution Monitoring

DATA INTEGRATION

DataAccess ETL / ELT Version

ControlBusiness

RulesChange

Data Capture Scheduler ParallelProcessing

HighAvailability

Big DATA QUALITY

Hive Data Profiling

Drill-downto Values

DQ Portal,Monitoring

DataStewardship

ReportDesign

AddressValidation

CustomAnalysis

M/R Parsing,Matching

BIG DATA

Hadoop 2.0 MapReduceETL/ELT

Hcatalog/meta-data

Pig, Sqoop,Hive

Hadoop JobScheduler

Google BigQuery

NoSQLSupportHDFS

Page 26: Visual Mapping of Clickstream Data
Page 27: Visual Mapping of Clickstream Data

NonStop HBase – Making HBase Continuously Available for Enterprise Deployment

Dr. Konstantin BoudnikWANdisco

Page 28: Visual Mapping of Clickstream Data

Non-Stop HBase

Making HBase Continuously Available for Enterprise DeploymentKonstantin Boudnik – Director, Advanced Technologies, WANdisco

Brett Rudenstein – Senior Product Manager, WANdisco

Page 29: Visual Mapping of Clickstream Data

WANdisco: continuous availability company WANdisco := Wide Area Network Distributed Computing

We solve availability problems for enterprises.. If you can’t afford 99.999% - we’ll help

Publicly trading at London Stock Exchange since mid-2012 (LSE:WAND)

Apache Software Foundation sponsor; actively contributing to Hadoop, SVN, and others

US patented active-active replication technology

Located on three continents

Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability

Subversion, Git, Hadoop HDFS, HBase at 200+ customer sites

Page 30: Visual Mapping of Clickstream Data

What are we solving?

Page 31: Visual Mapping of Clickstream Data

Traditionally everybody relies on backups

Page 32: Visual Mapping of Clickstream Data

HA is (mostly) a glorified backup

Redundancy of critical elements- Standby servers

- Backup network links

- Off-site copies of critical data

- RAID mirroring

Baseline:- Create and synchronize replicas

- Clients switching in case of failure

- Extra hardware allaying idly spinning “just in case”

Page 33: Visual Mapping of Clickstream Data

A Typical Architecture (HDFS HA)

Page 34: Visual Mapping of Clickstream Data

Backups can fail

Page 35: Visual Mapping of Clickstream Data

WANdisco Active-Active Architecture

/ page 35

100% Uptime with WANdisco’s patented replication technology- Zero downtime / zero data loss

- Enables maintenance without downtime

Automatic recovery of failed servers; Automatic rebalancing as workload increases

HDFS Data

Page 36: Visual Mapping of Clickstream Data

Multi-threaded Server Software:Multiple threads processing client requests in a loop

Server Process

make change to state (db)

get client request e.g. hbase put

send return value to client

OP OP OP OP

OP

OP

OP OPOP OP

OP

OP

thread 1

thread 3

thread 2

thread 1

thread 2

thread 3

acquire lock release lock

Page 37: Visual Mapping of Clickstream Data

Ways to achieve single server redundancy

Page 38: Visual Mapping of Clickstream Data

Using a TCP Connection to send data to three replicated servers (Load Balancer)

server3

Server Process

OP OP

server2

Server Process

OP OP OP OP

server1

Server Process

OP OP OP OP

ClientOP OP OP OP

Load BalancerLoad Balancer

Page 39: Visual Mapping of Clickstream Data

HBase WAL replication

State Machine (HRegion contents, HMaster metadata, etc.) is modified first

Modification Log (HBase WAL) is sent to a Highly Available shared storage

Standby Server(s) read edits log and serve as warm standby servers, ready to take over should the active server fail

Page 40: Visual Mapping of Clickstream Data

HBase WAL replication

server1

Server Process

OP OP OP OP

server2

Server ProcessShared Storage

Standby Server

WAL Entries

Single Active Server

Page 41: Visual Mapping of Clickstream Data

HBase WAL tailing, WAL Snapshots etc.

Only one active region server is possible

Failover takes time

Failover is error prone

RegionServer failover isn’t seamless for clients

Page 42: Visual Mapping of Clickstream Data

Implementing multiple active masterswith Paxos coordination(not about leader election)

Page 43: Visual Mapping of Clickstream Data

Three replicated servers

server3

Server Process

OP OP OP OP

Distributed Coordination Engine

server2

Server Process

Distributed Coordination Engine

OP OP OP OP

server1

Server Process

OP OP OP OP

Distributed Coordination Engine

PaxosDConE

Client Client

ClientClient

Client

PaxosDConE

OP OPOPOP

Page 44: Visual Mapping of Clickstream Data

HBase Continuous Availability(multiple active masters)

Page 45: Visual Mapping of Clickstream Data

HBase Single Points of Failure

Single HBase Master- Service interruption after Master failure

Hbase client- Client session doesn’t failover after a RegionServer failure

HBase Region Server: downtime- 30 secs ≥ MMTR ≤ 200 secs

Region major compaction (not a failure, but…)- (un)-scheduled downtime of a region for compaction

Page 46: Visual Mapping of Clickstream Data

HBase Region Server& Master Replication

Page 47: Visual Mapping of Clickstream Data
Page 48: Visual Mapping of Clickstream Data
Page 49: Visual Mapping of Clickstream Data
Page 50: Visual Mapping of Clickstream Data
Page 51: Visual Mapping of Clickstream Data

NonStopRegionServer:

Client Service e.g. multi

Client Service

DConE

HRegionServer

NonStopRegionServer 1

Client Service e.g. multi

Client Service

DConE

HRegionServer

NonStopRegionServer 2

Hbase Client

1. Client calls HRegionServer multi

2. NonStopRegionServer intercepts 3. NonStopRegionServer makes

paxos proposal using DConE library 4. Proposal comes back as agreement on all NonStopRegionServers 5. NonStopRegionServer calls super.multi on all nodes. State changes are recorded 6. NonStopRegionServer 1 alone sends response back to client

HMaster is similar

Page 52: Visual Mapping of Clickstream Data

HBase RegionServer replication using WANdisco DConE

Shared nothing architecture

HFiles, WALs etc. are not shared

Replica count is tuned

Snapshots of HFiles do not need to be created

Messy details of WAL tailing are not necessary:- WAL might not be needed at all (!)

Not an eventual consistency model

Does not serve up stale data

Page 53: Visual Mapping of Clickstream Data
Page 54: Visual Mapping of Clickstream Data

/ page 54

DEMODEMO

Page 55: Visual Mapping of Clickstream Data

/ page 55

Page 56: Visual Mapping of Clickstream Data

/ page 56

Page 57: Visual Mapping of Clickstream Data

/ page 57

Page 58: Visual Mapping of Clickstream Data

/ page 58

DEMOQ & A

Page 59: Visual Mapping of Clickstream Data

Thank you

Konstantin [email protected]@c0sin

Page 60: Visual Mapping of Clickstream Data