clickstream analysis with spark—understanding visitors in realtime by josef adersberger

32
CLICKSTREAM ANALYSIS WITH SPARK – UNDERSTANDING VISITORS IN REAL-TIME Dr. Josef Adersberger QAware GmbH, Germany

Upload: spark-summit

Post on 16-Apr-2017

2.294 views

Category:

Data & Analytics


0 download

TRANSCRIPT

CLICKSTREAM ANALYSIS WITH SPARK – UNDERSTANDING VISITORS IN REAL-TIME

Dr. Josef AdersbergerQAware GmbH, Germany

THE CHALLENGE

One Kettle to Rule ‘em All

Web Tracking Ad Tracking

ERP CRM

One Kettle to Rule ‘em All

Retention Reach

Monetarization

steer …§ Campaigns§ Offers§ Contents

THE CONCEPTS

by Randy Paulino

The First Sketch

(= real-time)

Tableau

User Journey Analysis

C V VT VT VT C X

C V

V V V V V V V

C V V C V V V

VT VT V V V VT C

V X

Event stream: User journeys:

Web / Ad tracking

KPIs:§ Unique users§ Conversions§ Ad costs / conversion value§ …

V

X

VT

C Click

View

View Time

Conversion

THE ARCHITECTURE

„Larry & Friends“ Architecture

Collector Aggregation SQL DB

Runs not well for morethan 1 TB data in terms ofingestion speed, query timeand optimization efforts

by adweek.com

Nope.Sorry, no Big Data.

„Hadoop & Friends“ Architecture

Collector Batch Processor[Hadoop]Collector Event Data Lake Batch Processor Analytics DB

JSON Stream

Aggregation takes too long

Cumbersomeprogramming model(can be solved withpig, cascading et al.)

Not interactiveenough

Nope.Too sluggish.

κ-Architecture

Collector Stream ProcessorAnalytics DB

Persistence

JSON Stream

Cumbersomeprogramming model

Over-engineered: We only need15min real-time ;-)

Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.

λ-Architecture

Collector

Event Processor

Event Data Lake Batch Processor

Analytics DBSpeed Layer

Batch Layer

JSON Stream

Cumbersomeprogramming model Complex

architecture

Redundant logic

Feels Over-Engineered…

http://www.brainlazy.com/article/random-nonsense/over-engineered

The Final Architecture**) Maybe called μ-architecture one day ;-)

Functional Architecture

Strange Events

IngestionRaw Event Stream

Collection Events Processing Analytics Warehouse

FactEntries

Atomic Event Frames

Data Lake

Master Data Integration

§ Buffers load peeks§ Ensures message

delivery (fire & forgetfor client)

§ Create user journeys andunique user sets

§ Enrich dimensions§ Aggregate events to KPIs§ Ability to replay for schema

evolution

§ The representation of truth§ Multidimensional data

model§ Interactive queries for

actions in realtime anddata exploration

§ Eternal memory for all events (even strangeones)

§ One schema per eventtype. Time partitioned.

class Analytics Model

´ factªWebFact

´ dimensionªZeit

´ dimensionªKampagne

Jahr

Quartal

Monat

Woche

Tag

Stunde

Minute

Kunde

+ Land: String

Partner

´ dimensionªTracking

Tracking Group

SensorTag

+ Typ: SensorTagType

Platzierung

+ Format: ImageSize+ Kostenmodell: KostenmodellArt

Werbemittel

+ AdGroup: String+ Format: ImageSize+ Grˆ fle: KiloBytes+ LandingPage: URL+ Motif: URL

Kampagne

´ dimensionªClient

KategorieDev ice

+ Bezeichner: String+ Hersteller: String+ Typ: String

Brow ser

+ Typ: String+ Version: int

´ dimensionªAusspielort

LandRegion

Stadt

´ dimensionªKanal

Kanal

´ dimensionªVermarktung

´ enumerationªSensorTagType

ORDER_TAG MASTER_TAG CUSTOM_TAG

Betriebssystem

+ Typ: String+ Version: Version

? Dimension :Unabh‰ng ig esPr‰dikataufMetrikenbeiderAnalyse("kannisoliertdar¸ bernachdenken/isoliertdazuAnalysenfahren")

? H ierarchie:Sub-Pr‰dikataufMetriken.Erzeug tmehralseine(zueinanderdiskunkte)TeilmengenderMetriken.Entsprichtdeng‰ng ig enDrill-Down-PfadenindenReports bzw.denBatch-Ag g reg ate-Up-PfadeninderAg g reg ationslog ik. SemantischeUnterstrukturen:"istTeilvon& kannnichtexistierenohne".

? Asssoz iation :Nichtverwendet.SeparatesStammdatenmodell. ? Attribut:Ermˆ g lichteineweitere(querschneidende)Einschr‰nkung derMetrikmengeerg‰nzendzudenHierarchien.

Domain

Website

Tracking Site

Vermarkter

Auslieferungs-Domain

Referral

´ enumerati...KostenmodellArt

CPC CPM CPO CPA

´ abstractªDimensionValue

+ id: int+ name: String+ sourceId: String

WebsiteFact

+ Bounces: int+ Verweildauer: float+ Visits: int

BasicAdFact

+ Clicks: int+ Sichtbare Views: int+ Validierte Clicks: int+ View (angefragt): int+ View (ausgeliefert): int+ View (gemessen): int

´ dimensionªProdukt

Shop

Produkt

+ Produktkategorie: String

´ dimensionªZeitfenster

Letzte X Tage

´ dimensionªUser

User Segment

´ dimensionªOrder

OrderStatus

+ Status: OrderStatus

´ enumerationªOrderStatus

IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG

UniquesFact

+ Unique Clicks: int+ Unique Users: int+ Unique Views: int

AdCostFact

+ CPC: int+ Kosten: float

Conv ersionFact

+ PC: int+ PR: int+ PV: int+ Umsatz PC: float+ Umsatz PR: float+ Umsatz PV: float

AdVisibilityFact

+ Sichtbarkeitsdauer: float

Activ atedOrderFact

+ Orders: int+ Umsatz: float

TrackingFact

+ Orders: int+ Page Impressions: int+ Umsatz: float

X= {7,14,28,30}

§ Fault tolerant message handling§ Event handling: Apply schema, time-partitioning, De-dup, sanity

checks, pre-aggregation, filtering, fraud detection§ Tolerates delayed events§ High throughput, moderate latency (~ 1min)

Series Connection of Streaming and Batching - all based on Spark.

IngestionRaw Event Stream

Collection Event Data Lake Processing Analytics Warehouse

FactEntries

SQL InterfaceAtomic Event

Frames

§ Cool programming model§ Uniform dev&ops

§ Simple solution§ High compression ratio due to

column-oriented storage§ High scan speed

§ Cool programming model§ Uniform dev&ops§ High performance§ Interface to R out-of-the-box§ Useful libs: MLlib, GraphX, NLP, …

§ Good connectivity (JDBC, ODBC, …)

§ Interactive queries§ Uniform ops§ Can easily be replaced

due to Hive Metastore

§ Obvious choice forcloud-scale messaging

§ Way the best throughputand scalability of all evaluated alternatives

LESSONS LEARNED

by: http://hochmeister-alpin.at

Technology Mapping

https://github.com/qaware/big-data-landscape

User Interface

Data Lake

Data Warehouse

Ingestion

Processing

Data Science Interactive Analysis Reporting & Dashboards

Data Sources

Analytics

Micro Analytics Services instead of reportingservers.

Charting Libraries:Dashboards:

Analytics FrontendsAlgorithm Libraries

Structured Data Lake: The eternal memory.

Efficient data serialization formats:§ Integated compression§ Column-oriented storage§ Predicate pushdown

Distributed Filesystem or NoSQL DB

Data Workflows ETL Jobs Massive Parallelization

Pig Open Studio

Data Logicstics Stream Processing

NewSQL: SQL meets NoSQL.Polyglott Persistence

Index Machines: Fast aggregation and search.

In-Memory Databases: Fast access.

Time Series Databases

Atlas

Polyglott AnalyticsData Lake Analytics

Warehouse

SQL lane

Rlane

Timeserieslane

Reporting Data ExplorationData Science

Micro Analytics Services

Microservice

Dashboard

Microservice …

No Retention ParanoiaData Lake

AnalyticsWarehouse

§ Eternal memory§ Close to raw events§ Allows replays and refills

into warehouse

Aggressive forgetting with clearly definedretention policy per aggregation level like:§ 15min:30d§ 1h:4m§ …

Events

Strange Events

Continuous Tuning

IngestionRaw Event Stream

Collection Event Data Lake Processing Analytics Warehouse

FactEntries

SQL InterfaceAtomic Event

Frames

Load Generator Throughput & latency probes

In NumbersOverall dev effort until the first release: 250 person days

Dimensions: 10 KPIs: 26

Integrated 3rd party systems: 7Inbound data volume per day: 80GB

New data in DWH per day: 2GB

Total price of cheapest cluster which is able to handle production load:

THANK [email protected]@qaware.de

Download slides

Bonus Topic: Roadmap

SparkKafka HDFS

IaaS

Simplify Ops with Mesos

Faster aggregation & easier updateswith Spark-on-Solr

http://qaware.blogspot.de/2015/06/solr-with-sparks-or-how-to-submit-spark.html

Bonus Topic: Smart Aggregation

Ingestion Event Data Lake Processing Analytics Warehouse

FactEntries

AnalyticsAtomic Event

Frames

1 2

3

Architecture follows requirementsclass Analytics Model

´ factªWebFact

´ dimensionªZeit

´ dimensionªKampagne

Jahr

Quartal

Monat

Woche

Tag

Stunde

Minute

Kunde

+ Land: String

Partner

´ dimensionªTracking

Tracking Group

SensorTag

+ Typ: SensorTagType

Platzierung

+ Format: ImageSize+ Kostenmodell: KostenmodellArt

Werbemittel

+ AdGroup: String+ Format: ImageSize+ Grˆ fle: KiloBytes+ LandingPage: URL+ Motif: URL

Kampagne

´ dimensionªClient

KategorieDev ice

+ Bezeichner: String+ Hersteller: String+ Typ: String

Brow ser

+ Typ: String+ Version: int

´ dimensionªAusspielort

LandRegion

Stadt

´ dimensionªKanal

Kanal

´ dimensionªVermarktung

´ enumerationªSensorTagType

ORDER_TAG MASTER_TAG CUSTOM_TAG

Betriebssystem

+ Typ: String+ Version: Version

? Dimension :Unabh‰ng ig esPr‰dikataufMetrikenbeiderAnalyse("kannisoliertdar¸ bernachdenken/isoliertdazuAnalysenfahren")

? H ierarchie:Sub-Pr‰dikataufMetriken.Erzeug tmehralseine(zueinanderdiskunkte)TeilmengenderMetriken.Entsprichtdeng‰ng ig enDrill-Down-PfadenindenReports bzw.denBatch-Ag g reg ate-Up-PfadeninderAg g reg ationslog ik. SemantischeUnterstrukturen:"istTeilvon& kannnichtexistierenohne".

? Asssoz iation :Nichtverwendet.SeparatesStammdatenmodell. ? Attribut:Ermˆ g lichteineweitere(querschneidende)Einschr‰nkung derMetrikmengeerg‰nzendzudenHierarchien.

Domain

Website

Tracking Site

Vermarkter

Auslieferungs-Domain

Referral

´ enumerati...KostenmodellArt

CPC CPM CPO CPA

´ abstractªDimensionValue

+ id: int+ name: String+ sourceId: String

WebsiteFact

+ Bounces: int+ Verweildauer: float+ Visits: int

BasicAdFact

+ Clicks: int+ Sichtbare Views: int+ Validierte Clicks: int+ View (angefragt): int+ View (ausgeliefert): int+ View (gemessen): int

´ dimensionªProdukt

Shop

Produkt

+ Produktkategorie: String

´ dimensionªZeitfenster

Letzte X Tage

´ dimensionªUser

User Segment

´ dimensionªOrder

OrderStatus

+ Status: OrderStatus

´ enumerationªOrderStatus

IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG

UniquesFact

+ Unique Clicks: int+ Unique Users: int+ Unique Views: int

AdCostFact

+ CPC: int+ Kosten: float

Conv ersionFact

+ PC: int+ PR: int+ PV: int+ Umsatz PC: float+ Umsatz PR: float+ Umsatz PV: float

AdVisibilityFact

+ Sichtbarkeitsdauer: float

Activ atedOrderFact

+ Orders: int+ Umsatz: float

TrackingFact

+ Orders: int+ Page Impressions: int+ Umsatz: float

X= {7,14,28,30}

act Processing

Proc

essi

ng

Web

site

Ord

er A

ctiv

atio

nAd

Cos

tCo

nver

sion

Uniq

ues

+O

verla

pAd

Vis

ibili

tyBa

sic

Ads

Trac

king

Dimensionsv erzeichnis aktualisieren

Page Impressions und Orders

zählenAggregation

TrackingFact

Ad Impressions

zählenAggregation

BasicAdFactCTR und

Sichtbarkeitsrate berechnen

Inferenz der Sichtbarkeitsdauer

Aggregation

AdVisibilityFact

Dimensionsraum aufspannen

Pro Vektor die Ev ents und die dort enthaltenen UserIds

und Interaktionsarten ermitteln

Menge der UserIds pro Interaktionsart erzeugen und deren

Mächtigkeit bestimmen

UniquesFact

User Journeys erstellen (bzw. LV/LC pro User)

Attribution v on Conv ersion auf Basis v orkonfiguriertem

Conv ersion Modell

Conv ersions zählen (PV, PC, PR)

Warenkorbwert ermitteln bei Order-Conv ersion

(PV, PC, PR)Aggregation

Conv ersionFact

Kosten pro Ev ent

ermitteln Aggregation

Nicht zuweisbare Kosten ermitteln

CostFact

Inferenz der Visits

Visits zählen

Bounces zählen

Analyse der Verweildauer

AggregationWebsiteFact

Order Status ermitteln

Anzahl der nicht getrackten Orders ermitteln

Aggregation Activ atedOrderFact

:Analytics Warehouse:Event Data Lake

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

«flow»

Data Processing WorkflowMultidimensional Data Model

Sample resultsGeolocated and gender-specific conversions.

Frequency of visits

Performance of an ad campaign