(bdt306) how hearst publishing manages clickstream analytics with aws
TRANSCRIPT
![Page 1: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/1.jpg)
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roy Ben-Alta, Business Development Manager, AWS
Rick McFarland, VP of Data Services, Hearst
October 2015
BDT306
The Life of a Click
How Hearst Publishing Manages
Clickstream Analytics with AWS
![Page 2: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/2.jpg)
What to Expect from the Session
• Common patterns for clickstream analytics
• Tips on using Amazon Kinesis and
Amazon EMR for clickstream processing
• Hearst’s big data journey in building the Hearst analytics
stack for clickstream
• Lesson learned
• Q&A
![Page 3: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/3.jpg)
Clickstream Analytics = Business Value
Verticals/Use
Cases
Accelerated Ingest-
Transform-Load to final
destination
Continual Metrics/
KPI ExtractionActionable Insights
Ad Tech/
Marketing Analytics
Advertising data aggregation Advertising metrics like coverage,
yield, conversion, scoring
webpages
User activity engagement
analytics, optimized bid/ buy
engines
Consumer Online/
Gaming
Online customer engagement data
aggregation
Consumer/ app engagement
metrics like page views, CTR
Customer clickstream analytics,
recommendation engines
Financial Services
Digital assets
Improve customer experience on
bank website
Financial market data metrics Fraud monitoring, and value-at-
risk assessment, auditing of
market order data
IoT / Sensor Data Fitness device , vehicle sensor,
telemetry data ingestion
Wearable sensor operational
metrics, and dashboards
Devices / sensor operational
intelligence
![Page 4: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/4.jpg)
DataXu Records
68.198.92 - - [22/Dec/2013:23:08:37 -0400] "GET
/ HTTP/1.1" 200 6394 www.yahoo.com
"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-"
192.168.198.92 - - [22/Dec/2013:23:08:38 -0400] "GET
/images/logo.gif HTTP/1.1" 200 807 www.yahoo.com
"http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-"
192.168.72.177 - - [22/Dec/2013:23:32:14 -0400] "GET
APACHE ACCESS LOG
{"cId":"10049","cdid":"5961","campID":"8","loc":"b","ip_address":"174.56.106.10","icctm_ht_athr":"","icctm_ht_aid":"","icctm_ht_attl":"Family Circus","icctm_ht_dtpub":"2011-04-05","icctm_ht_stnm":"SEATTLE POST-INTELLIGENCER","icctm_ht_cnocl":"http://www.seattlepi.com/comics-and-games/fun/Family_Circus","ts":"1422839422426","url":"http://www.seattlepi.com/comics-and-games/fun/Family_Circus","hash":"d98ace5874334232f6db3e1c0f8be3ab","load":"5.096","ref":"http://www.seattlepi.com/comics-and-games","bu":"HNP","brand":"SEATTLE POST-INTELLIGENCER","ref_type":"SAMESITE","ref_subtype":"SAMESITE","ua":"desktop:chrome"}
JSON
Clickstream Record
Number of fields is not fixed
Tags names change
Multiple pages/sites
Format can be defined as we
store the data
AVRO, CSV, TSV, JSON
![Page 5: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/5.jpg)
Clickstream Analytics Is the New “Hello World”
Hello World Word count Clickstream
![Page 6: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/6.jpg)
Clickstream Analytics – Common Patterns
Flume HDFS Hive Batch high latency on retrieve
SQLFlume HDFSHive &
PigBatch low latency on
retrieve
Flume
SqoopHDFS
Impala
SparkSql
Presto
Other
More options: Batch with lower
latency when retrieve
![Page 7: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/7.jpg)
users
Amazon
Kinesis
Kinesis-
enabled appAmazon S3 Amazon
EMR
Web
Servers
Amazon S3
Amazon Redshift
![Page 8: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/8.jpg)
It’s All About the Pace, About the Pace…
Big data
Hourly server logs: were your systems misbehaving 1hr ago
Weekly / monthly bill:
what you spent this billing cycle
Daily customer-preferences report from your web
site’s click stream:
what deal or ad to try next time
Daily fraud reports:
was there fraud yesterday
Real-time big data
•Amazon CloudWatch metrics:
what went wrong now
•Real-time spending alerts/caps:
prevent overspending now
•Real-time analysis:
what to offer the current customer now
•Real-time detection:
block fraudulent use now
![Page 9: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/9.jpg)
Clickstream Storage and Processing with
Amazon Kinesis
Amazon KinesisApp N
Live dashboard
AW
S e
nd
po
int
App 1
Aggregate and ingest data to S3
App 2Aggregate and ingest data to
Amazon Redshift
Data lake
Amazon RedshiftApp 3
ETL/ELTMachine learning
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
ZoneAvailability
Zone
EMR
DynamoDB
![Page 10: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/10.jpg)
Amazon
EMR
Managed, elastic Hadoop (1.x & 2.x) cluster
Integrates with Amazon S3, Amazon DynamoDB, and
Amazon Redshift
Install Storm, Spark, Hive, Pig, Impala, and end user
tools automatically
Support for Spot instances
Integrated HBase NOSQL database
Amazon EMR with Apache Spark
Apache Spark
Spark
SQL
Spark
StreamingMllib
GraphX
![Page 11: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/11.jpg)
Spot Integration with Amazon EMR
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
![Page 12: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/12.jpg)
Spot Integration with Amazon EMR
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
![Page 13: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/13.jpg)
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
![Page 14: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/14.jpg)
Resize Nodes with Spot Instances
20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
![Page 15: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/15.jpg)
Resize Nodes with Spot Instances
50 % less run-time ( 14 7)
25% less cost (140 105)
![Page 16: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/16.jpg)
Amazon EMR and Amazon Kinesis for Batch and Interactive Processing
• Streaming log analysis
• Interactive ETL
Amazon Kinesis Amazon EMR
Amazon Redshift
Amazon S3
Data scientist
Amazon EMR for data scientists
using Spot instances
BI
![Page 17: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/17.jpg)
Amazon Beanstalk - App to push data into Amazon Kinesis
![Page 18: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/18.jpg)
• Amazon software license linking – Add ASL dependency
to SBT/MAVEN project (artifactId = spark-streaming-
kinesis-asl_2.10)
• Shards - Include head-room to catching up with data in
stream
• Tracking Amazon Kinesis application state (DynamoDB)• Kinesis-Application:DynamoDB-table (1:1)
• Created automatically
• Make sure application name doesn’t conflict with existing DynamoDB tables.
• Adjust DynamoDB provision throughput if necessary (default 10 reads per
sec & 10 writes per second)
Amazon Kinesis Applications – Tips
![Page 19: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/19.jpg)
Spark on Amazon EMR - Tips
• Amazon EMR applications after version 3.8.0 (no need
to run bootstrap actions)
• Use Spot instances for time & cost saving especially
when using Spark
• Run in Yarn cluster mode (--master yarn-cluster) for
production jobs – Spark driver runs in application master
(high availability)
• Data serialization – use Kryo if possible to boost
performance
(spark.serializer=org.apache.spark.serializer.KryoSerializer)
![Page 20: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/20.jpg)
The Life of a Click at Hearst
• Hearst’s journey with their big data analytics platform on
AWS
• Demo
• Clickstream analysis patterns
• Lessons learned
![Page 21: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/21.jpg)
![Page 22: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/22.jpg)
Have you heard of Hearst?
![Page 23: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/23.jpg)
BUSINESS MEDIA operates more than 20 business-to-businesses with significant holdings in the
automotive, electronic, medical and finance industries
MAGAZINES publishes 20 U.S. titles and close to 300 international editions
BROADCASTINGcomprises 31 television and two radio stations
NEWSPAPERS owns 15 daily and 34 weekly newspapers
Hearst includes over 200 businesses in over
100 countries around the world
![Page 24: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/24.jpg)
Data Services at Hearst – Our Mission
• Ensure that Hearst leverages its combined data
assets
• Unify Hearst’s data streams
• Development of Big Data Analytics Platform using AWS
services
• Promote enterprise-wide product development
• Example: product initiative led by all of Hearst’s editors
– Buzzing@Hearst
![Page 25: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/25.jpg)
1
![Page 26: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/26.jpg)
Business Value of Buzzing
• Instant feedback on articles from our audiences
• Incremental re-syndication of popular articles across
properties (e.g. trending newspaper articles can be
adopted by magazines)
• Inform the editors to write articles that are more
relevant to our audiences and what channels are our
audiences leveraging to read our articles
• Ultimately, drive incremental value
• 25% more page views, 15% more visitors which
lead to incremental revenue
![Page 27: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/27.jpg)
• Throughput goal: transport data from all 250+ Hearst
properties worldwide
• Latency goal: click-to-tool in under 5 minutes
• Agile: easily add new data fields into clickstream
• Unique metrics requirements defined by Data Science team
(e.g., standard deviations, regressions, etc.)
• Data reporting windows ranging from 1 hour to 1 week
• Front-end developed “from scratch” so data exposed through
API must support development team’s unique requirements
Most importantly, operation of existing sites cannot be
affected!
Engineering Requirements of Buzzing…
![Page 28: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/28.jpg)
What we had to work with…
a ”static” clickstream collection process on many Hearst sites
Users to
Hearst
Properties
Clickstream
corporate
data centerNetezza
Data
Warehouse
Once per day
…now how do we get there?
Used for ad hoc
SQL-based
reporting and
analytics
~30 GB per day
containing basic web
log data (e.g., referrer,
url, user agent, cookie,
etc.)
![Page 29: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/29.jpg)
…and we own Hearst’s tag management system
Users to
Hearst
Properties
Clickstream
This not only gave us access
to the clickstream but also
the JavaScript code that
lives on our websites
JavaScript on
web pages
![Page 30: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/30.jpg)
Phase 1 – Ingest Clickstream Data Using AWS
Amazon
Kinesis
Node.JS App-
Proxy
Kinesis S3 App –
KCL LibrariesUsers to
Hearst
Properties
Clickstream
“Raw JSON”
Raw data
Use tag manager to easily deploy JavaScript to all sites
Kinesis Client
Libraries and
Kinesis
Connectors
persist data to
Amazon S3
for durability
ElasticBeanstalk with
Node.JS exposes an
HTTP endpoint which
asynchronously takes
the data and feeds to
Amazon Kinesis
Implement
JavaScript on
sites that call
an exposed
endpoint and
pass in query
parameters
![Page 31: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/31.jpg)
Node.JS – Push clickstream to Amazon Kinesisfunction pushToKinesis(data) {
var params = {
Data: data, /* required */
PartitionKey: guid(),
StreamName: streamName /* required */
};
kinesis.putRecord(params, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
}
});
}
app.get('/hearstkin.gif', function(req, res){
async.series([function(callback){
var queryData = url.parse(req.url, true).query;
queryData.proxyts = new Date().getTime().toString();
pushToKinesis(JSON.stringify(queryData));
callback(null);
}]);
res.writeHead(200,{'Content-Type': 'text/plain', 'Access-
Control-Allow-Origin': '*'});
res.end(imageGIF, 'binary');
});
http.createServer(app).listen(app.get('port'), function(){
console.log('Express server listening on port ' +
app.get('port'));
});
Asynchronous calls –
ensures no user experience
interruption
Server timestamp – to
create a unified timestamp.
Amazon Kinesis now offers
this out-of-the box!
JSON format – this
helps us downstream
Kinesis Partition Key – guid() is a
good partition key to ensure even
distribution across the shards
![Page 32: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/32.jpg)
Ingest Monitoring - AWS
Amazon Kinesis Monitoring
AWS Elastic Beanstalk Monitoring
Auto Scaling triggered by
network in > 20MB. Then scale
up to 40 instances.
![Page 33: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/33.jpg)
Phase 1- Summary
• Use JSON formatting for payloads so more fields can be easily
added without impacting downstream processing
• HTTP call requires minimal code introduced to the actual site
implementations
• Flexible to meet rollout and growing demand
• Elastic Beanstalk can be scaled
• Amazon Kinesis stream can be re-sharded
• Amazon S3 provides high durability storage for raw data
• Once a reliable, scalable onboarding platform is in place,
we can now focus on ETL!
![Page 34: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/34.jpg)
Phase 2a- Data Processing First Version (EMR)
ETL on
Amazon EMR
“Raw
JSON”
Raw DataClean Aggregate
Data
• Amazon EMR was
chosen initially for
processing due to ease
of Amazon EMR creation
… and Pig because we
knew how to code in
PigLatin
• 50+ UDFs were written
using Python…also
because we knew Python
![Page 35: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/35.jpg)
Unfortunately, Pig was not performing well – 15 min latency
Processing Clickstream Data with Pig
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
REGISTER '/home/hadoop/PROD/parser.py' USING jython as pyudf;
REGISTER '/home/hadoop/PROD/refclean.py' USING jython AS refs;
AA0 = load 's3://BUCKET_NAME/rawjson/datadata.tsv.gz' using TextLoader as
(line:chararray);
A0 = FILTER AA0 BY ( pyudf.get_obj(line,'url') MATCHES '.*(a\\d+|g\\d+).*';
A1 = FOREACH A0 GENERATE
( pyudf.urlclean(pyudf.get_obj(line,'url')) as url:chararray,
pyudf.get_obj(line,'hash') as hash:chararray,
pyudf.get_obj(line,'icxid') as icxid:chararray,
pyudf.pubclean(pyudf.get_obj(line,'icctm_ht_dtpub')) as pubdt:chararray,
pyudf.get_obj(line,'icctm_ht_cnocl') as cnocl:chararray,
pyudf.get_obj(line,'icctm_ht_athr') as author:chararray,
pyudf.get_obj(line,'icctm_ht_attl') as title:chararray,
pyudf.get_obj(line,'icctm_ht_aid') as cms_id:chararray,
pyudf.num1(pyudf.get_obj(line,'mxdpth')) as mxdpth:double,
pyudf.num2(pyudf.get_obj(line,'load')) as topsecs:double,
refs.classy(pyudf.get_obj(line,'url'),1) as bu:chararray,
pyudf.get_obj(line,'ip_address') as ip:chararray,
pyudf.get_obj(line,'img') as img:chararray
;
Gzip your
output
Regex in
Pig!
Python imports
limited to what is
allowed by
Jython
![Page 36: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/36.jpg)
Phase 2b- Data Processing (SparkStreaming)
Clean Aggregate Data
Node.JS App-
ProxyUsers to
Hearst
Properties
Clickstream
• Welcome Apache Spark– one framework for
batch and realtime
• Benefits – using same code for batch and
real time ETL
• Use Spot instances – cost savings
• Drawbacks – Scala!
Amazon KinesisETL on EMR
![Page 37: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/37.jpg)
Using SQL with Scala
SparkSQLSince we
knew SQL,
we wrote
Scala with
embedded
SQL Query
endpointUrl = kinesis.us-west-2.amazonaws.com
streamName= hearststream
outputLoc.json.streaming =
s3://hearstkinesisdata/processedsparkjson
window.length = 300
sliding.interval = 300
outputLimit = 5000000
query1Table=hearst1
query1= SELECT \
simplestartq(proxyts, 5) as startq,\
urlclean(url) as url,\
hash,\
icxid,\
pubclean(icctm_ht_dtpub) as pubdt,\
classy(url,1) as bu,\
ip_address as ip,\
artcheck(classy(url,1),url) as artcheck,\
ref_type(ref,url) as ref_type,\
img, \
wc, \
contentSource \
FROM hearst1
val jsonRDD = sqlContext.jsonRDD(rdd1)
jsonRDD.registerTempTable(query1Table.trim)
val query1Result = sqlContext.sql(query1)//.limit(outputLimit.toInt)
query1Result.registerTempTable(query2Table.trim)
val query2Result = sqlContext.sql(query2)
query2Result.registerTempTable(query3Table.trim)
val query3Result = sqlContext.sql(query3).limit(outputLimit.toInt)
val outPartitionFolder = UDFUtils.output60WithRolling(slidingInterval.toInt)
query3Result.toJSON.saveAsTextFile("%s/%s".format(outputLocJSON,
outPartitionFolder), classOf[org.apache.hadoop.io.compress.GzipCodec])
logger.info("New JSON file written to "+outputLoc+"/"+outPartitionFolder)
![Page 38: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/38.jpg)
Python UDF versus ScalaPythondef artcheck(bu,url):
try:
if url and bu:
cleanurl = url[0:url.find("?")].strip('/')
tailurl = url[findnth(url, '/', 3)+1:url.find("?")].strip('/')
revurl=cleanurl[::-1]
root=revurl[0:revurl.find('/')][::-1]
if (bu=='HMI' or bu=='HMG') and re.compile('a\d+|g\d+').search(tailurl)!=None : return 'T'
elif bu=='HTV' and root.isdigit()==True and re.compile('/search/').search(cleanurl)==None: return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix').search(url)!=None and re.compile(r'\S*[0-9]{4,4}\/[0-9]{2,2}\/[0-9]{2,2}\S*').search(tailurl)!=None : return 'T'
elif bu=='HNP' and re.compile('businessinsider').search(url)!=None and re.compile(r'\S*[0-9]{4,4}\-[0-9]{2,2}').search(root)!=None : return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix|businessinsider').search(url)==None and re.compile('.php').search(url)!=None : return 'T'
else : return 'F'
else : return 'F'
except:
return 'F'
def artcheck(bu:String, url: String )={
try{
val cleanurl = UDFUtils.utilurlclean(url.trim).stripSuffix("/")
val pathClean = UDFUtils.pathURI(cleanurl)
val lastContext = pathClean.split("/").last
var resp = "F"
if(("HMI"==bu||"HMG"==bu)&&Pattern.compile("/a\\d+|/g\\d+").matcher(pathClean).find()) resp="T"
else if("HTV"==bu && StringUtils.isNumeric(lastContext) && !cleanurl.contains("/search/")) resp="T"
else if("HNP"==bu && Pattern.compile("blog|fuelfix").matcher(url).find() && Pattern.compile("\\d{4}/\\d{2}/\\d{2}").matcher(pathClean).find()) resp="T"
else if("HNP"==bu && Pattern.compile("businessinsider").matcher(url).find() && Pattern.compile("\\d{4}-\\d{2}").matcher(lastContext).find()) resp="T"
else if("HNP"==bu && !Pattern.compile("blog|fuelfix|businessinsider").matcher(url).find()&& Pattern.compile(".php").matcher(url).find()) resp="T"
resp}
}
catch{
case e:Exception => "F"
}
Scala
Don’t be intimidated by Scala…if
you know Python, the syntax can
be similar
re.compile('a\d+|g\d+').
Pattern.compile("a\\d+|g\\d+").
Try: Except:
Try{} Catch{}
![Page 39: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/39.jpg)
Phase 3a- Data Science!
Data Science on EC2
Clean Aggregate Data API-ready Data
Amazon KinesisETL on EMR
• We decided to perform our
Data Science using SAS
on Amazon EC2 initially
because of the ability to
perform both data
manipulation and easily
run complex data science
techniques (e.g.,
regressions)
• Great for exploration and
initial development
• Performing data science
using this method took
3-5 minutes to complete
![Page 40: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/40.jpg)
SAS Code Example
data _null_;
call system("aws s3 cp s3://BUCKET_NAME/file.gz
/home/ec2-user/LOGFILES/file.gz");
run;
FILENAME IN pipe "gzip -dc /home/ec2-user/LOGFILES/file.gz" lrecl=32767;
data temp1;
FORMAT startq DATETIME19.;
infile IN delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=1;
input
startq :YMDDTTM.
url :$1000.
pageviews :best32.
visits :best32.
author :$100.
cms_id :$100.
img :$1000.
title :$1000.;
run;
Use pipe to
read in S3
data and
keep it
compressed
proc sql;
CREATE TABLE metrics AS
SELECT
url FORMAT=$1000.,
SUM(pageviews) as pageviews,
SUM(visits) as visits,
SUM(fvisits) as fvisits,
SUM(evisits) as evisits,
MIN(ttct) as rec,
COUNT(distinct startq) as frq,
AVG(visits) as avg_visits_pp,
SUM(visits1) as visits_soc,
SUM(visits2) as visits_dir,
SUM(visits3) as visits_int,
SUM(visits4) as visits_sea,
SUM(visits5) as visits_web,
SUM(visits6) as visits_nws,
SUM(visits7) as visits_pd,
SUM(visits8) as visits_soc_fb,
SUM(visits9) as visits_soc_tw,
SUM(visits10) as visits_soc_pi,
SUM(visits11) as visits_soc_re,
SUM(visits12) as visits_soc_yt,
SUM(visits13) as visits_soc_su,
SUM(visits14) as visits_soc_gp,
SUM(visits15) as visits_soc_li,
SUM(visits16) as visits_soc_tb,
SUM(visits17) as visits_soc_ot,
CASE WHEN (SUM(v1) - SUM(v3) ) > 20 THEN ( SUM(v1) - SUM(v3) ) / 2 ELSE 0 END as trending
FROM temp1
GROUP BY 1;
Use PROC SQL when
possible for easier
translation to Amazon
Redshift for production
later on.
![Page 41: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/41.jpg)
Phase 3b- Split Data Science into Development and Production
Amazon
Kinesis
Clean Aggregate
Data
API-ready Data
Data Science
“Production”
Amazon Redshift
ETL on EMR
• Once Data Science
models were established,
we split the modeling and
production
• Production was moved to
Amazon Redshift which
provided much faster
ability to read Amazon S3
data and process the data
• Data Science processing
time went down to 100
seconds!
Use S3 to store
data science
models and
apply them
using Amazon
Redshift Data Science
“Development”
on EC2
Statistical Models
run once per day
Models
Agg Data
![Page 42: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/42.jpg)
select
clean_url as url,
trim(substring(max(proxyts||domain) from 20 for 1000)) as domain,
trim(substring(max(proxyts||clean_cnocl) from 20 for 1000)) as cnocl,
trim(substring(max(proxyts||img) from 20 for 1000)) as img,
trim(substring(max(proxyts||title) from 20 for 1000)) as title,
trim(substring(max(proxyts||section) from 20 for 1000)) as section,
approximate count(distinct ic_fpc) as visits,
count(1) as hits
from kinesis_hits
where bu='HMG' and (article_id is not null or author is not null or title is
not null)
group by 1;
Amazon Redshift Code Example
Cool trick to find the most
recent value of a
character field in one
pass through the data
![Page 43: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/43.jpg)
Phase 4a- Elasticsearch Integration
Amazon EMR
PUSH
Buzzing API
S3 Storage
Data Science
Amazon Redshift
ETL on EMR
Since we had the Amazon EMR cluster running already, we used a handy Pig jar
that made it easy to push data to Elasticsearch.
S3 Storage
Models
Agg Data API Ready Data
![Page 44: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/44.jpg)
REGISTER /home/hadoop/pig/lib/piggybank.jar;
REGISTER /home/hadoop/PROD/elasticsearch-hadoop-2.0.2.jar;
DEFINE EsStorageDEV org.elasticsearch.hadoop.pig.EsStorage
('es.nodes = es-dev.hearst.io',
'es.port = 9200',
'es.http.timeout = 5m',
'es.index.auto.create = true');
SECTIONS = load 's3://hearstkinesisdata/ss.tsv' USING PigStorage('\t') as
(sectionid:chararray,cnt:long,visits:long,sectionname:chararray);
STORE SECTIONS INTO 'content-sections-sync/content-sections-sync' USING
EsStoragePROD;
Pig Code – Push to ES Example
Use handy Pig jar to push data to Elasticsearch
The “Amazon EMR overhead” required to read small files added 2 min to latency
![Page 45: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/45.jpg)
Phase 4b- Elasticsearch Integration Sped Up
Buzzing API
S3 Storage
API
Ready
DataData Science
Amazon Redshift
ETL on EMR
Since the Amazon
Redshift code was
run in a Python
wrapper, solution
was to push data
directly into
Elasticsearch
Models
Agg Data
![Page 46: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/46.jpg)
# Converting file into bulk-insert compatible format
$bin/convert_json.php big.json create rowbyrow.json
# Get mapping file
${aws} s3 cp S3://hearst/es_mapping es_mapping
# Creating new ES index
$(curl -XPUT http://es.hearst.io/content-web180-sync --data-binary es_mapping -s)
# Performing bulk API call
$(curl -XPOST http://es.hearst.io/content-web180-sync/_bulk --data-binary rowbyrow.json
-s) "http://es.hearst.io/content-web180-sync"
Script to Push to Elasticsearch Directly
Converting one big input JSON
file to a row-by-row JSON is a
key step for making the data
batch compatible
Use a mapping file to manage
the formatting in your index…
very important for dates and
numeric values that look like
strings
![Page 47: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/47.jpg)
Final Data Pipeline
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- ProxyUsers to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
100 seconds
1G/day
30 seconds
5GB/day
5 seconds
1G/day
Milliseconds
100GB/day
LATENCY
THROUGHPUT Models
Agg Data
![Page 48: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/48.jpg)
Data Science
Amazon Redshift
ETL
A more “visual” representation of our pipeline!
Clickstream dataAmazon
KinesisResults
API
![Page 49: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/49.jpg)
Version Transport
S
T
O
R
A
G
E ETL
S
T
O
R
A
G
E Analysis
S
T
O
R
A
G
E Exposure Latency
V1
Amazon
Kinesis S3 EMR-Pig S3 EC2-SAS S3
EMR to
ElasticSearch 1 hour
Today
Amazon
Kinesis Spark-Scala S3
Amazon
Redshift ElasticSearch <5 min
Tomorrow
Amazon
Kinesis PySpark + SparkR ElasticSearch <2 min
Lessons learned
“No Duh’s” Removing “stoppage” points, speed up processing, and
combine processes improve latency.
![Page 50: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/50.jpg)
Data Science Tool Box
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- ProxyUsers to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
Models
Agg Data
• IPython Notebook
• On Spark and Amazon Redshift
• Code sharing (and insights)
• User-friendly development environment for data scientists
• Auto-convert .pynb .py Data
Science
Toolbox
DataModels
Amazon Redshift
![Page 51: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/51.jpg)
Data Science at Hearst – Notebook
![Page 52: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/52.jpg)
Next Steps
• Amazon EMR 4.1.0 with Spark 1.5 released and we can do more with pyspark, look at Apache Zeppelin on Amazon EMR
• Amazon Kinesis just release a new feature to retain data up to 7 days - We could do more ETL “in the stream”
• Amazon Kinesis Firehose and Lambda – Zero touch (no Amazon EC2 maintenance)
• More complex data science that requires…• Amazon Redshift UDFs
• Python shell that calls Amazon Redshift but also allows for complex statistical methods (e.g., using R or machine learning)
![Page 53: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/53.jpg)
Conclusion
• Clickstreams are the new “data currency” of business
• AWS provides great technology to process data
• High speed
• Lower costs – Using Spot…
• Very agile
• Do more with less: this can all be done with a team
of 2 FTEs!
• 1 developer (well versed in AWS) + 1 data scientist
![Page 54: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/54.jpg)
Ingest Store Process Analyze
Click Insight
Time
Call To Action
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMRAmazon
Redshift
![Page 55: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/55.jpg)
Use Amazon Kinesis, EMR and Amazon Redshift for
Clickstream
Open source connectors:
• http://docs.aws.amazon.com/kinesis/latest/dev/developing-
consumers-with-kcl.html
AWS Big Data blog
- http://blogs.aws.amazon.com/bigdata/
AWS re:Invent Big Data booth
AWS Big Data Marketplace and Partner ecosystem
Hearst Booth – Hall C1156: Learn more about the
interesting things we are doing with data!
Call To Action
![Page 56: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/56.jpg)
Remember to complete
your evaluations!
![Page 57: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS](https://reader033.vdocuments.site/reader033/viewer/2022042723/58f250b91a28ab8f268b45c7/html5/thumbnails/57.jpg)
Thank you!