bigdata @ comscore
TRANSCRIPT
BigData @ comScore
Michael Brown, CTO, comScore, Inc.March 25th, 2011
comScore is a Global Leader in Measuring the Digita l World
NASDAQ SCOR
Clients 1600+ worldwide
Employees 1,000+
Headquarters Reston, VA
Global Coverage170+ countries under measurement;43 markets reported
Local Presence 30+ locations in 21 countries
2© comScore, Inc. Proprietary.
Local Presence 30+ locations in 21 countries
V0910
Broad Client Base and Deep Expertise Across Key Ind ustries
Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology
3© comScore, Inc. Proprietary. V0910
The Trusted Source for Digital Intelligence Across Vertical Markets
47 out of the top 50
4 out of the top 4WIRELESS CARRIERS
9 out of the top 10INVESTMENT BANKS
9 out of the top 10
9 out of the top 10INTERNET SERVICEPROVIDERS
9 out of the top 10AUTO INSURERS
4© comScore, Inc. Proprietary.
47 out of the top 50 ONLINE PROPERTIES
45 out of the top 50ADVERTISING AGENCIES
9 out of the top 10MAJOR MEDIA COMPANIES
9 out of the top 10PHARMACEUTICALCOMPANIES
9 out of the top 10CONSUMER FINANCECOMPANIES
9 out of the top 10CPG COMPANIES
V0910
comScore History of Leadership and Innovation
To measure the search market
To measure
video streaming
To provide behavioral ad effectiveness
To meter mobile user behavior 1st
To Unify census + panel measurement
5© comScore, Inc. Proprietary.
To build and project from 2 million+ longitudinal panel
To monitor and report e-commerce data
1To deliver a worldwide Internet audience measurement
Global Shaper Company2010
V0910
Average Records Captured per Day (2005-2009)
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
6© comScore, Inc. Proprietary.
-
200,000,000
400,000,000
600,000,000
800,000,000
Launching the 3 rd Generation
� In 2009, in the midst of the recession, comScore de cided to build and release its 3 rd Generation Product – Unified Digital Measurement (UD M or Hybrid)
� Technology Goals
– Ramp up data collection
– Deploy new methodologies for data processing and analysis
– Be able to scale linearly to the environment to support growth
7© comScore, Inc. Proprietary.
– Be able to scale linearly to the environment to support growth
– Have yesterdays data available today
� And one more thing … do it in 4 months or less.
Unified Digital Measurement™ (UDM) Establishes Platf orm For Panel + Census Data Integration
Global PERSON Measurement
Global MACHINE Measurement
8© comScore, Inc. Proprietary.
PAGE TAGSPANEL
Unified Digital Measurement (UDM)Patent-Pending Methodology
Adopted by 88% of Top U.S. Media Properties
V0910
How Does the Hybrid Process Work?
Collect Traffic from PCs and devices
Clean Traffic – remove non-human, bots, apply edit rules
9© comScore, Inc. Proprietary.
Apply comScore URL Dictionary
Total Traffic Filtered Traffic
URL Dictionary (CFD): Advertising Industry “Currenc y”
� Intelligent grouping of Properties with 7+ levels of detail
– Property (e.g., Yahoo! Properties, Microsoft Sites)
– Media Title (e.g., Yahoo!, MSN)
10© comScore, Inc. Proprietary.
– Channel (e.g., Yahoo! Search, MSN Homepages)
– Subchannel (e.g., Yahoo! Image Search, MSNBC)
– Group/Subgroup (e.g., Yahoo! Calendar, Today)
URL Dictionary (CFD) Coverage Statistics
11MM Unique Domains Average/Month in 2010
• Over 80% pages viewed from top 131K domains in 2010 vs. 392K in 2009
11© comScore, Inc. Proprietary.
• 2,360K patterns in January 2011represents 85% of all pages
• 1,254K syndicated entities in January 2010
• 41K patterns added/month in 2010.
Worldwide UDM ™ Penetration
Europe Austria 80%
Asia Pacific
Australia 91%
North America
Canada 94%
Latin America
Argentina 94%
Middle East & Africa
Israel 93%
Percentage of Machines Included in UDM Measurement
12© comScore, Inc. Proprietary. July 2010 Penetration Data
Austria 80%Belgium 85%Switzerland 84%Germany 84%Denmark 82%Spain 90%Finland 85% France 91%Ireland 91%Italy 80%Netherlands 88%Norway 84%Portugal 86%Sweden 85%United Kingdom 90%
Australia 91%Hong Kong 88%India 84%Japan 73%Malaysia 87%New Zealand 88%Singapore 91%
Canada 94%United States 91%
Argentina 94%Brazil 92%Chile 94%Colombia 95%Mexico 93%Puerto Rico 92%
Israel 93%South Africa 73%
V0910
Worldwide Tags per Day
15,000,000,000
20,000,000,000
25,000,000,000
# of
rec
ords
13© comScore, Inc. Proprietary.
0
5,000,000,000
10,000,000,000
Jul 2009
Aug 2009
Sep 2009
Oct 2009
Nov 2009
Dec 2009
Jan 2010
Feb 2010
Mar 2010
Apr 2010
May 2010
Jun 2010
Jul 2010
Aug 2010
Sep 2010
Oct 2010
Nov 2010
Dec 2010
Jan 2011
Feb 2011
# of
rec
ords
Beacon Records Panel Records
Monthly Totals
300,000,000,000
400,000,000,000
500,000,000,000
600,000,000,000
# of
rec
ords
14© comScore, Inc. Proprietary.
0
100,000,000,000
200,000,000,000
300,000,000,000
Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb
2009 2010 2011
# of
rec
ords
Beacon Records Panel Records
High Level Data Flow
Panel
ETL
15© comScore, Inc. Proprietary.
Census
ETL
Delivery
Enterprise Data Warehouse : Sybase IQ 15.2 Multip lex
� EDW is currently comprised of 20 servers running Wi ndows 2003 R2 x64
– Currently 220 Intel CPUs
– Dedicated EDW technical team of 3 DBAs and 1 Administrator
– Ability to grow compute capacity and storage capacity independently
� EDW data repository housed on both EMC VMAX and Cla rion
– 4 EDW instances (2 in Virginia and 2 in Illinois)
– One EDW instance is 147TB usable (app. 200TB of raw data)
16© comScore, Inc. Proprietary.
– One EDW instance is 147TB usable (app. 200TB of raw data)
– Production EDW Drive Layout 416 x 1TB SATA, RAID6, 14+2
42 x 600GB 15K, RAID1
8 X 400GB Flash, RAID5, 7+1
� Current Capacity and Performance Metrics
– 1,835,412,793,799 Rows loaded
– 140TB in 14,168 tables
– Capable of Loading 56 Billion rows per hour
Subsystem
� System designed using multiple sub systems
� Easily take out and replace different components as demands changed
� Moved from a single server to a cluster of servers in a few months in some cases with first stage tag processing
� Periodically redesign different subsystems to suppo rt increased processing demands
17© comScore, Inc. Proprietary.
� Many systems on their third generation of technolog y
Homegrown Distributed Processing
Reduced core aggregation from
Reduce final product creation
2002 – comScore distributed processing framework
Open Source Hadoop
Sca
labi
lity
Wal
l
18© comScore, Inc. Proprietary.
aggregation from 48 hours to 7 hours
product creation from 24 hours to
2 hours
Hadoopframework
Sca
labi
lity
Wal
l
GreenPlum
� GreenPlum MPP
– 80 Node Cluster: 1 Master; 6 ETL; 72 Workers
– Using Dell R510 with 12 600GB 15K RAID, 64GB RAM, 24 cores (HT)
– Support analytic end users with access to record level data, through a SQL interface
– Ability to load over 400 billion rows in 8 hours
– Hourly data loading in place
19© comScore, Inc. Proprietary.
– Hourly data loading in place
– Allow the analysts to mine the data for the business uses
– Use for quick analysis of raw event data and for the ideation and creation of new products
Hadoop
� Hadoop
– Dev - 6x Dell 2950 w/6 1TB
– Prod - 10x Dell R710 w/ 6 600GB
– Prod in 2 weeks – 10x Dell R710 w/6 600GB & 20x Dell R510 w/12 2TB
– Moving large processing jobs that currently are constrained by our current framework to Hadoop. We have some large analytical runs that currently go for over 40 hours on 32 servers and we are re-engineering to reduce
20© comScore, Inc. Proprietary.
for over 40 hours on 32 servers and we are re-engineering to reduce processing time.
– We have found that the Fair Scheduler works well for our job loads
– We use a “homegrown” workflow system (BORG) that manages tasks inside and outside hadoop.
Sharding
� Sharding divides work across multiple systems using different mechanisms
� Shard data as far up stream as possible
� Ability to break data into multiple chunks early in processing, enables ability to compute capacity down stream to accommodate large volume increases in data ingest
21© comScore, Inc. Proprietary.
Sorting
� We use DMExpress from SyncSort across hundreds of ser vers this allows for efficient data processing
� We sort input data based on a column in advance
� To calculate uniques, check if the prior value chan ged from the current value and then increment a counter
� We now have aggregation systems that can process ov er 50 GB of data with 357 million rows in less than an hour on a Del l R710 2U serve
22© comScore, Inc. Proprietary.
with 357 million rows in less than an hour on a Del l R710 2U serve
Compression w/Sorting
� Compress Log Files when processing large volumes of log data
� Several advantages to Sorting Data First:
– Reduces the size of the data
– Improves application performance
� Examples:
– 1 Hour of our data (313 GB raw, 815 million rows)
23© comScore, Inc. Proprietary.
1 Hour of our data (313 GB raw, 815 million rows)
– Standard compression of time ordered data is 93GB (30% of original)
– Standard compression on a 2 key sorted set is 56GB (18% of original)
– For one day it saves 800GB
– For one month it saves 25 TB
– For 90 days it saves 75TB
Big data makes you think differently
� Question: How many distinct cookies over 3 months?
� Data: 3 monthly tables with distinct cookies, indexed
� Size: 10B records per table
� Platform: Sybase IQ
� Attempt: UNION select count(cookies) over 3 monthly tables
24© comScore, Inc. Proprietary.
– Union operator distincts
� Result: FAIL. Out of temp space. Out of luck.
– Failed after 30 minutes.
� Why? UNION performs a SELECT and then a DISTINCT (sorting 30B rows)
Rethink the problem!
� INNER joins are cheaper
� No sort, they use existing indexes
� Remember set theory? Of course you do!
� Let months be {A, B, C}
A B
∪ ∪
25© comScore, Inc. Proprietary.
� INNER join on only 2 tables of data at a time
� 2 month intersections took 2 hours each and less taxing on memory
� Used intersection of intermediate (indexed!) results… 5 mins
CA ∪ B ∪ C = A + B + C – A ∩ B – A ∩ C – C ∩ B + A ∩ B ∩ C
A ∩ B ∩ C = (A ∩ B) ∩ (A ∩ C) ∩ (C ∩ B)
Total query time: 6.5 hours
TCO with Large Cluster Systems
� Examine replication factor and disk configuration f or systems with replication built into the framework to support red undancy and concurrency
� Example:
Hadoop cluster that supports 108TB of base compresse d data
Hypothetical Configurations:
26© comScore, Inc. Proprietary.
– Replication Factor of 3R710 (6x drives, JBOD); requires 162 servers
R510 (12x drives JBOD); requires 68 servers
– Replication Factor of 2R710 (6x drives, RAID 5); requires 129 servers
R510 (12x drives, RAID 5); requires 54 servers
Useful Factoids
Colorful, bite-sized graphical representations of t he best discoveries we unearth.
27© comScore, Inc. Proprietary.
Visit www.comscoredatamine.com or follow @datagems for the latest gems.