yahoo! data quality: embedded dq strategy, statistical...

20
Yahoo! Data Quality: Embedded DQ Strategy, Statistical Influences, and Challenges in Yahoo!'s Massive Data Environment ABSTRACT While data is vital to Yahoo!, quality data is a key to its success. Consistent, accurate, and overall high-quality data are needed to give reliable insights on reach, engagement and monetization. This presentation describes the Data Quality team’s approach to ensure the highest data quality by applying recognized best practices with a customized approach in the very large data-intensive organization. Recently, to encourage ownership of data quality throughout the organization, Yahoo! instituted an embedded, company-wide data quality program. In addition, the DQ team began investing in improvements to data forecasting and monitoring to aid the new program. These improvements reduced errors and encouraged buy-in from across the organization. The presentation also focuses on the technical challenges associated with protecting data accuracy from malicious traffic in Yahoo!’s massive data environment. BIOGRAPHY Jeff Kibler Data Quality Lead, Audience Management & Analytics Yahoo! Jeff Kibler earned his Bachelors in Computer Science and his Masters of Business Administration from the University of Illinois at Urbana- Champaign. At Motorola, he designed and developed test policies and procedures using six sigma methodologies within the cellular phone business. He currently acts as the dq lead for audience measurement and analytics within the Yahoo! data quality team. Oleksiy Chayka Yahoo! / University of Trento, Italy Oleksiy Chayka received his joint Masters in Computer Science from the University of Trento ( Italy ) and RWTH Aachen ( Germany ) in 2007. In his current PhD coursework at the University of Trento , he researches data quality management within the European project OKKAM and the MASTER project. Since March 2010, he works as a DQ intern at Yahoo!. Oleksiy also works in system analysis and application of data mining techniques to data quality assessment. He is a member of W3C Incubator group on data provenance and Database Group of Trento DBTrento). The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010 508

Upload: others

Post on 23-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Yahoo! Data Quality: Embedded DQ Strategy, Statistical Influences, and Challenges in Yahoo!'s Massive Data Environment ABSTRACT While data is vital to Yahoo!, quality data is a key to its success. Consistent, accurate, and overall high-quality data are needed to give reliable insights on reach, engagement and monetization. This presentation describes the Data Quality team’s approach to ensure the highest data quality by applying recognized best practices with a customized approach in the very large data-intensive organization. Recently, to encourage ownership of data quality throughout the organization, Yahoo! instituted an embedded, company-wide data quality program. In addition, the DQ team began investing in improvements to data forecasting and monitoring to aid the new program. These improvements reduced errors and encouraged buy-in from across the organization. The presentation also focuses on the technical challenges associated with protecting data accuracy from malicious traffic in Yahoo!’s massive data environment. BIOGRAPHY Jeff Kibler Data Quality Lead, Audience Management & Analytics Yahoo! Jeff Kibler earned his Bachelors in Computer Science and his Masters of Business Administration from the University of Illinois at Urbana-Champaign. At Motorola, he designed and developed test policies and procedures using six sigma methodologies within the cellular phone business. He currently acts as the dq lead for audience measurement and analytics within the Yahoo! data quality team. Oleksiy Chayka Yahoo! / University of Trento, Italy Oleksiy Chayka received his joint Masters in Computer Science from the University of Trento ( Italy ) and RWTH Aachen ( Germany ) in 2007. In his current PhD coursework at the University of Trento , he researches data quality management within the European project OKKAM and the MASTER project. Since March 2010, he works as a DQ intern at Yahoo!. Oleksiy also works in system analysis and application of data mining techniques to data quality assessment. He is a member of W3C Incubator group on data provenance and Database Group of Trento DBTrento).

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

508

Page 2: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Yahoo! Data QualityEmbedded DQ Strategy, Statistical influences, and Challenges in

Yahoo!’s Massive Data Environment

6/29/20101

MIT IQIS ‘10Jeff Kibler

Oleksiy Chayka

Abstract

While data is vital to Yahoo!, quality data is a key to its success. Consistent, accurate, and overall high-quality data are needed to

give reliable insights on reach, engagement and monetization. This presentation describes the Data Quality team’s approach to ensure the highest data quality by applying recognized best practices with a customized approach in the very large data-intensive organization. Recently, to encourage ownership of data quality throughout the

organization, Yahoo! instituted an embedded, company-wide data quality program. In addition, the DQ team began investing in

improvements to data forecasting and monitoring to aid the new program. These improvements reduced errors and encouraged buy-

in from across the organization. The presentation also focuses on the technical challenges associated with protecting data accuracy

from malicious traffic in Yahoo!’s massive data environment.

2

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

509

Page 3: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Table of Contents

1. Quick Overview of Yahoo! Data2. Embedded Data Quality Model 3. Support through Statistics4. Guarding from Malicious Traffic

3

Yahoo! User Data Overview

4

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

510

Page 4: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Collects over a dozen terabytes of data per day [1]• U.S. Library of Congress equivalents every day

Leading Internet Portal and Software Supplier [2]• Serves 150+ MM US users – 75%+ of US internet users

• Top ranked in 11 sites (Mail, Messenger, FP, Finance, etc.)

User Data & Analytics – Organization Dedicated to Data R&D• Use of data for business insights

– Improving the product through “advanced usability testing”

– Leader in advanced user profiling & behavioral targeting

– Ads become “content” and user experience is enhanced

• Data is Key to Measuring Company Growth

– Quarterly reporting of key metrics to Wall Street, the press, etc.

– Engagement: unique users, time spent and return users

– Effective metrics program is critical to guiding the business[1] Baeza-Yates, Ricardo and Raghu Ramakrishnan. Data Challenges at Yahoo!. EDBT 2008. March 2008.[2] August 2008 US Yahoo! Audience Measurement Report. comScore, June 2009.

Yahoo! User Data Volume and Definition

InstrumentationCollectionWarehousing

Metrics CategorizationUser

Profiling SegmentationIntent

Inference

• Web pages• UGC• Search terms• Ads

• Purchase intent• Conversion

inferences• Demo inference• Geo inference

Demo-graphics

- Property Usage- Search Terms- Campaign Response

Yahoo! Behavior

Off-Yahoo!Behavior

TARGETING & OPTIMIZATION INSIGHTS• Behavioral, Geo, Demo Targeting• Audience Targeting• Search Relevancy• Cost Per Lead / Acquisition• Real-time Recommendations• Content Optimization• Marketplace Design

• Site & Consumer Analytics• Advertiser Analytics• Ad Effectiveness• Advertiser, Marketing & Business Reporting• Experimentation• Data Mining

• Y! usage• Ad performance• Web2.0• Engagement • User value

• Y! Network • RightMedia• Ebay & NPC• Off-network

• Pages• Ads• Streams• Links• Conversions

• Audience• Advertiser

Dat

a M

anag

emen

tSe

man

tic

Inte

llige

nce

Dat

a C

apab

ilitie

s

Yahoo! User Data Management and Application

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

511

Page 5: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

One of our Properties

Extract, Transform, Load

It’s You!

Pipeline Overview

7

Business Insights

`

Yahoo!’s Cross-Organizational OwnershipEmbedded Data Quality Model

8Jeff Kibler

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

512

Page 6: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Impacts of Poor Data Quality

9

$$ Loss

Wrongdirectionfor thebusiness

10

Valuable Yahoo! Resources:

“I thought we fixed our data??”

How To Think about Data Quality

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

513

Page 7: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Central Standards, Process, & Tools

Data ProcessingData Sourcing

Audience: Property, Operations

Advertiser: Booking, ServingCloud Storage

Aggs & Reporting

Aggs & Rptg

Embedded Ownership

` Σ

11

Yahoo! Data Quality ProgramMission: Consistent, high quality data across all Yahoo! User Data

Improvement Progression

6/29/201012

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

514

Page 8: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Data Quality Key Partnerships

13

Program Model

Data AdvocatesData SourcingDisplay AdvertisingSearch AdvertisingAudience

Central TeamDQ ArchitectsDQ Engg AnalystsDQ PMInfrastructure/Tools

14

• Product teams own and are accountable for the quality of the data they touch – each org must provide small incremental funding

• Proactive: Drive DQ in roadmap• Reactive: QB issue triage• Dashboards and executive DQ review

• Define, drive & enforce standards, process, tools & core capabilities (proactive & reactive) across all products & pipeline stages

• Training, Documentation & Program Management

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

515

Page 9: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Organizational Success Factors

Engage key DQ leverage points• Organic alliances with people who feel DQ Pain• “Data Quality is Everyone’s Job”• Clear charter and communication in teams’ language

Evolutionary approach• Start with inventory, then simple monitoring and add on• Learn as you go and become recognized experts

Education, communication and teaming• Why DQ is important and current status against goals• How to use DQ tools and processes to improve• What is needed from each team for DQ improvement?

15

Selling DQ through Data Quality MonitoringSupport through Statistics

16Jeff Kibler

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

516

Page 10: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Key ContactData Quality Team

Engagement Model for Monitoring

Generate Alerts

Have Contact?Have 

Contact?

DQ TeamDQ Team

Identify BI or Key 

Customer

Identify BI or Key 

Customer

Add as Key 

Contact

Add as Key 

Contact

Notify Contact of 

Alert

Notify Contact of 

Alert

Investigate Alerts

Investigate Alerts

Feedback to DQ Team

Feedback to DQ Team

ConsultingConsulting

No Yes

Analytics

Display Commentary

Display Commentary

Improve Monitoring

Evolution of Value MonitoringPhase 1 – Absolute Differences

This Week’s Value

This Week’s Value

Last Week’s Value

Last Week’s Value

Absolute Difference

(+2) Customers Understand Logic (+0) Limited to few data points(‐3) High Alpha (False Positive) and Beta (False Negative) Errors 

Result: Wary Customers, DQ Team does most of analysis

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

517

Page 11: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Evolution of Value MonitoringPhase 2 – z-Value Testing

Diff

(+2) Customers Understand Logic (+1) Capability for more Customers(‐2) Moderate Alpha and Beta Errors (‐1) Moderate Post‐Processing and “Hacks” to avoid Seasonality

Result: Larger scale operation with moderate levels of Alpha and Beta Errors

‐1 z‐3 z +3 z+2 z+1 z

This Week’s Value

This Week’s Value

Last Week’s Value

Last Week’s Value

‐2 z

Evolution of Value MonitoringProposed Phase 3 – Time Series Forecasting

Difference

(+0) Customers Understand Logic (+2) Expansion to more volatile datasets, customers(+1) Low Alpha and Beta Errors (+0) Low Post‐Processing and “Hacks” to avoid Seasonality

Anticipated Results: Expansion of Methodology outside DQ Team – Common Usage

This Week’s Value

This Week’s Value

Last Week’s Value

Last Week’s Value

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

518

Page 12: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Usage of Control ChartsProportions Charting with critical ratios

Verifies Derived RatiosVerifies Derived Ratios

Low Alpha Error RateLow Alpha Error Rate

Easy to InterpretEasy to Interpret

Easy to ImplementEasy to Implement

Result: High Adoption and Usage Rate due to Low Alpha Error Rate and simplicity

Very Narrow ScopeVery Narrow Scope

Beta Error UnknownBeta Error Unknown

Dynamic Control LimitsDynamic Control Limits

Traffic Protection in Yahoo!Guarding from Malicious Traffic

22Aleksey Chayka

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

519

Page 13: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Understanding people behind traffic

23

Traffic Protection: Motivation

24

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

520

Page 14: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Traffic Protection: Motivation

25

• Malicious users• Robots

• Software• Hardware

Main causes of bad quality data:

Data analysis:Filtering spikes and detectable issues

26

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

521

Page 15: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Yahoo! ID

IP address

Browser Cookie

Traffic Protection: attributes for analysis

User identification (?)Consistent view on user activities in Yahoo! network

‐ Many people uses Yahoo! without registration (e.g., search)‐ Each user may have many yuid‐ Each yuid may be used by many 

users/robots

User’s PC identification (?)Geo location

‐ Public PC‐ Dynamic IP address

Bind to one PC/user (?)Reach statisticsMost traffic has browser cookies

‐ Public PC‐ Cookie churn

27

Abuser identification

28

No. of clicks

No. of page views

No. of properties visited

No. of yuid used

No. of IP used

Average time spent on page or property, etc.

Machine learning

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

522

Page 16: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Data analysis:Dimension selection

29

Data analysis:Dimension selection

30

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

523

Page 17: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Data analysis:Verification and Integration of rules

31

Questions

Jeff Kibler / Oleksiy Chayka

[email protected] [email protected]

2021 S. First StreetChampaign, IL 61822

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

524

Page 18: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Backup

Z-Value Testing

•Identifies Outliers

•Assumes Normality

•Simple to implement• Standard Deviations from Average

•Scalable• Minimal computational power

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

525

Page 19: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

Yahoo!’s Usage of Z-Value Testing

•Oracle-Driven• Stored Procedure (AIR)• Scalar Processing time

•Large-Scale Pipeline Monitoring• Adequate for consistent behavior

•Alert Generation

Limitations of Z-Value Testing in Yahoo!

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

526

Page 20: Yahoo! Data Quality: Embedded DQ Strategy, Statistical ...mitiq.mit.edu/IQIS/Documents/CDOIQS_201077/Papers/05_02_7B-1.pdf · researches data quality management within the European

pChart Control Charting

Benefits1.Quick Adoption of DQ Principles by Stakeholders2.Simple to Read3.Fairly easy to Implement4.Few Beta/Alpha Errors5.Knowledge Transfer

Drawbacks1.No Root Cause Analysis2.No Accuracy Verification

37

Data analysis: Dimension selection

38

The Fourth MIT Information Quality Industry Symposium, July 14-16, 2010

527