consumer analytics in real time: infoscout and mechanical turk (bdt206) | aws re:invent 2013

26
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk Jon Brelig, CTO, InfoScout Sharon Chiarella, Vice President, Amazon Mechanical Turk November 13, 2013

Upload: amazon-web-services

Post on 14-Jun-2015

770 views

Category:

Technology


4 download

DESCRIPTION

Understanding the factors that drive consumer purchase behavior make brands better marketers. In this session, join the Vice President of Mechanical Turk to explore how retail businesses are marrying human judgment with large scale data analytics without sacrificing efficiency or scalability. We’ll highlight real world examples and introduce Jon Brelig, CTO of InfoScout, to explore how his company is leveraging a combination of automated methods and Mechanical Turk to build out a real-world analytics solution relied upon by brands, such as P&G, Unilever, and General Mills. By extracting item-level purchase data from more than 40,000 consumer receipt images each day and associating it with specific products, brands, user surveys and other digital marketing signals, Infoscout is able to rapidly gauge changes in consumer behavior and market share with remarkable granularity

TRANSCRIPT

Page 1: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Consumer Analytics in Real Time: How InfoScout Tracks Purchase Behavior with Mechanical Turk Jon Brelig, CTO, InfoScout Sharon Chiarella, Vice President, Amazon Mechanical Turk

November 13, 2013

Page 2: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Overview

– Receipt workflow – Quality control – Analytics

Page 3: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013
Page 4: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Wish I knew who that shopper was!

Page 5: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Helping brands answer… • Who’s buying my product? • Who’s the end consumer? • Why did they buy? • When and where? • How many? • At what price? • With what else?

Who’s the shopper? What’s their motive?

Page 6: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

How do we build a better panel? Capture receipts through mobile

Page 7: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Our mobile apps Receipt Hog Shoparoo

Put $ in your pocket! Fundraise for a cause!

Page 8: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Architecture

2. Convert to structured data Computer vision + OCR + MTurk 1. Capture Receipt

5. Build cool stuff on top of it! Analytics, data firehouse, hacks, etc.

4) Data warehouse & prematerialize MySQL, Amazon Redshift, Hadoop (Amazon EMR)

Tlog Redshift

target.com target.com

3) Link to masterdata Scraping + classification models + human training

GAT G2 LMN LIME = UPC 052000209648

Masterdata MySQL

Page 9: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Digitizing Receipts Task is to convert image(s) of receipts => structured data

Page 10: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Amazon Mechanical Turk

Page 11: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Can w

e skip?

Use

r mar

ks o

r sta

ff re

ject

s H

IT

Auto Extract OpenCV, OCR, Regex

Summary Extraction Mechanical Turk

Itemized Extraction Mechanical Turk

Score & Audit Staff / Mechanical Turk

Complete

• Isn’t OCR good enough? – It is a solved problem… for books – Low recognition on wrinkled receipts from mobile

• Hybrid of computer + human – Leverage OCR & computer vision, fill gaps with

humans

• Human = MTurk + small audit staff – We leverage a 6-person team to act as the top

audit layer of the system

Transcribing Receipts

Page 12: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Can w

e skip?

Use

r mar

ks o

r sta

ff re

ject

s H

IT

Auto Extract OpenCV, OCR, Regex

Summary Extraction Mechanical Turk

Itemized Extraction Mechanical Turk

Score & Audit Staff / Mechanical Turk

Complete

Summary Transcription

Page 13: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Summary Transcription

-

200,000

400,000

600,000

800,000

1,000,000

1,200,000Receipts by Month

How do we scale quality control with growing volume?

Page 14: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Known Answers

• Publish HIT with at least one known answer to audit Worker accuracy

• Additional support provided by Amazon API

• Most effective when there is a concrete, expected answer

– i.e. Multiple choice answers Known Answer

Page 15: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Known Answers

$-

$0.0050

$0.0100

$0.0150

$0.0200

$0.0250

$0.0300

Net Cost per Receipt

InfoScout Review Cost Mturk Cost

Known Answers lowered our net cost per receipt from 2 cents to 1 cent per receipt

Developed more efficient review process

Transitioned to Known Answers

Page 16: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Can w

e skip?

Use

r mar

ks o

r sta

ff re

ject

s H

IT

Auto Extract OpenCV, OCR, Regex

Summary Extraction Mechanical Turk

Itemized Extraction Mechanical Turk

Score & Audit Staff / Mechanical Turk

Complete

Itemized Extraction

Page 17: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Itemized Extraction • Transcribe every item on receipt • HITs audited by review team, priority scored by:

– Comparing output to known OCR extraction – Comparison to master data? (i.e. did they “fat finger” a price or UPC?) – Worker approval history – Worker tenure (for InfoScout HITs) – Additional features

• Not a great candidate for Known Answers….

How do we scale quality control for itemized extraction?

Page 18: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Plurality • HIT completed by >1 Worker

– InfoScout only sends HITs with low confidence to multiple Workers

• Higher quality, higher cost

– Limit costs by scientifically selecting HITs to send to a second Worker

• Multiple strategies when an answer

discrepancy is found – Ask a third Worker – Leverage internal auditors

Match?

YES

Publish HIT

Worker 1 Submits

Worker 2 Submits

Accept

Page 19: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

HIT Acceptance Latency

0

100

200

300

400

500

600

700

12/22/12 1/22/13 2/22/13 3/22/13 4/22/13 5/22/13 6/22/13

Min

utes

to A

ccep

t

Changed Template

• Measures HIT demand • Template change decreased demand temporarily, but Workers acclimated

Page 20: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Worker Retention

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

% C

ompl

eted

by

reta

ined

Wor

kers

Tota

l HIT

s C

ompl

eted

HITs Complete (New Workers) HITs Complete (Retained Workers)

Within two months, 80% of HITs were completed by returning Workers

Page 21: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Pareto of Worker Volume

0%10%20%30%40%50%60%70%80%90%

Top 5% 6-10% 10-20% 21-50% 51-100%

% o

f all

HIT

s co

mpl

eted

Worker Percentile

Our top 5% (~500) active Workers account for >80% of all HITs completed

Page 22: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Analytics Demo

Page 23: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT206

Page 24: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Appendix

Page 25: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

Quality Control Strategies • Filter incoming Workers

– Qualifications

• Increase quality during completion – Template validation – Template instructions

• Post submission – Plurality (multiple HITs per task) – Known Answers – Workers audit Workers

Multiple strategies can yield high accuracy

Approve/Reject?

HIT

Enha

nce

Page 26: Consumer Analytics in Real Time: InfoScout and Mechanical Turk (BDT206) | AWS re:Invent 2013

HIT templates • Clear & concise instructions

– 1st time each Worker sees detailed instructions, has ability to hide once they’re comfortable

• Keyboard shortcuts • Maximize Validation

– Client-side and/or AJAX validation

• Bonus Rewards – Nice option for rewarding Workers,

especially when HIT’s are variable in length & time