christopher bingham, crimson hexagon: better algorithms from bigger data

28
BETTER ALGORITHMS FROM BIGGER DATA Chris Bingham, CTO, Crimson Hexagon April 26 th , 2012

Upload: claire-willett

Post on 29-Jun-2015

3.112 views

Category:

Technology


1 download

DESCRIPTION

Often, analyzing more and more data doesn’t improve your results: you just make the same mistakes at a larger scale. Crimson Hexagon CTO Christopher Bingham discusses several techniques that leverage the quantity of data, increasing accuracy as you scale. Big data can thus lead to better analysis–not just bigger analysis.

TRANSCRIPT

Page 1: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

BETTER ALGORITHMSFROM BIGGER DATAChris Bingham, CTO, Crimson Hexagon

April 26th, 2012

Page 2: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

INTRODUCTIONCrimson Hexagon and me

Page 3: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

ABOUT CRIMSON HEXAGON

•Founded 4 years ago; now 40+ employees in Boston

•Help companies make actionable business decisions

•Based on unique analysis of social media and internal

data

•Customers include F100, agencies, UN

•Tech stack:• Java, with R for algorithms• Massive Lucene infrastructure with custom shard management• Distributed computing framework for analysis• Hadoop increasingly used

Page 4: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

BIG DATA, BETTER DATA, BETTER ALGORITHMS

•World’s largest searchable social media archive

•>200 billion posts in 2012

•Adding 1 billion every 2-3 days

•Twitter, Facebook, blogs, forums, comments, news,

etc.

Page 5: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

BIG DATA, BETTER DATA, BETTER ALGORITHMS

•Who’s talking and listening?• Demographics• Interests• Relationships

•Trends and comparisons• Compared to yourself, over time• Compared to industry, competitors, etc.

•Human input• Define specific business question and possible answers• Provides focus and context

Page 6: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

BIG DATA, BETTER DATA, BETTER ALGORITHMS

•Based on work by co-founder Gary King at Harvard

•Takes all those billions of posts, plus the human input

•Leverages the human judgment to massive scale

•Quantitative answers to specific business questions

•Accurate in any language

Page 7: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

ALGORITHMS AND BIG DATAThe problem of leverage

Page 8: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

MACHINE LEARNING

Let’s consider a typical data-analysis problem

using machine learning.

How does having more data help (or hurt) us?

Page 9: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

DEFINE CATEGORIES

A

B

C

D

Some set of user-defined

categories (AKA topics, classes,

etc.)

Page 10: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

PROVIDE TRAINING

A

B

C

D

Training examples to

map features to categories

Page 11: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

LEARN A MODEL

A

B

C

D

Algorithm classifies items into categories

based on training data

Page 12: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

CLASSIFY ITEMS

A

B

C

DIncoming unknown

items to be classified

w x y z

Page 13: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

OBTAIN RESULTS

A

B

C

D

Result: Items are classified, hopefully

correctly!

w

x

y

z

Page 14: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

DID IT WORK?

A

B

C

D

Compare algorithm to human(s) to

measure accuracy—here “z” was

incorrectly classified

w

x

y

z

A

B

C

D

w

x

y

z

Page 15: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

ERROR RATE

We were wrong 25% of the time.

What happens when we add more data?

75% correct

25% wrong

Page 16: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

SCALE TO BIG DATA

We just make the same mistakes

on a larger scale.

75% correct

25% wrong

75% correct

25% wrong

Page 17: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

CAN MORE DATA HELP?

Can bigger data help us? In some ways.

• It can enable more types of analysis

• It can enable analysis of more categories

• It can provide more raw material for training and validation

What about accuracy?

A

B

C

D

E

F

Page 18: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

HUMAN SCALE

A

B

C

D

More training usually improves accuracy—but we need not just more

data, but more humans.

Humans don’t scale.

Page 19: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

w

x z

FEEDBACK

A

B

C

D

For some applications, users can implicitly provide feedback through their use.

e.g. ad placement; spam detection

But this isn’t possible in all cases—and you can’t be too wrong to begin

with

y

Page 20: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

BOOTSTRAPPING

A

B

C

D

We can also feed the classified items back

into the training set (no human intervention).

Some incorrect classifications will

become part of the training! But that

doesn’t necessarily hurt.

w

x

y

z

Page 21: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

BOOTSTRAPPING RESULT

A

B

C

D

The more data you have, the more you can

classify.

The more you classify, the more training data

you obtain.

The more training data, the more accurate the

results.

And we didn’t have to scale the human

involvement.

w

x

y

z

y sr

wtw

xu

xx

xv

Page 22: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

INDIVIDUAL VS. AGGREGATE

w x y z

So far we’ve considered classification of individual items. This is the conventional machine-

learning approach. A

B

C

D

w

x

y

z

Page 23: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

C

25% A

25% B

50% C

0% D

INDIVIDUAL VS. AGGREGATE

w x y z

What if we want to know the size of each category, rather than

which items are in which category?

e.g. epidemiology, polls, market research

A

B

D

Page 24: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

w =

=

INDIVIDUAL VS. AGGREGATE

x

y

z

When considered individually, there’s a limited amount of information we have about each item.

As a result, there will be limited correlation with the training data, and therefore poor accuracy.

=

=

A? C?

B? D?

75% correct

25% wrong

Page 25: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

W+X+Y+Z =

INDIVIDUAL VS. AGGREGATE

When considered in the aggregate, there’s much more data correlating with the training

data for each category.

As a result, we can make more accurate estimates of the category proportions.

%A

%C

%B

%D

85% correct

15% wrong

Page 26: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

S+T+U+V+W+X+Y+Z =

INDIVIDUAL VS. AGGREGATE

Now, increasing the amount of data can actually increase the accuracy—

with the same amount of human training data.

%A

%C

%B

%D

95% correct

5% wrong

Page 27: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

CONCLUSION

•Bigger data is important

•Better data is important

•Better algorithms are important

•The sweet spot is when one leverages the other

Bigger data can lead to better algorithms.

Page 28: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

QUESTIONS?