christopher bingham, crimson hexagon: better algorithms from bigger data

BETTER ALGORITHMSFROM BIGGER DATAChris Bingham, CTO, Crimson Hexagon

April 26th, 2012

INTRODUCTIONCrimson Hexagon and me

ABOUT CRIMSON HEXAGON

•Founded 4 years ago; now 40+ employees in Boston

•Help companies make actionable business decisions

•Based on unique analysis of social media and internal

data

•Customers include F100, agencies, UN

•Tech stack:• Java, with R for algorithms• Massive Lucene infrastructure with custom shard management• Distributed computing framework for analysis• Hadoop increasingly used

BIG DATA, BETTER DATA, BETTER ALGORITHMS

•World’s largest searchable social media archive

•>200 billion posts in 2012

•Adding 1 billion every 2-3 days

•Twitter, Facebook, blogs, forums, comments, news,

etc.


•Who’s talking and listening?• Demographics• Interests• Relationships

•Trends and comparisons• Compared to yourself, over time• Compared to industry, competitors, etc.

•Human input• Define specific business question and possible answers• Provides focus and context


•Based on work by co-founder Gary King at Harvard

•Takes all those billions of posts, plus the human input

•Leverages the human judgment to massive scale

•Quantitative answers to specific business questions

•Accurate in any language

ALGORITHMS AND BIG DATAThe problem of leverage

MACHINE LEARNING

Let’s consider a typical data-analysis problem

using machine learning.

How does having more data help (or hurt) us?

DEFINE CATEGORIES

A

B

C

D

Some set of user-defined

categories (AKA topics, classes,

etc.)

PROVIDE TRAINING

A

B

C

D

Training examples to

map features to categories

LEARN A MODEL

A

B

C

D

Algorithm classifies items into categories

based on training data

CLASSIFY ITEMS

A

B

C

DIncoming unknown

items to be classified

w x y z

OBTAIN RESULTS

A

B

C

D

Result: Items are classified, hopefully

correctly!

w

x

y

z

DID IT WORK?

A

B

C

D

Compare algorithm to human(s) to

measure accuracy—here “z” was

incorrectly classified

w

x

y

z

A

B

C

D

w

x

y

z

ERROR RATE

We were wrong 25% of the time.

What happens when we add more data?

75% correct

25% wrong

SCALE TO BIG DATA

We just make the same mistakes

on a larger scale.

75% correct

25% wrong

75% correct

25% wrong

CAN MORE DATA HELP?

Can bigger data help us? In some ways.

• It can enable more types of analysis

• It can enable analysis of more categories

• It can provide more raw material for training and validation

What about accuracy?

A

B

C

D

E

F

HUMAN SCALE

A

B

C

D

More training usually improves accuracy—but we need not just more

data, but more humans.

Humans don’t scale.

w

x z

FEEDBACK

A

B

C

D

For some applications, users can implicitly provide feedback through their use.

e.g. ad placement; spam detection

But this isn’t possible in all cases—and you can’t be too wrong to begin

with

y

BOOTSTRAPPING

A

B

C

D

We can also feed the classified items back

into the training set (no human intervention).

Some incorrect classifications will

become part of the training! But that

doesn’t necessarily hurt.

w

x

y

z

BOOTSTRAPPING RESULT

A

B

C

D

The more data you have, the more you can

classify.

The more you classify, the more training data

you obtain.

The more training data, the more accurate the

results.

And we didn’t have to scale the human

involvement.

w

x

y

z

y sr

wtw

xu

xx

xv

INDIVIDUAL VS. AGGREGATE

w x y z

So far we’ve considered classification of individual items. This is the conventional machine-

learning approach. A

B

C

D

w

x

y

z

C

25% A

25% B

50% C

0% D


w x y z

What if we want to know the size of each category, rather than

which items are in which category?

e.g. epidemiology, polls, market research

A

B

D

w =

=


x

y

z

When considered individually, there’s a limited amount of information we have about each item.

As a result, there will be limited correlation with the training data, and therefore poor accuracy.

=

=

A? C?

B? D?

75% correct

25% wrong

W+X+Y+Z =


When considered in the aggregate, there’s much more data correlating with the training

data for each category.

As a result, we can make more accurate estimates of the category proportions.

%A

%C

%B

%D

85% correct

15% wrong

S+T+U+V+W+X+Y+Z =


Now, increasing the amount of data can actually increase the accuracy—

with the same amount of human training data.

%A

%C

%B

%D

95% correct

5% wrong

CONCLUSION

•Bigger data is important

•Better data is important

•Better algorithms are important

•The sweet spot is when one leverages the other

Bigger data can lead to better algorithms.

QUESTIONS?

christopher bingham, crimson hexagon: better algorithms from bigger data

Technology

human training data

data help

important better data

c x c x z

conclusion bigger data

internal data customers

c x zpart

important better algorithms