oscon miller 2011

23
Mike Miller _milleratmit July 25, 2011 Bayes on your (Big)Couch

Upload: mlmilleratmit

Post on 16-May-2015

904 views

Category:

Technology


0 download

DESCRIPTION

The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application using BigCouch. The accompanying video is up at http://www.youtube.com/watch?v=QEBDNxbSRuk

TRANSCRIPT

Page 1: Oscon miller 2011

Mike Miller_milleratmitJuly 25, 2011

Bayes on your (Big)Couch

Page 2: Oscon miller 2011

Mike Miller, Oscon 2011 2

I want my app to do _this_

Page 3: Oscon miller 2011

Mike Miller, Oscon 2011 3

CouchDB in a slide• Schema-free document database management system

Documents are JSON objectsAble to store binary attachments

• RESTful APIhttp://wiki.apache.org/couchdb/reference

• Views: Custom, persistent representations of your dataIncremental MapReduce with results persisted to diskFast querying by primary key (views stored in a B-tree)

• Bi-Directional ReplicationMaster-slave and multi-master topologies supportedOptional ‘filters’ to replicate a subset of the dataEdge devices (mobile phones, sensors, etc.)

Page 4: Oscon miller 2011

Mike Miller, Oscon 2011 4

BigCouch = Couch+Scaling• Open Source, Apache License

• Horizontal ScalabilityEasily add storage capacity by adding more serversComputing power (views, compaction, etc.) scales with more servers

• No SPOFAny node can handle any requestIndividual nodes can come and go

• Transparent to the ApplicationAll clustering operations take place “behind the curtain”looks (mostly) like a single server instance of CouchDB

Page 5: Oscon miller 2011

Mike Miller, Oscon 2011 5

...back to making my app smart

Page 6: Oscon miller 2011

Mike Miller, Oscon 2011

Sample Data

6

Weight [lbs]80 100 120 140 160 180 200 220

Hei

ght [

in]

35

40

45

50

55

60

65

70

75

80

Height vs. Weight

GirlsBoys

Height vs. Weight

Page 7: Oscon miller 2011

Mike Miller, Oscon 2011

Naive Bayes Classifier

7

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

gaus

height

male

mean male height

male height variance

Page 8: Oscon miller 2011

Mike Miller, Oscon 2011

Implementation Plan

8

Weight [lbs]80 100 120 140 160 180 200 220

Hei

ght [

in]

35

40

45

50

55

60

65

70

75

80

Height vs. Weight

GirlsBoys

Height vs. Weight

Model people as documents in CouchDB

Calculate Means/Variances with MapReduce

Run classifier in the CouchDB as post-MapReduce hook (“_list”)

• Note:do not need to specify fields to use in classificationmulti-class implementation continuous, incremental training! Results improve as training data trickles in.

Page 9: Oscon miller 2011

Mike Miller, Oscon 2011

3 ways to follow along

couchapp python tool to push/pull from other couchdb’s> sudo easy_install install -U couchapp> couchapp clone ‘http://millertime.cloudant.com/bitb'create an account at cloudant.com> curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’> couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’github> git clone [email protected]:mlmiller/bayes.git

CouchDB replication to your cloudant accountbonus, brings along the data, too!

9

Page 10: Oscon miller 2011

Mike Miller, Oscon 2011

The Code

10

Classifier (Probability Calculator)

view code to calculate means and variances

post MapReduce Hook (“_list”

method)

you can ignore everything else

client side test via node.js

Page 11: Oscon miller 2011

Mike Miller, Oscon 2011

Data Model

11

‘class’ => training Data

Arbitrary number of numerical fields allowed

Page 12: Oscon miller 2011

Mike Miller, Oscon 2011

Training via MapReduce

12

‘class’ => training Data

Calculate mean/variance for all numerical fields in a document

emit: ([<class>, <field>], <value>)

Reduce: _stats (Erlang builtin)

views/training/map.js

Page 13: Oscon miller 2011

Mike Miller, Oscon 2011

Bayes: Trained State

13

pre-reduce output

Page 14: Oscon miller 2011

Mike Miller, Oscon 2011

Bayes: Trained State

14

Count, Min, Max, Mean, Variance

Automatically Updated as new training Data Arrives

Page 15: Oscon miller 2011

Mike Miller, Oscon 2011

Bayes Classifier

15

Load state from DB

No assumptions on Field Names

Calculate prob. for all possible hypotheses

lib/bayes_classifier.js

Page 16: Oscon miller 2011

Mike Miller, Oscon 2011

A brief aside...

• Lets test our classifierSelect 2000 documents for testRandomly choose 1000 documents for training sampleRemaining documents used for validation

• Simulate continuous trainingAdd documents one at a timeAfter each document addition, test on all 1000 of our validation sampleRecord and plot fraction of validation sample properly classified

16

Page 17: Oscon miller 2011

Mike Miller, Oscon 2011

A brief aside...

17

Number of documents in the training set

Dramatic improvement with additional training data

Page 18: Oscon miller 2011

Mike Miller, Oscon 2011

... and back to the code

18

Page 19: Oscon miller 2011

Mike Miller, Oscon 2011

test it yourself

19

• Client side test via node.js > ./test.js height=<some number> weigth=<some number>Classifier runs server side, configured in line 6 of test.js

Can point this to your DB

Page 20: Oscon miller 2011

Mike Miller, Oscon 2011

Running as CouchApp

20

http://millertime.cloudant.com/bitb/_design/bayes/index.html

create a database (e.g., ‘bitb’) at cloudant.comadd datathen push your code>couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’HTML & CSS served directly from BigCouch to the browserHeavy lifting of classification done server side

Page 22: Oscon miller 2011

Mike Miller, Oscon 2011

Wrapping Up: Bayes on BigCouch• Simple code, powerful results

light requirements on data modelcan be relaxed with more complex view codeContinuous learning is very powerfule.g., time-based learning (automatically adapt to changing conditions)Classification can be performed client- or server-sidepush documents into DB and they are auto-tagged!More sophisticated classifiers easily implementede.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etcView Engine allows simple deployment of sophisticated domain libraries in mass parallele.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc..

22

Page 23: Oscon miller 2011

Mike Miller, Oscon 2011 23

Give it a spin

Hosting, Management, Support for CouchDB and BigCouchhttp://cloudant.com

http://github.com/cloudant/bigcouch