bigdata and algorithms - la algorithmic trading

52
LA Algorithmic Trading Meetup - Winter 2013 Algorithmic Trading (In Los Angeles)

Upload: tim-shea

Post on 27-May-2015

1.147 views

Category:

Technology


4 download

DESCRIPTION

Slides from LA Algorithmic Trading event (http://www.meetup.com/LA-Algorithmic-Trading/events/98963812/) on using BigData and Algorithms in your business. Covers how What'sGood uses algorithms to allow users to make choices about food on the go and the "BigData" infrastructure we've built to support them. Includes topics such as"big" data ingestion, in-stream processing, NLP algorithms, assessing "popularity", assigning relevancy weights in search, adding "dimensionality" to restaurant menus by cleansing public data sets, and mapping loosely correlated dataset into your own.

TRANSCRIPT

Page 1: BigData and Algorithms - LA Algorithmic Trading

LA Algorithmic Trading Meetup - Winter 2013

Algorithmic Trading

(In Los Angeles)

Page 2: BigData and Algorithms - LA Algorithmic Trading

What’s Good?

Page 3: BigData and Algorithms - LA Algorithmic Trading

WelcomeTim Shea \ [email protected] \ @sheanineseven

Data Scientist

Ad Agency Guy (Razorfish, Universal, TBWA\Chiat\Day)

Founder and CTO of WhatsGood.com

Big interest in convergence of Tech and Finance communities

Page 4: BigData and Algorithms - LA Algorithmic Trading

Elevator PitchDigital Menu Platform for picky eaters on-the-go.

Data-centric POV

Search/Sort/Slice/Dice, Answering “What’s Good Here”?

The “Good” in WhatsGood varies by person.

Page 5: BigData and Algorithms - LA Algorithmic Trading

“Dimensionality”

Hundreds of data points *behind* each menu item.

This data is *hidden* by traditional analog menus.

Dimensionality = Personalization.

Page 6: BigData and Algorithms - LA Algorithmic Trading

Thomas Guide

Thomas Guide :: Google Earth

As

Paper Menu :: What’sGood

Page 7: BigData and Algorithms - LA Algorithmic Trading

Hypothesis

Page 8: BigData and Algorithms - LA Algorithmic Trading

We’re Empiricists

Page 9: BigData and Algorithms - LA Algorithmic Trading

ProblemNoise

80/20 - In any scenario where you’re ordering food (ex. at-home, in-restaurant, etc) 80% of menu info is noise.

Bad In-store. Worse when considering multiple locations.

Paper menus dont help this situation at all.

Page 10: BigData and Algorithms - LA Algorithmic Trading

ResultHuman error.

Leads to:

Frustration - “Ill just get what I usually get”

Alienation - “I’m going out with my meat-eating friends, Ill just bring a granola bar”

Accidents - “The waiter didnt know there was soy sauce in there, and I ended up in

the hospital”

Page 11: BigData and Algorithms - LA Algorithmic Trading

HypothesisBigData + Machine Learning + The Crowd

Will remove these pain points.

And create something truly valuable for people.

Literally improve the way we discover food, permanantly.

Page 12: BigData and Algorithms - LA Algorithmic Trading

What’sGood Algos

Page 13: BigData and Algorithms - LA Algorithmic Trading

ClydeStorm

Page 14: BigData and Algorithms - LA Algorithmic Trading

Components

FoodNet + Vegas8 + Rhombus

Page 15: BigData and Algorithms - LA Algorithmic Trading

ClydeStormMenu Ingestion - Every 2 weeks, reconcile 400,000 Restaurants and 50MM Menu Items (Add/Edit/Delete)

NLP Classifiers - Then, for every dish, we run 8 NLP classifiers to determine (V,G,N,L,P,&Pop)

Data Mapping - Orthoginal datasets that “dont quite fit”

Search - Handles all the modern indexing and retrieval operations consumers are accustomed to.

Page 16: BigData and Algorithms - LA Algorithmic Trading

Vegas8Based on a simple human Intuition:

“Signal Words” helps us make 1 of 3 determinations:

1. Definitely Positive - “Vegan”: All bets are off, obviously vegan.

2. Strongly Negative - “Ribeye Steak”: Pretty damn confident, not vegan.

3. Fuzzy Signal - Not enough info, conflicting info, fuzzy signal.

Page 17: BigData and Algorithms - LA Algorithmic Trading

The Intuition

Page 18: BigData and Algorithms - LA Algorithmic Trading

FoodNetBased loosely on WordNet - Open Source Princeton project

Lexical Knowledge Graph or word relations (vs a list)

ex. Obviously “MILK” is a signal for “Contains Lactose”

But so are all of its other permutations:

- Synonyms

- Hyper- & Hypo-nyms

- Other languages

- All the foods in the world that commonly use MILK as an ingredient

Page 19: BigData and Algorithms - LA Algorithmic Trading

First Attempt

Page 20: BigData and Algorithms - LA Algorithmic Trading

First VersionRead from Menu DB - 50MM Venue, Dish Title & Description

Read from Synonym DB - Slam it into a big RegEx

For Each record - Any matches?

Save Results

Page 21: BigData and Algorithms - LA Algorithmic Trading

Results?Medoicre

- Took forever to run

- Unexpected results (think: RegEx)

- Tons of edge cases

Page 22: BigData and Algorithms - LA Algorithmic Trading

Algorithms and NLP

Page 23: BigData and Algorithms - LA Algorithmic Trading

Stepping BackHow do we find better tools for the job?

How do we measure any improvements we make?

Is there a more “Algorithmic” approach?

Such as Machine Learning in general, or NLP specifically?

Page 24: BigData and Algorithms - LA Algorithmic Trading

Not #NLP*Not* Nuero Linguistic Programming

Page 25: BigData and Algorithms - LA Algorithmic Trading

What is NLP?Natural Language Processing

Attempt to formalize the ways in which humans understand language, into a computer program.

Slippery - We’re not accustomed to thinking about how we understand each other, we just do it.

Page 26: BigData and Algorithms - LA Algorithmic Trading

Widely ApplicableSemantic Analysis - Whats the overall mood here?

Text Classification - What is this document I’m reading?

Knowledge Mapping - Which things relate to which?

Info Extraction - What are the major topics discussed?

Page 27: BigData and Algorithms - LA Algorithmic Trading

What’sGood Use Cases

Page 28: BigData and Algorithms - LA Algorithmic Trading

1. Similarity

Page 29: BigData and Algorithms - LA Algorithmic Trading

Are these the same?

A Frame, 12565 W Washington Blvd, Culver City, CA, 90066

A-Frame, 12565 Washington Blvd, Los Angeles, CA, 90066

Page 30: BigData and Algorithms - LA Algorithmic Trading

This Problem

Creme Brulee

vs

Crème brûlée

vs

Cr�e Br\u001lee

Page 31: BigData and Algorithms - LA Algorithmic Trading

This Problem

Page 32: BigData and Algorithms - LA Algorithmic Trading

OrthogonalityRhombus - The What’sGood Decoder Ring

Library that attempts to resolve “Matching Problems”

For Example: Public Calorie Database - Can I even use it?

Page 33: BigData and Algorithms - LA Algorithmic Trading

TextGrounderDisambiguate:

- Georgia vs Georgia

Context:

- Melrose Heights vs West Hollywood vs Los Angeles

Page 34: BigData and Algorithms - LA Algorithmic Trading

2. Sentiment

Page 35: BigData and Algorithms - LA Algorithmic Trading

“Bag of Words”Type of Naive Bayes Classifer

Tokenize

Remove Stop Words

Stemming the remaining words

Frequency Distribution - How many times did this occur?

Page 36: BigData and Algorithms - LA Algorithmic Trading

Edge CasesYelp Review - Comme Ca

“You’d expect a place with such a diverse selection of french food, wonderfully accomodating staff, and a world class chef to live up to its amazing reputation, but it just simply did not.”

Page 37: BigData and Algorithms - LA Algorithmic Trading

Other Great TricksPart of Speech Tagging

N-Grams

Levenshtein Distance

RevMiner

Page 38: BigData and Algorithms - LA Algorithmic Trading

Humans!!National Weather Service

Tries to quantify the effect of humans:

- Precipitation forecasts - 25% lift

- Temperature forecasts - 10% lift

Traders

Need human judgement when a model is failing.

Page 39: BigData and Algorithms - LA Algorithmic Trading

3. Relevancy

Page 40: BigData and Algorithms - LA Algorithmic Trading

Popularity Algorithm“Social Triangulation”

(A * (# star ratings)

+

B * (# of dish mentions/total reviews at restaurant)

+

C * (# of photos/avg mentions per restaurant in specific geography)

) * Arbitrary population weight

Page 41: BigData and Algorithms - LA Algorithmic Trading

Search WeightsWhich signals are more important:

Number of times your search query matched something?

Your previous searches & behaviors?

Does Proximity to you outweigh other factors?

Does Popularity?

Page 42: BigData and Algorithms - LA Algorithmic Trading

Infrastructure

Page 43: BigData and Algorithms - LA Algorithmic Trading

StackRunning on Windows

Web/REST Tier in the Cloud

Dedicated RDMS on Solid State Drives

C# & SQL Server

Python & NLTK

Solr Lucene

Page 44: BigData and Algorithms - LA Algorithmic Trading

Results“Vegas8 - RegEx 1”

Raw RegEx, RackSpace Cloud, Shared CPU

5 classifiers

~1 record/sec

50MM Records = 50MM Seconds

14,000 Hours

~578 Days

Page 45: BigData and Algorithms - LA Algorithmic Trading

ResultsResults Sec/Record Total Sec Total Hours Total DaysRegEx 1 1 50,000,000 13,888.89 578.70

Tokenize 1 25 2,000,000 555.56 23.15

Tokenize 2 (SSD & dedicated CPU) 110 454,545 126.26 5.26

Tokenize 3 with 50MM caching 16 3,125,000 868.06 36.17

Tokenize 4 with 10K caching 225 222,222 61.73 2.57

Token/Stem/Stop 230 217,391 60.39 2.52

Token/Stem/Stop w/ 4 parallel pro-cesses

874 57,208 15.89 0.66

Levenstein/Weights/Biz Rules/Ha-doop

?? ?? ?? ??

Page 46: BigData and Algorithms - LA Algorithmic Trading

Improvements?Serialization - eats ~40-60% of overhead. How do we remove it?

Dedicated Hardware - SSDs & Dedicated CPU

Parallelization - Hadoop? RightScale? Custom Solution?

Indexing - SQL “dumb” storage. Solr for search.

Page 47: BigData and Algorithms - LA Algorithmic Trading

ExpertsPanel of Resident Nutritionists

Formalizing things like:

“What is Hangover Food“

“How to get Huge fast”

“How to be a really annoying Yogi”

Page 48: BigData and Algorithms - LA Algorithmic Trading

Final Thoughts

Page 49: BigData and Algorithms - LA Algorithmic Trading

Trading ParallelsDynamic vs Static Systems

Knowledge/Signal Graph

If you’re monitoring “Apple” youll need to monitor:

- Apple, $APPL, Tim Cooke, iPhone, FOXCONN

- And assign a signal weight and signal vector for each

Orthogonality

Using loosely correlative systems

Page 50: BigData and Algorithms - LA Algorithmic Trading

Data ScienceBurgeoning skill set:

Data

Programmer

Sys admin

Full stack knowledge

Stats

Probability

Algorithms

Empirical methodology

Business

“Real world” knowledge

Subjectivity

Modeling uncertainty

Page 51: BigData and Algorithms - LA Algorithmic Trading

ResourcesData Science Toolkit

NLTK

Nate Silver - The Signal and the Noise

Page 52: BigData and Algorithms - LA Algorithmic Trading

Tim Shea

@sheanineseven

[email protected]