pydata berlin meetup

40
Helping travelers make better hotel choices 500 million times a month* Steffen Wenz, CTO TrustYou

Upload: steffen-wenz

Post on 19-Feb-2017

266 views

Category:

Internet


2 download

TRANSCRIPT

Page 1: PyData Berlin Meetup

Helping travelers make better hotel choices

500 million times a month* Steffen Wenz, CTO TrustYou

Page 2: PyData Berlin Meetup

For every hotel on the planet, provide a summary

of traveler reviews.What does TrustYou do?

Page 3: PyData Berlin Meetup

✓ Excellent hotel!

Page 4: PyData Berlin Meetup

✓ Excellent hotel!

✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »

Page 5: PyData Berlin Meetup

✓ Excellent hotel!*

✓ Nice building“Clean, hip & modern, excellent facilities”✓ Great view« Vue superbe »✓ Great for partying“Nice weekend getaway or for partying”✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt.

*) nhow Berlin (Full summary)

Page 6: PyData Berlin Meetup
Page 7: PyData Berlin Meetup
Page 8: PyData Berlin Meetup
Page 9: PyData Berlin Meetup

DBCrawling Semantic Analysis

TrustYou Analytics

API

Kayak...

TrustYou Architecture

200 million reqs/month

Page 10: PyData Berlin Meetup

Crawling

Page 11: PyData Berlin Meetup

/find?q=Berlin

/find?q=Munich

/meetup/BerlinPyData

/meetup/BerlinCyclists

/find?q=Munich&pa

ge=2

/meetup/BerlinPolitics

/meetup/BerlinCyclists

/find?q=Munich&pa

ge=3

Seed URLs

Frontier

Basic crawling setup

Page 12: PyData Berlin Meetup

/find?q=Berlin

/find?q=Munich

/meetup/BerlinPyData

/meetup/BerlinCyclists

/find?q=Munich&pa

ge=2

/meetup/BerlinPolitics

/meetup/BerlinCyclists

/find?q=Munich&pa

ge=3

/find?q=Munich&page=99999999

999...

...

… if only it were so easy

facebok.com/meetup

Seed URLs

Frontier

Page 13: PyData Berlin Meetup

Scrapy

● Build your own web crawlers○ Extract data via CSS selectors, XPath, regexes …○ Handles queuing, request parallelism, cookies,

throttling … ● Comprehensive and well-designed● Commercial support by http://scrapinghub.com/

Page 14: PyData Berlin Meetup

Frontier

Seed URLs

Intro to Scrapyfrom scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):

name = "my_spider"

# start with this URL

start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]

# follow these URLs, and call self.parse_meetup to extract data from them

rules = [

Rule(LinkExtractor(allow=[

"^http://www.meetup.com/[^/]+/$",

]), callback="parse_meetup"),

]

def parse_meetup(self, response):

# Extract data about meetup from HTML

m = MeetupItem()

yield m

Page 15: PyData Berlin Meetup

Try it out!$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null

{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -

Berlin", "members": "774"}

{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}

{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}

{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}

{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",

"members": "1"}

{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight

Catch Up", "members": "1"}

...

Full code on GitHub, dump of all Berlin meetups(Note: Meetup also has an API …)

Page 16: PyData Berlin Meetup

Number of registered meetups

Page 17: PyData Berlin Meetup

Crawling at TrustYou scale

● 2 - 3 million new reviews/week● Customers want alerts 8 - 24h

after review publication!● Smart crawl frequency & depth,

but still high overhead● Pools of constantly refreshed

EC2 proxy IPs● Direct API connections with

many sites

Page 18: PyData Berlin Meetup

Crawling at TrustYou scale

● Custom framework very similar to scrapy● Runs on Hadoop cluster (100 nodes)● … Though problem not 100% suitable for MapReduce

○ Nodes mostly waiting○ Coordination/messaging between nodes required:

■ Distributed queue■ Rate limiting

Page 19: PyData Berlin Meetup

Textual Data

Page 20: PyData Berlin Meetup

Treating textual data

raw text sentence splitting

stopword filteringstemming

tokenization

Page 21: PyData Berlin Meetup

Tokenization>>> import nltk

>>> raw = "We are always looking for interesting talks, locations to

host meetups and enthusiastic volunteers. Please get in touch using

[email protected]."

>>> nltk.sent_tokenize(raw)

['We are always looking for interesting talks, locations to host meetups

and enthusiastic volunteers.', 'Please get in touch using info@pydata.

berlin.']

>>> nltk.word_tokenize(raw)

['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',

'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',

'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',

'pydata.berlin', '.']

Page 22: PyData Berlin Meetup

“great rooms”“great hotel”“rooms are terrible”“hotel is terrible”

JJ NNJJ NNNN VB JJNN VB JJ

Grammars and Parsing

>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))

[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]

Page 23: PyData Berlin Meetup

>>> grammar = nltk.CFG.fromstring("""

... OPINION -> NN COP JJ

... OPINION -> JJ NN

... NN -> 'hotel' | 'rooms'

... COP -> 'is' | 'are'

... JJ -> 'great' | 'terrible'

... """)

>>> parser = nltk.ChartParser(grammar)

>>> sent = nltk.word_tokenize("great rooms")

>>> for tree in parser.parse(sent):

>>> print tree

(OPINION (ADJ great) (NOUN rooms))

Grammars and Parsing

Page 24: PyData Berlin Meetup

WordNet>>> from nltk.corpus import wordnet as wn

>>> wn.morphy('coded', wn.VERB)

'code'

>>> wn.synsets("python")

[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.

03')]

>>> wn.synset('python.n.01').hypernyms()

[Synset('boa.n.02')]

>>> # meh :/

Page 25: PyData Berlin Meetup

● “Nice room”● “Room wasn‘t so great”● “The air-conditioning

was so powerful that we were cold in the room even when it was off.”

● “อาหารรสชาติดี”● ” خدمة جیدة“

● 20 languages● Linguistic system

(morphology, taggers, grammars, parsers …)

● Hadoop: Scale out CPU○ ~1B opinions in DB

● Python for ML & NLP libraries

Semantic Analysis at TrustYou

Page 26: PyData Berlin Meetup

Word2Vec

● Map words to vectors● “Step up” from bag-of-

words model

● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts

>>> m["python"]

array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,

-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,

-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,

# ...

-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,

-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,

-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],

dtype=float32)

Page 27: PyData Berlin Meetup

Fun with Word2Vec>>> # trained from 100k meetup descriptions!

>>> m = gensim.models.Word2Vec.load("data/word2vec")

>>> m.most_similar(positive=["python"])[:3]

[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',

0.8189617991447449)]

>>> m.doesnt_match(["python", "c++", "javascript"])

'c++'

>>> m.most_similar(positive=["berlin"])[:3]

[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',

0.7970746755599976)]

>>> m.most_similar(positive=["ladies"])[:3]

[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]

Page 28: PyData Berlin Meetup

ML @ TrustYou

● gensim doc2vec model to create hotel embedding

● Used - together with other features - for various classifiers

Page 29: PyData Berlin Meetup

Workflow Management& Scaling Up

Page 30: PyData Berlin Meetup

● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs

● Some support for Hadoop● Pythonic replacement for Oozie● Can be combined with Pig, Hive

Luigi

Page 31: PyData Berlin Meetup

class MyTask(luigi.Task):

def requires(self):

return DependentTask()

def output(self):

return luigi.LocalTarget("data/my_task_output"))

def run(self):

with self.output().open("w") as out:

out.write("foo")

Luigi tasks vs. Makefilesdata/my_task_output: DependentTask

run

run

run ...

Page 32: PyData Berlin Meetup

class CrawlTask(luigi.Task):

city = luigi.Parameter()

def output(self):

output_path = os.path.join("data", "{}.jsonl".format(self.city))

return luigi.LocalTarget(output_path)

def run(self):

tmp_output_path = self.output().path + "_tmp"

subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".

format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])

os.rename(tmp_output_path, self.output().path)

Example: Wrap crawl in Luigi task

Page 33: PyData Berlin Meetup

Luigi dependency graphs

Page 34: PyData Berlin Meetup

Hadoop!

● MapReduce: Programming model for distributed computation problems

● Express your algorithm as sequences of operations:a. Map: Do a linear pass over your data, emit (k, v)b. (Distributed sort)c. Reduce: Linear pass over all (k, v) for the same k

● Python on Hadoop: Hadoop streaming, MRJob, Luigi(Just go learn PySpark instead)

Page 35: PyData Berlin Meetup

Luigi Hadoop integrationclass HadoopTask(luigi.hadoop.JobTask):

def output(self):

return luigi.HdfsTarget("output_in_hdfs")

def requires(self):

return {

"some_task": SomeTask(),

"some_other_task": SomeOtherTask()

}

def mapper(self, line):

key, value = line.rstrip().split("\t")

yield key, value

def reducer(self, key, values):

yield key, ", ".join(values)

Page 36: PyData Berlin Meetup

Luigi Hadoop integrationclass HadoopTask(luigi.hadoop.JobTask):

def output(self):

return luigi.HdfsTarget("output_in_hdfs")

def requires(self):

return {

"some_task": SomeTask(),

"some_other_task": SomeOtherTask()

}

def mapper(self, line):

key, value = line.rstrip().split("\t")

yield key, value

def reducer(self, key, values):

yield key, ", ".join(values)

1. Your input data is sitting in distributed file system (HDFS)

2. Luigi creates a .tar.gz, Hadoop moves your code on machines

3. mapper() gets run (distributed)4. Data gets re-sorted by key5. reducer() gets run (distributed)6. Output gets saved in HDFS

Page 37: PyData Berlin Meetup

● Batch, never real time● Slow even for batch

(lots of disk IO)● Limited expressiveness

(remedies/crutches: MRJob, Pig, Hive)

● Spark: More complete Python support

Beyond MapReduce

Page 38: PyData Berlin Meetup

Workflows at TrustYou

Page 39: PyData Berlin Meetup

Workflows at TrustYou