sanoma search use cases

Post on 16-Jul-2015

194 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sanoma Search Use CasesSander Kieft@skieft

About me@skieft

Manager Core Services at Sanoma

Responsible for all common services, including the Big Data

platform

Work:

– Centralized services

– Data platform

– Search

Like:

– Work

– Water(sports)

– Whiskey

– Tinkering: Arduino, Raspberry PI, soldering stuff24 April 20152

Sanoma, Publishing and Learning company

2+1002 Finnish newspapers

Over 100 magazines

24 April 2015 Presentation name3

5TV channels in Finland

and The Netherlands

200+Websites

100Mobile applications on

various mobile platforms

Sanoma = Donald Duck

24 April 20154

24 April 2015 Presentation name5

Use cases

Full text search

Use cases

24 April 2015 Presentation name7

Full Text Search

Photo credits: Igal Koshevoy - https://www.flickr.com/photos/igalko/6345215839/

Facetted search

24 April 2015 Presentation name9

Full text search

Facetted search

Guided search

Use cases

24 April 2015 Presentation name10

Guided search

Photo credits: http://www.flickr.com/photos/emyanmei/8223998414/

24 April 2015 Presentation name12

Source: ThesisDefense - Dirk Guijt

Startpagina.nl Search

Content

Source: ThesisDefense - Dirk Guijt

Startpagina.nl Search

ContentTags

#vakantie

#arke

#vakantie arke

#arke vakantie

#arke stedentrip

#stedentrip arke

#arkefly

Source: ThesisDefense - Dirk Guijt

Startpagina.nl Search

ContentTags

#vakantie

#arke

#vakantie arke

#arke vakantie

#arke stedentrip

#stedentrip arke

#arkefly

Query: vakantie arke

Source: ThesisDefense - Dirk Guijt

Startpagina.nl Search

ContentTags

#vakantie

#arke

#vakantie arke

#arke vakantie

#arke stedentrip

#stedentrip arke

#arkefly

Query: vakantie arke

Source: ThesisDefense - Dirk Guijt

Startpagina.nl Search

ContentTags

#vakantie

#arke

#vakantie arke

#arke vakantie

#arke stedentrip

#stedentrip arke

#arkefly

?

Query: vakantie arke stedentrip

Thesis Defense - Dirk Guijt

Term mismatch

“vakantie arke stedentrip” “arke stedentrip”

Query-Flow-Graph

User ID Date / Time Query

User A 02-12-2014 10:30:15 owls

User A 02-12-2014 10:30:23 snow owls

User A 02-12-2014 10:30:46 snow owls food

User A 02-12-2014 10:31:03 owls food

User B 02-12-2014 13:21:34 lemon

User B 02-12-2014 13:22:02 lemon trees

User B 02-12-2014 13:22:12 lemon cove

User B 02-12-2014 16:53:01 owls

User B 02-12-2014 16:53:53 forest owls

Source: ThesisDefense - Dirk Guijt

Model as a Graph

snow owls

snow owls food

owls food

owls

lemon trees

lemon cove

lemon

forest owls

User ID Date / Time Query

User A 02-12-2014 10:30:15 owls

User A 02-12-2014 10:30:23 snow owls

User A 02-12-2014 10:30:46 snow owls food

User A 02-12-2014 10:31:03 owls food

User B 02-12-2014 13:21:34 lemon

User B 02-12-2014 13:22:02 lemon trees

User B 02-12-2014 13:22:12 lemon cove

User B 02-12-2014 16:53:01 owls

User B 02-12-2014 16:53:53 forest owls

Source: ThesisDefense - Dirk Guijt

Model as a Graph

snow owls

snow owls food

owls food

owls

lemon trees

lemon cove

lemon

forest owls

5

3

1

2

6

5

owls snow owls (S) specialization

snow owls owls (G) generalization

olws owls (C) same query / error correction

snow owls forest owls (P) parallel move / equivalent rephrase

Source: ThesisDefense - Dirk Guijt

Using Query-Log based Collective Intelligence to Generate Query Suggestions for Tagged Content Search (paper)15th International Conference on Web Engineering (ICWE 2015): http://icwe2015.webengineering.org

Query Reformulation Types

Examples

Source: ThesisDefense - Dirk Guijt

Using Query-Log based Collective Intelligence to Generate Query Suggestions for Tagged Content Search (paper)15th

International Conference on Web Engineering (ICWE 2015): http://icwe2015.webengineering.org

Full text search

Facetted search

Guided search

Content repository

Use cases

24 April 2015 Presentation name24

24 April 201526

24 April 201527

Monolithic vs integrated

24 April 201528

Two approaches

Master

Content

*

*

Content

Master

Content

*

*

MyJour

Item Based Framework

….CMS

Architecture Content Platform

24 April 2015 Presentation name29

Content Platform Core

Search

Solr

Blob

Storage

(S3 & MT)

Article

storage

MongoDB

Analyse

CMS

CMS

Editorial

reuse-interface

ePub

Digital

Template

system

WoodWing

Content

Portal

Feeds

Noma

Viva

PDF Based Framework

….

HomeDeco

Sources Services Solutions Products

??

??

??

??

eLinea

Blendle

Google Currents

LINDA. nieuws

NU.nl search

NLPNatural Language Processing

Understanding a language is hard, really hard

Photo credits: Celines Photographer - http://www.flickr.com/photos/celinesphotographer/2295348530/

Ambiguous

Photo credits: M Hatrey - http://www.flickr.com/photos/mhatrey/6968211400/

I made her duck.

24 April 201533

Photo credits: Ulteriore Picure - http://www.flickr.com/photos/ulteriorepicure/200767137/

Photo credits: Atomic Seed - http://www.flickr.com/photos/atomic_seed/6824087444/

Photo credits: Ulteriore Picure - http://www.flickr.com/photos/ulteriorepicure/200767137/

Photo credits: Pintoy - http://www.flickr.com/photos/pintoys/6155690814/

Photo credits: KyanosAum - http://www.flickr.com/photos/kyanos_aum/3926971954/

Photo credits: Super Hua - http://www.flickr.com/photos/superhua/286479024/

Photo credits: Glim Eend - http://www.flickr.com/photos/glimeend/5075731300/

Wrong

Photo credits: Learnscope - http://www.flickr.com/photos/learnscope/5536614305/

Recursion

Photo credits: Wikipedia - http://en.wikipedia.org/wiki/File:Droste.jpg

Creative Tools - http://www.flickr.com/photos/creative_tools/4353860378/

Multi lingual

Ambiguity

Creative

Multi lingual

Wrong

..and many more

Understanding language is hard

24 April 201544

..but we don’t need to fully understand language to take care of it.

Tagging

Quote detection

Sentiment analysis

Topic detection

Named Entity Recognization

What did we use to enhance our index?

24 April 201546

Tagging

24 April 201547

Spectators outside the White House received a rare treat this morning

when they witnessed First Lady Michelle Obama on the South Lawn

going for a stroll with the family’s pet rhinoceros, Chauncey. “Owning a

rhino is a lot of work, but all of the Obamas—and especially Michelle—

really love Chauncey,” said White House spokesperson Sam

Davidson of the 3,000-pound eastern black rhinoceros the family

adopted in December after Barack Obama’s reelection promise to

“finally get Sasha and Malia that rhino they’ve been wanting.”

Source: http://www.theonion.com/articles/michelle-obama-seen-outside-walking-family-rhinoce,32851/

TF/IDF

Latent semantic indexing

Tagging

24 April 201548

TF/IDF

24 April 201549

TF-IDFTerm Frequency-Inverse Document Frequency

How often does the search

term occur in the text

How many words are in

the entire text

5/24 = 0,213/12 = 0,25

More relevant

NumPy, for SVD

Latent semantic indexing

24 April 201550

?

Many interesting things about text are longer than one word

– bigram: a sequence of two tokens

– collocation: a bigram that seems to be more than the sum of its parts

When is a bigram interesting?

#(vice president) #(vice)

#(president)

#(first lady) #(first)

#(lady)

Statistics beyond single words

24 April 201551

Quotes

24 April 201552

Spectators outside the White House received a rare treat this morning

when they witnessed First Lady Michelle Obama on the South Lawn

going for a stroll with the family’s pet rhinoceros, Chauncey. “Owning a

rhino is a lot of work, but all of the Obamas—and especially Michelle—

really love Chauncey,” said White House spokesperson Sam Davidson

of the 3,000-pound eastern black rhinoceros the family adopted in

December after Barack Obama’s reelection promise to “finally get Sasha

and Malia that rhino they’ve been wanting.”

Source: http://www.theonion.com/articles/michelle-obama-seen-outside-walking-family-rhinoce,32851/

24 April 201553

TF/IDF

Average per sentence

Quotes

24 April 201554

Classification

Photo credits: biodivlibrary- http://www.flickr.com/photos/biodivlibrary/6989150578/

Distinguish things from other things based on examples

Using supervised learning

Applications:

– Sentiment

– Spam filtering

– Topic detection

– Language detection

Classification

24 April 201556

Classification > Sentiment

24 April 201557

Bayesian model trained on Kieskeurig.nl review data

Calculates chance of being a positive of negative article

Classification > Sentiment

24 April 201558

Classification > Topic detection

24 April 201559

Food and drinks BeautyRelationships and

sexScience

K Nearest Neighbor

Solr related search

Classification > Topic detection

24 April 201560

?

Named Entities

24 April 201561

Spectators outside the White House received a rare treat this morning

when they witnessed First Lady Michelle Obama on the South Lawn

going for a stroll with the family’s pet rhinoceros, Chauncey. “Owning a

rhino is a lot of work, but all of the Obamas—and especially Michelle—

really love Chauncey,” said White House spokesperson Sam Davidson

of the 3,000-pound eastern black rhinoceros the family adopted in

December after Barack Obama’s reelection promise to “finally get

Sasha and Malia that rhino they’ve been wanting.”

Source: http://www.theonion.com/articles/michelle-obama-seen-outside-walking-family-rhinoce,32851/

Locations

Persons

This can get you the persons, organizations and locations based from the sentence structure

Other types of entities can be trained as well, e.g. events, ingredients

Bayesian model

Features:

– Capitalization “Donald Duck”

– Part of speech tag “Did you meet Donald Duck last week?”

– Gazetteer of common names “Donald”

Named Entities

24 April 201562

It was revealed in December 2006 that Michael Jackson "is and has been suffering for at least a

decade from Parkinson's Disease."[10] He also suffered from diabetes. Michael Jackson died of a heart

attack in his home the morning of 30 August 2007 at the age of …

Named Entities++

24 April 201563

?

It was revealed in December 2006 that Michael Jackson "is and has been suffering for at least a

decade from Parkinson's Disease."[10] He also suffered from diabetes. Michael Jackson died of a heart

attack in his home the morning of 30 August 2007 at the age of …

Named Entities++

24 April 201564

{'December': 3,'Parkinson’s Disease': 1,'died': 4,'heart attack': 2}

{'December': 2,'Thriller': 1,'Neverland': 2,'heart attack': 2}

{'December': 3,'Parkinson’s Disease': 1,'Writer': 4,'heart attack': 2}

Storing it in the search index

Use it for faceted search

Use it for boosting

Use it as a poor-man knowledge base/graph

What do we do with this additional info?

24 April 201565

Images & Video

Photo credits: Danila- http://www.flickr.com/photos/58372504@N05/6996542804/

Face detection

24 April 201567

Not face recognition!

OpenCV

Viola Jones

Cascading haar classifier

Currently identify:

– Portrait shots

– Faces

– Group shots

Side project:

– Content sensitive cropping

Color detection

24 April 201568

Image Histogram

Manual mapping

Improving image search

Photo credits: Camera Wiki- http://www.flickr.com/photos/camerawiki/5610384297/

EUVision, UvA spin off

Acquired by Qualcomm end of ‘14

Image and video classification

24 April 2015 Presentation name70

High level architecture

24 April 201571

Content Library

NERSentim

ent

Search

index

API

Loader

SolrMongo

DB

Image

recon.

Analyse

Pipeline

Python/Django, with Django Rest framework

MongoDB

Solr

Celery

Libraries:

– nltk

– NumPy & SciPi

– OpenCV

– ssdeep

– EUVision

Technology

24 April 201572

Full text search

Facetted search

Guided search

Content repository

Analytics

Use cases

24 April 2015 Presentation name73

Search for analytics

Keyword search can be combined with

advanced forms of ranking the results

Most of the fields go to an index

Facets can be used for analytics

Ranker can be replaced with custom logic

Products:

– Solr

– ElasticSearch

– Marklogic

Use cases:

– Content Search

– Analytics / Faceted

– Percolation

24 April 2015 Presentation name74

ELK – ElasticSearch, LogStash & Kibana

24 April 2015 Presentation name75

24 April 2015 Presentation name76

Full text search

Facetted search

Guided search

Content repository

Analytics

Adserving

Use cases

24 April 2015 Presentation name77

Search

24 April 2015 Presentation name78

Content

Q Σ Result ranking

Search too

24 April 2015 Presentation name79

Content

t

Σ Result ranking

User

Search too

24 April 2015 Presentation name80

Ads

Page

Σ Result ranking

User

Search too

24 April 2015 Presentation name81

Ads

Page

Σ Result ranking

User

History

24 April 201582

Open Projects

OL2R

– Implement

Content Library

– Additional semantic modules

– Impact Sematic Enrichments

Content Library

– Improve Ranking

Guided Search

– Evaluations

– User History

Image search

Analytics

Knowledge Base

– Build for Content Library and Guided Search

Multi lingual search

– NL, EN and FI

Probalistic search

– Product searches

24 April 2015 Presentation name83

Peter Dutton - https://www.flickr.com/photos/joeshlabotnik/14040532589/

Camera Wiki - http://www.flickr.com/photos/camerawiki/5610384297/

Danila - http://www.flickr.com/photos/58372504@N05/6996542804/

Biodivlibrary - http://www.flickr.com/photos/biodivlibrary/6989150578/

Wikipedia - http://en.wikipedia.org/wiki/File:Droste.jpg

Creative Tools - http://www.flickr.com/photos/creative_tools/4353860378/

Learnscope - http://www.flickr.com/photos/learnscope/5536614305/

Super Hua - http://www.flickr.com/photos/superhua/286479024/

Kyanos Aum - http://www.flickr.com/photos/kyanos_aum/3926971954/

Pintoy - http://www.flickr.com/photos/pintoys/6155690814/

Ulteriore Picure - http://www.flickr.com/photos/ulteriorepicure/200767137/

Atomic Seed - http://www.flickr.com/photos/atomic_seed/6824087444/

M Hatrey - http://www.flickr.com/photos/mhatrey/6968211400/

Celines Photographer - http://www.flickr.com/photos/celinesphotographer/2295348530/

Irene Mei - http://www.flickr.com/photos/emyanmei/8223998414/

Igal Koshevoy - https://www.flickr.com/photos/igalko/6345215839/

Photo creditsThanks to all Photographers!

24 April 2015 Presentation name85

top related