searchhub - how to spend your summer keeping it real: presented by grant ingersoll, lucidworks

16
OCTOBER 11-14, 2016 BOSTON, MA

Upload: lucidworks

Post on 16-Apr-2017

505 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Page 2: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

SearchHub: How to Spend Your Summer Keeping it Real Grant Ingersoll

CTO, Lucidworks

Page 3: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

3

01

SearchHub Demo

github.com/lucidworks/searchhub

http://searchhub.lucidworks.com

Page 4: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

4

02SearchHub Details

• Basics:

• 37 Apache Projects registered so far plus LW properties, opensource.com, Stack Overflow

• 130 datasources* including email, Github, JIRA*, Website and Wiki

• Fusion 2.4.2

• Signals everywhere

• UI based on View (work not complete)

• ASF Mail archives mirrored at: http://asfmail.lucidworks.io

Page 5: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

5

03Goals

• Company:

• “LucidFind” aka SearchHub on Fusion

• Provide backend for LW.com search, including docs and support

• Real, production, living, breathing instance of Fusion that we control

• Fusion best practices demo of major use cases

• CTO Office

• Real data, including clicks

• Platform for machine learning and experimentation

• Demos and talks

Page 6: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

6

01Agenda

• Quick Intro to Fusion and SearchHub

• Fusion Configuration, UI, Middle Tier

• Data Acquisition

• Deployment

• Signals and Machine Learning

• Next Steps

Page 7: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

7

Drive next generation relevance via Content, Collaboration and

Context

Built on best in class Open Source: Apache Solr + Spark

Simplify application development and reduce ongoing maintenance

Access data from anywhere to build intelligent, data-

driven applications.

Fusion in a Nutshell

Page 8: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

8

01Fusion

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache Zookeeper

ZK 1

Leader Election

Load Balancing

ZK N

Shared Config Management

Worker Worker

Apache SparkCluster

Manager

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

RE

ST A

PI

Admin UI

Lucidworks View

HD

FS

(Op

tio

nal

)

LOGS FILE WEB DATABASE CLOUD HADOOP

Page 9: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

9

01Fusion Configuration, UI and Middle Tier

• UI

• Derivative of Lucidworks View (https://lucidworks.com/products/view/)

• Deep integration of Snowplow Javascript Tracker (https://github.com/snowplow/snowplow/wiki/javascript-tracker)

• Python Flask middle tier ($SEARCHHUB_HOME/python)

• Data sources (project_config)

• Pipelines (fusion_config)

• Schedules (fusion_config)

Page 10: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

10

01Data Acquisition• Sources:

• ASF Mail archives mirrored at: http://asfmail.lucidworks.io

• Stack Overflow (SO)

• Github

• Processing

• Pipelines, including custom stage for parsing mail

• Main Challenges:

• “fail2ban” by the ASF

• Focused crawling of SO — JSoup FTW! (try.jsoup.org)

• Mail Threads

Page 11: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

11

01Deployment

• Client and Middle Tier run in a Docker container using Apache HTTPd and mod_wsgi

• Hosted on AWS (m4.2xls)

• Fusion backend is OOTB 2.4.2 with extra memory for Connectors and Solr

• README has the gory details: https://github.com/lucidworks/searchhub/blob/master/README.md

Page 12: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

12

01Signals• UI is fully instrumented, using Snowplow Javascript Tracker, for most

user interactions. See SnowplowService.js

• Captures, amongst other things:

• User Id, Session Id, Unique Query Id, IP address, Location, Timing data

• Actions tracked:

• Page View

• Page Ping (heartbeat) every 30 seconds

• Search with query, displayed doc list and displayed facet list

• Clicks with query, doc id, position, score and query UUID

• Typeahead Clicks with characters typed and suggestions offered

Page 13: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

13

01Machine Learning

• Fusion makes it easy to “round-trip” ML data/models between Spark and Solr

• Examples of:

• Recommenders

• Spark Lucene tokenization

• k-Means

• Word2Vec

• Topic Detection (LDA)

• Random Forests Classifier

• Many examples SparkShellHelpers.scala

Page 14: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

14

Experiment Management and BanditsGet Started

• Goal: Experimentation, not hard coded rules*

• Goal: Drive down the cost of experimentation

• “A/B testing on steroids”

• Exploration vs. Exploitation

• Fusion 3.0 (beta):

• Record and calculate relevance metrics from w/in Fusion (gold standard, TREC, other)

• Easily calculate MRR, NDCG, Precision, Recall and report over time

• Support for Bandits: Greedy Epsilon, SoftMax, UCB1

Page 15: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

15

Demo

Page 16: SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

16

01Still Hungry?

• “Combining Content and Collaboration in Recommenders” by Jake Mannix: Friday at 1:10 pm http://sched.co/7amt

• https://github.com/lucidworks/searchhub

• http://searchhub.lucidworks.com

•Email: [email protected] •Twitter: @gsingers •Web: http://lucidworks.com