crawling, indexing, and searching software project data with droids, tika, solr & friends

30
Copyright 2010 Sematext Int'l. All rights reserved. ProjectHub Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends Otis Gospodneti [email protected] @otisg Sematext Int'l www.sematext.com @sematext 1

Upload: oleksiy-kovyrin

Post on 10-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 1/30

Copyright 2010 Sematext Int'l. All rights reserved.

ProjectHub

Crawling, Indexing, and Searching Software Project Data

with Droids, Tika, Solr & friends

Otis Gospodneti [email protected] @otisg

Sematext Int'l www.sematext.com @sematext

1

Page 2: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 2/30

Copyright 2010 Sematext Int'l. All rights reserved.

What I Will Cover 

Who I am WhatWhy Where

Architecture

Info Gathering & Indexing

Search & Extra Search Dog Food

Performance & Analytics

Ops & Stats

2

Page 3: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 3/30

Copyright 2010 Sematext Int'l. All rights reserved.

About Otis Gospodneti

Lucene/Solr/Nutch /Mahout/... committer 

Lucene in Action 1 & 2 co-author 

Lucene Consulting since 2005

Sematext International since 2007

3

Page 4: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 4/30

Copyright 2010 Sematext Int'l. All rights reserved.

About Sematext

Search (Lucene, Solr, Elastic Search...)

Web Crawling (Nutch)

Machine Learning (Mahout)

Big Data (Hadoop, HBase, Voldemort...)

Page 5: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 5/30

Copyright 2010 Sematext Int'l. All rights reserved.

What

Search ever ything about a Software Project Lucene & Hadoop

 ± All sub-projects

 ± All content

Mailing list archives

JIRA issues

Web site & Wiki pages

Source code (local syntax highlighting), trunk

Javadoc, trunk

5

Page 6: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 6/30

Copyright 2010 Sematext Int'l. All rights reserved. 6

Page 7: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 7/30

Copyright 2010 Sematext Int'l. All rights reserved.

Why

We need it Other Hadoop, Lucene, Solr... users need it

Our own playground

Live product demos

 Yummy dog food

7

Page 8: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 8/30

Copyright 2010 Sematext Int'l. All rights reserved.

Where

search-lucene.com search-hadoop.com

Other suggestions / needs?

In your Enterprise?

8

Page 9: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 9/30

Copyright 2010 Sematext Int'l. All rights reserved.

Architecture

9

Page 10: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 10/30

Copyright 2010 Sematext Int'l. All rights reserved.

Tool Matrix

Data Source Fetch Parse

JIRA URLConnection (feed) Digester (feed) DOM (item)

ML FileInputStream (fs)URLConnection (feed)

Droid (works, unused)

Digester (feed) MIME4J (mbox)

Web site Droids Tika via Droids

Wiki Droids Tika via Droids

Source code svn co QDox

Javadoc svn co QDox

Page 11: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 11/30

Copyright 2010 Sematext Int'l. All rights reserved.

Information Gathering

Multiple independent JVM processes (cron) Different polling frequencies

Different data sources / formats:

 ±  RSS (JIRA, Mailing Lists)

 ±  Mbox (Mailing Lists)

 ±  HTTP/HTML (Web site, Wiki)

 ±  Subversion (source code, Javadoc)

Nutch is a beast. Droids is light & simple.

ML thread detection is tricky

Finding deleted docs (Wiki, Web, Javadoc...)

Page 12: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 12/30

Copyright 2010 Sematext Int'l. All rights reserved.

Thread Detection

Email clients are kaput SMTP headers are unreliable

Heuristics are needed

 ± Try headers

 ± Fall back to subjects (get subject skeleton,

calculate hash)

 ± Factor in time (4 weeks)

 ± Use index for thread info retrieval

Q: Are there any libraries for this?

Page 13: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 13/30

Copyright 2010 Sematext Int'l. All rights reserved.

Indexing

Use StreamingUpdateSolrServer  AutoCommit use-case

Solr index abuse: track seen/unseen

&qsrc=indexer 

&warmUp=true

Separate processes ± easier reindexing (esp.

with frequent project infra changes)

Treating quoted portions of ML messages

Page 14: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 14/30

Copyright 2010 Sematext Int'l. All rights reserved.

Search

Facets (multi-select) ± Project

 ± Data source/type

 ± Author (based on names only)

Boosting more recent documents vs. pure

relevance vs. newest/oldest firstgive equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs)

recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4

Page 15: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 15/30

Copyright 2010 Sematext Int'l. All rights reserved.

Search cont'd

Quer y Spellchecker  Sematext components:

 ± ReSearcher & Relaxer 

 ± AutoComplete

 ± Key Phrase Extractor (2 approaches)

Threaded vs. flat view

In-document search term highlighting

Short URLs

Page 16: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 16/30

Copyright 2010 Sematext Int'l. All rights reserved.

Search cont'd

Page 17: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 17/30

Copyright 2010 Sematext Int'l. All rights reserved.

Dog food #1: Auto-Complete

Source: nightly refreshed subject and titles Approach: go directly to selection

sematext.com/products/autocomplete/

Page 18: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 18/30

Copyright 2010 Sematext Int'l. All rights reserved.

Dog food #2: ReSearcher &

Relaxer 

Avoid ³sorr y, no/poor matches´ Multiple algos trigger re-searching

Different forms of relaxing

sematext.com/products/dym-researcher/

Page 19: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 19/30

Copyright 2010 Sematext Int'l. All rights reserved.

Dog food #3: Key Phrases

Help narrow search results, like facets 2 types:

 ± Stored in index vs. calculated from top N hits

sematext.com/products/key-phrase-extractor/

Page 20: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 20/30

Copyright 2010 Sematext Int'l. All rights reserved.

Basic Search Analytics

Top queries, top terms... Daily, weekly, monthly

MRR

http://en.wikipedia.org/wiki/Mean_reciprocal_rank

Page 21: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 21/30

Copyright 2010 Sematext Int'l. All rights reserved.

Ver y Basic Search Analytics

Page 22: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 22/30

Copyright 2010 Sematext Int'l. All rights reserved.

Real Search AnalyticsTohelp protectyour privacy, PowerPointprevented thisexternalpicturefrom being automatically downloaded. Todownload and display thispicture, click Optionsin theMessageBar, and then click Enableexternalcontent.

Page 23: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 23/30

Copyright 2010 Sematext Int'l. All rights reserved.

Performance & Monitoring: RPM

Page 24: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 24/30

Copyright 2010 Sematext Int'l. All rights reserved.

Availability: Site24x7.com

Page 25: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 25/30

Copyright 2010 Sematext Int'l. All rights reserved.

Operations

Small EC2 instance: 1.7 GB RAM EBS for data - got burnt once

Local disk for index

Solr 1.4.1 multi-core

Performance monitoring via RPM

Availability & performance via site24x7.com

Page 26: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 26/30

Copyright 2010 Sematext Int'l. All rights reserved.

Statistics

search-hadoop.com: ± 110K+ documents

 ± ~700 MB optimized

search-lucene.com

 ± 170K+ documents

 ± ~900 MB optimized

Page 27: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 27/30

Copyright 2010 Sematext Int'l. All rights reserved.

Future

Field collapsing (threads) Bot detection (load) DONE

Solr duplicate detection (release notes)

Relevance tuning (MRR)

Open sourcing?

Page 28: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 28/30

Copyright 2010 Sematext Int'l. All rights reserved.

World-wide!

Search & Data Analytics

Machine Learning & NLP

Big Data

 [email protected]

WE ARE HIRING

Page 29: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 29/30

Copyright 2010 Sematext Int'l. All rights reserved.

Questions

Page 30: Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends

http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 30/30

Copyright 2010 Sematext Int'l. All rights reserved.

Contact

sematext.com blog.sematext.com

@sematext

@otisg

[email protected]

30