wikipedia cloud search webinar
DESCRIPTION
View this webinar presented by Search Technologies' Chief Architect Paul Nelson on cloud search and a Wikipedia use case. Webinar given in conjunction with Amazon Cloud Search. Search Technologies provides implementation and consulting services for Amazon CloudSearch. For further information, see http://www.searchtechnologies.com/amazon-cloudsearch-services.html http://www.searchtechnologies.com/TRANSCRIPT
1
Searching Wikipedia with Amazon CloudSearch
2Agenda
• Project Background• High-level Architecture• Summary & Observations
3Project Background
• Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch
• Decision to use Wikipedia as a convenient data set for testing purposes
3
4High-level Architecture
4
5Indexing
• Wikipedia provides content in a series of large xml files• Amazon CloudSearch ingests xml in a specified form• Various content processing tasks to perform
• Splitting into individual documents• Date normalization• Metadata extraction & mapping• Cleanup, etc.
• We used Aspire for these tasks
5
6Aspire in Brief
• Based on Apache Felix / OSGi• Thread-safe, multi-threaded, distributable• Any number of pipelines, conditional branching• Plug-in components individually testable & upgradable• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.• Tested with Elasticsearch and SP 2013
6
7XML Input
7
8Indexing
• Streaming Wikipedia Dump Files directly into CloudSearch
• 500 docs/second achieved without much effort• Using 4 x XL instances of CloudSearch• 1 x XL EC2 instance for Aspire
8
9Searching
• Amazon CloudSearch provides a RESTful/XML interface for search purposes
• For the Wikipedia project, we needed a UI• Chose to use Twigkit• Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at
http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html
9
10Searching
• Supports navigators and relevancy customization• E.g. a “PageRank” style link
analysis was performed
• Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds• Very useful for analysis
applications
• So, what does it look like?
10
11wikipedia.searchtechnologies.com 11
12wikipedia.searchtechnologies.com 12
13Summary & Observations
• A capable and scalable “raw” engine• xml in, RESTful/xml out• Easy to set up – much the same as an EC2
instance• Elastic scalability
13
14Summary & Observations
• Cost effective• From $75 per month, including management /
maintenance
• Extremely convenient• Switch on / off at leisure• Promotes experimentation & agility
14
15