Searching for Search SolutionsHarvard IT Summit
June 23, 2011
Randy Stern | [email protected] | HUL/OIS
David Heitmeyer | [email protected] | HUIT
2
Searching the Web
3
Searching a Site
4
Searching a Collection
5
Searching Geospatially
6
Search at Harvard – Web
Search at Harvard – Web
7
8
Search at Harvard – Collections
• People
• Courses
• Grants
• Libraries
• ....many other things…
9
Search at Harvard – Libraries
10
Search at Harvard – Federated
11
Search Models
• “To oversimplify, there's the Google model and the faceted navigation model.” – Morville & Callendar in Search Patterns
• Keyword (“Google”)
– Keyword search against an index
• Advanced Search
– Searching or selecting specific fields
• Faceted Search (“Guided Navigation”)
– Integrated search and browse
– Keyword search
– Browse by category metadata
– “No dead ends”
12
Advanced Search
13
Advanced Search
14
Faceted Search
Search Technologies – Summary
15
Technology Products Examples at Harvard
Web Search Google, Yahoo, Bing everywhere
Site Search Google Search Appliance,Nutch, Sphinx, Elasticsearch
www.harvard.edu
Relational Database Oracle, MySQL, PostGres PeopleSoft, Aleph, DRS, HOLLIS Classic
XML Database Tamino, eXist VIA, OASIS, Virtual Collections
Spatially enabled ArcSDE, PostGIS Harvard Geospatial Library, WorldMap
Archived web search NutchWAX/Lucene Library Web Archiving Service
Full text and faceted search
Apache Solr/Lucene, Endeca, Autonomy, MS FAST
Library Full Text Search Service, HOLLIS, iSites, Course Catalog
Federated search Ex Libris Metalib Library Cross Search
Apache Lucene
• Open source from Apache
• High-performance, full-featured text search engine library written entirely in Java
• Text-based inverted index
• Documents of name/value pairs
• Stemming and tokenizers for various applications and languages
• Query syntax – and/or/not/near
• Highlighter
• **FAST**
16
Image goes here
Apache Solr
• “Solr is the popular, blazing fast open source enterprise search platform from Apache”
• A REST Web Service on top of Lucene for indexing and querying
– XML and JSON output
• Caching for faster response
• Faceting
• Web management interface
• XML schema configuration files
• “did you mean?” and “more like this” support
• Scalable server model
• Very active development community
17
Image goes here
http://lucene.apache.org/solr/
Lucene
Solr
Highly scalable with Hadoop cluster
Lucene
Solr
Lucene
Solr
Apache Solr/Lucene Ecology
18
Image goes here
Library catalogs
Enterprisedatabases
Nutch,Nutchwax
Web Archives
Lucene
Solr
TextFielded data
Solr Indexing
• Indexing: HTTP POST to http://mysolrserver/solr/update
<add> <doc> <field name="id">13579</field> <field name="title">Mona Lisa</field> <field name="creator">Leonardo DaVinci</field> <field name="year">1519</field> <field name="genre">painting</field> </doc></add>
19
Image goes here
Solr Searching
http://mysolrserver/solr/select?q=Davinci&start=0&rows=2&fl=title,genre
<response> <result numFound=“43” start="0"> <doc> <str name=“title">Mona Lisa</str> <str name=“genre”>painting</str> </doc> <doc> <str name=“title">Bronze Horse</str> <str name=“genre”>sculpture</str> </doc> </result></response>
20
Image goes here
Solr Searching
http://mysolrserver/solr/select?q=Davinci&start=0&rows=2&fl=title,genre&wt=json
{"response" : {"numFound" : 43,"start" : 0,"docs" :
[ {"title":"Mona Lisa", "genre":"painting"}, {"title":"Bronze Horse", "genre":"sculpture"}]
}}
21
Image goes here
Use of Solr Exploding
• Whitehouse.gov, FCC.gov, Comcast / xfinity, AT&T Interactive, AOL (Yellow Pages, Music, NFL Sports, Recipes), Sears, Ticketmaster, Digg, Netflix, Zappos.com, and many more
• Open source library catalogs
– Blacklight (Ruby), VuFind (PHP)
• Open source digital Repositories
– Fedora, Dspace
• Support available from Lucid Imagination (Solr creators)
22
Image goes here
Source: http://wiki.apache.org/solr/PublicServers
23
Harvard University Course Catalog
coursecatalog.harvard.edu
Solr & Course Catalog
• 9,000+ courses from 13 schools/programs
• 15 Mb index size
– fields are indexed and stored
• Search + Faceted Navigation
– School, calendar period, term, department, day, time, cross-registration status, credit level
• Updated daily
– REST interfaceHTTP post of XML files
• XSLT/XPath 2 processing of XML data from Solr
25
Course Catalog – Searching and Facets
Search Terms Facets
School
Semester
De-partment
Credit Level
Day of Week
Cross Regis-tration
Term within School Time of Day
Offered
26
Course Catalog
• Access to data to other applications
• Open Search browser plugins
iSites
• 5,500 course websites each year
• 20,000 websites
• 16,000 students
• 8 student portals
• 33,000 users on a peak day
28
Search within iSites
Solr & iSites
• 4.5 million items
– File, topic, forum, image, page, html, sign-up event, video, audio, site, link, wiki, announcement, podcast
– Crawlers use database and file system
• MS Office, PDF, Audio (metadata), OpenDocument, RTF, Text, HTML, XML
• 35 Gb index size
• Updated hourly
– Master and slave
• Search Tool - Permissions
Search – New Ways of Navigating
Harvard Library Full Text Search Service
31.
Harvard Library Full Text Search Service
32.
Full Text Search Service
• Uses Lucene directly
• Full text index of OCR page text for digitized books and other page turned objects
• Relevance ranked searching
• Hits in context
• ~81,000 objects so far, 7.2 million pages
• Index size 8.5GB
33
Harvard Library Web Archiving Service
34.
Harvard Library Web Archiving Service
35.
Web Archiving Service
• Lucene plus Nutchwax full text index of harvested web pages and harvested resources
• Indexing HTML, PDFs, Word docs, PPTS, etc. and collection metadata
• Currently a “small” web archive
– 265 web sites
– 13M web pages
– 100M web resources, 1TB of archived web data
• Index size 170GB and growing
– 80-90% of index size is full text required for “hit in context” search results
• 3-5 sec search result times on ordinary dual core Linux box
36
DRS 2 Web Administrator
37.
Facets to come!!
DRS 2 Web Administrator
• Solr for digital object management searching
– Digital preservation objects have many fields that may be important for collection management or preservation planning
– Faceted browse – by user tags, content type, owners, etc.
– Full text searching for descriptions and process info
• Easy to configure, update, and use (HTTP and simple URLs)
• Indexing metadata plus full text embedded in object descriptors, rather than the content of files themselves
• Scoped at release:
– 152 fields
– 30 million records, index size of 60GB
– master/slave configuration
38Footer reference – remove hyperlink if you want to keep this gray.
Email Archiving Service
39.
Email Archiving Service
• Why Solr for email object management?
– relevance ranking
– Facets
– full text searching of both email body and header fields
• Indexing email header fields, rights and collection metadata, plus full text from emails
40
Searching for Search Solutions
• Integrating multiple forms of data (text, images, audio, maps, etc.) into single searchable indexes
• Aggregating Indexes– Google, Google Books, Google Scholar
– Licensed cloud services for articles, books, media, everything
– Library Cloud
– DPLA
• Semantic Web
– Linked Data, RDF, HTML 5’s Microdata, Microformats
• Mobile (Localized)
• Specialized search vs. general search – there’s an app for that
41