APACHE SOLR CMS INTEGRATION
Ingo RennerSoftware Engineer
we build smart.
ID INFIELD DESIGN
MAY.01.2013LUCENE/SOLR REVOLUTION
TYPO3 CMS and Solr. How we did it.
APACHE SOLR CMS INTEGRATION
ABOUT IDWhat we do and who we do it for
• Strategy Planning
• Design
• UX
• Development & Integration
WHO IS THIS GUY?
• Committer TYPO3 CMS
• Committer and PMC member Apache Tika
• Release Manager TYPO3 CMS 4.2
• New San Franciscan
• Snowboarding, mountain biking
• Software Engineer, Architect at Infield Design
- Caution -TYPO3-Evangelist
TYPO3 CMS
TYPO3 CMS
• Free and Open Source Enterprise CMS
• Estimated 500,000+ installations worldwide
• Over 6,000+ public extensions
• 6,000,000+ downloads
• Content Management Framework
• Multi-Site, Multi-Language, Versioning, Workflows, ...
• Stable, Secure, Scaleable
TYPO3 COMMUNITY
• Community driven development
• Conferences in North America, Europe, Asia
• Barcamps, Developer Days, Snowboard Tour
• 4 times Google Summer of Code participant
• Backed by TYPO3 Association
• Several other projects under the TYPO3 brand
SOLR & CMS INTEGRATION
Integration Challenges & Solutions
PAGE RENDERING
• Di!erent template engines
• (too) flexible page rendering engine
• Identify relevant content on websites
• Exclude navigation and common page elements
• Content generated by plugins
Integration Challenges & Solutions
INDEX QUEUE
• Index Queue to track and index content
• Record Monitor to update Index Queue
• Crawl pages, index unstructured content marked relevant
• Exclude pages with plugin-generated content
• Index structured plugin data directly from DB
Integration Challenges & Solutions
ACCESS RIGHTS
• Intranet, Extranet, ...
• Not everybody may see everything
• Flexible user groups and permissions
• Permissions extended to sub-pages
Integration Challenges & Solutions
SOLR ACCESS FILTER PLUGIN
• Custom Solr access filter plugin
• Query Parser and Filter
• User group IDs stored in documents
• Current user’s groups submitted with query
• Plugin matches document groups with user’s groups
Integration Challenges & Solutions
FILE INDEXING
• Finding file links in page content
• Core file links vs. plugin file links
• Track files for indexing
• Reading file content
• Separate tools for di!erent file formats
Integration Challenges & Solutions
FILE INDEXING
• File Detectors & File Index Queue
• File system abstraction layer
• Apache Tika
• Knows 1,200+ file formats, reads about half of them
• Content & meta data extraction
• Language detection
Integration Challenges & Solutions
THE REST
• PHP people vs. Java technology
• Talking to Solr
• Learning from mistakes
Integration Challenges & Solutions
THE REST
• Fully automated bash install script
• SolrPhpClient
• Separate your languages
EXT:solr - Apache Solr for TYPO3
FEATURES
• Facetted Search
• File Indexing
• Multi-Language & Multi-Site Support
• Did you mean, More Like This
• Search Word Highlighting
• Auto Complete
• Access Rights Support
• Many More ...
we build smart.
ID INFIELD DESIGN
QUESTIONS?
ID INFIELD DESIGN
we build smart.
THANKS.
ID INFIELD DESIGN
we build smart.
T3CON North AmericaSan Francisco, May 30-3120% o! regular ticket price, use:LUCENETYPO3
INFIELD DESIGN is hiring!
CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge gets you in the door
TOMORROW Breakfast starts at 7:30Keynotes start at 8:30
CONTACT@[email protected], [email protected]