multi-language content discovery through entity driven search: presented by alessandro benedetti,...
Post on 13-Jul-2015
535 Views
Preview:
TRANSCRIPT
Multi-language Content Discovery Through Entity Driven SearchAlessandro Benedetti
Search Consultant and R&D Software EngineerZaizi
http://uk.linkedin.com/in/alexbenedetti
Who I am
Alessandro Benedetti
Apache ManifoldCF committer Search Consultant R&D Software Engineer Master in Computer Science Information Retrieval Background Semantic, NLP, Machine Learning Technologies Enthusiast Beach Volleyball Player & Snowboarder
ZAIZI
ZAIZI
Experienced at building and delivering a wide range of enterprise solutions across the whole information life cycle
Alfresco & Ephesoft certified Platinum Partner
Red Hat Enterprise Linux Ready Partner
R&D department specialising in Open SourceSearch Solutions
Alfresco Partner of the Year 2012 and 2013
Agenda
Context
Problem
Solution
Demo
What's upcoming
Zaizi R&D Department
Giving sense to the content
Enriching it semantically
Adding value to ECM/CMS
More structured content, easy to manage, link and search
Improving search
Across different domains, data sources, User Experience
Machine Learning applied research
Content Organization – Recommendation Systems
Enterprise Search Problems Challenge :
Search within Big and Heterogeneus Repositories
Heterogeneus data sources
Filesystems, DB, ECM/CMS, Email, …
Unstructured content in different formats
PDF, text plain, Word …
Documents not linked between each other
Federated Search
across data sources
preserving permissions
centralized endpoint
Sensefy
Semantic Enterprise Search Engine
Federated Search
Evolved User Experience
Based on cutting-edge Open Source Frameworks
Architecture
Entity Driven Search
Moving from keywords to Entities More understandable to Humans
Process the unstructured text at indexing time
Enrich it
Build specific indexes
Use entities and concepts in searches• Trying to foresee the concepts the user wants to express
What is an Entity in our domain ?
Real world concepts
Linked Data resources
Rdf(xml) structured data• Unique identifier + properties
Stored in a Knowledge Base ( Freebase, DbPedia, Custom Dataset)
Redlink
Semantic Cloud platform Providing Software as a Service Text analysis and Entity Linking using Knowledge Bases Linked Data Publishing Enterprise Data Linking Open-Source based components
Indexing - NLP & Semantic Enrichment
Apache ManifoldCF custom processors/output connectors
From unstructured to structured NLP Analysis. POS Tagging Named Entities Recognition Entity Linking using Knowledge Bases Disambiguation
Indexing in specific Solr Collections • Primary Index (documents)• Entity Index• Entity Types
Search - Smart Autocomplete
Multi Phase suggestions
Closer to natural language query formulation
Named Entities
Entity Types
Document Titles
Smart Autocomplete – Named Entities
Infix Suggestion ( ron → Cristiano Ronaldo)
Fuzzy suggestion ( cristinao → Cristiano Ronaldo)
Brief description of the suggested entity
Specific Solr index for the entities• Schema ( label, notable_type, occurrences...)• Edge-Ngram token filtered label field• Fuzzy queries with variable distance / classic queries to the label suggestion
field
Smart Autocomplete – Entity Types
Infix Suggestion ( play → Football Player)
Fuzzy suggestion ( foobtall → Football Team)
Multi Language ( calcia → Calciatore[it]( Football Player)[en] )
Multi phase suggestion through properties ( ital → football player nationality italian)
Specific Solr collection for the entity types• SolrDocument is an entity type ( type,occurrences,attributes,type hierarchy...)• EdgeNgram token filtered type• Multi-language suggestion highlight
Smart Autocomplete – configuration
Knowledge base for entity linking and dereference DbPedia, Freebase, Custom Dataset
Properties For each entity type of interest Ldpath will be used to identify the property in the graph
Hierarchy All the sub-instances of a type will automatically inherit their parent properties to ease the configuration
Semantic Search
Search by Named Entity Ex. Give me all the documents related to
Christian Bale
Search by Entity Type Ex. Give me all the documents about football players
Search by Entity Type + properties Ex. Give me all the documents about football players whose nationality is British
Query time Join : Entity-Entity Type collection → primary Index
Semantic Facets
Dynamic calculated semantic facets based on types and entities from documents
Improve the navigation of results
Allow refined search through semantic information
Configurable custom layer on top of Solr faceting component
Semantic More Like This
Search for similar documents based on Entities and Entity Types
Similarity function based on document meaning
Multi Language / Not based on text tokens but concepts
Solr More Like This on custom fields
Entity Frequency / Inverted Document Frequency
Entity Type Frequency / Inverted Document Frequency
Live Demo
Context
Problem
Solution
Demo
What's upcoming
What's upcoming
Machine Learning components:– Classification– Topic annotation– Clustering
Secured Entity Search Image and Media searches Advanced Geo-search Personalized/collaborative search Recommendations Q&A Advanced configurable Admin Dashboard
Any Questions?
Alessandro BenedettiSearch Consultant and R&D Software EngineerZaizi Email: abenedetti@zaizi.comTwitter: @Zaizi
top related