email processing and recommendation michal laclavík, ladislav hluchý, martin Šeleng (email...

27
Email Email Processing Processing and and Recommendation Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual recommendation)

Upload: lionel-miles

Post on 16-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

EmailEmail ProcessingProcessing and and RecommendationRecommendation

Michal Laclavík, Ladislav Hluchý, Martin Šeleng

(Email research, information extraction, information retrieval, contextual recommendation)

Page 2: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

AbstractAbstract

In this presentation we give overview to our research focusing on text processing and recommendation. We focus on information and knowledge hidden in email communication in organizational or enterprise context.

We exploit simple information extraction techniques based on patterns and gazetteers to deliver semantic or semi formal understanding of text (email) content and context. Context is used for recommendation. We have developed proof –of-concept prototypes of email based recommendation and search based on key-value pairs (named entities) extracted from text (emails), based on hierarchical trees build from recognized entities. In addition we exploit social networks hidden in email archives.

Vienna, 14th October 2010 2IRF-TUWIEN Doctoral Seminar

Page 3: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Vienna, 14th October 2010 3

Primary Research Team & CapabilitiesPrimary Research Team & Capabilities

Dept. of Parallel and Distributed ComputingResearch and Development Areas:

– Large-scale HPCN and Grid applications– Intelligent and Knowledge oriented Technologies

Experience from European IST projects:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE

(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7:

Commius, Admire, EGEE III, SecricomSeveral National Projects (SPVV, VEGA, APVT)IKT Group Focus:

– Information Processing– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed

Information ProcessingSolutions:

– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System

Director & leader of PDC: Dr. Dipl. Ing. Ladislav Hluchý

URL: http://ikt.ui.sav.sk

IRF-TUWIEN Doctoral Seminar

Page 4: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Ontea: Pattern based information extraction and Ontea: Pattern based information extraction and semantic annotationsemantic annotation

Text processing

Page 5: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Ontea: Information Extraction (Features)Ontea: Information Extraction (Features)

Regex patternsVisual Annotation Tool Integration with external tools

GATE, Stemers, Hadoop …Gazetteers IE System configurationAutomatic loading of extractorsPatternsMultilingual tests

Spanish Slovak English Italian

Vienna, 14th October 2010 5IRF-TUWIEN Doctoral Seminar

Page 6: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Information Extraction ModelInformation Extraction Model

Address and product patternsAddress and product patterns

ExtractionExtraction

ProcessingProcessing

3 words macro3 words macro

ZIP macroZIP macro

Street number macroStreet number macro

Street name macroStreet name macro

City name macroCity name macro

Country macroCountry macro

Address patternsAddress patterns

Vienna, 14th October 2010 6IRF-TUWIEN Doctoral Seminar

Page 7: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

SegmentationSegmentation

• Sentences • Paragraphs• Objects (Address, Product ..)

Vienna, 14th October 2010 7IRF-TUWIEN Doctoral Seminar

Page 8: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

GazetteerCan extract information, which

cannot be properly extracted by regular expression patterns (like given names, product names, etc.)

Gazetteer extraction approach is combined with regular expressions based extrac-tion. For example personal full names can be extracted with higher precision.

Gazetteer is easy to update, because it is configured by simple text files.

Information Extraction: Gazetteers configurationInformation Extraction: Gazetteers configuration

Vienna, 14th October 2010 8

Gazetteer listssimple text files with keywords

Gazetteer configurationsimple text file with<list file>:<IE result type>

Information extractor rules

IRF-TUWIEN Doctoral Seminar

Page 9: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Information Extraction: Rules configurationInformation Extraction: Rules configuration

IE System configuration– IE dynamically loads and run its

components (XMLRegexExtractor, Gazetteer, RuleTransformer) according to setting in IE rules file

– IE Components are executing consecutively and operate on a set of information extraction results

Vienna, 14th October 2010 9

Information extractor rules file

IE result setModified

IE result setIE component

Regex basedIE component

GazetteerIE component

Result set transformerIE component

IRF-TUWIEN Doctoral Seminar

Page 10: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Semantic AnnotationSemantic Annotation

Vienna, 14th October 2010 10

TheThe conceptconcept InformationExtractor - IEInformationExtractor - IE produces produces a set of extraction resultsa set of extraction results

SemanticAnnotator - SASemanticAnnotator - SA consumes consumes the IE result set and builds a trees the IE result set and builds a trees convertible to Ontology instances or convertible to Ontology instances or objects according to XML schema e.g. objects according to XML schema e.g. Core ComponentsCore Components

SA first builds an intermediate tree of IE SA first builds an intermediate tree of IE results on which it operatesresults on which it operates

The tree is upon its creation not compliant The tree is upon its creation not compliant to Core Components specification and to Core Components specification and needs to be transformedneeds to be transformed

Therefore we have Therefore we have tree transformerstree transformers which transform the IE result tree to a treeswhich transform the IE result tree to a trees

IRF-TUWIEN Doctoral Seminar

Page 11: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Semantic AnnotationSemantic Annotation

• Tree transformers– Input is a tree of IE results and output is the modified tree of IE results

– Tree transformers are executing consecutively and operate on a tree of information extraction results

– Tree transformers, which delete, create,rename, move, switch and order nodesare configured in the SA rules file

Vienna, 14th October 2010 11

Treetransformer

IRF-TUWIEN Doctoral Seminar

Page 12: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Social NetworksSocial Networks

Social network reconstruction:probabilistic inference using spreading

activationrelies on the output of the information

extractor (IE) in the form of complex objects

Vienna, 14th October 2010 12

Preliminary results on a set of Preliminary results on a set of 50 Spanish emails (phone/name):50 Spanish emails (phone/name):Precision 60% Precision 60% (due to lower recall in IE)(due to lower recall in IE)Precision 85% Precision 85% (achievable with better IE)(achievable with better IE)self-healing self-healing (with new incoming emails)(with new incoming emails)

IRF-TUWIEN Doctoral Seminar

Page 13: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Social NetworksSocial Networks

Vienna, 14th October 2010 13

Results as XML or HTML: Results as XML or HTML: (via XSL Transformations)(via XSL Transformations)

Future:Future:

DataSource for Search DataSource for Search for Partner modulefor Partner module

Improve the recall of Improve the recall of Information ExtractorInformation Extractor

Exploit multi-pass algorithm and named entity recognition: things Exploit multi-pass algorithm and named entity recognition: things learned in the first pass will be used in the next, e.g. possible names learned in the first pass will be used in the next, e.g. possible names with initials, etc.with initials, etc.

Build an enhanced statistical reasoning procedure on top of the Build an enhanced statistical reasoning procedure on top of the present Social Network Extractor/Correlatorpresent Social Network Extractor/Correlator

IRF-TUWIEN Doctoral Seminar

Page 14: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Email ResearchEmail Research

Acoma

Page 15: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Vienna, 14th October 2010 15

Acoma ArchitectureAcoma Architecture

• Connected to email protocols on desktop or server• No need to change working practices

– Emails are received and send as before

• Received email is processed by Acoma and enriched with useful information

• Extensible with OSGi modulesS

erverD

esktop Mail Client

Browser

Mail Server

POP3IMAP

Acoma

Se

rve

rD

es

kto

p Mail Client

Browser

Mail ServerSMTP

Acoma

Information Processing and Extraction

Mail Server

Modified

Co

nn

ector to

Em

ail Infrastru

cture

System Connectors

Hint Recomendation

Mo

du

le 1

Mo

du

le 2

Mo

du

le n

Mail Client

Browser

IRF-TUWIEN Doctoral Seminar

Page 16: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Vienna, 14th October 2010 16

System ConnectorsSystem Connectors

• Connection of Acoma to existing systems– Document Archives– Internet or Intranet Systems– Databases

• Access or import of data • Key-value pair transformation

Meta-Connector

Web Connector

SpreadSheet Connector

Database Connector

Internet

Key-value

TransformedKey-value

IRF-TUWIEN Doctoral Seminar

Page 17: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Vienna, 14th October 2010 17

Acoma architecture : Message Post ProcessingAcoma architecture : Message Post Processing

• Useful hints with links are included in enriched email

• Links lead to internal or external systems (Internet, Intranet)

IRF-TUWIEN Doctoral Seminar

Page 18: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Vienna, 14th October 2010 18

Business objects in EmailsBusiness objects in Emails

• Study on 6 organizations show:– Objects can be identified by patterns and gazeteers– It is possible to define set of common objects

• Objects identified:– Organization:

• org:Name, org:RegNo, org:TaxNo– Person:

• person:Name, person:Function– Contact:

• contact:Phone, contact:Email, contact:Webpage– Address:

• address:ZIP, address:Street, address:Settlement– Product:

• product:Name, product:Module, product:Component, product:BOID– Document:

• doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest– Inventory:

• inventory:ResID, inventory:ResType– Other business object

• ID: BOID

IRF-TUWIEN Doctoral Seminar

Page 19: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Social Networks and Graph DataSocial Networks and Graph Data

• Relations among objects• Support for search

Vienna, 14th October 2010 19IRF-TUWIEN Doctoral Seminar

Page 20: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and

ordered by spread activation on social network graph

• Faceted search, navigation

Email Search PrototypeEmail Search Prototype

Vienna, 14th October 2010 20IRF-TUWIEN Doctoral Seminar

Page 21: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Context based Recommendation, Knowledge SharingContext based Recommendation, Knowledge Sharing

EMBET, Acoma

Page 22: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

22

Objective: Recommend and provide user information or knowledge in context

EMBET: proactive information and knowledge provisionEMBET: proactive information and knowledge provision

• Collaboration among users• Knowledge sharing• Active knowledge provision• Reuse of knowledge: notes and other

resources

http://ups.savba.sk/kwfgrid/uaa/http://ups.savba.sk/kwfgrid/uaa/Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar

Page 23: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

23

EMBET: AchievementsEMBET: Achievements

• Software with following functionality

– User Problem description– Displaying Knowledge– Adding Knowledge – Knowledge Reuse– Permanent Notes

Storage– Voting on Notes

• EMBET architecture: Core, GUI

• Context detection

• Context Matching to display information & knowledge

• Plain text analysis using Advanced Semantic Annotation Algorithms – OnTeA

• Theory of different context matching algorithms

Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar

Page 24: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Vienna, 14th October 2010 24

Acoma: Hint RecommendationAcoma: Hint Recommendation

IRF-TUWIEN Doctoral Seminar

Page 25: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Information Retrieval and Information ExtractionInformation Retrieval and Information Extraction

lectures

Page 26: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

IR LecturesIR Lectures

• Introduction to Information Retrieval• Text Operations, Text Analysis, stemming• Crawling, link processing• IR Models, Indexing techniques• IR Software libraries and systems• Ranking by Graph Algorithms (PageRank, HITS, …) and Searching• Information Extraction• Regular Expressions• Large Scale Data Processing on MapReduce Architecture• Multimedia Information Retrieval• Evaluation Techniques, Precision, Recall• Google• Semantics and IR, Semantic Web Standards

26Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar

Page 27: Email Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng (Email research, information extraction, information retrieval, contextual

Lectures conditionsLectures conditions

• Every students gets project focused on – Crawling– Indexing– Ranking– Information Extraction– Large Scale information Processing

• They have to consult project 3 times during semester

• Availability of data from day one• Lectures are available at:

– http://vi.ikt.ui.sav.sk/Témy_prednášok

27

Spracovanie odkazov

Indexovač

Usporiadanie

Vyhľadávač

Bázadokumentov

Odkazy

Index dokumentov

Sťahovač

Textové operácie

Otázka

Užívateľ

Zoznam dokumentov

Internet

Vienna, 14th October 2010IRF-TUWIEN Doctoral Seminar