applications and technologies sergei ananyan megaputer intelligence, inc. text mining © 2001...
TRANSCRIPT
Applications and technologiesApplications and technologies
Sergei AnanyanSergei AnanyanMegaputer Intelligence, Inc.Megaputer Intelligence, Inc.
www.megaputer.comwww.megaputer.com
Text MiningText Mining
© 2001 Megaputer intelligence, Inc.
OutlineOutline
Definitions and application fieldsDefinitions and application fields
Text mining functionalityText mining functionality
Case studyCase study
TechnologyTechnology
Future developmentsFuture developments
Text MiningText Mining
Text Mining is a process of Text Mining is a process of extracting new, valid, and actionable extracting new, valid, and actionable
knowledge dispersed throughout text knowledge dispersed throughout text documentsdocuments and and
utilizing this knowledge to better organize utilizing this knowledge to better organize information for future referenceinformation for future reference..
Tasks addressed by TMTasks addressed by TM
Search and retrievalSearch and retrieval Semantic analysisSemantic analysis ClusteringClustering CategorizationCategorization Feature extractionFeature extraction Ontology buildingOntology building Dynamic focusingDynamic focusing
DM and TM comparisonDM and TM comparisonData MiningData Mining Text MiningText Mining
Object of Object of investigationinvestigation
Numerical and categorical Numerical and categorical datadata TextsTexts
Object structureObject structure Relational databasesRelational databases Free form textsFree form texts
GoalGoal Predict outcomes of future Predict outcomes of future situationssituations
Retrieve relevant information, Retrieve relevant information, distill the meaning, distill the meaning, categorize and target-delivercategorize and target-deliver
MethodsMethods Machine learning: SKAT, Machine learning: SKAT, DT, NN, GA, MBR, MBADT, NN, GA, MBR, MBA
Indexing, special neural network Indexing, special neural network processing, linguistics, processing, linguistics, ontologiesontologies
Current market Current market sizesize
100,000 analysts at large 100,000 analysts at large and midsize companiesand midsize companies
100,000,000 corporate workers 100,000,000 corporate workers and individual usersand individual users
MaturityMaturity Broad implementation Broad implementation since 1994 since 1994
Broad implementation starting Broad implementation starting 20002000
TM tasks in detailTM tasks in detail
Information search and retrievalInformation search and retrieval Index-basedIndex-based
• Excite, Alta VistaExcite, Alta Vista
Ontology-basedOntology-based• Yahoo, LycosYahoo, Lycos• Megaputer – ontology buildingMegaputer – ontology building
Boolean search + stemmingBoolean search + stemming• HotBot, dt-SearchHotBot, dt-Search
Semantics and linguistics enhancedSemantics and linguistics enhanced• MegaputerMegaputer
Dymanic focusingDymanic focusing• MegaputerMegaputer
TM tasks in detail TM tasks in detail (continued)(continued)
Semantic analysisSemantic analysis Neural network and customized dictionariesNeural network and customized dictionaries
• Megaputer, MicrosystemsMegaputer, Microsystems
LinguisticsLinguistics• MegaputerMegaputer
Bayesian inferenceBayesian inference• AutonomyAutonomy
Clustering and categorizationClustering and categorization• MegaputerMegaputer
Feature extractionFeature extraction• SRA, Megaputer, IBMSRA, Megaputer, IBM
Possible applicationsPossible applications
Search enginesSearch engines Enterprise portalsEnterprise portals Knowledge management systemsKnowledge management systems e-Business systemse-Business systems Vertical applications: Vertical applications:
e-mail categorization and routinge-mail categorization and routing Call center notes categorizationCall center notes categorization CRM systemsCRM systems
Typical setupsTypical setups
Venture capitalistVenture capitalist Search and retrievalSearch and retrieval Estimation of relevanceEstimation of relevance Summarization and navigationSummarization and navigation
Investment or Insurance companyInvestment or Insurance company Categorization of incoming messagesCategorization of incoming messages Target-sharing information with employeesTarget-sharing information with employees Structured fragments extraction (numbers)Structured fragments extraction (numbers) Feature extraction (who owns whom)Feature extraction (who owns whom)
Typical setups Typical setups (continued)(continued)
Government agencyGovernment agency Intelligent infromation retrievalIntelligent infromation retrieval Chain of events tracingChain of events tracing Supplement documents by their summaries for Supplement documents by their summaries for
more efficient referencemore efficient reference
e-Businesse-Business Match resource description to a user queryMatch resource description to a user query Learn visitor interests by analyzing the content Learn visitor interests by analyzing the content
browsedbrowsed Match interests to available resourcesMatch interests to available resources
Text and the WebText and the Web
99% of analytical information on the 99% of analytical information on the Web exists in the form of textsWeb exists in the form of texts
The Web is the place where users The Web is the place where users routinely encounter new textsroutinely encounter new texts
99% of e-Businesses today do not 99% of e-Businesses today do not leverage competitive advantage leverage competitive advantage provided by their content-rich provided by their content-rich websites because they do not utilize websites because they do not utilize text mining to the extend they shouldtext mining to the extend they should
Example: nytimes.comExample: nytimes.com
Extremely rich contentExtremely rich content Large audience: 10+ mln e-mailsLarge audience: 10+ mln e-mails Generates revenue from advertisersGenerates revenue from advertisers Uses an anonymous survey for loginUses an anonymous survey for login Does a very good job tracking Does a very good job tracking
individual pages accessedindividual pages accessed For any page can furnish demographic For any page can furnish demographic
profile of its visitorsprofile of its visitors ButBut does not utilize text mining. does not utilize text mining.
Cannot see customer-centered view.Cannot see customer-centered view.
Example: nytimes.com Example: nytimes.com (continued)(continued)
Could significantly increase the value Could significantly increase the value of each visitor to advertisers by doing of each visitor to advertisers by doing individualized marketingindividualized marketing
Rich content and high visitor loyalty Rich content and high visitor loyalty are ideal for learning visitors’ interests are ideal for learning visitors’ interests through text miningthrough text mining
This silent surveing is done This silent surveing is done unobtrusivelyunobtrusively
Privacy is preservedPrivacy is preserved Potential result:Potential result: increased revenueincreased revenue
Megaputer text miningMegaputer text mining
TextAnalystTextAnalyst** Tech:Tech: combi of n-grams and Neural Networks combi of n-grams and Neural Networks Scope:Scope: Analyst’s desktop solution Analyst’s desktop solution
* * Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst. Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.
TextractorTextractor Tech:Tech: Morphological analysis, Semantic analysis Morphological analysis, Semantic analysis
(WordNet and its extensions), Statistical and (WordNet and its extensions), Statistical and Fuzzy Logic analysis)Fuzzy Logic analysis)
Scope:Scope: Enterprise solution Enterprise solution
TextAnalystTextAnalyst**OverviewOverview
* * Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst. Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.
TextAnalystTextAnalyst
TextAnalyst is a tool for semantic TextAnalyst is a tool for semantic analysis, navigation, and search of analysis, navigation, and search of unstructured texts. unstructured texts.
TextAnalyst is available asTextAnalyst is available as Standlone applicationStandlone application SDK of COM components for easy integrationSDK of COM components for easy integration
TextAnalyst functionalityTextAnalyst functionality
Distilling the meaning Distilling the meaning (Semantic Network)(Semantic Network)
NavigationNavigation SummarizationSummarization Topic explicationTopic explication ClusteringClustering Dynamic focusingDynamic focusing Categorization Categorization (TextAnalyst COM)(TextAnalyst COM)
TextAnalystTextAnalyst
Ask Jeeves Ask Jeeves (USA)(USA) PfizerPfizer (USA)(USA)
IMS Health IMS Health (USA)(USA) TRW TRW (USA)(USA)
The Gallup Organization The Gallup Organization (USA)(USA) McKinsey & CompanyMcKinsey & Company (USA)(USA)
Centers for Disease ControlCenters for Disease Control (USA)(USA) Liberty MutualLiberty Mutual (USA)(USA)
Best Buy Best Buy (USA)(USA) LogiconLogicon (USA)(USA)
France TelecomFrance Telecom (France)(France) Net ShepherdNet Shepherd (Canada)(Canada)
Skila.comSkila.com (USA)(USA) Dept of Environmental ProtectionDept of Environmental Protection (Australia)(Australia)
US NavyUS Navy (USA)(USA) KPN ResearchKPN Research (Netherlands)(Netherlands)
Dow ChemicalDow Chemical (USA)(USA) Talkie.comTalkie.com (USA)(USA)
Clontech Clontech (USA)(USA) NICE SystemsNICE Systems (Israel)(Israel)
Customer base: Customer base: 300+ installations300+ installations Sample customersSample customers
TextAnalystTextAnalystUnderlying Underlying TechnologyTechnology
Text imageText image
Semantic NetworkSemantic Network - a list of the - a list of the most important most important conceptsconcepts (words and (words and word combinations) and word combinations) and relationsrelations between thembetween them
nuclear (100)
temperature (95)
nuclear reactions (98)
heat (99)
cell (98)
papers (86)
Temperature fusion (100)
Peterson (96)
(37)
(78)
(63)
(59)
(70)
(52)
(46) (29)
(28)
Semantic network creationSemantic network creation
Text is a string of characters: letters, Text is a string of characters: letters, spaces, punctuation marksspaces, punctuation marks
Steps for building Semantic NetworkSteps for building Semantic Network Break text in words and sentencesBreak text in words and sentences Push through a Push through a nn-character window-character window Feed patterns to a Recurrent Hierarchical Neural Feed patterns to a Recurrent Hierarchical Neural
Network and record frequenciesNetwork and record frequencies Identify relations between concepts (joint Identify relations between concepts (joint
occurrence in a sentence)occurrence in a sentence) Carry out preliminary semantic network Carry out preliminary semantic network
renormalization (Hopfield-like Neural Network) - renormalization (Hopfield-like Neural Network) - assign semantic weightsassign semantic weights
General & Text-specific tasksGeneral & Text-specific tasks
Parse and reorganize input into Parse and reorganize input into sequences of words joined by sequences of words joined by concatenation and separation signsconcatenation and separation signs
Recognize and remove auxiliary words Recognize and remove auxiliary words and flective morphemesand flective morphemes
Recognize, count and store stem Recognize, count and store stem morphemesmorphemes
Identify words sharing stem Identify words sharing stem morphemesmorphemes
Hierarchical Recurrent NNHierarchical Recurrent NN
Hierarchical Recurrent NNHierarchical Recurrent NN
General & Text-specific tasksGeneral & Text-specific tasks
Identify relationships Identify relationships Text - joint occurrence in sentencesText - joint occurrence in sentences
Preliminary SN renormalization: Preliminary SN renormalization: optimization task similar to Hopfield optimization task similar to Hopfield networknetwork
Association of concepts in SN with Association of concepts in SN with sentences and context in original textsentences and context in original text
Case studyCase study
IRLP provides R&D assistance and IRLP provides R&D assistance and information services to Indiana’s small information services to Indiana’s small businesses and governmental units businesses and governmental units
IRLP searches SBIR and the IRLP searches SBIR and the Commerce Commerce Business DailyBusiness Daily to identify research funding to identify research funding opportunities for its clients.opportunities for its clients.
““TextAnalyst was able to find the TextAnalyst was able to find the necessary matches even for those necessary matches even for those clients where existing search program clients where existing search program was incompatible.was incompatible.””
-- Cindy Moore, Marketing Coordinator, -- Cindy Moore, Marketing Coordinator, IRLPIRLP
Customer Customer quotesquotes
Eleanor McLellanEleanor McLellanData Manager / AnalystData Manager / AnalystCenters for Disease Control Centers for Disease Control Atlanta, GAAtlanta, GA
"TextAnalyst is able to efficiently handle numerous and often large (90+ pages apiece) text files without any problem. Furthermore, the program is extremely user-friendly."
TextAnalyst supports medical research TextAnalyst supports medical research at Centers for Disease Controlat Centers for Disease Control
Nikolai Kalnin, Ph.D.Nikolai Kalnin, Ph.D.Team LeaderTeam LeaderBioinformatics GroupBioinformatics GroupCLONTECH Laboratories, Inc.CLONTECH Laboratories, Inc.Palo Alto, CAPalo Alto, CA
"TextAnalyst has been selected as the only text analysis tool capable of establishing relations between terms. It is reasonably priced, easy to install and operate."
TextAnalyst helps processing texts TextAnalyst helps processing texts at Clontechat Clontech
Kalyan Gupta, Ph.D.Kalyan Gupta, Ph.D.Director, Research Director, Research CaseBank Technologies Inc.CaseBank Technologies Inc.Brampton, OntarioBrampton, Ontario
"TextAnalyst is used at CaseBank to identify and assess the contents of electronic repositories of troubleshooting and maintenance information. It saves case preparation time and allows CaseBank to be more responsive to its customer's knowledge retrieval needs."
TextAnalyst saves time and resources TextAnalyst saves time and resources for CaseBankfor CaseBank
Future developmentsFuture developments
Text categorization (now implemented Text categorization (now implemented in TextAnalyst COM)in TextAnalyst COM)
Thesaurus-based text retrievalThesaurus-based text retrieval Integration with Web technologiesIntegration with Web technologies
TextAnalyst evaluationTextAnalyst evaluation
We invite you to download a FREE We invite you to download a FREE evaluation copy of TextAnalyst fromevaluation copy of TextAnalyst from
www.megaputer.comwww.megaputer.com
and enjoy using it hands-on following and enjoy using it hands-on following the provided step-by-step lessons, or the provided step-by-step lessons, or exploring your own data.exploring your own data.
TextractorTextractor
Technology and Technology and ApplicationsApplications
™™
Textractor capabilitiesTextractor capabilities
Key senses extractionKey senses extraction Hierarchical clusteringHierarchical clustering CategorizationCategorization SummarizationSummarization Intelligent searchIntelligent search Feature extractionFeature extraction
Textractor applicationsTextractor applications
GeneralGeneral Automated email categorization and routingAutomated email categorization and routing
(categories can be provided by the user or determined by the system)(categories can be provided by the user or determined by the system)
Knowledge extraction from call center notesKnowledge extraction from call center notes(example: occupational hazard determination)(example: occupational hazard determination)
Knowledge-based executive reporting systemKnowledge-based executive reporting system(one-glance knowledge visualization)(one-glance knowledge visualization)
Flexible searching for support documentationFlexible searching for support documentation(semantic relations between terms: synonyms, hyponyms, meronyms)(semantic relations between terms: synonyms, hyponyms, meronyms)
Competitive intelligenceCompetitive intelligence
InsuranceInsurance Clustering of claims and ontology buildingClustering of claims and ontology building
(hierarchical organization of textual data)(hierarchical organization of textual data)
Automated feature extraction and claim taggingAutomated feature extraction and claim tagging
Textractor analysis stepsTextractor analysis steps
Morphological analysisMorphological analysis Syntactic analysisSyntactic analysis Semantic analysis - WordNet filteringSemantic analysis - WordNet filtering
(synonymy, antonymy, hyper/hyponymy and holo/meronymy)(synonymy, antonymy, hyper/hyponymy and holo/meronymy)
Statistical analysisStatistical analysis(frequency of terms against background frequencies)(frequency of terms against background frequencies)
Context AnalysisContext Analysis(polysemy resolving and term collocations)(polysemy resolving and term collocations)
Semantic Network comparisonSemantic Network comparison
WordNetWordNet
WordNet is a comprehensive semantically WordNet is a comprehensive semantically organized lexical database for Englishorganized lexical database for Englishwww.cogsci.princeton.edu/~wnwww.cogsci.princeton.edu/~wn
Textractor provides an ability to expand and Textractor provides an ability to expand and edit WordNet for a specific application field.edit WordNet for a specific application field.
Semantic term relationshipsSemantic term relationships
SynonymsSynonyms Accident – Collision – Wreck Accident – Collision – Wreck
Hyper/HyponymsHyper/Hyponyms Bird Bird (hyperym)(hyperym) : Eagle, Hawk, Pigeon : Eagle, Hawk, Pigeon (hyponyms)(hyponyms)
Holo/MeronymsHolo/Meronyms Car Car (holonym)(holonym) :: Motor, Windshield, Tire :: Motor, Windshield, Tire (meronyms)(meronyms)
AntonymsAntonyms Cold <> Hot, Deep <> ShallowCold <> Hot, Deep <> Shallow
PolysemyPolysemy Commercial Bank Commercial Bank River Bank River Bank
Textractor architectureTextractor architecture
Data sources
WordNet
Filters and DW interfaces
Semantic Analysis
Core TM engines
Morphological Analysis
Application-orientedTM engines
Field-specificWordNet
Extensions
WordNetExtension Editor
Text Mining Engines
SyntacticAnalysis
StoredIndices
Link Parser
Textractor text mining enginesTextractor text mining engines
Core TM engines Application-oriented TM engines
Text indexer
Formal search query creator
Key senses extractor
Feature extractor
Application-oriented TM engines
Text Categorizer
Text Clusterizer
Database enrichmentand mining
Intelligent Searcher(synonyms, hyper/hyponyms,term proximity, frequencies)
Document tagging
Any Questions?Any Questions?
Call Megaputer at(812) 330-0110
or write
120 W Seventh Street, Suite 310Bloomington, IN 47404 USA
Appendix AAppendix ATextAnalyst technology TextAnalyst technology detailsdetails
Two aspects of textTwo aspects of text
Sequence of charactersSequence of characters characterized characterized by patterns that represent information by patterns that represent information recognized by humansrecognized by humans
Structured sequence of lexical unitsStructured sequence of lexical units organized together according to organized together according to morphological and syntactic rules morphological and syntactic rules (morphemes, auxiliary lexical units, syntactic (morphemes, auxiliary lexical units, syntactic members, sentences, etc.)members, sentences, etc.)
Semantics of textSemantics of text
Humans rely on Humans rely on multimodalmultimodal associations for creating semantic associations for creating semantic modelsmodels
Standalone textStandalone text - semantics is formal, - semantics is formal, but still usefulbut still useful
MeaningMeaning of a concept - collection of of a concept - collection of relations of this concept to other relations of this concept to other concepts in the text concepts in the text (constructive definition)(constructive definition)
Lexical vs. GrammaticalLexical vs. Grammatical
Lexical meaningLexical meaning of a word - of a word - determined by stem morpheme (word determined by stem morpheme (word combinations - chains of morphemes)combinations - chains of morphemes)
Grammatical meaningGrammatical meaning - determined by - determined by morphemes morphemes (prefixes, endings, etc.)(prefixes, endings, etc.) and and auxiliary semantic units auxiliary semantic units (articles, (articles, prepositions, etc.)prepositions, etc.)
Grammatical chainsGrammatical chains - word sequences - word sequences with extracted stem morphemes - with extracted stem morphemes - frames for contentsframes for contents
Semantic structure of textsSemantic structure of texts
Single text - semantic analysis can be Single text - semantic analysis can be performed, but is not sufficient: need a performed, but is not sufficient: need a knowledge base against which the text knowledge base against which the text can be analyzedcan be analyzed
Analysis of a large number of texts Analysis of a large number of texts from diverse fields from diverse fields =>=> Grammatical Grammatical structure of the languagestructure of the language
Analysis of a large number of texts Analysis of a large number of texts from the field of interest from the field of interest =>=> Knowledge BaseKnowledge Base
Grammatical + Lexical = SemanticGrammatical + Lexical = Semantic
Grammatical dictionaries of Grammatical dictionaries of morphemes and auxiliary wordsmorphemes and auxiliary words of a of a language: threshold transformation language: threshold transformation applied to a NN trained on a large applied to a NN trained on a large corpus of texts from diverse fields corpus of texts from diverse fields
Trained “grammatical NN” - filter. Trained “grammatical NN” - filter. “Lexical” NN is connected to its “Lexical” NN is connected to its output.output.
Combining elements from both NN - Combining elements from both NN - obtain a list of concepts for obtain a list of concepts for Semantic Semantic NetworkNetwork (after relational (after relational renormalization)renormalization)