applications and technologies sergei ananyan megaputer intelligence, inc. text mining © 2001...

48
Applications and technologies Applications and technologies Sergei Ananyan Sergei Ananyan Megaputer Intelligence, Inc. Megaputer Intelligence, Inc. www.megaputer.com www.megaputer.com Text Mining Text Mining © 2001 Megaputer intelligence, Inc.

Upload: camille-sayres

Post on 31-Mar-2015

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Applications and technologiesApplications and technologies

Sergei AnanyanSergei AnanyanMegaputer Intelligence, Inc.Megaputer Intelligence, Inc.

www.megaputer.comwww.megaputer.com

Text MiningText Mining

© 2001 Megaputer intelligence, Inc.

Page 2: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

OutlineOutline

Definitions and application fieldsDefinitions and application fields

Text mining functionalityText mining functionality

Case studyCase study

TechnologyTechnology

Future developmentsFuture developments

Page 3: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Text MiningText Mining

Text Mining is a process of Text Mining is a process of extracting new, valid, and actionable extracting new, valid, and actionable

knowledge dispersed throughout text knowledge dispersed throughout text documentsdocuments and and

utilizing this knowledge to better organize utilizing this knowledge to better organize information for future referenceinformation for future reference..

Page 4: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Tasks addressed by TMTasks addressed by TM

Search and retrievalSearch and retrieval Semantic analysisSemantic analysis ClusteringClustering CategorizationCategorization Feature extractionFeature extraction Ontology buildingOntology building Dynamic focusingDynamic focusing

Page 5: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

DM and TM comparisonDM and TM comparisonData MiningData Mining Text MiningText Mining

Object of Object of investigationinvestigation

Numerical and categorical Numerical and categorical datadata TextsTexts

Object structureObject structure Relational databasesRelational databases Free form textsFree form texts

GoalGoal Predict outcomes of future Predict outcomes of future situationssituations

Retrieve relevant information, Retrieve relevant information, distill the meaning, distill the meaning, categorize and target-delivercategorize and target-deliver

MethodsMethods Machine learning: SKAT, Machine learning: SKAT, DT, NN, GA, MBR, MBADT, NN, GA, MBR, MBA

Indexing, special neural network Indexing, special neural network processing, linguistics, processing, linguistics, ontologiesontologies

Current market Current market sizesize

100,000 analysts at large 100,000 analysts at large and midsize companiesand midsize companies

100,000,000 corporate workers 100,000,000 corporate workers and individual usersand individual users

MaturityMaturity Broad implementation Broad implementation since 1994 since 1994

Broad implementation starting Broad implementation starting 20002000

Page 6: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TM tasks in detailTM tasks in detail

Information search and retrievalInformation search and retrieval Index-basedIndex-based

• Excite, Alta VistaExcite, Alta Vista

Ontology-basedOntology-based• Yahoo, LycosYahoo, Lycos• Megaputer – ontology buildingMegaputer – ontology building

Boolean search + stemmingBoolean search + stemming• HotBot, dt-SearchHotBot, dt-Search

Semantics and linguistics enhancedSemantics and linguistics enhanced• MegaputerMegaputer

Dymanic focusingDymanic focusing• MegaputerMegaputer

Page 7: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TM tasks in detail TM tasks in detail (continued)(continued)

Semantic analysisSemantic analysis Neural network and customized dictionariesNeural network and customized dictionaries

• Megaputer, MicrosystemsMegaputer, Microsystems

LinguisticsLinguistics• MegaputerMegaputer

Bayesian inferenceBayesian inference• AutonomyAutonomy

Clustering and categorizationClustering and categorization• MegaputerMegaputer

Feature extractionFeature extraction• SRA, Megaputer, IBMSRA, Megaputer, IBM

Page 8: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Possible applicationsPossible applications

Search enginesSearch engines Enterprise portalsEnterprise portals Knowledge management systemsKnowledge management systems e-Business systemse-Business systems Vertical applications: Vertical applications:

e-mail categorization and routinge-mail categorization and routing Call center notes categorizationCall center notes categorization CRM systemsCRM systems

Page 9: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Typical setupsTypical setups

Venture capitalistVenture capitalist Search and retrievalSearch and retrieval Estimation of relevanceEstimation of relevance Summarization and navigationSummarization and navigation

Investment or Insurance companyInvestment or Insurance company Categorization of incoming messagesCategorization of incoming messages Target-sharing information with employeesTarget-sharing information with employees Structured fragments extraction (numbers)Structured fragments extraction (numbers) Feature extraction (who owns whom)Feature extraction (who owns whom)

Page 10: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Typical setups Typical setups (continued)(continued)

Government agencyGovernment agency Intelligent infromation retrievalIntelligent infromation retrieval Chain of events tracingChain of events tracing Supplement documents by their summaries for Supplement documents by their summaries for

more efficient referencemore efficient reference

e-Businesse-Business Match resource description to a user queryMatch resource description to a user query Learn visitor interests by analyzing the content Learn visitor interests by analyzing the content

browsedbrowsed Match interests to available resourcesMatch interests to available resources

Page 11: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Text and the WebText and the Web

99% of analytical information on the 99% of analytical information on the Web exists in the form of textsWeb exists in the form of texts

The Web is the place where users The Web is the place where users routinely encounter new textsroutinely encounter new texts

99% of e-Businesses today do not 99% of e-Businesses today do not leverage competitive advantage leverage competitive advantage provided by their content-rich provided by their content-rich websites because they do not utilize websites because they do not utilize text mining to the extend they shouldtext mining to the extend they should

Page 12: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Example: nytimes.comExample: nytimes.com

Extremely rich contentExtremely rich content Large audience: 10+ mln e-mailsLarge audience: 10+ mln e-mails Generates revenue from advertisersGenerates revenue from advertisers Uses an anonymous survey for loginUses an anonymous survey for login Does a very good job tracking Does a very good job tracking

individual pages accessedindividual pages accessed For any page can furnish demographic For any page can furnish demographic

profile of its visitorsprofile of its visitors ButBut does not utilize text mining. does not utilize text mining.

Cannot see customer-centered view.Cannot see customer-centered view.

Page 13: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Example: nytimes.com Example: nytimes.com (continued)(continued)

Could significantly increase the value Could significantly increase the value of each visitor to advertisers by doing of each visitor to advertisers by doing individualized marketingindividualized marketing

Rich content and high visitor loyalty Rich content and high visitor loyalty are ideal for learning visitors’ interests are ideal for learning visitors’ interests through text miningthrough text mining

This silent surveing is done This silent surveing is done unobtrusivelyunobtrusively

Privacy is preservedPrivacy is preserved Potential result:Potential result: increased revenueincreased revenue

Page 14: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Megaputer text miningMegaputer text mining

TextAnalystTextAnalyst** Tech:Tech: combi of n-grams and Neural Networks combi of n-grams and Neural Networks Scope:Scope: Analyst’s desktop solution Analyst’s desktop solution

* * Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst. Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.

TextractorTextractor Tech:Tech: Morphological analysis, Semantic analysis Morphological analysis, Semantic analysis

(WordNet and its extensions), Statistical and (WordNet and its extensions), Statistical and Fuzzy Logic analysis)Fuzzy Logic analysis)

Scope:Scope: Enterprise solution Enterprise solution

Page 15: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TextAnalystTextAnalyst**OverviewOverview

* * Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst. Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.

Page 16: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TextAnalystTextAnalyst

TextAnalyst is a tool for semantic TextAnalyst is a tool for semantic analysis, navigation, and search of analysis, navigation, and search of unstructured texts. unstructured texts.

TextAnalyst is available asTextAnalyst is available as Standlone applicationStandlone application SDK of COM components for easy integrationSDK of COM components for easy integration

Page 17: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TextAnalyst functionalityTextAnalyst functionality

Distilling the meaning Distilling the meaning (Semantic Network)(Semantic Network)

NavigationNavigation SummarizationSummarization Topic explicationTopic explication ClusteringClustering Dynamic focusingDynamic focusing Categorization Categorization (TextAnalyst COM)(TextAnalyst COM)

Page 18: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TextAnalystTextAnalyst

Ask Jeeves Ask Jeeves (USA)(USA) PfizerPfizer (USA)(USA)

IMS Health IMS Health (USA)(USA) TRW TRW (USA)(USA)

The Gallup Organization The Gallup Organization (USA)(USA) McKinsey & CompanyMcKinsey & Company (USA)(USA)

Centers for Disease ControlCenters for Disease Control (USA)(USA) Liberty MutualLiberty Mutual (USA)(USA)

Best Buy Best Buy (USA)(USA) LogiconLogicon (USA)(USA)

France TelecomFrance Telecom (France)(France) Net ShepherdNet Shepherd (Canada)(Canada)

Skila.comSkila.com (USA)(USA) Dept of Environmental ProtectionDept of Environmental Protection (Australia)(Australia)

US NavyUS Navy (USA)(USA) KPN ResearchKPN Research (Netherlands)(Netherlands)

Dow ChemicalDow Chemical (USA)(USA) Talkie.comTalkie.com (USA)(USA)

Clontech Clontech (USA)(USA) NICE SystemsNICE Systems (Israel)(Israel)

Customer base: Customer base: 300+ installations300+ installations Sample customersSample customers

Page 19: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TextAnalystTextAnalystUnderlying Underlying TechnologyTechnology

Page 20: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Text imageText image

Semantic NetworkSemantic Network - a list of the - a list of the most important most important conceptsconcepts (words and (words and word combinations) and word combinations) and relationsrelations between thembetween them

nuclear (100)

temperature (95)

nuclear reactions (98)

heat (99)

cell (98)

papers (86)

Temperature fusion (100)

Peterson (96)

(37)

(78)

(63)

(59)

(70)

(52)

(46) (29)

(28)

Page 21: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Semantic network creationSemantic network creation

Text is a string of characters: letters, Text is a string of characters: letters, spaces, punctuation marksspaces, punctuation marks

Steps for building Semantic NetworkSteps for building Semantic Network Break text in words and sentencesBreak text in words and sentences Push through a Push through a nn-character window-character window Feed patterns to a Recurrent Hierarchical Neural Feed patterns to a Recurrent Hierarchical Neural

Network and record frequenciesNetwork and record frequencies Identify relations between concepts (joint Identify relations between concepts (joint

occurrence in a sentence)occurrence in a sentence) Carry out preliminary semantic network Carry out preliminary semantic network

renormalization (Hopfield-like Neural Network) - renormalization (Hopfield-like Neural Network) - assign semantic weightsassign semantic weights

Page 22: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

General & Text-specific tasksGeneral & Text-specific tasks

Parse and reorganize input into Parse and reorganize input into sequences of words joined by sequences of words joined by concatenation and separation signsconcatenation and separation signs

Recognize and remove auxiliary words Recognize and remove auxiliary words and flective morphemesand flective morphemes

Recognize, count and store stem Recognize, count and store stem morphemesmorphemes

Identify words sharing stem Identify words sharing stem morphemesmorphemes

Page 23: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Hierarchical Recurrent NNHierarchical Recurrent NN

Page 24: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Hierarchical Recurrent NNHierarchical Recurrent NN

Page 25: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

General & Text-specific tasksGeneral & Text-specific tasks

Identify relationships Identify relationships Text - joint occurrence in sentencesText - joint occurrence in sentences

Preliminary SN renormalization: Preliminary SN renormalization: optimization task similar to Hopfield optimization task similar to Hopfield networknetwork

Association of concepts in SN with Association of concepts in SN with sentences and context in original textsentences and context in original text

Page 26: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc
Page 27: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Case studyCase study

IRLP provides R&D assistance and IRLP provides R&D assistance and information services to Indiana’s small information services to Indiana’s small businesses and governmental units businesses and governmental units

IRLP searches SBIR and the IRLP searches SBIR and the Commerce Commerce Business DailyBusiness Daily to identify research funding to identify research funding opportunities for its clients.opportunities for its clients.

““TextAnalyst was able to find the TextAnalyst was able to find the necessary matches even for those necessary matches even for those clients where existing search program clients where existing search program was incompatible.was incompatible.””

-- Cindy Moore, Marketing Coordinator, -- Cindy Moore, Marketing Coordinator, IRLPIRLP

Page 28: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Customer Customer quotesquotes

Page 29: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Eleanor McLellanEleanor McLellanData Manager / AnalystData Manager / AnalystCenters for Disease Control Centers for Disease Control Atlanta, GAAtlanta, GA

"TextAnalyst is able to efficiently handle numerous and often large (90+ pages apiece) text files without any problem. Furthermore, the program is extremely user-friendly."

TextAnalyst supports medical research TextAnalyst supports medical research at Centers for Disease Controlat Centers for Disease Control

Page 30: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Nikolai Kalnin, Ph.D.Nikolai Kalnin, Ph.D.Team LeaderTeam LeaderBioinformatics GroupBioinformatics GroupCLONTECH Laboratories, Inc.CLONTECH Laboratories, Inc.Palo Alto, CAPalo Alto, CA

"TextAnalyst has been selected as the only text analysis tool capable of establishing relations between terms. It is reasonably priced, easy to install and operate."

TextAnalyst helps processing texts TextAnalyst helps processing texts at Clontechat Clontech

Page 31: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Kalyan Gupta, Ph.D.Kalyan Gupta, Ph.D.Director, Research Director, Research CaseBank Technologies Inc.CaseBank Technologies Inc.Brampton, OntarioBrampton, Ontario

"TextAnalyst is used at CaseBank to identify and assess the contents of electronic repositories of troubleshooting and maintenance information. It saves case preparation time and allows CaseBank to be more responsive to its customer's knowledge retrieval needs."

TextAnalyst saves time and resources TextAnalyst saves time and resources for CaseBankfor CaseBank

Page 32: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Future developmentsFuture developments

Text categorization (now implemented Text categorization (now implemented in TextAnalyst COM)in TextAnalyst COM)

Thesaurus-based text retrievalThesaurus-based text retrieval Integration with Web technologiesIntegration with Web technologies

Page 33: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TextAnalyst evaluationTextAnalyst evaluation

We invite you to download a FREE We invite you to download a FREE evaluation copy of TextAnalyst fromevaluation copy of TextAnalyst from

www.megaputer.comwww.megaputer.com

and enjoy using it hands-on following and enjoy using it hands-on following the provided step-by-step lessons, or the provided step-by-step lessons, or exploring your own data.exploring your own data.

Page 34: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

TextractorTextractor

Technology and Technology and ApplicationsApplications

™™

Page 35: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Textractor capabilitiesTextractor capabilities

Key senses extractionKey senses extraction Hierarchical clusteringHierarchical clustering CategorizationCategorization SummarizationSummarization Intelligent searchIntelligent search Feature extractionFeature extraction

Page 36: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Textractor applicationsTextractor applications

GeneralGeneral Automated email categorization and routingAutomated email categorization and routing

(categories can be provided by the user or determined by the system)(categories can be provided by the user or determined by the system)

Knowledge extraction from call center notesKnowledge extraction from call center notes(example: occupational hazard determination)(example: occupational hazard determination)

Knowledge-based executive reporting systemKnowledge-based executive reporting system(one-glance knowledge visualization)(one-glance knowledge visualization)

Flexible searching for support documentationFlexible searching for support documentation(semantic relations between terms: synonyms, hyponyms, meronyms)(semantic relations between terms: synonyms, hyponyms, meronyms)

Competitive intelligenceCompetitive intelligence

InsuranceInsurance Clustering of claims and ontology buildingClustering of claims and ontology building

(hierarchical organization of textual data)(hierarchical organization of textual data)

Automated feature extraction and claim taggingAutomated feature extraction and claim tagging

Page 37: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Textractor analysis stepsTextractor analysis steps

Morphological analysisMorphological analysis Syntactic analysisSyntactic analysis Semantic analysis - WordNet filteringSemantic analysis - WordNet filtering

(synonymy, antonymy, hyper/hyponymy and holo/meronymy)(synonymy, antonymy, hyper/hyponymy and holo/meronymy)

Statistical analysisStatistical analysis(frequency of terms against background frequencies)(frequency of terms against background frequencies)

Context AnalysisContext Analysis(polysemy resolving and term collocations)(polysemy resolving and term collocations)

Semantic Network comparisonSemantic Network comparison

Page 38: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

WordNetWordNet

WordNet is a comprehensive semantically WordNet is a comprehensive semantically organized lexical database for Englishorganized lexical database for Englishwww.cogsci.princeton.edu/~wnwww.cogsci.princeton.edu/~wn

Textractor provides an ability to expand and Textractor provides an ability to expand and edit WordNet for a specific application field.edit WordNet for a specific application field.

Page 39: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Semantic term relationshipsSemantic term relationships

SynonymsSynonyms Accident – Collision – Wreck Accident – Collision – Wreck

Hyper/HyponymsHyper/Hyponyms Bird Bird (hyperym)(hyperym) : Eagle, Hawk, Pigeon : Eagle, Hawk, Pigeon (hyponyms)(hyponyms)

Holo/MeronymsHolo/Meronyms Car Car (holonym)(holonym) :: Motor, Windshield, Tire :: Motor, Windshield, Tire (meronyms)(meronyms)

AntonymsAntonyms Cold <> Hot, Deep <> ShallowCold <> Hot, Deep <> Shallow

PolysemyPolysemy Commercial Bank Commercial Bank River Bank River Bank

Page 40: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Textractor architectureTextractor architecture

Data sources

WordNet

Filters and DW interfaces

Semantic Analysis

Core TM engines

Morphological Analysis

Application-orientedTM engines

Field-specificWordNet

Extensions

WordNetExtension Editor

Text Mining Engines

SyntacticAnalysis

StoredIndices

Link Parser

Page 41: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Textractor text mining enginesTextractor text mining engines

Core TM engines Application-oriented TM engines

Text indexer

Formal search query creator

Key senses extractor

Feature extractor

Application-oriented TM engines

Text Categorizer

Text Clusterizer

Database enrichmentand mining

Intelligent Searcher(synonyms, hyper/hyponyms,term proximity, frequencies)

Document tagging

Page 42: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Any Questions?Any Questions?

Call Megaputer at(812) 330-0110

or write

120 W Seventh Street, Suite 310Bloomington, IN 47404 USA

[email protected]

Page 43: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Appendix AAppendix ATextAnalyst technology TextAnalyst technology detailsdetails

Page 44: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Two aspects of textTwo aspects of text

Sequence of charactersSequence of characters characterized characterized by patterns that represent information by patterns that represent information recognized by humansrecognized by humans

Structured sequence of lexical unitsStructured sequence of lexical units organized together according to organized together according to morphological and syntactic rules morphological and syntactic rules (morphemes, auxiliary lexical units, syntactic (morphemes, auxiliary lexical units, syntactic members, sentences, etc.)members, sentences, etc.)

Page 45: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Semantics of textSemantics of text

Humans rely on Humans rely on multimodalmultimodal associations for creating semantic associations for creating semantic modelsmodels

Standalone textStandalone text - semantics is formal, - semantics is formal, but still usefulbut still useful

MeaningMeaning of a concept - collection of of a concept - collection of relations of this concept to other relations of this concept to other concepts in the text concepts in the text (constructive definition)(constructive definition)

Page 46: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Lexical vs. GrammaticalLexical vs. Grammatical

Lexical meaningLexical meaning of a word - of a word - determined by stem morpheme (word determined by stem morpheme (word combinations - chains of morphemes)combinations - chains of morphemes)

Grammatical meaningGrammatical meaning - determined by - determined by morphemes morphemes (prefixes, endings, etc.)(prefixes, endings, etc.) and and auxiliary semantic units auxiliary semantic units (articles, (articles, prepositions, etc.)prepositions, etc.)

Grammatical chainsGrammatical chains - word sequences - word sequences with extracted stem morphemes - with extracted stem morphemes - frames for contentsframes for contents

Page 47: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Semantic structure of textsSemantic structure of texts

Single text - semantic analysis can be Single text - semantic analysis can be performed, but is not sufficient: need a performed, but is not sufficient: need a knowledge base against which the text knowledge base against which the text can be analyzedcan be analyzed

Analysis of a large number of texts Analysis of a large number of texts from diverse fields from diverse fields =>=> Grammatical Grammatical structure of the languagestructure of the language

Analysis of a large number of texts Analysis of a large number of texts from the field of interest from the field of interest =>=> Knowledge BaseKnowledge Base

Page 48: Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc.  Text Mining © 2001 Megaputer intelligence, Inc

Grammatical + Lexical = SemanticGrammatical + Lexical = Semantic

Grammatical dictionaries of Grammatical dictionaries of morphemes and auxiliary wordsmorphemes and auxiliary words of a of a language: threshold transformation language: threshold transformation applied to a NN trained on a large applied to a NN trained on a large corpus of texts from diverse fields corpus of texts from diverse fields

Trained “grammatical NN” - filter. Trained “grammatical NN” - filter. “Lexical” NN is connected to its “Lexical” NN is connected to its output.output.

Combining elements from both NN - Combining elements from both NN - obtain a list of concepts for obtain a list of concepts for Semantic Semantic NetworkNetwork (after relational (after relational renormalization)renormalization)