irmac presentation for website

47
Copyright 2003-4, SPSS Inc. 1 An Introduction to Text Mining Tim Daciuk SPSS, Inc. Services Manager, Canada

Upload: frank-barnes

Post on 17-Jul-2015

196 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Copyright 2003-4, SPSS Inc. 1

An Introduction to Text Mining

Tim DaciukSPSS, Inc.Services Manager, Canada

Copyright 2003-4, SPSS Inc. 2

AgendaAgenda

Introductions

An Overview of Document Warehousing

Understanding Unstructured Text

Concept Extraction

Text Mining

Data Mining

Demonstration

Copyright 2003-4, SPSS Inc. 3

Tim DaciukTim Daciuk

Background Social research Survey research

SPSS 25 years working with the product 12 years working with the company 5 years working with text analysis

Prior history Consulting Education

Copyright 2003-4, SPSS Inc. 4

Predictive analysis helps connect data to effective

action by drawing reliable conclusions about

current conditions and future events.

— Gareth Herschel, Research Director, Gartner Group

Predictive Analytics: DefinedPredictive Analytics: Defined

Copyright 2003-4, SPSS Inc. 5

SPSS At A GlanceSPSS At A Glance

Leadership Market leader in Predictive Analytics Focus on online & offline customer data acquisition and analysis

Stability Founded in 1968 30+ year heritage in analytic technologies

Proven track record 250,000+ customers worldwide NASDAQ: SPSS

Analytics standard 80% of Fortune 500 are SPSS customers 80% plus market share in Survey & Market Research sector Ranked #1 Data Mining solution by KD Nuggets

Some of Our BrandsSome of Our Brands

Copyright 2003-4, SPSS Inc. 7

Unstructured Data ManagementUnstructured Data Management

Text Mining is a subset of Unstructured Data

Management.

UDM can be broken down into: Content and Document Management

Search and Retrieval

XML database and tools

Categorization, Classification, and Visualization

Copyright 2003-4, SPSS Inc. 8

80% of Data is Unstructured80% of Data is Unstructured

Database notes: Call center transcripts Other CRM

Email

Open-ended survey responses

Web pages

NewsGroups

Documents themselves

Competitive information

Copyright 2003-4, SPSS Inc. 9

Applications for Text AnalysisApplications for Text Analysis

Surveys

‘Reading’ email

Call centre data

Comment data

Abstracts

Document management

Corporate history

Thematic understanding of website

Copyright 2003-4, SPSS Inc. 10

Data Warehouse vs. Document Data Warehouse vs. Document WarehouseWarehouse

Data warehouse Who, what, when, where, how much Internally focused Operational information Rarely include external information

Document warehouse Why May not be internally focused May contain a range of information Often integrate external information

Copyright 2003-4, SPSS Inc. 11

Document Warehouse FeaturesDocument Warehouse Features

There is no single document structure or document

type

Documents are drawn from multiple sources

Essential features of documents are automatically

extracted and explicitly stored in the document

warehouse

Document warehouses are designed to integrate

semantically related documents

Copyright 2003-4, SPSS Inc. 12

Building the Document WarehouseBuilding the Document Warehouse

IdentifySources

RetrieveDocument

TextAnalysis

Pre-processDocument

CompileMetadata

Copyright 2003-4, SPSS Inc. 13

Predict, Impact, DeployPredict, Impact, Deploy

Customer

Data

Attitudes

Actions

Attributes

Business User

Grow

Retain

Fraud

Outcomes

Attract

Data Collection

Text

Surveys

WebChannel

OperationalSystems

Text Bu

siness U

I

Expert UIExpert UI

Concepts

Concept Maps

Clustering

Categoriza-tion

Trending

Information Extraction

Prediction

NLP

Copyright 2003-4, SPSS Inc. 14

The Building Blocks of LanguageThe Building Blocks of Language

Morphology

Syntax

Semantics

Phonology

Pragmatics

Copyright 2003-4, SPSS Inc. 15

MorphologyMorphology

Understanding words Stems Affixes

Prefix Suffix

Inflectional elements

Reducing complexity of

analysis

Reduces complexity of

representation

Supports text mining

Noun

PrefixNoun Stem

Suffix

- abledisputein -

Copyright 2003-4, SPSS Inc. 16

SyntaxSyntax

The Bank of Canada will curb inflation with higher interest rates

Prepositional phrase

Adjective

Sentence

Noun phrase Verb phrase

NounVerbAux

Noun phrase

NounAdjective

Noun

The Bank ofCanada

inflationcurbwill

Interest rateshigher

with

Copyright 2003-4, SPSS Inc. 17

SemanticsSemantics

The meaning of it all

Approaches to meaning Semantic networks Deductive logic Rule-based systems

Useful for classification

Copyright 2003-4, SPSS Inc. 18

Problems with NLPProblems with NLP

Limitations of Natural Language Processing Correctly identifying the role of noun phrases Representing abstract concepts Classifying synonyms Representing the number of concepts

Copyright 2003-4, SPSS Inc. 19

Problems with NLPProblems with NLP

Limitations of technology Language specific designs are required Classification speed Classifying hybrid words and sentences

Copyright 2003-4, SPSS Inc. 20

Underlying Technology is Based on Underlying Technology is Based on LinguisticsLinguistics

The Linguistic Approach:

Does not treat a document as a bag of words

Removes ambiguity by extracting structured concepts

Concepts are the DNA of text.

Text is unstructured, ambiguous, and language dependent.

Copyright 2003-4, SPSS Inc. 21

From Text to ConceptsFrom Text to Concepts

Morphology

Syntax

Semantics StatisticsLinguistic

Terminology

Extractor

ScalableAccurate

Customizable Discovery-Oriented

•Compound words

•Proper nouns

•Figures

•Named entities

•Domain specifics

•Speed

•Multiple formats

•Multiple languages

•SPSS dictionaries

•User dictionaries

•Extraction rules

•Extraction patterns

•Known terms

•Unknown terms

•New terms

•1GB/hour

•PDF, MS Office, text…

•English, French, GermanSpanish, Italian, Dutch,Japanese

• Inserm; merck & co…• tnp-470; glut-4…• factor receptor; Inhibitory effect;• D. John Paganoni, ..• Positive/Negative opinion…• London, Paris…

•Names, Orgs…

•MeSH, genes...

•Predicates

•Synonyms, stop words..

•Trends

Copyright 2003-4, SPSS Inc. 22

From Concepts to Predictive From Concepts to Predictive Analytics ComponentsAnalytics Components

Linguistic

Terminology

Extractor

LexiQuestMine

Discover concepts,

relationships and trends

LexiQuest Categorize

Understand documents and assign in pre-defined categories

Text Mining for Clementine

Add text fields to data mining for better prediction

Copyright 2003-4, SPSS Inc. 23

Concept Extraction EngineConcept Extraction Engine

The extractor turns unstructured text into concepts:

LexiQuest Extractor EngineLinguistic Processor

Visualization Probabilities

LexiQuestMine

ClementineLexiQuestCategorize

Copyright 2003-4, SPSS Inc. 24

Part-of-Speech TaggingPart-of-Speech Tagging

a: adjective b: adverb c: preposition

d: determiner n: noun v: verb

o: coordination p: participle s: stop word

Copyright 2003-4, SPSS Inc. 25

How is a Concept Extracted?How is a Concept Extracted?

Step 1: Part-of-Speech Tagging

Using a tool like LexiQuest Mine is a great

V P N A N N V P A

idea for any organization that is interested in maintaining

N P A N P V V P V

information on competitive intelligence.

N P N N

Copyright 2003-4, SPSS Inc. 26

How is a Concept Extracted?How is a Concept Extracted?

Step 2: Matching to Known Patterns

This:

V P N A N N V P A N PA N P V V P V N PN N

Looks Most Like:

N C D N N

(32 Known patterns for English)

Copyright 2003-4, SPSS Inc. 27

How is the Concept Extracted?How is the Concept Extracted?

The extractor looks at this sentence: Using a tool like LexiQuest Mine is a great idea for any

organization that is interested in maintaining information on competitive intelligence.

And extracts the concept: Competitive Intelligence

Concepts are: Noun based Can be longer than one word

Copyright 2003-4, SPSS Inc. 28

Example: CategorizationExample: Categorization

Copyright 2003-4, SPSS Inc. 29

The Issue of LanguageThe Issue of Language

NLP requires separate language understanding

Clementine text mining French English English/French German Spanish Dutch Japanese Italian Mesh (Medical subject headings)

http://www.nlm.nih.gov/mesh/meshhome.html

“The process of discovering meaningful

new relationships, patterns and trends by

sifting through data using pattern

recognition technologies as well as

statistical and mathematical techniques.”

- The Gartner group.

Data Mining DefinedData Mining Defined

Copyright 2003-4, SPSS Inc. 31

Why data mining?Why data mining?

Data Mining software generally employs modeling

algorithms designed to handle non-linearities and

unusual patterns in data As opposed to classical linear models (e.g., linear

regression) that aren’t as capable

A related issue is ‘noise’ in the data: where, for

example, 2 seemingly similar sets of inputs yield a

different output

Copyright 2003-4, SPSS Inc. 32

Use the cross industry standard process for data mining (CRISP-DM)

Based on real-world lessons: Focus on business

issues User-centric &

interactive Full process Results are used

A Data Mining MethodologyA Data Mining Methodology

Copyright 2003-4, SPSS Inc. 33

Data Mining is not…Data Mining is not…

Keep in mind that data mining is not… “Blind” application of analysis/modeling algorithms Brute-force crunching of bulk data Black box technology Magic

Copyright 2003-4, SPSS Inc. 34

Back to the ProcessBack to the Process

Text Mining

Copyright 2003-4, SPSS Inc. 35

UnderstandingUnderstanding

Business Understanding Determine objective Assess situation Determine data mining goals Produce project plan

Data Understanding Collect initial data Describe data Explore data Verify data quality

Copyright 2003-4, SPSS Inc. 36

Data PreparationData Preparation

Data Data set Data set description Select data Clean data Construct data set / Integrate data Format data

Text Concept extraction Concept combination Concept assessment

Copyright 2003-4, SPSS Inc. 37

ModelingModeling

Select modeling technique Universe of techniques Appropriate techniques

Data Text

Requirements Constraints Selected tools

Generate test design

Run model(s)

Assess model(s)

Copyright 2003-4, SPSS Inc. 38

EvaluationEvaluation

Results = Models + Findings

Evaluate results

Review process

Determine next steps

Copyright 2003-4, SPSS Inc. 39

DeploymentDeployment

Plan deployment

Plan monitoring and maintenance

Final report

Project review

Copyright 2003-4, SPSS Inc. 40

Unsupervised methods: Group patients by drugs and demographic information

and try to find unusual patients

Supervised methods: Attempt to predict amount due and find sets of cases

where the amount due is very different from the

predicted amount

Data Mining ApproachesData Mining Approaches

Copyright 2003-4, SPSS Inc. 41

What Does Data Mining Do?What Does Data Mining Do?

Data mining uses existing data to: Predict

Category membership Numeric Value Ie. Credit risk

Group Cluster (group) things together

based on their characteristics Ie. Different types of TV viewers

Associate Find events that occur together, or in

a sequence Ie. Beer and diapers

Find outliers Identify cases that don’t follow

expected behavior Ie. Fraudulent behaviour

Copyright 2003-4, SPSS Inc. 42

Benefits of Document WarehousingBenefits of Document Warehousing

Richer operational business intelligence

Knowing your customers

Macroenvironmental monitoring

Technology assessment

Copyright 2003-4, SPSS Inc. 43

ConclusionsConclusions

Text mining is More than word counts Linguistically based Concept extraction

Data mining is Advanced analytics applied to datasets A family of techniques Supervised or unsupervised

Copyright 2003-4, SPSS Inc. 44

ConclusionsConclusions

Text and data mining Add dimensionality to the data Allow for automation of the text analysis event Create 360 degree view

Applications Websites Surveys Email Call centre Documentation

Copyright 2003-4, SPSS Inc. 45

?

Copyright 2003-4, SPSS Inc. 46

So How Do I Get Started?So How Do I Get Started?

Document Warehousing and Text Mining Dan Sullivan, Wiley, 2001

Survey of Text Mining: Clustering, Classification

and Retrieval Michael W. Berry (ed.), Springer, 2003

Natural Language Processing for Online

Applications: Text Retrieval, Extraction and

Categorization P. Jackson and I. Moulinier, John Benjamins, 2002

Copyright 2003-4, SPSS Inc. 47

SPSS CanadaSPSS Canada

Tim Daciuk Services Manager, Canada 416-410-7921 800-543-6607 ext. 5156 [email protected]

Hugh Rooney SPSS Sales Canada 416-410-7921 905-886-4322 [email protected]

www.spss.com