text mining by examples, by hadi mohammadzadeh

41
1 . Hadi Mohammadzadeh Text Mining by Examples Pages By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm – 27 Jan. 2010 Seminar on Text Min by Examples

Upload: hadi-mohammadzadeh

Post on 11-May-2015

5.117 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Text mining by examples, By Hadi Mohammadzadeh

1

.

Hadi Mohammadzadeh Text Mining by Examples Pages

By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 27 Jan. 2010

Seminar on

Text Mining

by Examples

Page 2: Text mining by examples, By Hadi Mohammadzadeh

2

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

OutLine

1. New Terminologies2. WordNet - A Large Lexical DataBase of English3. Reuters-21578 … as a Text Collection4. CMU Text Learning Group Data Archives

5. Text Mine Software - Web based algorithms6. Text Mine Software - Command based algorithms7. Usefull Web sites

Page 3: Text mining by examples, By Hadi Mohammadzadeh

3

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part One

New TerminologiesWord and Meaning Relationships

Page 4: Text mining by examples, By Hadi Mohammadzadeh

4

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text

Hyponym and Hypernym

• In linguistics, a hyponym is a word or phrase whose semantic range is included within another word, its hypernym. For example, scarlet and crimson are all hyponyms of red (their hypernym), which is, in turn, a hyponym of colour.

Page 5: Text mining by examples, By Hadi Mohammadzadeh

5

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text Meronym

• Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,– X is a meronym of Y if Xs are parts of Y(s), or– X is a meronym of Y if Xs are members of Y(s).

• For example, 'finger' is a meronym of 'hand' because a finger is part of a hand. Similarly 'wheel' is a meronym of 'automobile'.

Page 6: Text mining by examples, By Hadi Mohammadzadeh

6

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text Holonym

• Holonymy defines the relationship between a term denoting the whole and a term denoting a part of the whole. That is,

– 'X' is a holonym of 'Y' if Ys are parts of Xs, or– 'X' is a holonym of 'Y' if Ys are members of Xs.

• For example, 'tree' is a holonym of 'bark', of 'trunk‘ and of 'limb.'

Page 7: Text mining by examples, By Hadi Mohammadzadeh

7

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Two

WordNetA Large Lexical DataBase of English

Page 8: Text mining by examples, By Hadi Mohammadzadeh

8

.

Hadi Mohammadzadeh Text Mining by Examples Pages

WordNet

• WordNet® is a large lexical database of English, developed under the direction of George A. Miller.

• Develpoment of WordNet began in 1985 and its use is widespread in tools to manage text.

• WordNet is more than just a dictionary and thesaurus; it includes all kinds of relationships between words. WordNet version 2.0 contains roughly 150,000 content words.

Page 9: Text mining by examples, By Hadi Mohammadzadeh

9

.

Hadi Mohammadzadeh Text Mining by Examples Pages

WordNet cont.

• Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

• WordNet is also freely and publicly available for download.

• WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

Page 10: Text mining by examples, By Hadi Mohammadzadeh

10

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text – Polysemy

Number of Senses in WordNet

• A word can have more than one meaning that is not obvious in a sentence.

• In WordNet a word has an average of 1.4 senses.

Average of Sense

Word Number Average of Senses

Verb 2.1Adjective 1.45

Adverb 1.25

Nouns 1.24

Page 11: Text mining by examples, By Hadi Mohammadzadeh

11

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text – Polysemy

Number of Senses in WordNet

Words with the Highest Number of Senses from WordNet

Word Number of Senses

Break 74

Cut 73

Run 57

Play 52

Make 51

Page 12: Text mining by examples, By Hadi Mohammadzadeh

12

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text – Polysemy

Number of POS in WordNet

• Some words also have more than one part of speech(POS). For example still has five different parts of speech.

Word Number of POS

Out 5Round 5

Still 5Down 5Over 4

Page 13: Text mining by examples, By Hadi Mohammadzadeh

13

.

Hadi Mohammadzadeh Text Mining by Examples Pages

World Classifications in WordNet

• Words can be classified into word classes or POS.

• We refer to nouns, verbs, adjectives, and adverbs as content words.

• Conjunctions, determiners, pronouns, and prepositions are called function words.

Frequencies of Word Classes from WordNet

Type Number Type Number

Noun 114,400(75%) Preposition 133(0.08%)

Adjective 21,438(14%) Pronoun 118(0.07%)

Verb 11,341(7.4%) Conjunction 89(0.05%)

Adverb 4662(3%) Determiner 14(0.009%)

Page 14: Text mining by examples, By Hadi Mohammadzadeh

14

.

Hadi Mohammadzadeh Text Mining by Examples Pages

WordNet Website and Developed Program

• WordNet Website

• WordNet Developed Program

Page 15: Text mining by examples, By Hadi Mohammadzadeh

15

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Three

Reuters-21578

as a Text Collection

Page 16: Text mining by examples, By Hadi Mohammadzadeh

16

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Reuters-21578 History

• The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987.

• Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Really it is a classic benchmark for text categorization algorithms.

• The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files contain 1000 documents, while the last contains 578 documents.

Page 17: Text mining by examples, By Hadi Mohammadzadeh

17

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Reuters-21578

• Distribution 1.0 on 26 September 1997, By David D. Lewis AT&T Labs - Research

• The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text

categorization system.

Page 18: Text mining by examples, By Hadi Mohammadzadeh

18

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Four

CMU Text Learning Group Data Archives

as a Text Collection

Page 19: Text mining by examples, By Hadi Mohammadzadeh

19

.

Hadi Mohammadzadeh Text Mining by Examples Pages

CMU Text Learning Group Data Archives

• This data set is a collection of 20,000 messages, collected from 20 different netnews newsgroups. One thousand messages from each of the twenty newsgroups were chosen at random and partitioned by newsgroup name.

• Link

• Sample Message

• Experiment Results

• Prof. Cho , Sam Houston State of University

Page 20: Text mining by examples, By Hadi Mohammadzadeh

20

.

Hadi Mohammadzadeh Text Mining by Examples Pages

CMU Text Learning Group Data Archives

1. alt.atheism 2. talk.politics.guns 3. talk.politics.mideast 4. talk.politics.misc 5. talk.religion.misc 6. soc.religion.christian 7. comp.sys.ibm.pc.hardware 8. comp.graphics 9. comp.os.ms-windows.misc 10. comp.sys.mac.hardware 11. comp.windows.x 12. rec.autos 13. rec.motorcycles 14. rec.sport.baseball 15. rec.sport.hockey 16. sci.crypt 17. sci.electronics 18. sci.space 19. sci.med

20. misc.forsale

Page 21: Text mining by examples, By Hadi Mohammadzadeh

21

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Five

Text Mine SoftwareWeb based algorithms

Page 22: Text mining by examples, By Hadi Mohammadzadeh

22

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Text Mine Application

• The three scripts in the first row handle:1. the creation of text statistics

• Number of word types• Letter frequencies• Word frequencies

2. Entity Extraction3. Finding the POS tags for words

Page 23: Text mining by examples, By Hadi Mohammadzadeh

23

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Text Mine Application

• As an input use a text file such as Help File or write a text on Textbox.

Page 24: Text mining by examples, By Hadi Mohammadzadeh

24

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Six

Text Mine SoftwareCommand based algorithms

Page 25: Text mining by examples, By Hadi Mohammadzadeh

25

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Zeroth ProgramTokens

• Name of Program: tokens.pl• Input : sample. • Output : After runnig this program, it will generate a text

file with following name

tokens.txt• Aim : Generating Tokens

Page 26: Text mining by examples, By Hadi Mohammadzadeh

26

.

Hadi Mohammadzadeh Text Mining by Examples Pages

First ProgramPart of Speech Tagger

• Name of Program: pos-test.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

pos_test_results.txt• Aim : Part of Speech Tagger

Page 27: Text mining by examples, By Hadi Mohammadzadeh

27

.

Hadi Mohammadzadeh Text Mining by Examples Pages

• To generate named entities with associated types, we need some dictionaries for categories such as – Person, place, organization, number, currency,

dimension, time, technical time, or miscellaneous.– For Exampel co_abbrev.dat contains a list of about 900

abbreviations. Or co_places table is a list of about 3000 of the world’s lager cities.

Second ProgramEntity Extraction

Page 28: Text mining by examples, By Hadi Mohammadzadeh

28

.

Hadi Mohammadzadeh Text Mining by Examples Pages

• Name of Program: test-ent.pl• Input : Inside Perl File. • Output : After runnig this program, it will

generate a text file with following name

test_ent_results.txt• Aim : Entity Extraction

Second ProgramEntity Extraction

Page 29: Text mining by examples, By Hadi Mohammadzadeh

29

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Third ProgramDisambiguate words with multiple

• Name of Program: sense.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

sense.txt

Page 30: Text mining by examples, By Hadi Mohammadzadeh

30

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Fourth ProgramRandom Text Generator

• Name of Program: tgen.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

tgen.txt

Page 31: Text mining by examples, By Hadi Mohammadzadeh

31

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Fifth ProgramSplitting of text into sentences

• Name of Program: tsplit.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

tsplit.txt

Page 32: Text mining by examples, By Hadi Mohammadzadeh

32

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Sixth programClustering

• Name of Program: cluster.pl

• Input Data: a collection of 55 Reuters documents from three topics– Cocoa , 15 documents– Suger , 22 documents– Coffee , 18 documentsInput file included in cluster.pl.

• Input Parameters : A similarity threshold, a linking parameter, and an indexing parameter.

• Output : It returns a list of clusters and similarity matrix. Cluster.txt

• Method : This program is based on genetic algorithm method.

Page 33: Text mining by examples, By Hadi Mohammadzadeh

33

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Seven

Usefull Web sites

Page 34: Text mining by examples, By Hadi Mohammadzadeh

34

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Talk to Ditto

• http://www.convo.co.uk/x02/?

Page 35: Text mining by examples, By Hadi Mohammadzadeh

35

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Page 36: Text mining by examples, By Hadi Mohammadzadeh

36

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Page 37: Text mining by examples, By Hadi Mohammadzadeh

37

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Page 38: Text mining by examples, By Hadi Mohammadzadeh

38

.

Hadi Mohammadzadeh Text Mining by Examples Pages

How it works?

• Bayesian Classification is used to teach Ditto the donkey the basics of the English language

• When Ditto receives a message, he evaluates it for niceness or nastiness, then responds emotionally on a scale of –100 to +100

• Ditto was trained using 5525 examples

Page 39: Text mining by examples, By Hadi Mohammadzadeh

39

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Dragon Toolkit

• Dragon Toolkit

Page 40: Text mining by examples, By Hadi Mohammadzadeh

40

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Disp

• http://www.ltg.ed.ac.uk/disp/resources/

Page 41: Text mining by examples, By Hadi Mohammadzadeh

41

.

Hadi Mohammadzadeh Text Mining by Examples Pages

References

• Books– Introduction to Information Retrieval-2008– Managing Gigabytes-1999– The Text Mining Handbook– Text Mining Application Programming– Web Data Mining