iets over data mining en information retrieval : motivatie & samenvatting

Gegevensbanken 2012

Iets over data miningen Information Retrieval

Bettina Berendthttp://people.cs.kuleuven.be/~bettina.berendt/

2

Iets over data miningen Information Retrieval:

Motivatie & Samenvatting

3

Waar zijn we?

Les # wie wat1 ED intro, ER2 ED EER, (E)ER naar relationeel schema2 ED relationeel model3 KV Relationele algebra & relationeel calculus4,5 KV SQL6 KV Programma's verbinden met gegevensbanken7 KV Functionele afhankelijkheden & normalisatie8 KV PHP10 BB Beveiliging van gegevensbanken11 BB Geheugen en bestandsorganisatie12 BB Externe hashing13 BB Indexstructuren14 BB Queryverwerking15-17 BB Transactieverwerking en concurrentiecontrole18 BB Data mining en Information Retrieval9 ED XML (en meer over het Web als GB), NoSQL

Nieuwe thema‘s / vooruitblik

4

Aan wie zou een bank geld lenen?

Gegevensbanken queries:

• Wie heeft ooit een krediet niet terugbetaald?SELECT DISTINCT Fname, Lname FROM Clients, LoansWHERE clientID = loantakerIDAND paid = „NO“

Data Warehousing / Online Analytical Processing OLAP:

• In welke wijken hebben meer dan 20% van de clienten vorig jaar een krediet niet terugbetaald?

Data Mining:

• Bij welke mensen is te verwachten dat ze een krediet niet terugbetalen? (= wijk, baan, leeftijd, geslacht, ...)

5

nog een toepassingsgebied

• Het Web• Je gebruikt Web data mining elke dag

6

Indexering en ranking

7

Gedragsanalyse voor recommender systems

8

Tekstmining voor recommender systems

9

Of ook

10

Wie koopt de printer XYZ ?

• Mijn Klant (ezf.): database lookup• „Ik ken het antwoord niet, maar de volgende 2398445

pagina‘s zijn relevant voor uw query“: zoekmachine / information retrieval / document retrieval

• Deze gebruiker (omwille van zijn profiel, zijn postings, zijn vrienden en hun eigenschappen, …): data mining

• Iemand die pas zijn oude printer verkocht/weggegooid heeft: logica

Verschillende methodes voor inferentie; verschillende types van antwoorden Beschrijven / bekende gegevens versus voorspellen

11

Het volgende is ook …

… een vooruitblik op verschillende cursussen in de Master, bv.• Advanced Databases• Text-based Information Retrieval• Current Trends in Databases• Data Mining

Ook interessant / gerelateerd (logica!), maar niet het onderwerp van vandaag:

• Modellering van complexe systemen

12

Agenda

Methoden (1): Classifier learning op relaties

Methoden (2): Itemset mining

Van relaties naar teksten

Methoden (3): Classifier learning op teksten

(Een beetje) KD proces: Preprocessing

Wat doen zoekmachines? Wat kunnen WIJ doen?

Hoe worden gegevens machtig? Mining & combinatie

13

Agenda








14

Knowledge discovery (en data mining)

“het niet-triviale proces voor het identificeren van geldige, nieuwe, mogelijk te gebruiken, en uiteindelijk verstaanbare

patronen in data.”

Data mining

15

Data mining technieken

• Verkennende data-analyse met interactieve, vaak visuele methoden• Beschrijvende modellering (schatting van de dichtheid, clusteranalyse

en segmentatie, afhankelijkheidsmodellering)

• Voorspellende modelleringen (classificatie en regressie)• Het doel is een model te bouwen waarmee de waarde van één

variable te voorspellen is, op basis van de gekende waarden voor de andere variabelen.

• In classificatie is de voorspelde waarde een categorie;

• bij regressie is deze waarde quantitatief

• Het ontdekken van (lokale) patronen en regels• Typische voorbeelden zijn frequente patronen zoals

• verzamelingen, sequenties, subgrafen

• en regels die hieruit afgeleid kunnen worden (bv. associatieregels)

16

Bijzonder interessant op basis van gecombineerde gegevens ...

... en ...

... en ...

17

Gegevens

• relationele gegevens, • teksten, • grafen, • semi-gestructureerde gegevens (bv. Web clickstreams)• beelden, • …

18

Agenda








19

Input data ... Q: when does this person play tennis?

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

20

Terminology (using a popular data example)

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook Rows:• Instances • (think of them as objects)• Days, described by:

Columns:• Features• Outlook, Temp, …

In this case, there is a feature with a special role:

• The class• Play (does X play tennis on this day?)

This is “relational DB mining“. We will later see other types of data and the mining applied to them.

21

The goal: a decision tree for classification / prediction

In which weather

will someone play (tennis etc.)?

22

2222

Constructing decision trees

Strategy: top downRecursive divide-and-conquer fashion

First: select attribute for root nodeCreate branch for each possible attribute value

Then: split instances into subsetsOne for each branch extending from the node

Finally: repeat recursively for each branch, using only instances that reach the branch

Stop if all instances have the same class

23

2323

Which attribute to select?

24

2424

Which attribute to select?

25

2525

Criterion for attribute selection

Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that

produces the “purest” nodes

Popular impurity criterion: information gain

Information gain increases with the average purity of the subsets

Strategy: choose attribute that gives greatest information gain

26

2626

Computing information

Measure information in bits Given a probability distribution, the info

required to predict an event is the distribution’s entropy

Entropy gives the information required in bits(can involve fractions of bits!)

Formula for computing the entropy:

27

2727

Example: attribute Outlook

info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits

info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits

info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits

28

2828

Computing information gain

Information gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029 bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits

29

2929

Continuing to split

gain(Temperature ) = 0.571 bitsgain(Humidity ) = 0.971 bitsgain(Windy ) = 0.020 bits

30

3030

Final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes

Splitting stops when data can’t be split any further

31

V: entropy, heeft dit iets te maken met het thermodynamische concept ( een maat voor de wanorde van iets, een grootheid die enkel kan toenemen, ongeacht wat er gebeurd) of staat dit hier helemaal los van?

A: Ja en neen …

• Aanbevolene bron:• Stanford encyclopedia of Philosophy • http://plato.stanford.edu/entries/information-entropy/

• Iets korter (maar ik kan de inhoud niet beoordelen):• http://en.wikipedia.org/wiki/Entropy_in_thermodynamics_and_information_theory

32

Agenda








33

Gegevens

• „Market basket (winkelmandje) data“: attributen met booleaanse domeinen

• In een tabel elke rij is een basket (ook: transactie)

Transactie ID

Attributen (basket items)

1 Spaghetti, tomatensaus

2 Spaghetti, brood

3 Spaghetti, tomatensaus, brood

4 Brood, boter

5 Brood, tomatensaus

34

Als relationele tabel

Transactie

Spaghetti

Tomaten-saus

brood boter

1 1 1 0 0

2 1 0 1 0

3 1 1 1 0

4 0 0 1 1

5 0 1 1 0

35

Solution approach: The apriori principle and the pruning of the search tree (1)

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

36



Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter







37



Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter







38



Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter







39

Genereren van grote k-itemsets met Apriori

• Min. support = 40%• Stap 1: kandidaat-1-itemsets

• Spaghetti: support = 3 (60%)• Tomatensaus: support = 3 (60%)• Brood: support = 4 (80%)• Boter: support = 1 (20%)

Transactie ID



2 Spaghetti, brood


4 Brood, boter


40

• Stap 2: grote 1-itemsets• Spaghetti• Tomatensaus• Brood

• kandidaat-2-itemsets• {Spaghetti, tomatensaus}: support = 2 (40%)

• {Spaghetti, brood}: support = 2 (40%)

• {tomatensaus, brood}: support = 2 (40%)

Transactie ID



2 Spaghetti, brood


4 Brood, boter


41

• Stap 3: grote 2-itemsets• {Spaghetti, tomatensaus}

• {Spaghetti, brood}

• {tomatensaus, brood}

• kandidaat-3-itemsets• {Spaghetti, tomatensaus, brood}: support = 1 (20%)

• Stap 4: grote 3-itemsets• { }

Transactie ID



2 Spaghetti, brood


4 Brood, boter


42

Van itemsets naar associatieregels

• Schema: Als subset dan grote k-itemset met support s en confidence c• s = (support van grote k-itemset) / # tupels• c = (support van grote k-itemset) / (support van subset)

• Voorbeeld:

• Als {spaghetti} dan {spaghetti, tomatensaus}• Support: s = 2 / 5 (40%)• Confidence: c = 2 / 3 (66%)

43

Het kan beter … (een mogelijkheid)V: de FP-boom

Item Support

Link

Brood 4

Spaghetti

3

Tomaten-saus

3

NULL

S:1

T:1

Br:4

S:2 T:1

T:1

44

Agenda








46

Teksten als relaties

Do-cu-ment

star Britney

Spears

Big

Dipper

class

1 1 1 1 0 0 Celebrity

2 1 0 1 1 1 Astronomy



5 1 0 0 1 1 Astronomy

IF star AND Britney THEN CelebrityIF star AND Dipper THEN Astronomy

47

Teksten als itemsets („sets of words“)

Do-cu-ment

star Britney

Spears

Big

Dipper

1 1 1 1 0 0

2 1 0 1 1 1

3 1 1 1 0 0

4 0 1 1 0 0

5 1 0 0 1 1

IF star AND Britney THEN SpearsIF star AND Dipper THEN Big

48

Teksten als bags of words

Do-cu-ment

star

Brit-ney

Spears

Big

Dipper

1 1 3 10 0

2 1 0 11 1

3 1 1 10 0

4 0 1 10 0

5 2 0 01 1

49

GB-Structuren daarachter:Wat en waarvoor een index? (3) – vinden (hier: volledig geïnverteerde bestanden)

50

Teksten als bags of words

Do-cu-ment

star

Brit-ney

Spears

Big

Dipper

1 1 3 10 0

2 1 0 11 1

3 1 1 10 0

4 0 1 10 0

5 2 0 01 1

Britney is zeer characteristiek voor doc 1.Star is niet characteristiek (in elke doc!). Term frequency / inverse doc. Freq.TF.IDF gewichten voor worden

Welke documenten zijn waarschijnlijk meest belangrijk voor een zoek naar• Britney• star ?

Gelijkaar-digheid query –

doc !

51

V: Is het hierbij de bedoeling dat je een webpagina omzet in één of andere soort vector waarin de belangrijkste info staat? Hoe gaat zoiets in zijn werk, wat staat er dan in zo een vector?

52

Agenda








53

Wat maakt mensen blij?

54

Blij in blogs

55

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts

current mood:

Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.

Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.

current mood:

Wat zijn de karakteristieke

woorden voor deze twee stemmingen?

56

Data, het voorbereiden van data en leren

• LiveJournal.com – optionele stemmingsannotatie• 10,000 blogs:

• 5,000 blije blije entries / 5,000 trieste entries• gemiddelde grootte: 175 woorden / entry• post-processing – verwijder SGML tags, tokenizatie, part-of-

speech tagging

• kwaliteit van automatische “stemmingsonderscheiding”• naïve bayes text classifier

• five-fold cross validation

• Nauwkeurigheid: 79.13% (>> 50% baseline)

57

Resultaat:

happiness factoren afgeleid uit een corpus

goodbye 18.81hurt 17.39tears 14.35cried 11.39upset 11.12sad 11.11cry 10.56died 10.07lonely 9.50crying 5.50

yay 86.67

shopping79.56

awesome79.71

birthday78.37

lovely77.39

concert 74.85cool 73.72cute 73.20lunch

73.02books

73.02

58

Agenda








59

Maar: de teksten zijn er niet zomaar …

60

Preprocessing (1)

Data cleaningGoal: get clean ASCII text

Remove HTML markup*, pictures, advertisements, ...

Automate this: wrapper induction

* Note: HTML markup may carry information too (e.g., <b> or <h1> marks something important), which can be extracted! (Depends on the application)

61

Preprocessing (2)

Further text preprocessingGoal: get processable lexical / syntactical unitsTokenize (find word boundaries)Lemmatize / stem

ex. buyers, buyer buyer / buyer, buying, ... buyRemove stopwordsFind Named Entities (people, places, companies, ...); filteringResolve polysemy and homonymy: word sense disambiguation; “synonym

unification“Part-of-speech tagging; filtering of nouns, verbs, adjectives, ......

Most steps are optional and application-dependent!Many steps are language-dependent; coverage of non-English variesFree and/or open-source tools or Web APIs exist for most steps

62

Preprocessing (3)

Creation of text representationGoal: a representation that the modelling algorithm can work on

Most common forms: A text asa set or (more usually) bag of words / vector-space representation:

term-document matrix with weights reflecting occurrence, importance, ...

a sequence of words

a tree (parse trees)

63

An important part of preprocessing:Named-entity recognition (1)

64

Agenda








65

V: als je bij Google verschillende woorden ingeeft, worden deze dan met AND en OR gecom-bineerd, of zit er meer achter?

66

Vooruitblik: Natural language queries

67

V:

Algemeen over het internet: valt dit te beschouwen als één grote ongeordende chaos van websites,

of zijn het meer allemaal aparte databases (bijvoorbeeld met alle webpagina's uit België of alle webpagina's van een internetprovider als Telenet)

die samen het internet vormen (en dus toelaten aan een grote, algemene database om die zijn taken te verdelen) ?

68

Wat is dit?Kunnen we hiermee iets doen?

69

Linked Open Data (DBPedia and Freebase indicated in red circles)

70

Vooruitblik








XML (ezf.), NoSQL

71

Bronnen

• Methoden (Classifier learning)

• Slides from the „WEKA book“:

• Ian H. Witten, Eibe Frank, Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). 2011.

• http://www.cs.waikato.ac.nz/ml/weka/book.html

• Methoden (Itemset Mining)

• Agrawal R, Imielinski T, Swami AN. "Mining Association Rules between Sets of Items in Large Databases." SIGMOD. June 1993, 22(2):207-16, http://rakesh.agrawal-family.com/papers/sigmod93assoc.pdf

• Agrawal R, Srikant R. "Fast Algorithms for Mining Association Rules", VLDB. Sep 12-15 1994, Chile, 487-99. http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf

• Methoden (Blij in blogs)

• Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs.

• en Rada Mihalcea´s presentatie op CAAW 2006

iets over data mining en information retrieval : motivatie & samenvatting

Documents