gioi thieu weka

Upload: betunxinh

Post on 07-Jul-2015

516 views

Category:

Documents


0 download

TRANSCRIPT

GI I THI U PH N M M WEKA

NG

I TH C HI NTHU TRANG

N I DUNG BO CONg i th c hi n: Thu Trang

I. Gi

i thi u ph n m m Weka

II. Bi

ton p d ng

2

L CH S

PHT TRI N

WEKA Waikato Environment for Knowledge Analysis. L ph n m m khai thc d thu c d n nghin c u c a Waikato, New Zealand. M c tiu: xy d ng m t cng c hi n th c t . i nh m pht tri n cc k li u, ih c

thu t my h c v p d ng chng vo bi ton khai thc d li u trong

3

L CH S

PHT TRI N

1993

i h c Waikato, New ng d n, xy u tin c a Weka. nh xy d ng l i t

Zealand, kh i d ng phin b n 1997 Quy t Weka t

u b ng Java, c ci

cc thu t ton m hnh ha. 2005 Weka nh n gi i th ng SIGKDD Data Mining and t).4

Knowledge Discovery Service Award. X p h ng trn Sourceforge.net t 25-06-2007: 241 (907,318 l

CU TRC PH N M M

WEKA

c xy d ng b ng ngn ng Java, c u trc g m

h n 600 l p, t ch c thnh 10 packages. Cc ch c n ng chnh c a ph n m m:Kh o st d li u: ti n x l d li u, phn l p, gom nhm d li u, v khai thc lu t k t h p. Th c nghi m m hnh: cung c p ph nh gi cc m hnh h c. Bi u di n tr c quan d li u b ng nhi u d ng th khc nhau. ng ti n ki m ch ng,

5

TI N X

L D

LI U

Hi n th thng tin v d li u ang xtT p d li u: tn, s m u, s thu c tnh. Cc thu c tnh: tn, ki u d li u, gi tr thu c tnh, t l %... Bi u minh h a thng tin.

Cung c p cc b l c d li u thng d ng, v d :ReplaceMissingValues: thay th gi tr thi u. Normalize: chu n ha d li u v Discretize: r i r c ha d li u. o n [0, 1].

6

KHAI THC LU T K T H P

Cung c p cc thu t ton khai thc lu t k t h pApriori PredictiveApriori: l c i ti n c a thu t ton Apriori.

7

PHN L P

Cung c p r t nhi u thu t ton phn l p, c gom thnh cc nhm d a trn c s l thuy t ho c ch c n ng.Bayes: m ng Bayes, Nave Bayes... Hm: SVM, cc ph Cc ph ng php h i quy, h u tuy n tnh Cy: ID3, C4.5 (J58) ng php phn l p d a trn lu t. Bagging, AdaBoost

8

GOM NHM

Cung c p cc thu t ton gom nhm ph bi n, v d :DBSCan EM (Expectation Maximization). K-Means

9

C U TRC T P TIN ARFF

% This is a relation about test@relation test

Tn quan h

Ch thch thu c tnh ki u DL

Tn @attribute kichthuoc {vua, nho, lon} @attribute mau {xanhduong, do, xanhlacay} @attribute dang {hop, non, cau, tru} @attribute quyetdinh {yes, no}@data vua,xanhduong,hop,yes nho,do,non,no nho,do,cau,yes lon,do,non,no lon,xanhlacay,tru,yes lon,do,tru,no lon,xanhlacay,cau,yes

1m u

10

C U TRC T P TIN ARFF

Cc ki u d li u

c h tr trong ARFF bao g m

numeric: l ki u d li u s , g m real v integer nominal: l ki u d li u danh sch. string: l ki u d li u d ng chu i date: ki u d li u th i gian (ngy thng n m, gi pht giy)

11

C U TRC T P TIN ARFF

Dng ghi ch D li u thi u

cb t

u b ng d u %. t trong d u nhy n.

c bi u di n b ng d u ?. i theo ng thng tin khai

Chu i n u c kho ng tr ng ph i bo trong header.

Cc gi tr trong ph n data ph i tuy t

12

Ti n x l d li u D li u c th c nh p vo (imported) t m t t p tin c khun d ng: ARFF, CSV D li u c ng c th c c vo t m t a ch URL, ho c t m t c s d li u thng qua JDBC Cc cng c ti n x l d li u c a WEKA c g i l filters R i r c ha (Discretization) Chu n ha (Normalization) L y m u (Re-sampling) L a ch n thu c tnh (Attribute selection) Chuy n i (Transforming) v k t h p (Combining) cc thu c tnh

Classifly L a ch n m t b phn l p (classifier) L a ch n cc ty ch n cho vi c ki m tra (test options) Use training set. B phn lo i h c cs c nh gi trn t p h c Supplied test set. S d ng m t t p d li u khc (v i t p h c) cho vi c nh gi Cross-validation. T p d li u s c chia u thnh k t p (folds) c kch th c x p x nhau, v b phn lo i h c cs c nh gi b i ph ng php cross-validation Percentage split. Ch nh t l phn chia t p d li u i v i vi c nh gi

D li u m u D li u v khch hng ngn hng, g m 12 thu c tnh, 600 d li u v khch hng

D li u m u

L c thu c tnh (Filtering Attribute) Lo i b thu c tnh Id khng dng trong m hnh. Filter > Choose > filters > unsupervised > attribute > remove - B m vo textbox bn ph i nt Choose v g vo 1 ( y l index c a thu c tnh id trong file d li u) - Ch ty ch n InvertSelection ph i c thi t l p l false. - B m ch n Apply t o ra m t s li u m i g m 11 thu c tnh sau khi lo i b thu c tnh Id. - Khi thu c tnh Id b lo i th t t c cc gi tr c a tr ng Id trong cc b n ghi c ng b lo i.

R i r c ha thu c tnh Trong Data Mining, m t s k thu t nh : ID3, khai ph lu t k t h p (association rule mining) ch c th th c hi n trn cc d li u phn lo i, nn ta c n th c hi n r i r c ha trn cc thu c tnh c ki u d li u lin t c (d li u ki u numeric). Bi ton c 3 thu c tnh ki u s age, income, children

Dng thu t ton C4.5 phn l p d li uDng th t ton C4.5 phn l p d li u, n u thu c tnh l ki u r i r c th phn l p theo gi tr phn bi t c a chng (gi ng thu t ton ID3), n u thu c tnh l ki u s th ta ph i tm ng ng c a php tch chia t p con theo ng ng .

Dng thu t ton C4.5 phn l p d li u D li u tu i, s con, thu nh p c r i r c ha

Dng thu t ton ID3 phn l p d li u Thu t ton ID3 ch phn l p v i cc thu c tnh r i r c, ch a x l tr ng h p thu c tnh lin t c.

Cross - validation T p d li u s c chia thnh k t p (fold) c kch th c ~ nhau. D tham s Folds c gi tr l bao nhiu th k t qu thu c khng thay i, v n t k t qu phn l p 600 b n ghi.

Percentage splip Ty vo t l % t p d li u khc nhau. t k t qu phn chia