clustering web doccument

Upload: hieu-ho

Post on 10-Jul-2015

282 views

Category:

Documents


0 download

TRANSCRIPT

Khai ph d liu Web bng k thut phn cm

http://www.ebook.edu.vn

MC LC

MC LC11.1. Khai ph d liu v pht hin tri thc ......................................................... 2 1.2. K thut phn cm trong khai ph d liu .................................................. 3 1.3. Khai ph Web .............................................................................................. 6 1.4. X l d liu vn bn ng dng trong khai ph d liu Web ..................... 7

Chng 2. MT S K THUT PHN CM D LIU ....................... 102.1. Phn cm phn hoch ................................................................................10 2.2. Phn cm phn cp ....................................................................................14 2.3. Phn cm da trn mt .........................................................................17 2.4. Phn cm da trn li..............................................................................19 2.5. Phn cm d liu da trn m hnh...........................................................20 2.6. Phn cm d liu m .................................................................................22

Chng 3. KHAI PH D LIU WEB ............................................. ....... 23 .3.1. Khai ph ni dung Web .............................................................................23 3.2. Khai ph theo s dng Web ......................................................................27 3.3. Khai ph cu trc Web ..............................................................................31 3.4. p dng thut ton phn cm d liu trong tm kim v phn cm ti liu Web ...................................................................................................................35

TI LIU THAM KHO .......................................................................... 42 .

Khai ph d liu Web bng k thut phn cm

http://www.ebook.edu.vn

CHNG 1. TNG QUAN V KHAI PH D LIU

1.1. Khai ph d liu v pht hin tri thc1.1.1. Khai ph d liuKPDL l mt lnh vc mi c nghin cu, nhm t ng khai thc thng tin, tri thc mi hu ch, tim n t nhng CSDL ln cho cc n v, t chc, doanh nghip,. t lm thc y kh nng sn xut, kinh doanh, cnh tranh cho cc n v, t chc ny. Cc kt qu nghin cu khoa hc cng nhng ng dng thnh cng trong KDD cho thy KPDL l mt lnh vc pht trin bn vng, mang li nhiu li ch v c nhiu trin vng, ng thi c u th hn hn so vi cc cng c tm kim phn tch d liu truyn thng. Ta c th khi qut ha khi nim KPDL l mt qu trnh tm kim, pht hin cc tri thc mi, hu ch, tim n trong CSDL ln.

1.1.2. Qu trnh khm ph tri thcQu trnh kh ph tri thc c th chia thnh 5 bc nh sau [10]:nh gi, biu din Cc mu

Trch chn

Tin x l

Bin i

Khai ph D liu bin i

Tri thc

D liu th

D liu la chn

D liu tin x l

Hnh 1.1. Qu trnh khm ph tri thc

1.1.3. Khai ph d liu v cc lnh vc lin quanKPDL l mt lnh vc lin quan ti thng k, hc my, CSDL, thut ton, tnh ton song song, thu nhn tri thc t h chuyn gia v d liu tru tng. c trng ca h thng khm ph tri thc l nh vo cc phng php, thut ton v k thut t nhng lnh vc khc nhau KPDL.

1.1.4. Cc k thut p dng trong khai ph d liuKDD l mt lnh vc lin ngnh, bao gm: T chc d liu, hc my, tr tu nhn to v cc khoa hc khc. ng trn quan im ca hc my, th cc k thut trong KPDL, bao gm: Hc c gim st, hc khng c gim st, hc na gim st. Nu cn c vo lp cc bi ton cn gii quyt, th KPDL bao gm cc k thut p dng sau: Phn lp v d bo, lut kt hp, phn tch chui theo thi gian, phn cm, m t v tm tt khi nim.

1.1.5. Nhng chc nng chnh ca khai ph d liuHai mc tiu chnh ca KPDL l m t v d bo. Trong lnh vc KDD, m t c quan tm nhiu hn d bo, n ngc vi cc ng dng hc my v nhn dng mu m trong vic d bo thng l mc tiu chnh. Cc nhim v chnh ca KDD gm: M t lp v khi nim, phn tch s kt hp, phn lp v d bo, phn cm, phn tch cc i tng ngoi cuc, phn tch s tin ho.

1.1.6. ng dng ca khai ph d liuKPDL l mt lnh vc c quan tm v ng dng rng ri. Mt s ng dng in hnh trong KPDL c th lit k nh sau: Phn tch d liu v h tr ra quyt nh, iu tr y hc, KPVB, khai ph Web, tin-sinh, ti chnh v th trng chng khon, bo him,... Hin nay cc h qun tr CSDL v phn mm tch hp nhng modul KPDL nh SQL Server, Oracle, Office 2007,..

1.2. K thut phn cm trong khai ph d liu1.2.1. Tng quan v k thut phn cmPCDL l mt k thut trong KPDL nhm tm kim, pht hin cc cm, cc mu d liu t nhin, tim n, quan trng trong tp d liu ln t cung cp thng tin, tri thc hu ch cho vic ra quyt nh. PCDL ang l vn m v kh v ngi ta cn phi i gii quyt nhiu vn c bn nh cp trn mt cch trn vn v ph hp vi nhiu dng

d liu khc nhau. c bit i vi d liu hn hp, ang ngy cng tng trng khng ngng trong cc h qun tr d liu, y cng l mt trong nhng thch thc ln trong lnh vc KPDL trong nhng thp k tip theo v c bit l trong lnh vc KPDL Web.

1.2.2. ng dng ca phn cm d liuPCDL c ng dng trong nhiu lnh vc nh thng mi v khoa hc. Cc k thut PCDL c p dng cho mt s ng dng in hnh trong cc lnh vc sau [10][19]: Thng mi, sinh hc, phn tch d liu khng gian, lp quy hoch th, nghin cu tri t, a l, khai ph Web,...

1.2.3. Cc yu cu i vi k thut phn cm d liuHu ht cc nghin cu v pht trin thut ton PCDL u nhm tho mn cc yu cu c bn sau [10][19]: C kh nng m rng, thch nghi vi cc kiu d liu khc nhau, khm ph ra cc cm vi hnh th bt k, ti thiu lng tri thc cn cho xc nh cc tham s vo, t nhy cm vi th t ca d liu vo, thch nghi vi d liu nhiu cao, t nhy cm vi cc tham s u vo, thch nghi vi d liu a chiu, d hiu, d ci t v kh thi.

1.2.4. Cc kiu d liu v o tng t1.2.4.1. Phn loi kiu d liu da trn kch thc min Ta c th phn thnh 2 loi thuc tnh lin tc, thuc tnh ri rc. 1.2.4.2. Phn loi kiu d liu da trn h o Mt s kiu d liu thng dng nh thuc tnh nh danh, thuc tnh c th t, thuc tnh khong, thuc tnh t l. Cc n v o c nh hng n cc kt qu phn cm. khc phc iu ny ngi ta phi chun ho d liu. 1.2.4.3. Khi nim v php o tng t, phi tng t Khi cc c tnh ca d liu c xc nh, ngi ta tm cch thch hp xc nh " tng t" gia cc i tng. y l cc hm o s ging nhau gia cc cp i tng d liu, dng tnh tng t hoc l tnh phi tng t gia cc i tng d liu. Gi tr ca hm tnh o tng t cng

ln th s ging nhau gia i tng cng ln v ngc li, cn hm tnh phi tng t t l nghch vi hm tnh tng t. Mt s php o tng t p dng i vi cc kiu d liu khc nhau [10][17][27]: + Thuc tnh khong: o phi tng t ca hai i tng d liu x, y c xc nh bng cc metric nh sau: Khong cch Minskowski, Euclide, Manhattan, khong cch cc i. + Thuc tnh nh phn: Bng tham s sau: y: 1 x: 1 x: 0

y: 0 +

+

+

+

Bng 1.1. Bng tham s thuc tnh nh phn

Cc php o thng dng i vi d liu thuc tnh nh phn: + - H s i snh n gin: d ( x, y) = . - H s Jacard: d ( x, y) . + + = + Thuc tnh nh danh: o phi tng t gia hai i tng x v ypm c nh ngha nh sau: d ( x, y) , trong m l s thuc tnh i snh = p

tng ng trng nhau v p l tng s cc thuc tnh. + Thuc tnh c th t: Php o phi tng t gia cc i tng d liu vi thuc tnh th t c thc hin nh sau: Cc trng thi Mi c sp th t: [1Mi], ta c th thay th mi gi tr ca thuc tnh bng gi tr cng loi ri, vi ri {1,,Mi}. Mi mt thuc tnh th t c cc min gi tr khc nhau, v vy ta chuyn i chng v cng min gi tr [0,1] bng cch thc hin php bin i sau cho mi thuc tnh: z = r M(j) i i (j)

i

1 , vi i=1,..,Mi 1

S dng cng thc tnh phi tng t ca thuc tnh khong i vi cc gi tr

z

(j) i

, y cng chnh l phi tng t ca thuc tnh c th t.

+ Thuc tnh t l: C nhiu cch khc nhau tnh tng t gia cc thuc tnh t l. C th s dng cng thc tnh logarit cho mi thuc tnh xi. Tu tng trng hp d liu c th m ngi ta s dng cc m hnh tnh tng t khc nhau. Vic xc nh tng t thch hp, chnh xc, m bo khch quan l rt quan trng v gp phn xy dng thut ton PCDL c hiu qu cao trong vic m bo cht lng v chi ph tnh ton ca thut ton.

1.3. Khai ph Web1.3.1. Li ch ca khai ph WebVi s pht trin nhanh chng ca thng tin trn www, KPDL Web tng bc tr nn quan trng hn trong lnh vc KPDL, ngi ta lun hy vng ly c nhng tri thc hu ch thng qua vic tm kim, phn tch, tng hp, khai ph Web. Nhng tri thc hu ch c th gip ta xy dng nn nhng Web site hiu qu c th phc v cho con ngi tt hn, c bit trong lnh vc thng mi in t. Khm ph v phn tch nhng thng tin hu ch trn www bng cch s dng k thut KPDL tr thnh mt hng quan trng trong lnh vc khm ph tri thc. Vy lm th no c th tm kim c thng tin m ngi dng cn? Lm th no c c nhng trang Web cht lng cao?... Nhng vn ny s c thc hin hiu qu hn bng cch nghin cu cc k thut KPDL p dng trong mi trng Web.

1.3.2. Khai ph WebC nhiu khi nim khc nhau v khai ph Web, nhng c th tng qut ha nh sau [5][30]: Khai ph Web l vic s dng cc k thut KPDL t ng ha qu trnh khm ph v trch rt nhng thng tin hu ch t cc ti liu,

cc dch v v cu trc Web. Lnh vc ny thu ht c nhiu nh khoa hc quan tm. Qu trnh khai ph Web c th chia thnh cc cng vic nh nh sau: Tm kim ngun ti nguyn, la chn v tin x l d liu, tng hp, phn tch.

1.3.3. Cc kiu d liu WebTa c th khi qut bng s sau:Free Text HTML file Content data Web data Usage data User Profile data XML file Dynamic content Multimedia Static link

Structure data

Dynamic link

Hnh 1.2. Phn loi d liu Web

1.4. X l d liu vn bn ng dng trong khai ph d liu Web1.4.1. D liu vn bnCSDL vn bn c th chia lm 2 loi chnh [14][20]: + Dng khng c cu trc: y l nhng ti liu vn bn thng thng m ta c thng ngay trn cc sch, bo, internet,... + Dng na cu trc: y l nhng vn bn c t chc di dng cu trc lng, nhng vn th hin ni dung chnh ca vn bn, nh vn bn HTML, Email,..

1.4.2. Mt s vn trong x l d liu vn bnMt s vn lin quan n vic biu din vn bn bng m hnh khng gian vector: Khng gian vector l mt tp hp bao gm cc t. T l mt chui cc k t; ngoi tr cc khong trng, k t xung dng, du cu, khng phn bit ch hoa v ch thng. Ct b t: Trong nhiu ngn ng, nhiu t c cng t gc hoc l bin th ca t gc sang mt t khc. Vic s dng t gc lm gim s lng cc t.

Ngoi ra, nng cao cht lng x l, mt s cng trnh nghin cu a ra mt s ci tin thut ton xem xt n c tnh ng cnh ca cc t bng vic s dng cc cm t/vn phm ch khng ch xt cc t ring l [31]. 1.4.2.1. Loi b t dng Ta thy trong ngn ng t nhin c nhiu t ch dng biu din cu trc cu ch khng biu t ni dung ca n. Nh cc gii t, t ni,... nhng t nh vy xut hin nhiu trong cc vn bn m khng lin quan g ti ch hoc ni dung ca vn bn. Do , ta c th loi b nhng t gim s chiu ca vector biu din vn bn, nhng t nh vy c gi l nhng t dng. 1.4.2.2. nh lut Zipf gim s chiu ca vector biu din vn bn hn na ta da vo mt quan st sau: Nhiu t trong vn bn xut hin rt t ln, nu mc tiu ca ta l xc nh tng t v s khc nhau trong ton b tp hp cc vn bn th cc t c tn s xut hin nh th nh hng rt b n cc vn bn. nh lut Zipf c pht biu di dng cng thc nh sau: rt.ft K (vi K l mt hng s). Ta c th vit li nh lut Zipf nh sau: rt K/ ft Tng qut, mt t ch xut hin mt ln trong tp hp, ta c rmax=K. Xt phn b ca cc t duy nht xut hin b ln trong tp hp, chia 2 v cho nhau ta c K/b. nh lut Zipf cho ta thy s phn b ng ch ca cc t ring bit trong 1 tp hp bi cc t xut hin t nht trong tp hp.

1.4.3. Cc m hnh biu din d liu vn bnCch biu din tt nht l bng cc t ring bit c rt ra t ti liu gc v cch biu din ny nh hng tng i nh i vi kt qu. 1.4.3.1. M hnh Boolean y l m hnh biu din vector vi hm f nhn gi tr ri rc vi duy nht hai gi tr ng/sai (true/false). Hm f tng ng vi thut ng ti s cho gi tr ng khi v ch khi ti xut hin trong ti liu .

1.4.3.2. M hnh tn s 1.4.3.2.1. M hnh da trn tn s xut hin cc t Trong m hnh da trn tn s xut hin t (TF-Term Frequency) gi tr ca cc t c tnh da vo s ln xut hin ca n trong ti liu, gi tfij l s ln xut hin ca t ti trong ti liu dj, khi wij c th c tnh theo mt trong cng thc [31] Wij = 1+log(tfij) Khi s ln xut hin thut ng ti trong ti liu dj cng ln th c ngha l dj cng ph thuc nhiu vo thut ng ti, ni cch khc thut ng ti mang nhiu thng tin hn trong ti liu dj. 1.4.3.2.2. Phng php da trn tn s vn bn nghch o Trong m hnh da trn tn s vn bn nghch o (IDF) gi tr trng s ca t c tnh bng cng thc sau [31]:Wij= n log( ) = log(n) log(hi ) hi 0 nu ti dj nu ngc li (ti dj)

Nu ti xut hin cng t trong cc vn bn th n cng quan trng, do nu ti xut hin trong dj th trng s ca n cng ln. 1.4.3.2.3. M hnh kt hp TF-IDF Trong m hnh TF-IDF [31], mi ti liu dj c xt n th hin bng mt c trng ca (t1, t2,.., tn) vi ti l mt t/cm t trong dj. Th t ca ti da trn trng s ca mi t. Cc tham s c th c thm vo ti u ha qu trnh thc hin nhm. Cng thc tnh trng s TF-IDF l: n tf ij idf = [1 + log( f ij )] log( ) nu ti dj Wij= hi Dataij

0

nu ngc li (ti dj)

Thng thng ta xy dng mt t in t ly i nhng t rt ph bin v nhng t c tn s xut hin thp. Trng s wij c tnh bng tn s xut hin ca thut ng ti trong ti liu dj v him ca thut ng ti trong ton b CSDL. K thut phn cm phn cp v phn cm phn hoch (k-means) l 2 k thut phn cm thng c s dng cho phn cm ti liu vi m hnh TF-IDF.

Chng 2. MT S K THUT PHN CM D LIUCc k thut PCDL c th c phn loi thnh mt s loi c bn da trn cc phng php tip cn nh sau [10][19]:

2.1. Phn cm phn hoch tng chnh ca k thut ny l phn mt tp d liu c n phn t cho trc thnh k nhm d liu sao cho mi phn t d liu ch thuc v mt nhm d liu v mi nhm d liu c ti thiu t nht mt phn t d liu. tng chnh ca thut ton phn cm phn hoch ti u cc b l s dng chin lc n tham tm kim nghim. Sau y l mt s thut ton kinh in c k tha s dng rng ri.

2.1.1. Thut ton k-meansMc ch ca thut ton k-means l sinh ra k cm d liu {C1, C2,, Ck} t mt tp d liu ban u gm n i tng trong khng gian d chiu Xi =(xi1, xi2,,xid) ( i = 1, n ), sao cho hm tiu chun:E = xi =1 k

CiD

2

( x mi )

t gi tr ti

thiu. Vi mi l trng tm ca cm Ci, D l khong cch gia hai i tng. Thut ton k-means bao gm cc bc c bn nh sau:INPUT: Mt CSDL gm n i tng v s cc cm k. OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun E t gi tr ti thiu. Bc 1: Khi to Chn k i tng mj (j=1...k) l trng tm ban u ca k cm t tp d liu (vic la chn ny c th l ngu nhin hoc theo kinh nghim). Bc 2: Tnh ton khong cch i vi mi i tng Xi (1 i n) , tnh ton khong cch t n ti mi trng tm mj vi j=1,..,k, sau tm trng tm gn nht i vi mi i tng. Bc 3: Cp nht li trng tm i vi mi j=1,..,k, cp nht trng tm cm mj bng cch xc nh trung bnh cng ca cc vector i tng d liu. Bc 4: iu kin dng Lp cc bc 2 v 3 cho n khi cc trng tm ca cm khng thay i. Hnh 2.1. Thut ton k-means

phc tp tnh ton l: O( (n k d ) ) vi d l s chiu, l s vng lp. k-means cn rt nhy cm vi nhiu v cc phn t ngoi lai trong d liu. Cht lng PCDL ph thuc nhiu vo cc tham s u vo nh: s cm k v k trng tm khi to ban u. c rt nhiu thut ton k tha t tng ca thut ton k-means p dng trong KPDL gii quyt tp d liu c kch thc ln nh thut ton kmedoid, PAM, CLARA, CLARANS,...

2.1.2. Thut ton PAMThut ton PAM l s m rng ca thut ton k-means, nhm x l hiu qu i vi d liu nhiu hoc cc phn t ngoi lai. Thay v s dng cc trng tm nh k-means, PAM s dng cc i tng medoid biu din cho cc cm d liu, mt i tng medoid l i tng t ti v tr trung tm nht bn trong ca mi cm. xc nh cc medoid, PAM bt u bng cch la chn k i tng medoid bt k. Sau mi bc thc hin, PAM c gng hon chuyn gia i tng medoid Om v mt i tng Op khng phi l medoid, min l s hon chuyn ny nhm ci tin cht lng ca phn cm, qu trnh ny kt thc khi cht lng phn cm khng thay i. quyt nh hon chuyn hai i tng Om v Op hay khng, thut ton PAM s dng gi tr tng chi ph hon chuyn Cjmp lm cn c: - Om: L i tng medoid hin thi cn c thay th - Op: L i tng medoid mi thay th cho Om; - Oj: L i tng d liu (khng phi l medoid) c th c di chuyn sang cm khc. - Om,2: L i tng medoid hin thi khc vi Om m gn i tng Oj nht. PAM tnh gi tr hon i Cjmp cho tt c cc i tng Oj. Cjmp y nhm lm cn c cho vic hon chuyn gia Om v Op. Trong mi trng hp Cjmp c tnh vi 4 cch khc nhau nh sau:

- Trng hp 1: Gi s Oj hin thi thuc v cm c i din l Om v Oj tng t vi Om,2 hn Op (d(Oj, Op) d(Oj, Om,2)). Trong khi , Om,2 l i tng medoid tng t xp th 2 ti Oj trong s cc medoid. Nu ta thay th Om bi i tng medoid mi Op v Oj s thuc v cm c i tng i din l Om,2. Cjmp = d(Oj, Om,2) d(Oj, Om) l khng m. - Trng hp 2: Oj hin thi thuc v cm c i din l Om, nhng Oj t tng t vi Om,2 so vi Op. Nu thay th Om bi Op th Oj s thuc v cm c i din l Op. Cjmp=(Oj,Op)- d(Oj, Om) c th l m hoc dng. - Trng hp 3: Gi s Oj hin thi khng thuc v cm c i tng i din l Om m thuc v cm c i din l Om,2. Mt khc, gi s Oj tng t vi Om,2 hn so vi Op, khi , nu Om c thay th bi Op th Oj vn s li trong cm c i din l Om,2. Do : Cjmp= 0. - Trng hp 4: Oj hin thi thuc v cm c i din l Om,2 nhng Oj t tng t ti Om,2 hn so vi Op. V vy, nu ta thay th Om bi Op th Oj s chuyn t cm Om,2 sang cm Op. Cjmp= (Oj,Op)- d(Oj, Om,2) y lun m. - Kt hp c bn trng hp trn, tng gi tr hon chuyn Om bng Op c xc nh nh sau: TCmp = Cj jmp

.

Thut ton PAM gm cc bc thc hin chnh nh sau:INPUT: Tp d liu c n phn t, s cm k OUTPUT: k cm d liu sao cho cht lng phn hoch l tt nht. Bc 1: Chn k i tng medoid bt k; Bc 2: Tnh TCmp cho tt c cc cp i tng Om, Op. Trong Om l i tng medoid v Op l i tng khng phi l modoid. Bc 3: Vi mi cp i tng Om v Op. Tnh minOm, minOp, TCmp. Nu TCmp l m, thay th Om bi Op v quay li bc 2. Nu TCmp dng, chuyn sang bc 4. Bc 4: Vi mi i tng khng phi l medoid, xc nh i tng medoid tng t vi n nht ng thi gn nhn cm cho chng. Hnh 2.2. Thut ton PAM

phc tp tnh ton ca PAM l O(Ik(n-k) ), trong I l s vng lp. Nh vy, thut ton PAM km hiu qu v thi gian tnh ton khi gi tr ca k v n l ln.

2

2.1.3. Thut ton CLARACLARA c Kaufman v Rousseeuw xut nm 1990, nhm khc phc nhc im ca thut ton PAM trong trng hp gi tr ca k v n ln. CLARA tin hnh trch mu cho tp d liu c n phn t v p dng thut ton PAM cho mu ny v tm ra cc cc i tng medoid ca mu ny. Ngi ta thy rng, nu mu d liu c trch mt cch ngu nhin, th cc medoid ca n xp x vi cc medoid ca ton b tp d liu ban u. tin ti mt xp x tt hn, CLARA a ra nhiu cch ly mu ri thc hin phn cm cho mi trng hp ny v tin hnh chn kt qu phn cm tt nht khi thc hin phn cm trn cc mu ny. Kt qu thc nghim ch ra rng, 5 mu d liu c kch thc 40+2k cho cc kt qu tt. Thut ton CLARA nh sau:INPUT: CSDL gm n i tng, s cm k. OUTPUT: k cm d liu 1. For i = 1 to 5 do Begin 2. Ly mt mu c 40 + 2k i tng d liu ngu nhin t tp d liu v p dng thut ton PAM cho mu d liu ny nhm tm cc i tng medoid i din cho cc cm. 3. i vi mi i tng Oj trong tp d liu ban u, xc nh i tng medoid tng t nht trong s k i tng medoid. 4. Tnh phi tng t trung bnh cho phn hoch cc i tng dnh bc trc, nu gi tr ny b hn gi tr ti thiu hin thi th s dng gi tr ny thay cho gi tr ti thiu trng thi trc, nh vy tp k i tng medoid xc nh bc ny l tt nht cho n thi im hin ti. End; Hnh 2.3. Thut ton CLARA2

phc tp tnh ton ca thut ton l O(k(40+k) + k(n-k))

2.1.4. Thut ton CLARANS tng c bn ca CLARANS l khng xem xt tt c cc kh nng c th thay th cc i tng tm medoids bi mt i tng khc, n ngay lp tc thay th cc i tng medoid ny nu vic thay th c tc ng tt n cht lng phn cm ch khng cn xc nh cch thay th ti u nht. S cc lng

ging c hn ch bi tham s do ngi dng a vo l Maxneighbor. Tham s Numlocal cho php ngi dng xc nh s vng lp ti u cc b c tm kim. Khng phi tt cc lng ging c duyt m ch c Maxneighbor lng ging c duyt. Thut ton CLARANS c th c din t nh sau [10][19]:INPUT: Tp d liu gm n i tng, s cm k, O, dist, numlocal, maxneighbor; OUTPUT: k cm d liu; For i=1 to numlocal do Begin Khi to ngu nhin k medois j = 1; while j < maxneighbor do Begin Chn ngu nhin mt lng ging R ca S. Tnh ton phi tng t v khong cch gia 2 lng ging S v R. Nu R c chi ph thp hn th hon i R cho S v j=1 ngc li j++; End; Kim tra khong cch ca phn hoch S c nh hn khong cch nh nht khng, nu nh hn th ly gi tr ny cp nht li khong cch nh nht v phn hoch S l phn hoch tt nht ti thi im hin ti. End. Hnh 2.4. Thut ton CLARANS2

phc tp tnh ton ca CLARANS l O(kn ). CLARANS c u im l khng gian tm kim khng b gii hn nh i vi CLARA v trong cng mt lng thi gian th cht lng ca cc cm phn c ln hn so vi CLARA.

2.2. Phn cm phn cpPhn cm phn cp sp xp mt tp d liu cho thnh mt cu trc c dng hnh cy, cy phn cp ny c xy dng theo k thut quy. Cy phn cm c th c xy dng theo hai phng php tng qut: phng php Top down v phng php Bottom up. Mt s thut ton phn cm phn cp in hnh nh CURE, BIRCH, Chemeleon, AGNES, DIANA,...

2.2.1. Thut ton BIRCH tng ca thut ton l khng cn lu ton b cc i tng d liu ca cc cm trong b nh m ch lu cc i lng thng k. i vi mi cm d liu, BIRCH ch lu mt b ba (n, LS, SS), vi n l s i tng trong cm, LS l tng cc gi tr thuc tnh ca cc i tng trong cm v SS l tng bnh phng cc gi tr thuc tnh ca cc i tng trong cm. Cc b ba ny c gi l cc c trng ca cm CF=(n, LS, SS) v c lu gi trong mt cy c gi l cy CF. Cy CF c c trng bi hai tham s: yu t nhnh (B) v ngng (T) Thut ton BIRCH thc hin qua giai on sau:INPUT: CSDL gm n i tng, ngng T OUTPUT: k cm d liu Bc 1: Duyt tt c cc i tng trong CSDL v xy dng mt cy CF khi to. Mt i tng c chn vo nt l gn nht to thnh cm con. Nu ng knh ca cm con ny ln hn T th nt l c tch. Khi mt i tng thch hp c chn vo nt l, tt c cc nt tr ti gc ca cy c cp nht vi cc thng tin cn thit. Bc 2: Nu cy CF hin thi khng c b nh trong th tin hnh xy dng mt cy CF nh hn bng cch iu khin bi tham s T (v tng T s lm ho nhp mt s cc cm con thnh mt cm, iu ny lm cho cy CF nh hn). Bc ny khng cn yu cu bt u c d liu li t u nhng vn m bo hiu chnh cy d liu nh hn. Bc 3: Thc hin phn cm: Cc nt l ca cy CF lu gi cc i lng thng k ca cc cm con. Trong bc ny, BIRCH s dng cc i lng thng k ny p dng mt s k thut phn cm th d nh k-means v to ra mt khi to cho phn cm. Bc 4: Phn phi li cc i tng d liu bng cch dng cc i tng trng tm cho cc cm c khm ph t bc 3: y l mt bc tu chn duyt li tp d liu v gn nhn li cho cc i tng d liu ti cc trng tm gn nht. Bc ny nhm gn nhn cho cc d liu khi to v loi b cc i tng ngoi lai Hnh 2.5. Thut ton BIRCH

S dng cu trc cy CF lm cho thut ton BIRCH c tc thc hin PCDL nhanh v c th p dng i vi tp d liu ln, BIRCH c bit hiu qu khi p dng vi tp d liu tng trng theo thi gian. phc tp l O(n). Nhc im ca n l cht lng ca cc cm c khm ph khng c tt v khng thch hp vi d liu a chiu.

2.2.2. Thut ton CURECURE l thut ton s dng chin lc Bottom up. Thay v s dng cc trng tm hoc cc i tng tm biu din cm, CURE s dng nhiu i tng din t cho mi cm d liu. Cc i tng i din cho cm ny ban u c la chn ri rc u cc v tr khc nhau, sau chng c di chuyn bng cch co li theo mt t l nht nh. Ti mi bc ca thut ton, hai cm c cp i tng i din gn nht s c trn li thnh mt cm. CURE c kh nng x l i vi cc phn t ngoi lai. p dng vi CSDL ln, CURE s dng ly mu ngu nhin v phn hoch. Thut ton CURE c thc hin qua cc bc c bn nh sau:Bc 1. Chn mt mu ngu nhin t tp d liu ban u; Bc 2. Phn hoch mu ny thnh nhiu nhm d liu c kch thc bng nhau: tng chnh y l phn hoch mu thnh p nhm d liu bng nhau, kch thc ca mi phn hoch l n'/p (vi n' l kch thc ca mu); Bc 3. Phn cm cc im ca mi nhm: Ta thc hin PCDL cho cc nhm cho n khi mi nhm c phn thnh n'/(pq)cm (vi q>1); Bc 4. Loi b cc phn t ngoi lai: Trc ht, khi cc cm c hnh thnh cho n khi s cc cm gim xung mt phn so vi s cc cm ban u. Sau , trong trng hp cc phn t ngoi lai c ly mu cng vi qu trnh pha khi to mu d liu, thut ton s t ng loi b cc nhm nh. Bc 5. Phn cm cc cm khng gian: Cc i tng i din cho cc cm di chuyn v hng trung tm cm, ngha l chng c thay th bi cc i tng gn trung tm hn. Bc 6. nh du d liu vi cc nhn tng ng. Hnh 2.6. Thut ton CURE

phc tp ca thut ton l O(n log(n)). CURE l thut ton tin cy trong vic khm ph cc cm vi hnh th bt k v c th p dng tt trn cc tp d

2

liu hai chiu. Tuy nhin, n li rt nhy cm vi cc tham s nh l tham s cc i tng i din, tham s co ca cc phn t i din. Nhn chung th BIRCH tt hn so vi CURE v phc tp, nhng km v cht lng phn cm.

2.3. Phn cm da trn mt Phng php ny nhm cc i tng theo hm mt xc nh. Mt c nh ngha nh l s cc i tng ln cn ca mt i tng d liu theo mt ngng no . Trong cch tip cn ny, khi mt cm d liu xc nh th n tip tc c pht trin thm cc i tng d liu mi min l s cc i tng ln cn ca cc i tng ny phi ln hn mt ngng c xc nh trc. Phng php phn cm da vo mt ca cc i tng xc nh cc cm d liu v c th pht hin ra cc cm d liu vi hnh th bt k. Cc cm c th c xem nh cc vng c mt cao, c tch ra bi cc vng khng c hoc t mt . Khi nim mt y c xem nh l cc s cc i tng lng ging. Mt s thut ton PCDL da trn mt in hnh nh [2][3][13][20]: DBSCAN, OPTICS, DENCLUE,.

2.3.1 Thut ton DBSCANThut ton i tm cc i tng m c s i tng lng ging ln hn mt ngng ti thiu. Mt cm c xc nh bng tp tt c cc i tng lin thng mt vi cc lng ging ca n. Cc bc thut ton DBSCAN nh sau:Bc 1: Chn mt i tng p tu Bc 2: Ly tt c cc i tng mt - n c t p vi Eps v MinPts. Bc 3: Nu p l im nhn th to ra mt cm theo Eps v MinPts. Bc 4: Nu p l mt im bin, khng c im no l mt - n c mt t p v DBSCAN s i thm im tip theo ca tp d liu. Bc 5: Qu trnh tip tc cho n khi tt c cc i tng c x l. Hnh 2.7. Thut ton DBSCAN

Nu ta chn s dng gi tr tr ton cc Eps v MinPts, DBSCAN c th ho nhp hai cm thnh mt cm nu mt ca hai cm gn bng nhau. phc tp tnh ton trung bnh ca mi truy vn l O(nlogn).

2.3.2. Thut ton OPTICSThut ton OPTICS l thut ton m rng cho thut ton DBSCAN, bng cch gim bt cc tham s u vo. N thc hin tnh ton v sp xp cc i tng theo th t tng dn nhm t ng phn cm v phn tch cm tng tc hn l a ra phn cm mt tp d liu r rng. Th t ny din t cu trc d liu phn cm da trn mt cha thng tin tng ng vi phn cm da trn mt vi mt dy cc tham s u vo. OPTICS xem xt bn knh ti thiu nhm xc nh cc lng ging ph hp vi thut ton. Thut ton DBSCAN v OPTICS tng t vi nhau v cu trc v c cng phc tp: O(nLogn).

2.3.3. Thut ton DENCLUEDENCLUE l thut ton PCDL da trn mt tp cc hm phn phi mt . tng chnh ca thut ton ny nh sau [19]: - nh hng ca mt i tng ti lng ging ca n c xc nh bi hm nh hng. - Mt ton cc ca khng gian d liu c m hnh phn tch nh l tng tt c cc hm nh hng ca cc i tng. - Cc cm c xc nh bi cc i tng mt cao, trong mt cao l cc im cc i ca hm mt ton cc. Thut ton DENCLUE ph thuc nhiu vo ngng nhiu v tham s mt . phc tp tnh ton ca thut ton DENCLUE l O(nlogn). Cc thut ton da trn mt khng thc hin k thut phn mu trn tp d liu nh trong cc thut ton phn cm phn hoch, v iu ny c th lm tng thm phc tp do c s khc nhau gia mt ca cc i tng trong mu vi mt ca ton b d liu.

2.4. Phn cm da trn liL phng php da trn cu trc d liu li PCDL, phng php ny ch yu tp trung p dng cho lp d liu khng gian. Cch tip cn da trn li ny khng di chuyn cc i tng trong cc m xy dng nhiu mc phn cp ca nhm cc i tng trong mt . Cc cm khng da trn o khong cch m n c quyt nh bi mt tham s xc nh trc. u im ca phng php PCDL da trn li l thi gian x l nhanh v c lp vi s i tng d liu trong tp d liu ban u, thay vo l chng ph thuc vo s trong mi chiu ca khng gian li. Mt s thut ton PCDL da trn cu trc li in hnh nh [13][20]: STING, WaveCluster, CLIQUE,

2.4.1 Thut ton STINGSTING do Wang, Yang v Muntz xut nm 1997, n phn r tp d liu khng gian thnh s hu hn cc cell s dng cu trc phn cp ch nht. C nhiu mc khc nhau cho cc cell trong cu trc li, cc cell ny hnh thnh nn cu trc phn cp nh sau: Mi cell mc cao c phn hoch thnh cc cell mc thp hn trong cu trc phn cp. Gi tr ca cc tham s thng k cho cc thuc tnh ca i tng d liu c tnh ton v lu tr thng qua cc tham s thng k cc cell mc thp hn. Cc tham s ny bao gm: tham s m count, tham s trung bnh means, tham s ti a max tham s ti thiu min, lch chun s, . Cc i tng d liu ln lt c chn vo li v cc tham s thng k trn c tnh trc tip thng qua cc i tng d liu ny. Cc truy vn khng gian c thc hin bng cch xt cc thch hp ti mi mc ca phn cp. Mt truy vn khng gian c xc nh nh l mt thng tin khi phc li ca d liu khng gian v cc quan h ca chng. STING c kh nng m rng cao, nhng do s dng phng php a phn gii nn n ph thuc cht ch vo trng tm ca mc thp nht. a phn gii l kh nng phn r tp d liu thnh cc mc chi tit khc nhau. Khi ho nhp cc ca cu trc li hnh thnh cc cm, cc nt ca mc con khng c ho nhp ph hp v hnh th ca

cc cm d liu khm ph c c cc bin ngang v dc, theo bin ca cc . STING s dng cu trc d liu li cho php kh nng x l song song, phc tp tnh ton cc i lng thng k cho mi l O(n). Sau khi xy dng cu trc d liu phn cp, thi gian x l cho cc truy vn l O(g) vi g l tng s cell ti mc thp nht (gpw2 nu pw1 gn trng tm hn pw2. END Hnh 3.7. Thut ton nh trng s cm v trang

Nh vy, theo cch tip cn ny ta s gii quyt c cc vn sau: + Kt qu tm kim s c phn thnh cc cm theo cc ch khc nhau, ty vo yu cu c th ngi dng s xc nh ch m h cn. + Qu trnh tm kim v xc nh trng s cho cc trang ch yu tp trung vo ni dung ca trang hn l da vo cc lin kt trang. + Gii quyt c vn t/cm t ng ngha trong cu truy vn ca ngi dng. + C th kt hp phng php phn cm trong lnh vc khai ph d liu vi cc phng php tm kim c. Hin ti, c mt s thut ton phn cm d liu c s dng trong phn cm vn bn nh thut ton phn cm phn hoch (k-means, PAM, CLARA), thut ton phn cm phn cp (BIRCH, STC),... Trong thc t phn cm theo ni dung ti liu Web, mt ti liu c th thuc vo nhiu nhm ch khc nhau. gii quyt vn ny ta c th s dng thut ton phn cm theo cch tip cn m.

3.4.2. Qu trnh tm kim v phn cm ti liuV c bn, qu trnh phn cm kt qu tm kim s din ra theo cc bc c th hin nh sau:

- Tm kim cc trang Web t cc Website tha mn ni dung truy vn. - Trch rt thng tin m t t cc trang v lu tr n cng vi cc URL tng ng. - S dng k thut phn cm d liu phn cm t ng cc trang Web thnh cc cm, sao cho cc trang trong cm tng t v ni dung vi nhau hn cc trang ngoi cm.D liu web Tm kim v trch rt d liu Tin x l

Biu din kt qu

p dng thut ton phn cm

Biu din d liu

Hnh 3.8. Cc bc phn cm kt qu tm kim trn Web

3.4.2.1. Tm kim d liu trn Web Nhim v ch yu ca giai on ny l da vo tp t kha tm kim tm kim v tr v tp gm ton vn ti liu, tiu , m t tm tt, URL, tng ng vi cc trang . Nhm nng cao tc x l, ta tin hnh tm kim v lu tr cc ti liu ny trong kho d liu s dng cho qu trnh tm kim (tng t nh cc Search Engine Yahoo, Google,). 3.4.2.2. Tin x l d liu Qu trnh lm sch d liu v chuyn dch cc ti liu thnh cc dng biu din d liu thch hp. Giai on ny bao gm cc cng vic nh sau: Chun ha vn bn, xa b cc t dng, kt hp cc t c cng t gc, s ha v biu din vn bn,..

3.4.2.2.1. Chun ha vn bn y l giai on chuyn vn bn th v dng vn bn sao cho vic x l sau ny c d dng, n gin, thut tin, chnh xc so vi vic x l trc tip trn vn bn th m nh hng t n kt qu x l. 3.4.2.2.2. Xa b cc t dng Trong vn bn c nhng t mang t thng tin trong qu trnh x l, nhng t c tn s xut hin thp, nhng t xut hin vi tn s ln nhng khng quan trng cho qu trnh x l u c loi b. Theo mt s nghin cu gn y cho thy vic loi b cc t dng c th gim bi c khong 20-30% tng s t trong vn bn. C rt nhiu t xut hin vi tn s ln nhng n khng hu ch cho qu trnh phn cm d liu. Nhng t xut hin vi tn s qu ln cng s c loi b. n gin trong ng dng thc t, ta c th t chc thnh mt danh sch cc t dng, s dng nh lut Zipf xa b cc t c tn s xut hin thp hoc qu cao. 3.4.2.2.3. Kt hp cc t c cng gc Hu ht trong cc ngn ng u c rt nhiu cc t c chung ngun gc vi nhau, chng mang ngha tng t nhau, do gim bt s chiu trong biu din vn bn, ta s kt hp cc t c cng gc thnh mt t. Theo mt s nghin cu [5] vic kt hp ny s gim c khong 40-50% kch thc chiu trong biu din vn bn. V d trong ting Anh, t user, users, used, using c cng t gc v s c quy v l use; t engineering, engineered, engineer c cng t gc s c quy v l engineer. 3.4.2.3. Xy dng t in Vic xy dng t in l mt cng vic rt quan trng trong qu trnh vector ha vn bn, t in s gm cc t/cm t ring bit trong ton b tp d liu. T in s gm mt bng cc t, ch s ca n trong t in v c sp xp theo th t.

Mt s bi bo xut [31] nng cao cht lng phn cm d liu cn xem xt n vic x l cc cm t trong cc ng cnh khc nhau. Theo xut ca Zemir [19][31] xy dng t in c 500 phn t l ph hp. 3.4.2.4. Tch t, s ha vn bn v biu din ti liu Tch t l cng vic ht sc quan trng trong biu din vn bn, qu trnh tch t, vector ha ti liu l qu trnh tm kim cc t v thay th n bi ch s ca t trong t in. y ta c th s dng mt trong cc m hnh ton hc TF, IDF, TFIDF,... biu din vn bn. Chng ta s dng mng W (trng s) hai chiu c kch thc m x n, vi n l s cc ti liu, m l s cc thut ng trong t in (s chiu), hng th j l mt vector biu din ti liu th j trong c s d liu, ct th i l thut ng th i trong t in. Wij l gi tr trng s ca thut ng i i vi ti liu j. Giai on ny thc hin thng k tn s thut ng ti xut hin trong ti liu dj v s cc ti liu cha ti. T xy dng bng trng s ca ma trn W theo cng thc sau: Cng thc tnh trng s theo m hnh IF-IDF:Wij=

n tf ij idf = [1 + log(tf )] log( ) hi ij ij 0

nu ti dj nu ngc li (ti dj)

3.4.2.5. Phn cm ti liu Sau khi tm kim, trch rt d liu v tin x l v biu din vn bn chng ta s dng k thut phn cm phn cm ti liu.INPUT: Tp gm n ti liu v k cm. OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun t gi tr cc tiu. BEGIN Bc 1. Khi to ngu nhin k vector lm i tng trng tm ca k cm. Bc 2. Vi mi ti liu dj xc nh tng t ca n i vi trng tm ca mi cm theo mt trong cc o tng t thng dng (nh Dice, Jaccard, Cosine, Overlap, Euclidean, Manhattan). Xc nh trng tm tng t nht cho mi ti liu v

a ti liu vo cm . Bc 3. Cp nhn li cc i tng trng tm. i vi mi cm ta xc nh li trng tm bng cch xc nh trung bnh cng ca cc vector ti liu trong cm . Bc 4. Lp li bc 2 v 3 cho n khi trong tm khng thay i. END. Hnh 3.9. Thut ton k-means trong phn cm ni dung ti liu Web

- phc tp ca thut ton k-means l O((n.k.d).r). Trong : n l s i tng d liu, k l s cm d liu, d l s chiu, r l s vng lp. Sau khi phn cm xong ti liu, tr v kt qu l cc cm d liu v cc trng tm tng ng.

3.4.6. Kt qu thc nghim+ D liu thc nghim l cc trang Web ly t 2 ngun chnh sau: - Cc trang c ly t ng t cc Website trn Internet, vic tm kim c thc hin bng cch s dng Yahoo tm kim t ng, chng trnh s da vo URL ly ton vn ca ti liu v lu tr li phc v cho qu trnh tm kim sau ny (da liu gm hn 4000 bi v cc ch data mining, web mining, Cluster algorithm, Sport). - Tm kim c chn lc, phn ny c tin hnh ly th cng, ngun d liu ch yu c ly t cc Web site: http://www.baobongda.com.vn/ http://bongda.com.vn/, http://vietnamnet.vn, http://www.24h.com Gm hn 250 bi bo ch bng . - Vic xy dng t in, sau khi thng k tn s xut hin ca cc t trong tp ti liu, ta p dng nh lut Zipf loi b nhng t c tn s xut hin qu cao v loi b nhng t c tn s qu thp, ta thu c b t in gm 500 t. S ti liu 50 50 100 100 S cm 10 15 10 15 Thi gian trung bnh (giy) Phn cm Tin x l v biu din vn bn ti liu 0,206 0,957 0,206 1,156 0,353 2,518 0,353 3,709

150 150 250 250

10 15 10 15

0,515 0,515 0,824 0,824

4,553 5,834 9,756 13,375

Bng 3.2. Bng o thi gian thc hin thut ton phn cm

Ta thy rng thi gian thc hin thut ton ph vo ln d liu v s cm cn phn cm. Ngoi ra, vi thut ton k-means cn ph thuc vo k trng tm khi to ban u. Nu k trng tm c xc nh tt th cht lng v thi gian thc hin c ci thin rt nhiu. Phn giao din chng trnh v mt s on m code in hnh c trnh by ph lc.

TI LIU THAM KHO Ti liu ting Vit [1] Cao Chnh Ngha, Mt s vn v phn cm d liu, Lun vn thc s, Trng i hc Cng ngh, H Quc gia H Ni, 2006. [2] Hong Hi Xanh, V cc k thut phn cm d liu trong data mining, lun vn thc s, Trng H Quc Gia H Ni, 2005 [3] Hong Th Mai, Khai ph d liu bng phng php phn cm d liu, Lun vn thc s, Trng HSP H Ni, 2006. Ti liu ting Anh [4] Athena Vakali, Web data clustering Current research status & trends, Aristotle University,Greece, 2004. [5] Bing Liu, Web mining, Springer, 2007. [6] Brij M. Masand, Myra Spiliopoulou, Jaideep Srivastava, Osmar R. Zaiane, Web Mining for Usage Patterns & Profiles, ACM, 2002. [7] Filippo Geraci, Marco Pellegrini, Paolo Pisati, and Fabrizio Sebastiani, A scalable algorithm for high-quality clustering of Web Snippets, Italy, ACM, 2006. [8] Giordano Adami, Paolo Avesani, Diego Sona, Clustering Documents in a Web Directory, ACM, 2003. [9] Hiroyuki Kawano, Applications of Web mining- from Web search engine to P2P filtering, IEEE, 2004. [10] Ho Tu Bao, Knowledge Discovery and Data Mining, 2000. [11] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma, Learning to Cluster Web Search Results, ACM, 2004. [12] Jitian Xiao, Yanchun Zhang, Xiaohua Jia, Tianzhu Li, Measuring Similarity of Interests for Clustering Web-Users, IEEE, 2001. [13] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, University of Illinois at Urbana-Champaign, 1999. [14] Khoo Khyou Bun, Topic Trend Detection and Mining in World Wide Web, A thesis for the degree of PhD, Japan, 2004. [15] LIU Jian-guo, HUANG Zheng-hong , WU Wei-ping, Web Mining for Electronic Business Application, IEEE, 2003. [16] Lizhen Liu, Junjie Chen, Hantao Song, The research of Web Mining, IEEE, 2002

[17] Maria Rigou, Spiros Sirmakessis, and Giannis Tzimas, A Method for Personalized Clustering in Data Intensive Web Applications, 2006. [18] Miguel Gomes da Costa Jnior, Zhiguo Gong, Web Structure Mining: An Introduction, IEEE, 2005. [19] Oren Zamir and Oren Etzioni, Web document Clustering: A Feasibility Demonstration, University of Washington, USA, ACM, 1998. [20] Pawan Lingras, Rough Set Clustering for Web mining, IEEE, 2002. [21] Periklis Andritsos, Data Clusting Techniques, University Toronto,2002. [22] R. Cooley, B. Mobasher, and J. Srivastava, Web mining: Information and Pattern Discovery on the World Wide Web, University of Minnesota, USA, 1998. [23] Raghu Krishnapuram, Anupam Joshi, and Liyu Yi, A Fuzzy Relative of the K -Medoids Algorithm with Application toWeb Document and Snippet Clustering, 2001 [24] Raghu Krishnapuram,Anupam Joshi, Olfa Nasraoui, and Liyu Yi, Low- Complexity Fuzzy Relational Clustering Algorithms for Web Mining, IEEE, 2001. [25] Raymond and Hendrik, Web Mining Research: A Survey, ACM, 2000 [26] Rui Wu, Wansheng Tang,Ruiqing Zhao, An Efficient Algorithm for Fuzzy Web-Mining, IEEE, 2004. [27] T.A.Runkler, J.C.Bezdek, Web mining with relational clustering, ELSEVIER, 2002. [28] Tsau Young Lin, I-Jen Chiang , A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering, ELSEVIER, 2005. [29] Wang Jicheng, Huang Yuan, Wu Gangshan, and Zhang Fuyan, Web Mining: Knowledge Discovery on the Web, IEEE, 1999. [30] WangBin, LiuZhijing, Web Mining Research, IEEE, 2003. [31] Wenyi Ni, A Survey of Web Document Clustering, Southern Methodist University, 2004. [32] Yitong Wang, Masaru Kitsuregawa, Evaluating Contents-Link Coupled Web Page Clustering for Web Search Results, ACM, 2002. [33] Zifeng Cui, Baowen Xu , Weifeng Zhang, Junling Xu, Web Documents Clustering with Interest Links, IEEE, 2005.