k44 do thi dieu ngoc thesis

Upload: luong-hong-giang

Post on 08-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    1/54

    Chng 1. TNG QUAN V KHAI PH DLIU WEB

    1.1. GII THIU VKHAI PH DLIU (DATAMING) V KDD

    1.1.1. Ti sao li cn khai ph dliu (datamining)

    Khong hn mt thp k tr li y, lng thng tin c lu tr trn ccthit bin t (a cng, CD-ROM, bng t, .v.v.) khng ngng tng ln. S tch ly

    d liu ny xy ra vi mt tc bng n. Ngi ta c on rng lng thng tin

    trn ton cu tng gp i sau khong hai nm v theo s lng cng nh kch c

    ca cc csd liu (CSDL) cng tng ln mt cch nhanh chng. Ni mt cch hnh

    nh l chng ta ang ngp trong d liu nhng li i tri thc. Cu hi t ra l

    liu chng ta c th khai thc c g t nhng ni d liu tng chng nh bi

    y khng ?

    Necessity is the mother of invention - Data Mining ra i nh mt hnggii quyt hu hiu cho cu hi va t ra trn []. Kh nhiu nh ngha v Data

    Mining v sc cp phn sau, tuy nhin c th tm hiu rng Data Mining nh

    l mt cng ngh tri thc gip khai thc nhng thng tin hu ch t nhng kho d liu

    c tch tr trong sut qu trnh hot ng ca mt cng ty, t chc no .

    1.1.2. Khai ph dliu l g?

    Khai ph d liu (datamining) c nh ngha nh l mt qu trnh cht lc

    hay khai ph tri thc t mt lng ln d liu. Mt v d hay c s dng l l vic

    khai thc vng t v ct, Dataming c v nh cng vic "i ct tm vng" trongmt tp hp ln cc d liu cho trc. Thut ng Dataming m ch vic tm kim mt

    tp hp nh c gi tr t mt s lng ln cc d liu th. C nhiu thut ng hin

    c dng cng c ngha tng t vi t Datamining nh Knowledge Mining (khai

    ph tri thc), knowledge extraction(cht lc tri thc), data/patern analysis(phn tch d

    liu/mu), data archaeoloogy (kho c d liu), datadredging(no vt d liu),...

    nh ngha: Khai ph dliu l mt tp hp cc kthutc sdngt

    ng khai thc v tm ra cc mi quan h ln nhau ca dliu trong mt tp hp d

    liu khng l v phc tp, ng thi cng tm ra cc mu tim n trong tp dliu .

    Khai ph d liu l mt bc trong by bc ca qu trnh KDD (Knowleadge

    Discovery in Database) v KDD c xem nh 7 qu trnh khc nhau theo th t

    sau:s

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    2/54

    1.Lm sch d liu (data cleaning & preprocessing)s: Loi b nhiu v cc dliu khng cn thit.

    2.Tch hp d liu: (data integration): qu trnh hp nht d liu thnh nhngkho d liu (data warehouses & data marts) sau khi lm sch v tin x l (data

    cleaning & preprocessing).3.Trch chn d liu (data selection): trch chn d liu t nhng kho d liu

    v sau chuyn i v dng thch hp cho qu trnh khai thc tri thc. Qu trnh ny

    bao gm c vic x l vi d liu nhiu (noisy data), d liu khng y

    (incomplete data), .v.v.

    4.Chuyn i d liu: Cc d liu c chuyn i sang cc dng ph hpcho qu trnh x l

    5.Khai ph d liu(data mining): L mt trong cc bc quan trng nht,trong s dng nhng phng php thng minh cht lc ra nhng mu d liu.

    6.c lng mu (knowledge evaluation): Qu trnh nh gi cc kt qu tmc thng qua cc o no .

    7.Biu din tri thc (knowledge presentation): Qu trnh ny s dng cc kthut biu din v th hin trc quan cho ngi dng.

    Hnh 1 - Cc bc trong Data Mining & KDD

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    3/54

    1.1.3. Cc chc nng chnh ca khai ph dliu

    Data Mining c chia nh thnh mt s hng chnh nh sau:

    M t khi nim (concept description): thin v m t, tng hp v tmtt khi nim. V d: tm tt vn bn.

    Lut kt hp (association rules): l dng lut biu din tri thdng khn gin. V d: 60 % nam gii vo siu th nu mua bia th c ti 80% trong s h s

    mua thm tht b kh. Lut kt hp c ng dng nhiu trong lnh vc knh doanh,

    y hc, tin-sinh, ti chnh & th trng chng khon, .v.v.

    Phn l p v don (classification & prediction): xp mt i tngvo mt trong nhng lp bit trc. V d: phn lp vng a l theo d liu thi

    tit. Hng tip cn ny thng s dng mt s k thut ca machine learningnh

    cy quyt nh (decision tree), mng nron nhn to (neural network), .v.v. Ngi tacn giphn lp l hc c gim st (hc c thy).

    Phn cm (clustering): xp cc i tng theo tng cm (s lng cngnh tn ca cm cha c bit trc. Ngi ta cn giphn cm l hc khng gim

    st (hc khng thy).

    Khai ph chui (sequential/temporal patterns): tng t nh khai phlut kt hp nhng c thm tnh th t v tnh thi gian. Hng tip cn ny c ng

    dng nhiu trong lnh vc ti chnh v th trng chng khon v n c tnh d bo

    cao.

    1.1.4.ng dng ca khai ph dliu

    Data Mining tuy l mt hng tip cn mi nhng thu ht c rt nhiu s

    quan tm ca cc nh nghin cu v pht trin nhvo nhng ng dng thc tin ca

    n. Chng ta c th lit k ra y mt sng dng in hnh:

    Phn tch d liu v h tr ra quyt nh (data analysis & decisionsupport)

    iu tr y hc (medical treatment) Text mining & Web mining Tin-sinh (bio-informatics) Ti chnh v th trng chng khon (finance & stock market)

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    4/54

    Bo him (insurance) Nhn dng (pattern recognition) .v.v.1.2. CSSLIU HYPERTEXT V FULLTEXT1.2.1. Csdliu FullText

    D liu dng FullText l mt dng d liu phi cu trc vi thng tin ch gm

    cc ti liu dng Text. Mi ti liu cha thng tin v mt vn no th hin qua

    ni dung ca tt c cc t cu thnh ti liu . ngha ca mi t trong ti liu

    khkng cnh m tu thuc vo tng ng cnh khc nhau s mang ngha khc

    nhau. Cc t trong ti liu c lin kt vi nhau theo mt ngn ng no .

    Trong cc d liu hin nay th vn bn l mt trong nhng d liu ph bin

    nht, n c mt khp mi ni v chng ta thng xuyn bt gp do cc bi ton

    v x l vn bn c t ra kh lu v hin nay vn l mt trong nhng vn

    trong khai ph d liu Text, trong c nhng bi ton ng ch nh tm kim vn

    bn, phn loi vn bn, phn cm vn bn hoc dn ng vn bn

    CSDL full_text l mt dng CSDL phi cu trc m d liu bao gm cc ti

    liu v thuc tnh ca ti liu. Csd liu Full_Text thng c t chc nh mt

    t hp ca hai thnh phn: Mt CSDL c cu trc thng thng (cha c im ca

    cc ti liu) v cc ti liu

    Ni dung cu ti liu c lu tr gin tip trong CSDL theo ngha h thngch qun l a ch lu tr ni dung.

    Csd liu dng Text c th chia lm hai loi sau:

    Dng khng c cu trc (unstructured): Nhng vn bn thng thng m

    chng ta thng c hng ngy c th hin di dng t nhin ca con ngi v n

    CSDL Full-Text

    CSDL c cu trc cha c imca cc ti liu

    Cc ti liu

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    5/54

    khng c mt cu trc nh dng no. VD: Tp hp sch, Tp ch, Bi vit c qun

    l trong mt mng th vin in t.

    Dng na cu trc (semi-structured): Nhng vn bn c t chc di dng

    cu trc khng cht ch nh bn ghi cc k hiu nh du vn bn v vn th hin

    c ni dung chnh ca vn bn, v d nh cc dnh HTML, email,...Tuy nhin vic phn lm hai loi cng khng tht r rng, trong cc h phn

    mm, ngi ta thng phi s dng cc phn kt hp li thnh mt h nh trong c

    h tm tin (Search Engine), hoc trong bi ton tm kim vn bn (Text Retrieval), mt

    trong nhng lnh vc qua tm nht hin nay. Chng hn trong h tm kim nh Yahoo,

    Altavista, Google... u t chc d liu theo cc nhm v th mc, mi nhm li c

    th c nhiu nhm con nm trong . H Altavista cn tch h p thm chng trnh

    dch tng c th dch chuyn i sang nhiu th ting khc nhau v cho kt qu kh

    tt.1.2.2. Csdliu HyperText

    Theo t in ca i hc Oxford (Oxford English Dictionary Additions

    Series) th Hypertext c nh ngha nh sau: l loi Text khng phi c theo

    dng lin tc n, n c thc c theo cc th t khc nhau, c bit l Text v

    nh ha (Graphic) l cc dng c mi lin kt vi nhau theo cch m ngi c c

    th khng cn c mt cch lin tc. V d khi c mt cun sch ngi c khng

    phi c ln lt tng trang tu n cui m c th nhy cc n cc on sau

    tham kho v cc vn h quan tm.

    Nh vy vn bn HyperText bao gm dng ch vit khng lin tc, chng

    c phn nhnh v cho php ngi c c th chn cch c theo mun ca mnh.

    Hiu theo ngha thng thng th HyperText l mt tp cc trang ch vit c kt ni

    vi nhau bi cc lin kt v cho php ngi c c thc theo cc cch khc nhau.

    Nh ta lm quen nhiu vi cc trang nh dng HTML, trong cc trang c nhng

    lin kt tr ti tng phn khc nhau ca trang hoc tr ti trang khc, v ngi c

    sc vn bn da vo nhng lin kt .

    Bn cnh , HyperText cng l mt dng vn bn Text c bit nn cng c

    th bao gm cc ch vit lin tc (l dng ph bin nht ca ch vit). Do khng b

    hn ch bi tnh lin tc trong HyperText, chng ta c th to ra cc dng trnh by

    mi, do ti liu s phn nh tt hn ni dung mun din t. Hn na ngi c c

    th chn cho mnh mt cch c ph hp chng hn nhi su vo mt vn m h

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    6/54

    quan tm. Sng kin to ra mt tpc c vn bn cng vi cc con tr tr ti cc vn

    bn khc lin kt mt tp cc vn bn c mi quan h voi nhau vi nhau l mt

    cch thc s hay v rt hu ch t chc thng tin. Vi ngi vit, cch ny cho

    php h c th thoi mi loi b nhng bn khon v th t trnh by, m c th t

    chc vn thnh nhng phn nh, ri s dng kt ni ch ra mi lin h gia ccphn nh vi nhau.

    Vi ngi c cch ny cho php h c thi tt trn mng thng tin v quyt

    nh phn thng tin no c lin quan n vn m h quan tm tip tc tm hiu.

    So snh vi cch c tuyn tnh, tc l c ln lt th HyperText cung cp cho

    chng ta mt giao din c th ti p xc vi ni dung thng tin hiu qu hn rt

    nhiu. Theo kha cnh ca cc thut ton hc my th HyperText cung c p cho

    chng ta chi nhn ra ngoi phm vi mt ti liu phn lp n, ngha l c tnh c

    n cc ti liu c lin kt vi n. Tt nhin khng phi tt c cc ti liu c lin kt

    n n u c ch cho vic phn lp, c bit l khi cc siu lin kt c th chn rt

    nhiu loi cc ti liu khc nhau. Nhng chc chn vn cn tni ti tim nng m con

    ngi cn tip tc nghin cu v vic s dng cc ti liu lin kt n mt trang

    nng cao chnh xc phn lp trang .

    C hai khi nim v HyperText m chng ta cn quan tm:

    Hypertext Document(Ti liu siu vn bn): L mt ti liu vn bn n trong

    h thng siu vn bn. Nu tng tng h thng siu vn bn l mt th, th cc ti

    liu tng ng vi cc nt. Hypertext Link(Lin kt siu vn bn): L mt tham chiu ni mt ti liu HyperText ny vi mt ti liu HyperText khc. Cc siu lin kt

    ng vai tr nh nhng ng ni trong th ni trn.

    HyperText l loi d liu ph bin hin nay, v cng l loi d liu c nhu cu

    tm kim v phn lp r ln. N l d liu ph bin trn mng thng tin Internet CSDL

    HyperText vi vn bn dng na cu trc do xut hin thm cc th : Th cu trc

    (tiu , mu, ni dung), th nhn trnh by ch (m, nghing,). Nhcc th

    ny m chng ta c thm mt tiu chun (so vi ti liu fulltext) c th tm kim v

    phn lp chng. Da vo cc th quy nh trc chng ta c th phn thnh cc u tin khc nhaucho cc t kha nu chng xut hin nhng v tr khc nhau. V d

    khi tm kim cc ti liu c ni dung lin quan n people th chng ta a t kha

    tm kim l people, v cc ti liu c t kha poeple ng tiu th s gn vi

    yu cu tm kim hn.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    7/54

    So snh c im ca d liu Fulltext v d liu trang web

    Mc d trang Web l mt dang c bit ca d liu FullText, nhng c nhiuim khc nhau gia hai loi d liu ny. Mt s nhn xt sau y cho thy s khc

    nhau gia d liu Web v FullText. S khc nhau vc im l nguyn nhn chnh

    dn n s khc nhau trong khai ph hai loi d liu ny (phn lp, tm kim,).

    Mt s minh ho Hypertext Document nh l cc nt v cc Hypertext Link nh lcc lin kt gia chng

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    8/54

    Mt si snh di y vc im gia d liu Fulltext vi d liu trang

    c trnh by trong [2].

    STT Trang web Vn bn thng thng (Fulltext)

    1 L dng vn bn na cu trc.Trong ni dung c phn tiu v

    c cc th nhn mnh ngha ca

    t hoc cm t

    Vn bn thng l dng vn bn phicu trc. Trong ni dung ca n

    khng c mt tiu chun no cho ta

    da vo nh gi

    2 Ni dung ca cc trang Web

    thng n m t ngn gn, c

    ng, c cc siu lin kt ch ra

    cho ngi c n nhng ni

    khc c ni dung lin quan

    Ni dung ca cc vn bn thng

    thng thng rt chi tit v y

    3 Trong ni dung cc trang Web c

    cha cc siu lin kt cho php

    lin kt cc trang c ni dung lin

    vi nhau

    Cc trng vn bn thng thng khng

    lin kt c n ni dung ca cc

    trang khc

    1.3. KHAI PH DLIU VN BN (TEXTMINING) V KHAI PH D

    LIU WEB (WEBMINING)

    Nh cp trn, TextMining (Khai ph d liu vn bn) v WebMining

    (Khai ph d liu Web) l mt trong nhng ng dng quan trng ca Datamining.

    Trong phn ny ta si su hn vo bi ton ny.

    1.3.1. Cc bi ton trong khai ph dliu vn bn

    1. Tm kim vn bn

    a. Ni dung

    Tm kim vn bn l qu trnh tm kim vn bn theo yu cu ca ngi dng.Cc yu cu c th hin di dng cc cu hi (query), dng cu hi n gin nht

    l cc t kha. C th hnh dung h tm kim vn bn sp xp vn bn thnh hai lp:

    Mt lp cho ra nhng cc vn bn tha mn vi cu hi a ra v mt lp khng hin

    th nhng vn bn khng c tha mn. Cc h thng thc t hin nay khng hin th

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    9/54

    nh vy m a ra cc danh sch vn bn theo quan trng ca vn bn tu theo cc

    cu hi a vo, v din hnh l cc my tm tin nh Google, Altavista,

    b. Qu trnh

    Qu trnh tm tin c chia thnh bn qu trnh chnh :

    nh ch s (indexing): Cc vn bn dng th cn c chuyn sang mt

    dng biu din no x l. Qu trnh ny cn c gi l qu trnh biu din vn

    bn, dng biu din phi c cu trc v d dng khi x l.

    nh dng cu hi:Ngi dng phi m t nhng yu cu v ly thng tin cn

    thit di dng cu hi. Cc cu hi ny phi c biu din di dng ph bin cho

    cc h tm kim nh nhp vo cc t kha cn tm. Ngoi ra cn c cc phng php

    nh dng cu hi di dng ngn ng t nhin hoc di dng cc v d, i vi cc

    dngny th cn c cc k thut x l phc tp hn. Trong cc h tm tin hin nay th

    i a s l dng cu hi di dng cc t kha.

    So snh: H thng phi c s so snh r rng v hon ton cu hi cc cu hi

    ca ngi dng vi cc vn bn cl u tr trong CSDL. Cui cng ha ra mt

    quyt nh phn loi cc vn bn c lin quan gnvi cu hi a vo v th t ca

    n. H s hin th ton b vn bn hoc ch mt phn vn bn.

    Phn hi: Nhiu khi kt quc tr v ban u khng tha mn yu cu ca

    ngi dng, do cn phi c qua trnh phn hi ngi dng c tht hay i li

    hoc nhp mi cc yu cu ca mnh. Mt khc, ngi dng c th tng tc vi cch v cc vn bn tha mn yu cu ca mnh v h c chc nng cp nhu cc vn

    bn . Qu trnh ny c gi l qu trnh phn hi lin quan (Relevance feeback).

    Cc cng c tm kim hin nay ch yu tp trung nhiu vo ba qu trnh u,

    cn phn ln cha c qu trnh phn hi hay x l tng tc ngi dng v my. Qu

    trnh phn hi hin nay ang c nghin cu rng ri v ring trong qu trnh tng

    tc giao din ngi my xut hin hng nghin cu l interface agent.

    2. Phn lp vn bn(Text Categoization)

    a. Ni dung

    Phn lp vn bn c xem nh l qu trnh gn cc vn bn vo mt hay

    nhiu vn bn xc nh t trc. Ngi ta c th phn lp cc vn bn mtc ch th

    cng, tc l c tng vn bn mt v gn n vo mt lp no . Cch ny s tn rt

    nhiu thi gian v cng sc i vi nhiu vn bn v do khng kh thi. Do vy m

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    10/54

    phi c cc phng php phn lp tng. phn lp tng ngi ta s dng cc

    phng php hc my trong tr tu nhn to (Cy quyt nh, Bayes, k ngi lng

    ging gn nht)

    Mt trong nhng ng dng quan trng nht ca phn lp vn bn l trong tm

    kim vn bn. T mt tp d liu phn lp cc vn bn sc nh ch s vitng lp tng ng. Ngi dng c th xc nh ch hoc phn lp vn bn m

    mnh mong mun tm kim thng qua cc cu hi.

    Mt ng dng khc ca phn lp vn bn l trong lnh vc tm hiu vn bn.

    Phn lp vn bn c thc s dng lc cc vn bn hoc mt phn cc vn bn

    cha d liu cn tm m khng lm mt i tnh phc tp ca ngn ng t nhin.

    Trong phn lp vn bn, mt lp c thc gn gi trng sai (True hay

    False hoc vn bn thuc hay khng thuc lp) hoc c tnh theo mc ph thuc

    (vn bn c mt mc ph thuc vo lp). Trong trng hp c nhiu lp th phnloi ng sai s l vic xem mt vn bn c thuc vo mt lp duy nht no hay

    khng..

    b. Qu trnh

    Qu trnh phn lp vn bn. tun theo cc bc sau:

    nh chs (Indexing): Qu trnh nh ch s vn bn cng ging nh trong

    qu trnh nh ch s ca tm kim vn bn. Trong phn ny th tc nh ch s

    ng vai tr quan trng v mt s cc vn bn mi c th cn c x l trong thigan thc

    Xc nh phn lp: Cng ging nh trong tm kim vn bn, phn lp vn

    bn yu cu qu trnh din t vic xc nh vn bn thuc lp no nh th no,

    da trn cu trc biu din ca n. i vi h phn lp vn bn, chng ta gi qu trnh

    ny l b phn lp (Categorization hoc classifier). N ng vai tr nh nhng cu hi

    trong h tm kim. Nhng trong khi nhng cu hi mang tnh nht thi, th b phn

    loi c s dng mt cch n nh v lu di cho qu trnh phn loi.

    So snh: Trong hu ht cc b phn loi, mi vn bn u c yu cu gnng sai vo mt lp no . S khc nhau ln nht i vi qu trnh so snh trong h

    tm kim vn bn l mi vn bn chc so snh vi mt s lng cc lp mt ln v

    vicc hn quyt nh ph hp cn ph thuc vo mi quan h gia cc lp vn bn.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    11/54

    Phn hi (Hay thch nghi): Qu trnh phn hi ng vai tr trong h phn lp

    vn bn. Th nht l khi phn loi th phi c mt s lng ln cc vn bn c

    xp loi bng tay trc , cc vn bn ny c s dng lm mu hun luyn h

    trxy dng b phn loi. Th hai l i vi vic phn loi vn bn ny khng d

    dng thay i cc yu cu nh trong qu trnh phn hi ca tm kim vn bn , ngidng c th thng tin cho ngi bo tr h thng v vic xa b, thm vo hoc thay

    i cc phn lp vn bn no m mnh yu cu.

    3. Mt s bi ton khc

    Ngoi hai bi ton k trn, cn c cc bi ton sau:

    Tm tt vn bn

    Phn cm vn bn

    Phn cm cc tmc

    Phn lp cc tmc

    nh chmc cc ttim nng

    Dn ng vn bn

    Trong cc bi ton x l vnbn nu trn, chng tra thy vai tr ca biu

    din vn bn rt ln, c bit trong cc bit on tm kim, phn lp, phn cm, dn

    ng

    1.3.2. Khai ph dliu Web

    a. Nhu cu

    S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi

    lng khng l cc d liu dng siu vn bn(d liu Web). Cng vi s thay i v

    pht trin hng nga hng giv ni dung cng nh s lng ca cc trang Web trn

    Internet th vn tm kim thn g tin i vi ngi s dng li ngy cng kh khn.

    C th ni nhu cu tm kim thng tin trn mt CSDL phi cu trc c pht trin

    ch yu cng vi s pht trin ca Internet. Thc vy vi Internet con ngi lm

    quen vi cc trang Web cng vi v vn cc thng tin. Trong nhng nm gn y

    Intrnet trthnh mt trong nhng knh v khoa hc, thng tin kinh t, thng mi

    v qung co. Mt trong nhng l do cho s pht trin ny l s thp v gi c tiu tn

    khi cng khai mt trang Web trn Internet. So snh vi nhng dch v khc nh mua

    bn hay qung co trn mt tbo hay t p ch, th mt trang Web "i" r hn rt

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    12/54

    nhiu v cp nht nhanh chng hn ti hng triu ngi dng khp mi ni trn th

    gii. C th ni trang Web nh l cun tin Bch khoa ton th. Thng tin trn cc

    trang Web a dng v mt ni dung cng nh hnh thc. C th ni Internet nh mt

    x hi o, n bao gm cc thng tin v mi mt ca i sng kinh t, x hi c trnh

    by di dng vn bn, hnh nh, m thanh,...Tuy nhin cng vi sa dng v s lng ln thng tin nh vy ny sinh

    vn qu ti thng tin. Ngi ta khng th tm t kim a ch trang Web cha thng

    tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung ca

    cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu

    cu ca ngi tm kim. Cc tin ch ny qun l d liu nh cc i tng phi cu

    trc. Hin nay chng ta lm quen vi mt s cc tin ch nh vy l: Yahoo,

    goolel, Alvista,...

    Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao,Kinh t-X hi v xy dng...Cn c vo ni dung ca cc ti liu m khch hng xem

    hoc download v, sau khi phn lp chng ta s bit khch hng hay tp trung vo ni

    dung g trn trang Web ca chng ta, t chng ta s b sung thm nhiu cc ti liu

    v cc ni dung m khch hng quan tm v ngc li. Cn v pha khch hng sau

    khi phn tch chng ta cng bit c khch hng hay tp trung v vn g, t

    c tha ra nhng h tr thm cho khch hng . T nhng nhu cu thc t trn,

    phn lp v tm kim trang Web vn l bi ton hay v cn pht trin nghin cu hin

    nay.

    b. Kh khn

    H thng phc v World Wide Web nh l mt h thng trung tm rt ln

    phn b rng cung cp thng tin trn mi lnh vc khoa hc, x hi, thng mi, vn

    ha,... Web l mt ngun ti nguyn giu c cho Khai ph d liu. Nhng quan st sau

    y cho thy Web a ra s thch thc ln cho cng ngh Khai ph d liu

    1. Web dng nh qu ln t chc thnh mt kho d liu phc v

    Dataming

    Cc CSDL truyn thng th c kch thc khng ln lm v thng c lu

    trmt ni, , Trong khi kch thc Web rt ln, ti hng terabytes v thay i

    lin tc, khng nhng th cn phn tn trn rt nhiu my tnh khp ni trn th gii.

    Mt vi nghin cu v kch thc ca Web a ra cc s liu nh sau: Hin nay

    trn Internet c khong hn mt t cc trang Web c cung cp cho ngi s dng.,

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    13/54

    gi s kch thc trung bnh ca mi trang l 5-10Kb th tng kch thc ca n t nht

    l khong 10 terabyte. Cn t lt ng ca cc trang Web th tht s gy n tng. Hai

    nm gn y s cc trang Web tng gp i v cng tip tc tng trong hai nm ti.

    Nhiu t chc v x hi t hu ht nhng thng tin cng cng ca h ln Web. Nh

    vy vic xy dng mt kho d liu (datawarehouse) lu tr, sao chp hay tch hpcc d liu trn Web l gn nh khng th

    2. phc tp ca trang Web ln hn rt nhiu so vi nhng ti liu vn bn

    truyn thng khc

    Cc d liu trong cc CSDL truyn thng th thng l loi d liu ng nht

    (v ngn ng, nh dng,), cn d liu Web th hon ton khng ng nht. V d v

    ngn ng d liu Web bao gm rt nhiu loi ngn ng khc nhau (C ngn ng din

    t ni dung ln ngn ng l p trnh), nhiu loi nh dng khc nhau (Text, HTML,

    PDF, hnh nh m thanh,), nhiu loi t vng khc nhau (a ch Email, cc lin kt(links), cc m nn (zipcode), sin thoi)

    Ni cch khc, trang Web thiu mt cu trc thng nht. Chng c coi nh

    mt th vin k thut s rng ln, tuy nhin con s khng l cc ti liu trong th vin

    th khng c sp xp tun theo mt tiu chun c bit no, khng theo phm tr,

    tiu , tc gi, s trang hay ni dung,... iu ny l mt th thch rt ln cho vic tm

    kim thng tin cn thit trong mt th vin nh th.

    3. Web l mt ngun ti nguyn thng tin c thay i cao

    Web khng ch c thay i v ln m thng tin trong chnh cc trang Web

    cng c cp nht lin tc. Theo kt qu nghin cu , hn 500.000 trang Web trong

    hn 4 thng th 23% cc trang thay i hng ngy, v khong hn 10 ngy th 50% cc

    trang trong tn min bin mt, ngha l a ch URL ca n khng cn tn ti na.

    Tin tc, th trng chng khon, cc cng ty qun co v trung tm phc v Web

    thng xuyn cp nht trang Web ca h.s Thm vo s kt ni thng tin v s

    truy cp bn ghi cng c cp nht

    4. Web phc v mt cngng ngi dng rng ln v a dng

    Internet hin nay ni vi khong 50 trm lm vic, v cng ng ngi dng

    vn ang nhanh chng lan rng. Mi ngi dng c mt kin thc, mi quan tm, s

    thch khc nhau. Nhng hu ht ngi dng khng c kin thc tt v cu trc mng

    thng tin, hoc khng c thc cho nhng tm kim, rt d b "lc" khi ang "m

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    14/54

    mm"trong "bng ti" ca mng hoc s chn khi tm kim m ch nhn nhng mng

    thng tin khng my hu ch

    5. Chmt phn rt nh ca thng tin trn Web l thc shu ch

    Theo thng k, 99% ca thng tin Web l v ch vi 99% ngi dng Web.

    Trong khi nhng phn Web khng c quan tm li b bi vo kt qu nhn ctrong khi tm kim. Vy th ta cn phi khai ph Web nh th no nhn c trang

    web cht lng cao nht theo tiu chun ca ngi dng?

    Nh vy chng ta c th thy cc im khc nhau gia vic tm kim trong

    mt CSDL truyn thng vi vvic tm kim trn Internet. Nhng thch thc trn

    y mnh vic nghin cu khai ph v s dng ti nguyn trn Internet

    c. Thun li

    Bn cnh nhng th thch trn, cn mt s li th ca trang Web cung cpcho cng vic khai ph Web.

    1. Web bao gm khng ch c cc trang m cn c c cc hyperlink tr t

    trang ny ti trang khc. Khi mt tc gi to mt hyperlink t trang ca ng ta ti mt

    trang A c ngha l A l trang c hu ch vi vn ang bn lun. Nu trang A cng

    nhiu Hyperlink t trang khc trn chng t trang A quan trng. V vy s lng

    ln cc thng tin lin kt trang s cung cp mt lng thng tin giu c v mi lin

    quan, cht lng, v cu trc ca ni dung trang Web, v v th l mt ngun ti

    nguyn ln cho khai ph Web2. Mt my ch Web thng ng k mt bn ghi u vo (Weblog entry) cho

    mi ln truy cp trang Web. N bao gm a ch URL, a ch IP, timestamp. D liu

    Weblog cung cp lng thng tin giu c v nhng trang Web ng. Vi nhng thng

    tin va ch URL, a ch IP, mt cch hin tha chiu c thc cu trc nn

    da trn CSDL Weblog. Thc hin phn tch OLAP a chiu c tha ra N ngi

    dng cao nht, N trang Web truy cp nhiu nht, v khong thi gian nhiu ngi truy

    cp nht, xu hng truy cp Web

    d. Cc ni dung trong Webmining

    Nh phn tch vc im v ni dung cc vn bn HyperText trn, t

    khai ph d liu Web cng s tp trung vo cc thnh phn c trong trang Web.

    chnh l:

    1. Khai ph ni dung trang Web (Web Content mining)

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    15/54

    Khai ph ni dung trang Web gm hai phn:

    a. Web Page Content

    Ngha l s s dng ch cc t trong vn bn m khng tnh n cc lin kt

    gia cc vn bn. y chnh l khai ph d liu Text (Textmining)

    b.Search Result

    Tm kim theo kt qu. Trong cc my tm kim, sau khi tm ra nhng

    trang Web tho mn yu cu ngi dng, cn mt cng vic khng km phn quan

    trng, l phi sp xp kt qu theo th t d gn nhau vi ni dung cn tm kim.

    y cng chnh l khai ph ni dung trang Web.

    2. Web Structure Mining

    Khai ph da trn cc siu lin kt gia cc vn bn c lin quan.

    3. Web Usage Mininga. General Access Partern Tracking:

    Phn tch cc Web log khm ph ra cc mu truy cp ca ngi dng

    trong trang Web.

    b. Customize Usage Tracking:

    Phn tch cc mu truy cp ca ngi dng ti mi thi im bit xu

    hng truy cp trang Web ca tng i tng ngi dng ti mi thi im khc nhau

    Cc ni dung trong khai ph Web

    WebStructure

    WebContent

    Web PageContent

    SearchResult

    WebUsage

    General AccessPattern

    CustomizedUsage

    Web Mining

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    16/54

    Chng 2. MY TM KIM

    2.1. NHU CU

    Nh cp phn trn. Internet nh mt x hi o, n bao gm cc

    thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn,

    hnh nh, m thanh,... Thng tin trn cc trang Web a dng v mt ni dung cngnh hnh thc Tuy nhin cng vi sa dng v s lng ln thng tin nh vy

    ny sinh vn qu ti thng tin. i vi mi ngi dng ch mt phn rt nh

    thng tin l c ch, chng hn c ngi ch quan tm n trang Th thao, Vn ha m

    khng my khi quan tm n Kinh t. Ngi ta khng th tm t kim a ch trang

    Web cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun

    l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni

    dung ging vi yu cu ca ngi tm kim. Hin nay chng ta lm quen vi mt

    s cc tin ch nh vy l: Yahoo, Google, Alvista,...My tm kim l cc h thng c xy dng c kh nng tip nhn cc yu

    cu tm kim ca ngi dng (thng l mt tp cc t kho), sau phn tch v tm

    kim trong c s d liu c sn v a ra cc kt qu l cc trang web cho ngi

    s dng. Cth, ngi dng gi mt truy vn, dng n gin nht l mt danh sch

    cc t kha, v my tm kim s lm vic tr li mt danh sch cc trang Web c

    lin quan hoc c cha cc t kha . Phc tp hn, th truy vn l c mt vn bn

    hoc mt on vn bn hoc ni dung tm tt ca vn bn.

    2.2. CU TRC V CCHHOTNG

    2.2.1. Tng quan v cc h tm kim hin nay

    Bng mt v d c th, ta xem xt h tm kim Google

    Trong phn ny ta a ra ci nhn tng quan v cch lm vic ca mt h

    tm kim Google. Phn sau s tho lun vng dng chnh (Crawling, indexing,

    searching) v cu trc d liu m phn ny cha kp cp.

    Phn ln Google c thit k bng C, C++ v chy tt trn Solaris hay

    Linux. Trong Google, Web crawling(download cc trang Web) c thc hin bimt vi Webcrawler phn tn. C mt my ch URL gi danh sch cc URL m

    c nh km ti crawler. Nhng trang Web c nh km c gi ti

    my ch lu tr. My ch lu tr s nn v lu tr cc trang vo Repository (Ni

    lu tr). Mi trang Web u c mt ch s ID km theo gi l DocID. Chc nng

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    17/54

    Index c c thc hin bi

    Indexer v Sorter. Indexer thc hin

    cc chc nng sau: c t

    Repository , gii nn ti liu v

    phn tch chng. Mi ti liu cc chuyn thnh mt tp hp cc

    t xut hin gi l Hits. Hits ghi cc

    t, v tr cc t, xp x ca phng

    ch, s vit hoa thng. Indexer

    phn b nhng Hits thnh cc b

    gi l "Barrels". Indexer thc hin

    mt chc nng quan trng khc,

    l n phn tch tt c nhng

    hyperlink trn tt c cc trang v

    lu tr nhng thng tin quan trng

    v chng vo mt file ngun. File

    ny cha mt lng ln cc

    thng tin xc nh mi lin kt tr t v tr ti trang no, cng ni dung ca lin

    kt.

    Nh vy, Crawler c nhim v down cc trang web v lu tr vo

    respository

    Indexerc t respository gii nn cc ti liu v phn tch, m ha thnh

    Hits, sp xp thnh "Barrels". Phn tch tt c cc hyperlink lu tr vo mt file

    2.2.2. Cu trc ca cc h tm kim

    Cc my tm kim hin nay thng c t chc thnh ba Modul sau:

    Modulnh chmc (indexing): D tm cc trang Web trn Internet, phn

    tch chng ri lu vo CSDL.

    Modul tm kim (searching): Truy xut cc CSDL tr v danh sch cc ti

    liu tha mn mt yu cu ngi dng (di dng truy vn l mt tp cc t kha).

    Modul giao din ngi my:Ly kt qu t modul tm kim.

    Sau y ta i su vo chi tit ca tng modul v nhim v ca chng

    Hnh 2.3_M hnh kin trc ca my tm kim Google

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    18/54

    a. Modul nh ch mc (Indexing)

    Modul nh ch mc thc hin cc nhim v sau

    1. Phn tch c php vn bn v nh ch mc ton b cc t kho trong vn

    bn (s ln xut hin, v tr xut hin)

    2. Lp th lin kt gia cc siu vn bn (lin kt xui v lin kt ngc).

    3. Tnh ton quan trng PageRank ca tt c cc vn bn da vo cu trc

    lin kt siu vn bn (GoogleTM).

    Sau y, ta xem xt chi tit tng nhim v

    a.1. B d trn Web theo cc hyperlink (Web Crawler)

    Crawler (s): Hu ht cc my tm kim hot ng da trn cc chng trnh

    c tn l Crawler, chng trnh ny cung cp d liu (l cc trang Web) cho my tm

    kim hot ng. Crawler l cc chng trnh nh ca cc my tm kim lm cng vic

    duyt Web. Cng vic ca n cng tng t nh cng vic ca con ngi truy cp

    Web da vomi lin kt i n cc trang Web khc nhau. Cc Crawlerc cung

    cp cc a ch URL ban u v s phn tch cc lin kt c trong cc trang v a

    cc thng tin v cho b phn iu khin crawler (Crawler control). B phn iu

    khin ny s quyt nh xem lin kt no sc i thm tip theo v gi li kt qu

    cho Crawler (trong mt vi my tm kim chc nng ny ca b phn iu khin

    crawler c thc crawler thc hin lun). Cc Crawler cng chuyn lun cc trang

    tm thy vo kho cha cc trang (Page Repository), tip tc i thm cc trangWeb khc trn Internet cho n khi cc ngun cha cn kit.

    Vy modul Crawler truy lc cc trang ly t Mng, download xung sau

    cc trang c nh ch mc bi Mdul nh ch mc, sau y vo CSDL. Qu

    trnh ny c lp i lp li cho n khi Crawler c quyt nh dng.

    biu khin quyt nh c trang Web no c i thm tip theo

    Mt my tm kim tiu chun cn xem xt hai vn chnh trong modul

    crawler:- S cc trang Web l rt ln, nn Crawler khng th down ton b cc trang

    m ch chn nhng trang "quan trng". Vy nhng trang nh th no c coi l quan

    trng v quan trng c tnh ton nh th no?

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    19/54

    - Bi v ni dung cc trang Web lin tc thay i nn sau khi download,

    crawler phi thng xuyn thm li cc trang c down cp nht s thay i

    . Hn na mc thay i ca cc trang l khc nhau nn crawler phi cn thn

    xem xt trang no cn xem li, trang no b qua.

    Vn 1: quan trngCho mt trang Web P, chng ta c cc cch tnh quan trng sau:

    1. C mt truy vn Q. quan trng ca P c nh ngha l "s ging nhau

    v t ng" gia P v Q

    2. Biu din Q v P bi hai vector n chiu v=(w1, w2,..., wn) vi wi l biu th

    cho t th i trong b t vng , c th wi=s ln xut hin ca t th i. chch lch

    gia P v Q l gi tr cos ca hai vector biu din

    Gi quan trng nhn c t phng php tnh ny l IS(P)2. Trang no c nhiu trang khc linkn s quang trng hn, nn mt cch

    tnh quan trng ca trang P l tnh s linkn P

    Gi quan trng nhn c t phng php tnh ny l IB(P)

    3. Tnh quan trng bi chnh a ch URL ca n. Nu a ch trang Web

    no tn cng bng".com" hay c cha t "home" s quan trng hn

    Gi quan trng nhn c t phng php tnh ny l IL(P)

    4. Mt phng php na tnh quan trng l m s ln ngi dng truycp vo trang trong mt khong thi gian no

    Vy cui cng quan trng ca trang P s l s kt hp ca cc quan

    trng tnh theo cc cch trn, theo mt t l no :

    IC(P)=k1. IS (P)+k2.IB(P)+ k3.IL(P)+k4.IU(P) (vi k1,k2,k3,k4 v truy

    vn Q l cho trc)

    Vn 2: Scp nht cc trang download

    C hai chin lc cho s cp nht cc trang download:1. C p nht theo nh k tt c cc trang: crawler s thm li tt c cc

    trang vi cng mt tn s f, khng tnh n mc thng xuyn thay i ca

    chng.Ngha l cc trang c i x cng bng bt k chng thay i ra sao.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    20/54

    Cp nht thng xuyn theo ngha l khi down c 10.000 trang chng hn th s

    tnh li PageRank, index ca word trong URL

    2. Cp nht theo mt t l: Trang no cng nhiu thay i th tn sut cp

    nht cng ln. VD: cc trang e1, e2,...,en, thay i theo th t k1,k2,...,kn ln

    a.2. Indexing (Qu trnh nh ch mc)

    Indexer Modules tm hiu tt c cc t trong tng trang Web c lu tr

    trong kho cha cc trang, v ghi li cc a ch URL ca cc trang c cha mi t.

    Kt qu sinh ra mt bng ch mc rt ln, v nhc bng ch mc ny n c th

    cung cp tt c cc

    a ch URL ca cc

    trang khi c yu cu.

    Hai modul nh ch

    s (indexer) vcollection analysis

    trn hnh 1 lm nhim

    v xy dng cc ch

    s khc nhau cho cc

    trang web down

    v. Modul Indexer

    xy dng hai loi ch

    s c bn:Text(content)Index v

    structor(link) index.

    S dng 2 loi ch s

    trn v cc trang web trong ni lu tr cc trang (repository), modul collection

    analysis xy dng thm nhiu ch shu ch khc. Di y chng ta m t

    ngn gn mt vi loi ch s, tp trung vo cu trc v cch s dng ca chng.

    Link index

    xy dng ch s lin kt (link indext), mt phn ca b d (Crawler)c m ha di dng mt s vi cc nt v cc cnh ni, trong cc nt l

    cc trang Web, cc cnh ni gia cc nt l cc lin kt gia cc trang. Ch s

    index sc xy dng ln theo cc nt v cc cnh ca s. (v hnh)

    Hnh1.2_ th minh ho cc nt ( ti liu Hypertext)v cc cnh ni (link) trong mt tp ti liu Hypertext

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    21/54

    Thng thng, thng tin c cu trc ph bin nht c s dng bi cc

    thut ton tm kim trong cc h tm tin l cc thng tin ly t cc trang c lin kt,

    chnh s lin kt trn cung cp mt cch hu hiu s truy cp ti cc thng

    tin lng ging . Nhng s nh vi hng trm thm ch hng nghn nt c th

    c biu din bi bt k mt cu trc d liu no, song cng s thc hin nhng vi mt s ln hn c hng triu nt li l mt thch thc ln.

    Text Index

    Mc d k thut da vo lin kt c s dng tng cng cht

    lng v lin quan gia cc kt qu tm c, th s truy xut da vo t mc

    (tm kim cc trang c cha cc t kha) vn l mt phng php chnh xc

    nh cc trang web c lin quan n truy vn. Cch nh ch s h trtruy vn da

    vo t mc c thc thc hin bng cch s dng bt k phng php truy cp

    truyn thng no tm trn ton b ni dung ti liu.My tm kim s dng chmc lin kt ngc (Inverted Index) cho vic biu din ti liu. Ch mc lin kt

    ngc (Inverted Index) l la chn truyn thng cho cu trc ch s ca cc trang

    Web

    V d chng ta c 4 vn bn sau:

    vn bn 1: computer science

    vn bn 2: computer is about live

    vn bn 3: to live or not to liveQu trnh to file Index nh sau:

    - Ly tt c cc t c mt trong c 4 ti liu

    - Lu tr chng theo th t a, b, c, ....

    - Lu tr cc thng tin v ti liu (bao gm m ti liu, a ch URL,

    tiu , miu t ngn gn...)

    Kt qu thu c mt File Inverted index l mt danh sch cc thng

    tin sau:

    T M V a Tiu Miu

    About 2 3 ... ... ...

    Computer 1 1

    com uter 2 1 ... ... ...

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    22/54

    Is 2 2 ... ... ...

    live 3 2

    Live 3 6

    Live 2 4 ... ... ...

    Not 3 4 ... ... ...Or 3 3 ... ... ...

    science 1 2 ... ... ...

    to 3 1

    To 3 5

    Tuy nhin mt thut ton tm kim thng s dng thm nhng thng tin v

    s xut hin ca t mc trong trang web, v d t mc c vit hoa (nm trong th

    ), hay t mc nm phn tiu (nm trong th v ). kt hp

    nhng thng tin ny, mt trng mi c thm vo gi l trngpayload(ti trng),trng ny m ha cc thng tin thm v s xut hin ca cc t mc trong vn bn.

    Nhng thng tin ny phc v cho thut ton Ranking sau ny.

    Inverted index

    Inverted index c lu tr qua file CSDL cc bn ghi.Vic xy dng mt

    CSDL lu tr Inverted Index cho b d liu ln nh tp cc trang web trn internet

    i hi mt kin trc phn tn vi mm do cao. Trong mi trng Web c hai

    chin lc cbn cho vic chia cc Inverted Index thnh mt tp cc nt khc nhau

    c th lu tr phn tn ti nhiu ni khc nhau.

    Kiu th nht l local inverted file (IFL) .

    Trong t chc kiu IFL th mi nt lu tr cc danh sch inverted index ca

    mt tp nh cc trang Web khc nhau trong tp cc trang Web lu tr trong b phn

    lu tr (page repository). Khi c yu cu tm kim th b phn search query s truyn

    yu cu i tt c cc nt, mi nt s tr li mt danh sch ring cc trang c cha cc

    tang tm kim

    Kiu th hai l Global inverted file (GFL).Trong t chc kiu GFL, inverted index c chia theo cc t, v vy mi mt

    query server lu tr danh sch inverted index ca mt tp nh cc t trong b d liu.

    V d h thng vi hai query server A v B, th A s lu tr danh sch inverted index

    cho tt c cc t vi k t bt u t a n o, cn B lu tr cho cc t cn li t p n

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    23/54

    z. V vy khi b phn search query mun tm cc trang c cha t people th n s

    ch hi server A.

    Cu trc dliu chnh

    Modul Indexer ly cc trang c Crawler down v cha trong Repository,

    nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trcchnh ca csd liu trong hu ht cc my tm kim:

    a. Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : M

    s t kha, t kha (hnh a). Cc t kha ny dc thit lp trong qu trnh Indexing:

    c File vn bn, tch t kha, xem c trong file t kha cha. Nu cha c to ra

    bn gi mi trong file t kha, trong c m s t kha v tt nhin c lun c m

    s. Nu c ri th ly m s. M s ly c dng cho vic to ra bn ghi tp theo.

    b. File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghi

    cho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL),

    a ch trong my h thng cha file vn bn (cache ca cc trang web ) (hnh b)

    c. File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mi

    bn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin t kha ny trong

    vn bn (hnh c)( y chnh l file ch s lin kt ngc(Inverted index))

    Cch t chc CSDL: S dng cu trc hm bm _theo cc t vng

    Thch thc

    - Vic xy dng mt file ch mc lin kt ngc (inverted index) lin quan n

    vic tin x l cc trang thnh cc phn nh, sp xp chng vo cc ch s t mc v

    nh v tr cho chng, cui cng vit ra nhng phn c sp xp di dng mt tp

    hp cc danh sch lin kt ngc. Thi gian xy dng file index khng qua kht khe,

    tuy nhin khi lm vic vi mt tp hp cc trang Web, mt s file ch s trnn kh

    qun l v yu cu ngun ti nguyn ln (chng hn nh b nh), v thng cn nhiu

    thi gian hon thnh. S so snh vi nhng h tm tin truyn thng cho thy, vi h

    thng ang nghin cu, ni lu tr (repository)cha 40 triu trang Web mc d ch

    biu din c 4% ca tng cc trang Web c kh nng nh ch s, nhng ln hnh thng tm tin tiu chun (TREC-7 colection)l 100GB

    - Bi v ni dung ca cc trang web thay i nhanh chng, nn vic xy dng

    li file ch s l rt cn thit cho vic lm mi cc trang Web. Mt phn cng vic ca

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    24/54

    Crawler l cp nht cc trang Web down v, song song vi cng vic ny vic xy

    dng li cc file ch s

    - Cui cng, dng b nhdnh cho file inverted index cn phi c thit k

    cn thn. Mt file ch sc nn s ci tin thao tc truy vn hn l c file ch s

    c lu tr trong b nh. Tuy nhin vn gp phi l tn thi gian dnh cho vicgii nn

    a.3. Tnh ton i lng PageRank

    Cc h tm kim c hai c tnh quan trng gip a ra kt qu c chnh

    xc cao. u tin, n s dng cu trc lin kt ca Web tnh ton quan trng cho

    tng trang Web, (PageRank).Th hai, h s dng lin kt xp hng kt qu

    (Ranking). Chnh s cc lin kt gia cc trang Web cho php tnh ton nhanh

    chng i lng PageRank.

    i lng PageRankc nh ngha nh sau:

    Gi strang A c cc trang T1,T2,,Tn tr ti. Tham sd l h shm c gi

    tr trong khong 0 v 1. Chng ta thngt d=0.85. C(A) l slin kt ra ttrang A.

    Khi PageRank ca A c tnh nhsau:

    PR(A)=(1-d)+d (PR(T1)/C(T1)++PR(Tn)/C(Tn)).

    V PageRank ca mt trang l i lng i din cho s phn b xc sut trn

    cc trang Web trong mt tp cc trang Web nht nh, do tng cc gi tr pagerank

    ca tt c cc trang Web trong tp cc d liu c gi tr bng 1

    Trang V1

    Trang V2

    Trang Vm

    Trang U

    RV1/ NV1

    RV1/NVm

    Hnh 2.2

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    25/54

    Qu trnh tnh ton c lp i lp li cho n khi hi t.

    Vi d=0.85, s vng lp =20 vi khong vi triu trang. V tnh PageRank

    cho 26 triu trang web vi mt trm lm vic va phi th thi gian tiu tn ti vi gi.

    2.3. NHCIM CA CC MY TM KIM

    1. L cc h tm kim tng, ngi s dng cha c vai tr g trong qu

    trnh tm kim, khng c cch phn hi t ngi s dng cp nht cc tham s

    tm kim nhm tng hiu qu cho ln tm kim sau

    2. Coi quan trng ca cc t kha l nh nhau, do cha cho php tnh

    quan trng khc nhau ca cc t kha. Nh trong cc h tm kim ln nh Google,

    Yahoo, nu a vo t System Information th h s tm kim tt c cc trang Web

    c lin quan n 2 t System v Information. Nu ngi dng mun tm kim t

    Computer Story m trong t Computer c ngha nhiu hn t Story (chng hn,

    t Computer c trng s 0.8, story c trng s 0.2), th vn t ra l cn phi xy

    dng mt h tm kim nh vy

    3. Cha quan tm n bn cht ca x l vn bn, vn tng ngha, a

    ngha

    C rt nhiu ti liu lin quan n ni dung cn tm nhng khng cha cc t

    kha a vo, m ch cha cc tng ngha vi chng v nhng ti liu s b b

    qua trong qu trnh tm kim.

    V cc my hu ht tm kim theo t kha, da vo vic nh ch mc cho cctrang Web(index-base search engine), c th c hng trm ti liu cng cha t kha

    a vo, dn n mt s lng ln ti liu nhn c t my tm kim, m rt nhiu

    trong chng t hoc khng lin quan n ni dung cn tm

    2.4. BI TON TM KIM MI

    Hng ngy c hng t ngi truy cp vo Internet v cng c tng y ngi

    thc hin cc thao tc tm kim vi cc my tm kim khc nhau. Nu thng k cc

    thng tin ca mi ln tm kim ny th chc chn chng ta sc mt ngun thng

    tin khng l, v nu bit cch s dng chng th s lm c rt nhiu cng vic hu

    ch. Cc bi ton tm kim trong cc my tm kim thng thng chn gin p ng

    nhu cu tm kim thng tin ca khch hng m cha bit tn dng nhng thng tin t

    pha khch hng qua mi ln tm kim. Di y l bi ton xut thm vo tnh

    nng ca cc my tm kim v hng gii quyt trong tng lai.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    26/54

    Bi ton:

    Cn c vo cc ti liu m khch hng xem hoc down v, sau khi phn tch ta

    bit c khch hng hay tp trung vo cc trang c ni dung g trn tp cc trang

    Web ca chng ta, t b xung thm nhiu ti liu m khch hng quan tm v

    ngc li. Cn v pha khc hng sau khi phn tch chng ta cng bit c khchhng hay tp trung v vn g , t c thm nhng h trcho khch hng.

    Hng gii quyt:

    Xy dng mt CSDL vcc ti liu, trong c mt trng ClassificationID

    cho bit ti liu ny thuc lnh vc no da trn kt qu phn tch trc .(Bng

    phn lp)

    Xy dng mt CSDL vpha khch hng: Trc khi khch hng truy cp vo

    CSDL, yu cu ng k mt account thng tin: tn, tui, a ch,chng ta cng a

    thm hai trng quan trng l ngh nghip, trnh (cho chnh xc ca thng tin l

    c%). Yu cu ng k account l tu chn vi khch hng. Sau trong qu trnh mi

    ln khch hng truy cp vo CSDL chng ta s ghi li cc ti liu m khch hng truy

    nh p vo bng thng tin khch hng. Sau da vo cc thng tin v ti liu m

    khch hng truy nhp v thng tin v khch hng, phn tch theo thut ton cy quyt

    nh sinh lut cho bit khch hng khch hng c ngh nghip v trnh nh th

    no th quan tm n lnh vc no vi tin cy l ngng c

    2.5. KT LUN

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    27/54

    Chng 3. BI TON PHN LP

    3.1. PHT BIU BI TON

    Trong t nhin, con ngi thng c tng chia s vt thnh cc phn,

    cc lp khc nhau. Tng t nh vy, gii thut phn lp n gin ch l mt php

    nh x c s d liu c sang mt min gi tr c th no , da vo mt thuc

    tnh hoc mt tp hp cc thuc tnh ca d liu.

    Phn lp vn bn c cc nh nghin cu nh ngha thng nht nh l

    vic gn cc ch c xc nh cho trc vo cc vn bn Text a trn ni

    dung ca n. Phn lp vn bn l cng vic c s dng h trtrong qu trnh

    tm kim thng tin (Inrmation Retrieval), chit lc thng tin (Information

    Extraction), lc vn bn hoc tng dn ng cho cc vn bn ti nhng ch

    xc nh trc. phn loi vn bn, ngi ta s dng phng php hc my c

    gim st (supervised learning). Tp d liu c chia ra lm hai tp l tp hunluyn v tp kim tra trc ht phi xy ng m hnh thng qua cc mu hc

    bng cc tp hun luyn, sau kim tra s chnh xc bng tp liu kim tra.

    Hnh sau l mt khung cho vic phn lp vn bn, trong bao gm ba

    cng on chnh: cng on u l biu din vn bn, tc l chuyn cc d liu

    vn bn thnh mt dng c cu trc no , tp hp cc mu cho trc thnh mt

    tp hun luyn. Cng on th hai l vic s dng cc k thut hc my hc

    trn cc mu hun luyn va biu din. Nh vy l vic biu din cng on mt

    s l u vo cho cng on th hai. Cng on th ba l vic b sung cc kinthc thm vo do ngi dng cung cp lm tng chnh xc trong biu din

    vn bn hay trong qu trnh hc my.

    Trong cng on hai, c nhiu phng php hc my c p dng, m

    hnh mng Bayes, cy quyt nh, phng php k ngii lng ging gn nht,

    mng Neuron, SVM,

    Dliuvo

    Giithutphnlphotng

    Lp 1

    Lp 2

    Lp n

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    28/54

    3.2. CC PHNG PHP BIU DIN VN BN

    3.2.1. Cc phng php biu din vn bn trong C s d liuFullText

    Tn ti ba m hnh CSDL FullText in hnh: M hnh logic, m hnh c php

    v m hnh Vector

    a. M hnh phn tch c phpa.1. Quy tc lu tr:

    - Mi vn bn u phi c phn tch c php v tr li thng tin chi tit v

    ch ca vn bn .

    - Sau tin hnh Index cc ch ca tng vn bn. Cch Index trn ch

    ging nh khi Index trn vn bn nhng ch Index trn cc t xut hin trong ch.

    - Cc vn bn c qun l thng qua cc ch ny c th tm kim c

    khi c yu cu, cu hi tm kim s da trn cc ch trn.

    a.2. Quy tc tm kim:Cu hi tm kim s da vo cc ch c Index. Vy u tin

    phi tin hnh Index cc ch. Cch Index trn ch ging nh Index trn ton b

    cc t c trong ch,

    Cu hi a vo c thc phn tch c php tr li mt chv

    tm kim trn ch

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    29/54

    Nh vy b phn x l chnh i vi mt h CSDL xy dng theo m hnh

    ny chnh l h thng phn tch c php v on nhn ni dung vn bn.

    a.2. u im, nhc im

    u im

    Khi c sn ch th vic tm kim theo phng php ny li kh hiu quv n gin do tm kim nhanh v chnh xc.

    i vi nhng ngn ngn gin v mt ng php th vic phn tch trn c

    tht c mc chnh xc cao v chp nhn c.

    Nhc im

    Cht lng ca h thng theo phng php ny hon ton ph thuc vo cht

    lng ca h thng phn tch c phpv on nhn ni dung ti liu. Trn thc t, vic

    xy dng h thng ny l rt phc tp, ph thuc vo c im ca tng ngn ng v

    a s vn cha t n chnh xc cao.

    b. M hnh Logic

    Theo m hnh ny cc t c ngha trong vn bn c Index v ni dung vn

    bn c qun l theo cc ch s Index .

    b.1. Cc quy tc lu tr

    - Mi vn bn c Index theo quy tc:

    Thng k cc t c ngha trong cc vn bn, l nhng t mang thng tin

    chnh v cc vn bn lu tr.

    Index cc vn bn a vo theo danh sch cc t kho ni trn. ng vi mi

    t kho trong danh sch s lu v tr xut hin n trong tng vn bn v tn vn bntn ti t kho .

    V d, c hai vn bn vi m tng ng l VB1,VB2.

    Cng ha x hi ch ngha Vit Nam (VB1)

    Vit Nam dn ch cng ha (VB2)

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    30/54

    Khi ta c cch biu din nh sau:

    b.2. Cc quy tc tm kim:

    Cu hi tm kim c a ra di dng Logic, tc l gm mt tp hp cc

    php ton (AND, OR,) c thc hin trn cc t hoc cm t. Vic tm kim s

    da vo bng Index to ra v kt qu tr li l cc vn bn tho mn ton b cc

    iu kin trn

    b.3. u im Nhc im

    u im

    - Tm kim nhanh v n gin. Thcvy, gi s cn tm kim t computer.

    H thng s duyt trn bng Index tr n ch s Index tng ng. Nu t

    computer tn ti trong h thng. Vic tm kim ny l kh nhanh v n gin khi

    trc ta sp xp bng Index theo vn ch ci. Php tm kim trn c phc tpcp (nlog2n), vi n l s t trong bng Index. Tng ng vi ch s index trn s cho

    ta bit cc ti liu cha n.Nh vy vic tm kim lin quan n k t th cc php ton

    cn thc ehin l k*n*log2n, vi n l s t trong bng Index

    - Cu hi tm kim nhanh v linh hot

    C th dng cc k t c bit trong cu hi tm kim m khng lm nh

    hng n phc tp ca php tm kim. V d ta tm ta th kt qu s tr li cc

    vn bn c cha cc t ta, tao, tay,l cc t bt u bng t ta

    K t % c gi l k ti din (wildcard character).Ngoi ra, bng cc php ton Logic cc t cn tm c th t chc thnh cc

    cu hi mt cch linh hot. V d: Cn tm t [ti, ta, tao], du [] s th hin vic

    tm kim trn mt trong s nhiu t trong nhm. y thc ra l mt cch th hin linh

    hot php ton OR trong i s Logic thay v phi vit l: Tm cc ti liu c cha t

    ti hoc t ta hoc tao.

    T mc MVB_V tr XH

    Cng VB1(1), VB2(5)Ha VB1(2), VB2(6)

    X VB1(3)hi VB1(4)ch VB1(5), VB2(4)ngha VB1(6)Vit VB1(7), VB2(1)

    Nam VB1(8), VB2(2)Dn VB2(3)

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    31/54

    Nhc im:

    -Ngi tm kim phi c chuyn mn trong lnh vc tm kim

    Thc vy, do cu hi a vo di dng Logic nn kt qu tr li cng c gi

    tr Logic (Boolean). Mt s ti liu sc tr li khi tho mn mi iu kin a

    vo. Nh vy mun tm c ti liu theo ni dung th phi bit ch xc v ti liu.- Vic Index cc ti liu l tn nhiu thi gian v phc tp.- Tn khng gian lu tr cc bng Index.- Cc ti liu tm c khng c xp xp theo chnh xc ca chng.- Cc bng Index khng linh hot. Khi cc t vng thay i (thm, xa,)

    th ch s Index cng phi thay i theo

    c. M hnh khng gian Vector

    c.1. Quy tc lu tr

    Mt trong nhng phng php in hnh biu din vn bn ni chung l s

    dng khng gian Vector. Trong cch biu din ny, mi vn bn c biu din bng

    mt vector. Mi thnh phn ca Vector l mt t mc ring bit trong tp vn bn

    gc(corpus)v c gn mt gi tr l hm f ch mt ca t mc trong vn bn.

    Chng ta c th biu din cc vn bn di dng vi t mc l cc tn v

    hm f biu din s ln xut hin ca chng, cch biu din ny cn gi l biu din

    theo ti cc t (bag of words)

    Chng hn vn bn vb1, n c biu din bi mt vector V (v1,v2,,vn)

    Vi vi l s ln xut hin ca t kha th i (ti) trong vn bn vb1.

    Ta xt hai vn bn sau:

    T Vector cho vn

    Com uter 2Is 1

    Li e 0Not 1Onl 1

    C nhiu tiu chun chn hm f, do m chng ta c th sinh ra nhiu gi

    tr trng s khc nhau. Sau y l mt vi tiu chun chn hm f

    Computer is not only computer

    Computer is life

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    32/54

    M hnh Boolean

    Gi s c mt CSDL gm m vn bn D={d1,d2,,dm}. Mi vn bn c

    biu din di dng mt vector gm n t mc T={t1,t2,,tn}. Gi W=(wij) l ma trn

    trng s, trong wij l gi tr ca t mc ti trong vn bn dj.

    M hnh Boolean l m hnh n gin nh, c xc nh nh sau:Wij = 0 nu ti khng c mt trong dj

    1 nu ngc li

    V d chng ta c hai vn bn sau:

    T Vector cho vn

    Com uter 1Is 1

    Li e 0Not 1Onl 1

    M hnh tn s(Frequency Model)

    M hnh tn s xc nh gi tr cc s trong ma trn W=(wij) cc gi tr l cc

    s dng da vo tn s ca c t sut hin trong vn bn hoc tn s xut hin ca

    vn bn trong CSDL. C ba phng php ph bin sau:

    Phng php da trn tn stmc (TF_Term Frequency)

    Cc gi tr ca cc t mc c tnh da trn s ln xut hin ca ca c t

    mc trong vn bn . Gi tfij l s ln xut hin ca t mc ti trong vn bn dj, khi

    wijc tnh bi cng thc:

    Wij = tfij hoc wij = 1+log(tfij) hoc w=tfij.

    Phng php da trn nghch o t s vn bn(IDF_ Inverse Document

    Frequency)

    Gi tr t mc c tnh bi cng thc sau:

    Wij= logdfij

    m=log(m)- log(dfi)

    Computer is not only computer

    Computer is life

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    33/54

    Phng php TF.IDE

    Phng php ny l tng hp ca hai phng php TF v IDF, ma trn trng

    sc tnh nh sau:

    Wij = [1+log(tfij)] log (dfi

    m) nu tfij >=1

    0 nu tfij =0c.2. Cc quy tc tm kim

    Cc cu hi a vo c nh x vector Q(q1,q2,,qm) theo h s ca cc t

    vng l khc nhau. Tc l: T vng cng c ngha vi ni dung cn tm c h s

    cng ln.

    Qi =0 khi t vng khng thuc danh sch nhng t cn tm.

    Qi0 khi t vng thuc danh sch cc t cn tm v Qi cng ln th mc

    lin quan n ni dung ti liu cng cao. Tc l h thng su tin hn i vi cc

    ti liu c cha cc t tm kim c h s cao.V d: Nu ni dung cn tm c t Machine quan trng hn t Computer,

    th trong vector Q ta c tht qk=2,qh=1 tng ng vi tk=Machine, th=a s.

    Khi , cho mt h thng cc t vng ta s xc nh c cc vector tng

    ng vi tng ti liu v ng vi mi cu hi a vo ta s c mt vector tng vi n

    vi nhng h s c xc nh t trc. Vic tm kim v qun l sc thc

    hin trn ti liu ny.

    T cch xc nh ni dung cc ti liu v cu hi theo cc vector tr cho ta

    phng php tm kim v lu tr cc ti liu dng Full-Text theo cch mi nh sau:1. Mi ti liu c m ha bi mt vector2. Phn loi cc ti liu theo cc vector ni trn.3. Mi cu hi a vo cng c m ha bi mt vectorVic tm kim cc ti liu c thc hin bng cch nhn ln lt tng Vector

    cu hi vi vector ca tng ti liu

    Kt qu tr li s l mi ti c lin quan n cu hi tm kim

    c.3. u, nhc im

    u im

    - Cc ti liu tr li c thc sp xp theo mc lin quan n ni dung

    yu cu do trong php th mi ti liu u tr li ch snh gi lin quan ca n

    n ni dung yu cu.

    - Vic a ra cc cu hi tm kim l d dng v khng yu cu ngi tm

    kim c trnh chuyn mn cao v vn

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    34/54

    - Tin hnh lu tr v tm kim n gin hn phng php Logic. Ngi tm

    kim c th ta ra s cc ti liu tr li c mc chnh xc cao nht

    Nhc im

    - Vic tm kim tin hnh kh chm khi h thng cc t vng l ln do phi

    tnh ton trn ton b cc Vector ca ti liu.- Khi biu din cc Vector vi cc h s l s t nhin lm tng mc chnh

    xc ca vic tm kim nhng lm tc tnh ton gim i rt nhiu do cc php nhn

    vector phi tin hnh trn cc s t nhin hoc s thc, hn na vic lu tr cc vector

    s tn km v phc tp

    - H thng khng linh hot khi lu tr cc t kha. Ch cn mt thay i rt

    nh trong bng t vng s ko theo hoc l vector ho li ton b cc ti liu lu tr,

    hoc l s b qua cc t c ngha b sung trong cc ti liu c m ha trc . Tuy

    nhin, vi nhng u im nht nh s sai s nh ny c th b qua do hin ti s cc

    t c ngha c m ha kh y trc khi tin hnh m ha ti liu. V y phng

    php Vector vn c quan tm v s dng

    - Mt nhc im na, chiu ca mi Vector theo cch biu din ny l rt

    ln, bi v chiu ca n c xc nh bng s lng cc t khc nhau trong tp hp

    vn bn. V d s lng cc t c th c t 103n 105 trong tp hp cc vn bn nh,

    cn trong tp hpc c vn bn ln th s lng s nhiu hn, c bit trong mi trng

    Web

    Cch khc phc: C mt s phng php gim bt s chiu ca Vectorc

    p dng. Mt phng php n gin v hiu qu l loi b cc t dng (stop words).T dng l cc t dng biu din cu trc cu ch khng biu t ni dung

    vn bn, v d nh cc t ni, cc gii tNhng t nh vy xut hin rt nhiu trong

    vn bn nhng li khng lin quan n ch v ni dung vn bn. Do chng ta c

    th loi b cc t ny i lm gim c s chiu ca cc vector biu din m li

    khng lm nh hng g n hiu qu tm kim.

    Mt s v d v cc t dng

    Ting Vit Ting AnhV a

    Hoc the

    Cng do

    about

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    35/54

    3.2.2. Cc phng php biu din vn bn trong C s d liu

    HyperText

    Trong chng I chng ta nu ra nhng kh khn trong vic tm kim d

    liu Web v s khc nhau gia cu trc mt vn bn truyn thng vi mt vn bn

    HyperText Chnh v nhng kh khn gp phi nh vy m vic biu din d liu trongcc my tm kim l rt quan trng. Biu din cc trang web nh th no c th lu

    tr c mt s lng khng l cc trang web my tm kim c th thc hin

    vic tm kim nhanh chng v a ra cc kt qu chnh xc cho ngi s dng?

    a. Biu din vn bn HyperText trong cc my tm kim (inverted index)

    Modul Indexer ly cc trang c Crawler down v cha trong Repository,

    nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trc

    chnh ca csd liu trong hu ht cc my tm kim:

    - Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : M

    s t kha, t kha. Cc t kha ny dc thit lp trong qu trnh Indexing

    - File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghi

    cho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL),

    a ch trong my h thng cha file vn bn (cache ca cc trang web )

    - File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mibn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin t

    kha ny trong vn bnu im: Biu din c v tr xut hin ca cc t (Bit c t kha xut

    hin trong cc loi th khc nhau, xut hin tiu hay thn vn bn). Lu trc

    thng tin quan trng ca cc t kha.

    Nhc im: Cha biu din c tn s xut hin ca cc t kha. Dn n

    thiu chc nng tm kim trangWeb theo ni dung

    b. Biu din vn bn HyperText theo m hnh Vector

    Trong lun n tin s, tc gi San Slattery [May 2002_CMU-CS-02-142]

    a ra 4 cch biu din theo m hnh Vector cho ti liu HyperText

    Cch 1

    B qua tt c cc thng tin lin kt gia cc ti liu lng ging m ch biu

    din ring ni dung ti liu ang cn biu din. y l cch biu din theo ti cc t.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    36/54

    Nu khng nh c ni dung cc ti liu lng ging l hon ton c lp vi

    lp th cch biu din ny l s la chn tt. Thc t l cc ti liu lng ging cung cp

    kh nhiu thng tin hu ch cho vic phn lp, do vy cch biu din ny l khng hiu

    qu.

    Cch 2Cch thc n gin nht nhm s dng ni dung cc ti liu lng ging l kt

    hp ni dung ti liu cn biu din vi ni dung mi ti liu lng ging ca n to ra

    mt super_document. Khi , thnh phn vector biu din chnh l tn sut xut hin

    ca t kha trong super_document.

    Hn ch ca cch biu din ny chnh l vic xa nha phn bit ti liu ang

    xt vi lng ging ca n, v v th to nn nhiu ln xn khi phn lp. Cch biu din

    ny ch tt trong trng hp cc ti liu c tr ti c cng ch vi ti liu cn

    phn lp.

    Cch 3

    Trong cch biu din ny, vector biu din c chia thnh hai phn: Phn u

    biu din cc t kha trong chnh ti liu cn phn lp, phn sau biu din cc t kha

    xut hin trong tt c cc ti liu lng ging vi n.

    Cch biu din ny khc phc c nhc im ca cch biu din trc l

    trnh lm mnht ti liu ch vi cc ti liu lng ging. Nu cc ti liu lng ging

    hu ch cho vic phn lp th c th d dng truy cp n ni dung ca chng. Tuy

    nhin cch biu din ny c nhc im l s chiu ca Vector ln.

    Cch 4Cch biu din ny c th hin qua cc ni dung sau:

    - Tm s lng trang lng ging trong ton b vn bn hypertext ang xem xt,

    gi s c d l s lng lng ging.

    - Cu trc vector biu din thnh d+1 phn:

    Phn u tin biu din trc tip ti liu cn phn lp.

    T phn th 2 n phn d+1 biu din cc ti liu lng ging, mi

    phn tng ng vi mt lng ging.

    D nhn thy vector nhn c l rt ln v mt khc, li khng tun theo mt

    quy tc duy nht. Tn ti nhiu cch chn th t t phn th 2 tri. Chnh v sa

    dng trong cch biu din ca phng php ny gy kh khn trong vic la chn

    mu d liu xy dng

    Qua cc cch biu din trn, chng ta a ra mt snhn xt v cch biu

    din vn bn HyperText theo m hnh Vector nhtrnh by di y.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    37/54

    u im:

    - Khai thc c thng tin tim nng ca cc siu lin kt.

    - Biu din c tn s xut hin ca cc t, nn c kh nng thc hin chc

    nng tm kim vn bn theo gn nhau v ni dung

    Nhc im :- Khng biu din c v tr xut hin ca cc t. Dn n b qua cc thng

    tin ly c quan trng ca t kha, nh nu t kha xut hin tiu hay

    trong cc th in m s quan trng hn cc v tr khc

    - S chiu ca Vector l rt ln

    III 2.2.3 Biu din vn bn HyperText theo m hnh quan h

    Biu din vn bn theo m hnh quan h l cch biu din t nhin cho vn

    bn HyperText. Chng ta d dng cu trc mt quan h nh phn (mi lin kt gia

    cc vn bn) m i s th nht l tn ca ti liu c cha cc Hyperlink v i s th2 l tn ca ti liu c tr ti.

    a)Quan h l g hiu c nhng u th ca hc quan h (relational learning), trc tin

    ta so snh chng vi nhng thut ton nh (propositional algorithms) m lm vic

    vi nhng v d hay thc th c lp. Mi iu m hc nh cn bit v cc v d

    hun luyn ch l cc miu t hay thng tin v chnh v d. Hn na khi thc hin

    phn lp cho mt v d, hc nh cng ch quan tm n thng tin ca chnh v d

    m khng quan tm n mi lin h gia v d vi cc v d khc.

    Biu din quan h bao gm c biu din nh (nh biu din theo m hnh

    vector, ti cc t (bag of word), tp hp cc t (set of word)) cng vi cc thng tin v

    mi quan h gia cc v d vi nhau. Chng hn, nu v d hun luyn ca chng ta l

    people , biu din nh ch ch m t cc thng tin nh tn, tui, cng vic,

    lng, ca tng ngi, trong khi biu din quan h s biu din tt c nhng

    thng tin trn cng thm mt s thng tin khc na, v d nh mi quan h gia ng

    ch-ngi lm thu hay mi quan hhn nhn.

    Nh vy r rng rng mt biu din quan h cho ta mt chi tm kim

    ton b khng gian giu c ca cc mi quan h. Nu chng ta tin tng rng cc v

    d lin quan c th l ngun thng tin hu ch cho s phn lp mt vi v d, th cch

    biu din quan h l ph hp, cn ngc li, cc v d lin quan khng cung cp thm

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    38/54

    thng tin no cn thit th cch biu din quan h (relation representation) khng th

    no tt hn cch biu din nh (propositionrepresentation)

    Biu din quan h trong cho HyperText

    Cc quan h :

    Link_to (page, page): Miquan h ny th hin cc siu lin kt (hyperlink)

    tham chiu n cu trc gia cc trang trong ton b vn bn Web. Chng ta c th

    biu din rng trang 15 cha siu lin kt tham chiu n trang 37 nh sau: link_to

    (page15, page37).

    Has_word(page): Cung cp thng tin v ni dung ca mi trang Web. Chng

    ta s ch biu din nhng t m ta quan tm (hay sau ny s chn lm t kha). Chng

    hn has_computer(A) c ngha l trangA c cha tcomputer.

    Ta c th biu din ph nh: not(link_to (page15, page37)) c ngha lpage15 khng lin kt vi page17, cn not(has_computer(A) c ngha l trang A

    khng c cha tcomputer

    V d: C hai trang Web A v B sau:

    Gi s A l trang ch ca sinh vin ca tp hp cc trang Web ca mt trng

    i hc

    Khi trang A c biu din nh sau:

    A:- has_engine(A), has_list(A), has_vector(A), link_to(B,A), has_jame(B),

    has_link(B), has_paul(B), not(has_home(A))

    V nu bng ngn ng th ta c th dch ra thnh lut nh sau: Mt trang m

    cha cc t kha list, vector, common nhng khng cha t kha home, v c linkt bi trang c cha cc tjame, paul, linkth l trang ch ca sinh vin

    A

    ListVector

    Common

    B

    JamePaulLink

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    39/54

    3.3.CC PHNG PHP HC MY3.3.1. Thut ton phn lp Bayes

    Thut ton phn lp Bayes l mt trong nhng thut ton phn lp in hnh

    nht trong khai thac d liu v tri thc. tng chnh ca thut ton l tnh xc sut

    c sau ca s kin c thuc lp x theo s phn loi da trn xc sut c trc ca skin c thuc lp x trong iu kin T

    Gi V l tp tt c cc t vng.

    Gi s c N lp ti liu: C1, C2,,Cn

    Mi lp Ci c xc sut p(Ci) v ngng CtgTshi.

    Gi p(C| Doc) l xc sut ti liu Doc thuc lp C.

    Cho mt lp C v mt ti liu Doc, nu xc sut p(C|Doc) tnh c ln hn

    hoc bng gi tr ngng ca C th ti liu Doc s thuc vo lp C.

    Ti liu Doc c biu din nh mt vector c kch thc l s t kho trong

    ti liu. Mi thnh phn cha mt t trong ti liu v tn xut xut hin ca t

    trong ti liu. Thut ton c thc hin trn tp t vng V, vector biu din ti liu

    Doc v cc ti liu c sn trong lp, tnh ton p(C|Doc) v quyt nh ti liu Doc s

    thuc lp no.

    Xc sut p(C | DOC) c tnh theo cng thc sau:

    Xc sut p(C | Doc) c tnh theo cng thc sau:

    Vi:

    p(c | x, ) = p(c | x,T) p(T |x)T in

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    40/54

    Trong :

    |V| : s lng cc t trong tpV

    Fj : t kho th j trong t vng

    TF(Fj | Doc) : Tn xut ca t Fj trong ti liu Doc (bao gm c tng

    ngha)

    TF(Fj | C) : Tn xut ca t Fj trong lp C (s ln Fj xut hin trong tt c

    cc ti liu thuc lp C)

    P(Fj | C) : Xc sut c iu kin t Fj xut hin trong ti liu ca lp C

    Cng thc F(Fi | C) c tnh s dng c lng xc sut Laplace. Sd

    c s 1 trn t s ca cng thc ny trnh trng hp tn sut ca t Fi trong

    lp C bng 0, khi Fi khng xut hin trong lp C.

    gim s phc tp trong tnh ton v gim thi gian tnh ton, ta

    thy rng, khng phi ti liu Doc cho u cha tt c cc t trong tp t vng

    V. Do , TF(Fi | DOC) =0 khi t Fi thuc V nhng khng thuc ti liu Doc, nn

    ta c, (P(Fj | C))TF(Fj, Doc) = 1.Nh vy cng thc (1) sc vit li nh sau:

    Vi:

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    41/54

    Nh vy trong qu trnh phn lp khng da vo ton b tp t vng m ch

    da vo cc t kha xut hin trong ti liu Doc.

    3.3.2. Thut ton k-ngi lng ging gn nht.

    Thut ton hot ng khng da vo tp t vng. Tuy nhin, n vn s

    dng ngng CtgTsh, v thc hin theo cc bc nh cp trn. l tin

    hnh ngu nhin k ti liu v tnh xc sut p(C|Doc) da trn s ging nhau gia ti

    liu Doc v k ti liu c chn. Xc sut p(C| Doc) c tnh theo cng thc sau:

    Trong :

    n : S lp

    k : S ti liu c chn so snh

    P(Ci | Dj ) : C gi tr 0 hoc 1, cho bit ti liu Dj c thuc lp Cikhng. Sdc gi tr ny v mt ti liu c th thuc hn mt lp

    Sm(Doc,Dj) xc nh mc ging nhau ca ti liu Doc vi ti liu c

    chn Dj , c tnh bng cos ca gc gia hai Vector biu din ta liu Doc v ti liu

    c chn Dj.

    Cch biu din cc ti liu trong thut ton ny hon ton tng t nh trong

    thut ton phn l p Bayes th nht, ngha l cng gm Fi t kha v tn xut Xi

    tng ng.

    Trong cng thc (4):

    Xi l tn xut ca t kho th i (da trn s t ng ngha xut hin trong ti

    liu Doc)

    Yi l tn xut ca t th i (da trn s t ng ngha xut hin trong ti liu

    Di)

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    42/54

    3.3.3. Phn lp da vo cy quyt nh

    Hc cy quyt nh l phgn php c s dng rng ri cho vic hc quy

    np t mt mu ln. y l phng php xp x hm mc tiu c gi tr ri rc. Mt

    khc, cy quyt nh cn c th chuyn sang dng biu din tng ng di dng

    tri thc l cc lut If-then. Trong cc thut ton hc cy quyt nh th ID3 v C4.5 lhai thuta ton ni ting nht. Sau y l ni dung thut ton ID3.

    ID3 (Example, Target attributes, Attributes)

    1.To mt nt gcRootcho cy quyt nh

    2. Nu ton bExamplesu l cc v d dng, t li cy Root mt nt

    n, vi nhn +.

    3. Nu ton bExamplesu l cc v d m, tr li cy Root mt nt n,

    vi nhn -.4. NuAttributes l rng th tr li cy Root mt nt n vi gn nhn bng

    gi tr ph bin nht ca Target_attribute trong Example.

    5. Ngc li Begin

    5.1. A

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    43/54

    Thuc tnh tt nht l thuc tnh c ly thng tin ln nht.

    Phng php hc my dng cy quyt nh v da trn cy quyt nh l rt

    hiu qu bi v n c th lm vic c vi mt s lng ln cc thuc tnh, v hn

    na t cy quyt nh c th rt ra c mt h thng lut hc c

    3.3.4. Thut ton hc quan h FOIL

    a. Khi nim mnh Horn (Horn Clause)

    Mnh Horn l cc mnh c nhiu nht mt literal dng, c dng nh

    sau:

    H \/ (-L1)\/ (-L2)\/\/ (-Ln))

    Trong H, L1,L2,,Ln gi l cc literal dng, cn L1,-L2,-Ln gi l

    cc literal m.

    Hay vit di dng lut:

    ( L1^L2^^Ln)=>H. Dng ny c gi l lut First_Order

    L1,L2,Ln gi l tp cc tin iu kin. H gi l kt lun.

    VD v cc lut First_Order:

    If Parents(x,y) then Ancestor (x,y)

    If (Parents(x,z) ^ Ancestor(z,y) ) then Ancestor(x,y).

    Trong Parents, Ancestor, gi l ccpredicateb.Thut ton Foil

    FOIL c xut v pht trin bi Quinlan (Quinlan, 1990). FOIL hc cc

    tp d liu ch bao gm hai lp, lp cc v ddng v v d m. FOIL hc m

    t lp i vi lp dng. u vo ca Foil gm cc tin iu kin v cc kt lun. .

    u ra l mt tp cc lut sinh t cc tin iu kin v cc kt lun . Mi bc Foil

    s thm mt literal vo cc tin iu kin ca lut ang hun luyn. Thut ton s

    dng hmFoil_Gain tnh ton la chn mt literal trong tp cc literal ng c

    FOIL l m hnh hc my khng tng trong thut ton leo i s dng

    metric da theo l thuyt thng tin xy dng mt lut bao trm ln d liu. Trong

    Foil c hai trng thi chnh :

    1.separate stage (trng thi phn tch) : Bt u mt trng thi mi

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    44/54

    2. Conquer State (trng thi ch ng): Kt hp cc literal xy dng thn

    ca mnh .

    Pha tch ri ca thut ton bt u t lut mi trong khi pha ch ng xy

    dng mt lin kt cc literal lm thn ca lut. Mi lut m t mt tp con no cc

    v d dng v khng c v d m. Lu rng, FOIL c hai ton t: bt u mt lutmi vi thn lut rng v thm mt literal kt thc lut hin ti. FOIL kt thc vic

    b sung literal khi khng cn v d m c bao ph bi lut, v bt u lut mi n

    khi tt c mi v d dng c bao ph bi mt lut no .

    Cc v d dng c ph bi mnh s c tch ra khi tp dy v qu

    trnh tip tc hc cc mnh tip theo vi cc v d cn li, v kt thc khi khng

    c cc v d dng thm na.

    Sau y l thit k bc 1 ca FOIL:

    1.Gi POS l tp cc v d dng.

    2. Gi NEG l tp cc v d m

    3. t NewClauseBody bng rng

    4. Trong khi POS cha rng thc hin:

    Separate: (Btu mt lut mi)

    5. Loi khi POS tt c nhng v d tho mn NewClauseBody.

    6. t li NEG l tp cc v d m ban u7. t li NewClauseBody bng rng

    Trong khi NEG cha rng thc hin.

    . Conquer (Xy dng thn mnh )

    8. Chn Literal L

    9. Kt hp vo NewClauseBody.

    10. Loi khi NEG nhng v d m khng tho mn L.

    FOIL s dng thut ton leo i b sung cc literal vi thng tin thu c

    ln nht vo mt lut. Vi mi bin i ca mt khng nh P, FOIL o lng thng

    tin t c. la chn literal vi thng tin t c cao nht, n cn bit bao nhiu

    b dng v m hin ti c bo m bi cc bin i ca mi khng nh c xc

    nh theo cch dn tri.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    45/54

    Cng thc tnh infortmaion gain ca Foil l:

    Gain(Literal)=T++ *(log2(P1/P1+N1) - log2(P0/P0+N0))

    P0 v N0 l s v d dng v m trc khi thm mt literalL vo mnh

    P1v N

    1l s v d dng v m sau khi thm literalL vo mnh .

    T++ l s v d dng cnh c trc v sau khi thm literal.(ngha l s v

    dng vi c hai lut R v R_l R sau khi thm vo literal L)

    Sau y l mt v d minh ha cho thut ton FOIL.

    Ta mun hc mi quan h Grandaughter(x,y) t cc quan h (Predicate)

    Grandaughter, Father, Mail, Femail v cc hng s: Victor, Sharon, Bob, Tom.

    T p v d: L nhng gi nh lin quan n cc Predicate Grandaughter,

    Father, Mail, Femail v cc hng s Victor, Sharon, Bob, Tom, trong c cc v d

    dng l Grandaughter(Victor, Sharon), Father (Sharon, Bob), Father(Tom, Bob),

    Femail(Sharon), Father(Bob, Victor). Cc v d cn li l m (Chng hn nh

    -Grandaughter(Tom,Bob),-Father(Victor, Victor),).

    chn cc literal cho lut, FOIL xt cc cch kt hp khc nhau ca cc

    bin x,y,z,t vi cc hng strn. Chng hn bc khi u khi lut ch l :

    - Bc 1:

    Lut khi u: Grandaughter (x,y)

    S kt h p {x/Bob, y/Sharon}s cho ta mt v d dng v trong d liu

    hun luyn Grandaughter(Bob, Sharon) l ng.

    Cn 15 cch kt hp cn li s tng ng vi cc v d m v khng tm thy

    s xc nhn tng ng trong tp hun luyn

    - Mi trng thi tip theo, lut c hnh thnh da trn tp cc kt ni m

    cho ra cc v d dng, m. Khi mi literal c thm vo lut, t p cc v d m

    dng s thay i. Chng hn xt literal tip theo c vo lut l Father (y,z), th

    thay v kt ni {x/Bob,y/Sharon} trn, kt ni {x/ Bob, y/Sharon,z/ Bob} mi

    tong ng vi mt v d dng. Ti mi bc, s v d m, dng sc tnh ton

    c c ly thng tinFoil_Gain (L,R).

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    46/54

    CHNG 4. HTHNG THNGHIM

    4.1. MT S CNG TRNH NGHIN CU LIN QUAN

    H thng th nghim c xy dng da trn s kt hp nhng u im ca

    cc gii php trong cc cng trnh nghin cu v vn tm kim v phn lp vn bn

    trc y. Sau y l ni dung v kt qu ca cc cng trnh nghin cu1.. [San Slattery (May 20002_CMU-CS-02-142)] Lun n tin s HyperText

    Classification

    Trong lun n tin sca mnh, tc gi so snh cc thut ton hc my p dng

    cho phn lp trang Web cng vi cc cch biu din tng ng, l:

    1. Dng Nave Bayes vi cch biu din ti liu thnh mt ti cc t (bag ofwords)

    2. Dng k ngi lng ging gn nht vi m hnh tn s cho biu din trangWeb (TF-IDF)

    3. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) cho miti liu (khng tnh n cc lin kt trong mi ti liu)

    4. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) v c tnhn cc thng tin lin kt trong cc ti liu

    Tc gi ci t v th nghim v a ra kt qu, vi tiu chun nh gi l

    hi tng(recall)v chnh xc( Precision)

    Cch tio cn 4 u im hn c, cho hi tng v chnh xc cao hn hn.

    Tip n, tc gi xy dng mt b phn lp HyperText mi s dng thut ton

    FOIL_PILES vi cch biu din vn bn theo m hnh quan h.

    2. [on Sn] Lun vn thc sPhng php sdng Logic mv ng dng trong

    khai ph dliu FullText

    Trong lun vn ny, tc gi thc hin phn lp vn bn s dng cch biu din vn

    bn bng phng php s dng Logic mv ng dng thut ton hc cy quyt nh.

    Vi cch gii quyt bi ton nh vy cho ta thy mt su im: S dng cc

    khi nim m lm gim s chiu ca cc thuc tnh, dn n lm gim thi gian

    tnh ton khi hc cy quyt nh.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    47/54

    Tuy nhin cch biu din ny cn c mt s mt hn ch, l vic con ngi c

    th s tn nhiu cng sc cho vic xy dng ch, cc khi nim v mi lin quan

    gia chng.

    3. [Bi Quang Minh] My tm kim Vietseek. Bo co kt qu nghin cu thuc

    ti khoa hc c bit cpHQGHN m sQG 02-02.Trong my tm kim Vietseek, cc vn bn c t chc thnh c s d liu.

    Vietseek xy dng c c ba loi ch mc (TextIndex, StructureIndex v

    UtilityIndex). Csd liu Vietseekc chia thnh hai phn:

    Phn 1: D liu v vn bn Web, Domain, Word c lu tr trong cc bng ca

    CSDL mySQL

    Phn 2: D liu v ch mc (index) c lu tr ring v c ccu ring. Do phn

    ny i hi tc cao nn khng lu tr trong CSDL MySql m lu tr trong 300 file

    nh phn khc nhau.

    Vietseek thc hin tm kim theo cm ta vo v tr v cc vn bn c cha cc

    cm t kha ch cha thc hin phn lp

    4. [Phm Th Thanh Nam] Lun vn Thc sMt sgii php cho bi ton tm kim

    trong CSDL HyperText.

    T CSDL ch mc c xy dng ca Vietsek, tc gi xy dng nn vector

    biu din cc trang Web, vi thnh phn ca vector chnh l tn sut xut hin ca cc

    t kha trong vn bn ang xt.

    Lun vn ny xut mt s thut ton:

    - Lit k danh sch cc trang Web Gn ngha nht vi trang Web hoc cm ttm kim a vo theo tiu ch Gn nhau v ni dung. gn nhau v ni dung s

    thu c khi so snh cc vector biu din vi nhau

    - quan trng ca trang Web da vo mi lin kt vi trang Web khc v tn s

    xut hin ca cc t kha tm kim trong trang.

    - Kt hp gn nhau v ni dung v quan trng ca trang web thnh mt tiu

    ch gi l gi tr kt hp. Kt qu sc hin th theo gi tr kt hp.

    Nhn xt

    Tuy cng trnh u tin [San Slattery] gii thiu kh tng quan v cc

    phng php phn lp v phn tch mt s kt qu th nghim, nhng ni chung c

    bn cng trnh nghin cu ni trn cha thc s cp ti vn thit k v ci t

    nhng gii php thc s tinh t gii quyt vn tng ngha v a ngn ngi vi

    h thng phn lp trong CSDL Web. Thc hin vic kho st nhng gii php cho vn

    ny v ci t th nghim l mt cng vic nghin cu c ngha.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    48/54

    Tn ti mt s thut ton in hnh gii quyt bi ton phn lp trong cc CSDL

    vn bn. Vic ci t th nghim v nh gi hiu qu hot ng ca mt s thut ton

    phn lp in hnh nh vy trong mt CSDL web thc s (khong vn trang ) c th

    c coi nh nhng bc i cn thit u tin trong vic xy dng v pht trin cc

    my tm kim ting Vit.4.2. XUT MT CCH T CHC CSDL V THUT TON P

    DNG

    Theo nhng phng php biu din vn bn HyperText v ang c s dng,

    nghin cu, ta c nhn xt tng qut sau: cch biu din vn bn HyperText trong cc

    my tm kim c u im l khai thc c nhng thng tin quan trng v v tr xut

    hin ca t kha, t xp hng c cc trang Web tm c theo th t gn vi

    ni dung t kha cn tm, nhng cha thy cp n tn s xut hin ca cc t

    kha trong vn bn. Nn vic tm theo ni dung l kh thc hin c.Cn vi cch biu din theo m hnh Vector ca Sen Slattery [2002] th b

    qua thng tin v v tr xut hin ca cc t kha, mt thng tin rt quan trng cho

    phn lp vn bn. Hn na nu theo cch biu din 2, vn bn gc cn phn lp s b

    mnht i trong tp hp cc vn bn lin qua n n, v phn lp s mt chnh xc

    nht l khi cc vn bn lin quan khng c cng ch. Cn vi cch biu din 3 v

    4, s chiu ca vector s rt ln v c rt nhiu thnh phn lp (chnh l cc t xut

    hin lp i lp li trong tp cc vn bn lin quan).

    T nhng u nhc im ca cc phng php trn, ti a ra mt cch biudin ring. t ng chnh vn l da trn m hnh vector, ng thi trong cch xy

    dng file t kha c tnh n cc tng ngha

    4.2.1. t bi ton

    Tn ti mt tp cc vn bn HyperText cho trc, mi lp cha cc ti liu (di

    dng *.html) thuc cng mt th loi. Xy dng h thng vi chc nng:

    c mt ti liu mi, yu cu h thng phn ti liu vo mt lp thch hp.

    4.2.2. Cch biu din vn bn:

    S dng m hnh Vector tnh tn sut c tnh n quan trng ca v tr xut

    hin cc t kha, cng vi cc lin kt gia cc trang

    Xy dng vector cho trang Web A bng cch:

    - Vi mi trang Web A no , thng k cc trang Web c lin kt ti A v c

    A tr ti.

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    49/54

    - m s ln ca mi t kha xut hin trong A v trong cc trang c lin quan

    n A, gi s count[i] l s ln xut hin ca t kha th i trong vector biu din ca

    trang A,

    Nu i xut hin trong th body () th ch tng count[i] ln 1,

    Nu t i xut hin trong th tiu () th tng count[i] ln 3,Sau khi m xong trang A, nhn count [i] vi 3 (chnh l trng s ca vn bn cn

    biu din), sau m tip trong cc trang c lin kt, vi nguyn tc tnh trng s v

    tr xut hin nh trong vn bn A, trng s ca cc vn bn lin quan bng 1.

    Nhvy: Cch biu din trn s dng kt hp c cc thng tin: Cc lin

    kt vo ra ca ti liu HyperText, tnh n cc ti liu lng ging nhng cng t ra

    trng s cho ti liu gc, biu din c s ln xut hin ca t kha trong ti liu

    ng thi tnh n v tr xut hin ca cc t kha trong ti liu

    4.2.3. Thit k CSDL.

    Cc vn bn HyperText c m ha thnh 3 bng trong CSDL Access.

    1. Bng 1: bng cc tkha (KeyWords),

    Field Name Data Type Description

    KeyWordID

    KeyWord

    Synonymous

    Auto Number

    Text

    Memo

    M t kha

    T kha

    Cc tng ngha vi t kha

    T kha (KeyWord) : Ni dung l mt t trong ting Anh nn n phi tha mncc iu kin sau: T trong ting Anh c mt m tit, mi m tit l mt chui k t a-

    z,A-Z. Cc t trong cu c tch bit bi du cch hoc cc k t bt k (du chm,

    du phy, du hai chm,) khng thuc a-z, A-Z.

    Cc t ng ngha (Synonymous): L trng memo c dng (word1,

    word2,,wordn ). Vy cc tng ngha c cng m (keywordID) vi t kha.

    2. Bng 2: Bng cc vn bn (Documents)

    Field Name Data Type DescriptionDocID

    DocName

    CacheAdd

    Vector

    Auto Number

    Text

    Text

    Memo

    M vn bn

    Tn vn bn

    a ch Cache

    Vector biu din cho vn bn

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    50/54

    Vector: l trng kiu Memo, mi vector c dng:

    (M t kha 1, s ln xut hin tiu , tng s ln xut hin trong vn

    bn);( M t kha 2, s ln xut hin tiu , tng s ln xut hin trong vn

    bn);

    S thnh phn ca Vector chnh l s t kha xut hin trong trang Web angbiu din, ch khng phi l ton b cc t kha trong bng KeyWord, do s chiu

    ca vector s gim i rt nhiu. Mi thnh phn ca vector biu din s ln xut hin

    v v tr xut hin ca cc t kha trong vn bn.

    VD: Mt Vector c dng: (1,1,4);(2,1,4);(4,2,7) c ngha: T kha th nht

    xut hin 4 ln, trong 1 ln xut hin tiu . T kho th 2 xut hin 4 ln trong

    1 ln xut hin tiu T kho th 4 xut hin 7 ln trong 2 ln xut hin

    tiu

    DocID Cache Address Vector

    1

    2

    3

    4

    C:\data\sport\s1.htm

    C:\data\sport\s2.htm

    C:\data\culture \ct3.htm

    C:\data\ culture \c4.htm

    (1,1,4); (3,1,4); (4,2,7);.

    (1,2,7); (2,1,4); (3,2,8);.

    (1,2,6); (5,1,4); (7,2,7);.

    (2,1,4); (3,1,4); (4,2,7);.

    3.Bng 3 Th hin s kin kt gia cc vn bn. (LINKS)

    Field Name Field Type DescrriptionDocID1

    DocID2

    Number

    Number

    M ca vn bn lin kt i

    M vn bn c lin kt ti

    DocID1 l m cc vn bn c lin kt ti cc vn bn c m trong DocID2.

    4. Bng 4. Xc sut ca cc lp

    4.2.4.Thit kModul chng trnh

    Field name Fielsd type DescriptionClassName

    Probability

    Text

    Number(t 0..100)

    Tn lp

    Xc sut c lp

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    51/54

    1.Modul phn tch trang Web to ra bng KEYWORDS

    Thut ton:

    Input: Cc vn bn dng to t kha

    While (cha c ht cc vn bn) do

    1. c tng vn bn2. While (cha c xong vn bn) do

    2.1.c tng t

    2.2. Insert vo Csd liu

    End

    End.

    Output: File cc t kha

    Trung Synonymous sc b sung bng tay i vi tng t kha

    Thm chc nng nhp thm t kha bng tay, xa t kha khng cn thit.

    2.Modul ly a ch Cache (CacheAddress) ca tng ti liu hun luyn v to

    ra m ti liu (DocID) thm vo hai trng u tin ca cc bng DOCUMENTS.

    Cn trng Vectors to sau nhModul th 4.

    Thut ton:

    Input: Cc vn bn dng hun luyn

    While (cha c ht cc vn bn) do

    1.1. c a ch Cache ca tng vn bn

    Insert vo CSDL

    1.2. c tn vn bnInsert vo CSDL

    End

    M vn bn t tng.

    3.Modul to bng LINKS. to bng LINKS trc ht phi c bng

    DOCUMENTS ly a m ca tng ti liu (DocID) tng ng.

    Thut ton:

    1. c t th mc cha cc ti liu t trn a cng2. t bin TnTM=[ng dn ca th mc]3. While (cha phn tch ht cc ti liu) do

    3.1. Ly tng ti liu trong th mc km thm a ch Cache(CacheAdd).

    3.2. Tm trong bng DOCUMENTS DocID ca ti liu ny nh vo

    CacheAdd, c DocID1

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    52/54

    3.2.1. Phn tch ly c cc th siu lin kt, l cc cm t cdng: href=[Tn ti liu c tr ti], gi s c N th.

    3.2.2. For i=1 to N do3.2.2.1. Cng TnTM v [tn ti liu c tr ti] c a ch

    Cache, duyt trong DOCUMENTS ly DocID, cDocID2

    3.2.2.2.Thc hin lnh Insert hai DocID ly c trn vo hai

    trng DocID1 v DocID2 ca bn LINKS

    End.

    End

    4. Tr li bng LINKS trong CSDL

    4. Modul to ra vector cho mi ti liu, thm vo trng Vector ca bng

    DOCUMENTS.

    Thut ton:

    1. c t bng DOCUMENTS trong CSDL ly DocID v CacheAdd

    2. While (cha c ht cc bn ghi)

    2.1. Dng CacheAdd c ti liu ta cng

    2.2. Gn DocID_curence=DocID

    2.3. Gn total_occurence=0; header_occurence=0; vector=;

    2.4. Ly tng t kha keywordtrong bng KEYWORDS so snh

    2.4.1 While (cha ht cc t kha)2.4.1.2. Phn tch ti liu ly tng t mc : word

    2.4.1.2. Kim tra xem nu word cha c trong bng KEYWORD th b

    sung thm

    2.4.1.3. While (cha c ht ti liu)

    - Nu (word= keyword) hoc (word=tng ngha) v (word nm trong

    th ) th total_occurence+3 v header_occurence+1;

    - Nu (word=keyword) hoc (word=tng ngha) v (word khng nm

    trong th ) th total_occurence ++; header_occurense++;

    End.

    2.4.1.4. total_occurence*3;

    header_occurence*3;

    2.4.1.5. c tt c cc ti liu m ti liu hin thi lin kt ti(outgoing)

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    53/54

    Lp li cc bc phn tch nhi vi ti liu hin thi, tng 2 bin

    total_occurence v header_occurence

    2.4.1.6. c tt cc ti liu lin kt ti ti liu hin thi (incoming)

    Lp li cc bc phn tch nhi vi ti liu hinh thi tng 2 bin

    total_occurence v header_occurenceEnd.

    2.5. Nu (total_occurence !=0 ) th vector += KeyWordID + , +

    total_occurence + , + header_occerence +;

    2.6. Insert into DOCUMENTS (Vector) values vector where

    DocID=DocID_curence.

    3. End.

    5. Modul thc hin phn lp.

    Input:Tp hp cc ti liu cn phn lp.

    While (cha c ht ti liu) do

    c vo ti liu cn phn lp

    1. Phn tch ti liu thnh cc vetor nh trong modul to trng vector

    ca bng DOCUMENTS

    2. Kt hp vi cc vector ca cc ti liu trong CSDL, p dng mt trong

    cc thut ton hc my phn lp.

    End

    4.2.5. Phn tch cc chc nng ca h thnga. Chc nng chnh ca h thng

    b. Chc nng chi tit

    - Chc nng to CSDL

    - Chc nng phn lp v tm kim

    4.2.6. nh gi h thng thnghim

    a. Mt sv d kt qu trn h thng thnghim

    H thng chy v cho mt s kt qu ban u

    - Xy dng c h thng CSDL nh trnh by trn

    + Phn tch cc vn bn ly t kha

    + Th hin c cc lin kt (link) gia cc ti liu siu vn bn trong mt siu

    vn bn

    + M ha cc vn bn thnh cc vector v lu tr vo CSDL

    - Thc hin vic phn lp mt ti liu siu vn bn cho trc

  • 8/6/2019 K44 Do Thi Dieu Ngoc Thesis

    54/54

    - Cho php tm kim mt ti liu siu vn bn c ni dung gn vi ti liu a vo

    b. Hn chca h thng

    Do hn ch v mt thi gian nn h thng cn c mt s mt hn ch

    - Cc t kha vn cha y v cha c chn lc-

    Ch phn lp c tng ti liu mt (nu cn thi gian s tip tc sa)- chnh xc cha cao do cha c d liu hc chnh xc.