17409 - bai giang khai pha du lieu

Upload: tuyen-long-hoang

Post on 06-Apr-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    1/78

    TRNG I HC HNG HI VIT NAM

    KHOA CNG NGH THNG TINB MN H THNG THNG TIN

    -----***-----

    BI GING

    KHAI PH D LIU

    TN HC PHN: KHAI PH D LIUM HC PHN: 17409TRNH O TO: I HC CHNH QUYDNG CHO SV NGNH: CNG NGH THNG TIN

    HI PHNG - 2011

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    2/78

    2MC LC

    Ni dung TrangChng 1. Tng quan kho d liu (Data warehouse) 51.1. Cc chin lc x l v khai thc thng tin 5

    1.2. nh ngha kho d liu 61.3. Mc ch ca kho d liu 71.4. c tnh ca d liu trong kho d liu 81.5. Phn bit kho d liu vi cc c s d liu tc nghip 10Chng 2. Tng quan v khai ph d liu 132.1. Khai ph d liu l g? 132.2. Phn loi cc h thng khai ph d liu 132.3. Nhng nhim v chnh 142.4. Tch hp h thng khai ph d liu vi c s d liu hoc kho 162.5. Cc phng php khai ph d liu 172.6. Li th ca khai ph d liu so vi phng php c bn 21

    2.7. La chn phng php 232.8. Nhng thch thc trong ng dng v nghin cu trong k thut khai ph d liu 24Chng 3. Tin x l d liu 283.1. Mc ch 283.2. Lm sch d liu 293.3. Tch hp v bin i d liu 31Chng 4. Khaiph da trn cc mu ph bin v lut kt hp 404.1. Khi nim c bn 404.2. Lut kt hp 414.3. Pht biu bi ton pht hin lut kt hp 44

    4.4. Pht hin lut kt hp da trn h thng tin nh phn 454.5. Khai ph lut kt hp trn h thng tin m 51Chng 5. Phn lp v d on 685.1. Khi nim c bn 685.2. Phn lp da trn cy quyt nh 70

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    3/78

    3Tn hc phn: Khai ph d liu Loi hc phn: 2B mn ph trch ging dy: H thng Thng tin Khoa ph trch: CNTT.M hc phn: 17409 Tng s TC: 2

    Tng s tit L thuyt Thc hnh/ Xemina T hc Bi tp ln n mn hc45 30 15 0 khng khng

    Hc phn hc trc: C s d liu; C s d liu nng cao; H qun tr CSDLHc phn tin quyt: Khng yu cu.Hc phn song song: Khng yu cu.Mc tiu ca hc phn:

    Cung cp cc kin thc c bn v kho d liu ln v cc k thut khai ph d liu.Ni dung ch yu:

    Tng quan v kho d liu v khai ph d liu; Phng php t chc lu tr d liu ln, vcc k thut khai ph d liu; Phn tch d liu s dng phng php phn cm; ng dng kthut khai ph d liu.Ni dung chi tit:

    TN CHNG MC PHN PHI S TITTS LT TH BT KTChng 1. Tng quan kho d liu (Data warehouse) 6 4 21.1. Cc chin lc x l v khai thc thng tin1.2. nh ngha kho d liu1.3. Mc ch ca kho d liu1.4. c tnh ca d liu trong kho d liu1.5. Phn bit kho d liu vi cc c s d liu tcnghipChng 2. Tng quan v khai ph d liu 9 6 32.1. Khai ph d liu l g?2.2. Phn loi cc h th ng khai ph d liu2.3. Nhng nhim v chnh2.4. Tch hp h thng khai ph d liu vi c s d liuhoc kho2.5. Cc phng php khai ph d liu2.6. Li th ca khai ph d liu so vi phng php cbn2.7. La chn phng php2.8. Nhng thch thc trong ng dng v nghin cutrong k thut khai ph d liu

    Chng 3. Tin x l d liu 9 6 33.1. Mc ch3.2. Lm sch d liu3.3. Tch hp v bin i d liuChng 4. Khai ph da trn cc mu ph bin vlut kt hp

    12 8 4

    4.1. Khi nim c bn4.2. Lut kt hp4.3. Pht biu bi ton pht hin lut kt hp4.4. Pht hin lut kt hp da trn h thng tin nh phn4.5. Khai ph lut kt hp trn h thng tin mChng 5. Phn lp v d on 9 6 35.1. Khi nim c bn5.2. Phn lp da trn cy quyt nh

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    4/78

    4Nhim v ca sinh vin:

    Tham d cc bui hcl thuyt v thc hnh, lm cc bi tp c giao, lm cc bi thi giahc phn v bi thi kt thc hc phn theo ng quy nh.Ti liu hc tp:

    1. J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd edition, MorganKaufmann, 2006.2. P. N. Tan, M. Steinbach, V. Kumar,Introduction to Data Mining, Addison-Wesley, 2006.3. Paulraj Ponnian,Data Warehousing Fundamentals, John Wiley.

    Hnh thc v tiu chun nh gi sinh vin:- Hnh thc thi: t lun hoc trc nghim.- Tiu chun nh gi sinh vin:cn c vo s tham gia hc tp ca sinh vin trong cc bui

    hc l thuyt vthc hnh, kt qu lm cc bi tp c giao, kt qu ca cc bi thi gia hc phnv bi thi kt thc hc phn.Thang im: Thang im ch A, B, C, D, F.im nh gi hc phn: Z = 0,3X + 0,7Y.

    Bi ging ny l ti liu chnh thc v thng nhtca B mn H thng Thng tin, Khoa

    Cng ngh Thng tin v c dng ging dy cho sinh vin.Ngy ph duyt: / /

    Trng B mn

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    5/78

    5

    Chng 1. Tng quan v kho d liu (Datawarehouse)1.1. Cc chin lc x l v khai thc thng tin

    S pht trin ca cng ngh thng tin v vic ng dng cng ngh thng tin trong nhiu lnh

    vc ca i sng, kinh t x hi trong nhiu nm qua cng ng ngha vi lng d liu c

    cc c quan thu thp v lu tr ngy mt tch lu nhiu ln. H lu tr cc d liu ny v cho rng

    trong n n cha nhng gi tr nht nh no . Tuy nhin, theo thng k th ch c mt lng nh

    ca nhng d liu ny (khong t 5% n 10%) l lun c phn tch, s cn li h khng bit s

    phi lm g hoc c th lm g vi chng nhng h vn tip tc thu thp rt tn km vi ngh lo s

    rng s c ci g quan trng b b qua sau ny c lc cn n n. Mt vn t ra l lm th

    no t chc, khai thc nhng khi lng d liu khng l v a dng c?

    V pha ngi s dng, cc kh khn gp phi thng l:Khng th tm thy d liu cn thit

    D liu ri rc rt nhiu h thng vicc giao din v cng c khc nhau, khin

    tn nhiu thi gian chuyn t h thng ny sang h thng khc.

    C th c nhiu ngun thng tin p ng c i hi, nhng chng li c nhng

    khc bit v kh pht hin thng tin no l ng.

    Khng th ly ra cd liu cn thit

    Thng xuyn phi c chuyn gia tr gip, dn n cng vic b dn ng.C nhng loi thng tin khng th ly ra c nu khng m rng kh nng lm

    vic ca h thng c sn.

    Khng th hiu d liu tm thy

    M t d liu ngho nn v thng xa ri vi cc thut ng nghip v quen thuc.

    Khng th s dng c d liu tm thy

    Kt qu thng khng p ng v bn cht d liu v thi gian tm kim.

    D liu phi chuyn i bng tay vo mi trng lm vic ca ngi s dng.Nhng vn vh thng thng tin:

    Pht trin cc chng trnh ng dng khc nhau l khng n gin.Mt chc nng c th hin rt nhiu chng trnh, nhng vic t chc v s

    dng n l rt kh khn do hn ch v k thut.

    Chuyn i d liu t cc khun dng tcnghip khc nhau ph hp vi ngi s

    dng l rt kh khn.

    Duy tr nhng chng trnh ny gp rt nhiu vn Mt thay i mt ng dng s nh hng n cc ng dng khc c lin quan.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    6/78

    6Thng thng s ph thuc ln nhau gia cc chng trnh khng r rng hoc l

    khng xc nh c.

    Do s phc tp ca cng vic chuyn i cng nh ton b qu trnh bo tr dn n

    m ngun ca cc chng trnh tr nn ht sc phc tp.

    Khi lng d liu lu tr tng rt nhanhKhng kim sot c kh nng chng cho d liu trong cc mi trng thng tin

    dn n khi lng d liu tng nhanh.

    Qun tr d liu phc tpThiu nhng nh ngha chun, thng nht v d liu dn n vic mt kh nng

    kim sot mi trng thng tin.

    Mt thnh phn d liu tn ti nhiu ngun khc nhau.

    Gii php cho tt c cc vn nu trn chnh l vic xy dng mt kho d liu (Data

    Warehouse) v pht trin mt khuynh hng k thut mi l k thut pht hin tri thc v khai

    ph d liu (KDD - Knowledge Discovery and Data Mining).

    Trc ht, chng ta nhc li mt vi khi nim c bn lin quan n d liu, c s d liu,

    kho d liu

    1.2. nh ngha kho dliuThng thng chng ta coi d liu nh mt dy cc bit, hoc cc s v cc k hiu, hoc cc

    i tng vi mt ngha no khi c gi cho mt chng trnh di mt dng nht nh.

    Chng ta s dng cc bit o lng cc thng tin v xem n nh l cc d liu c lc b cc

    d tha, c rt gn ti mc ti thiu c trng mt cch c bn cho d liu. Chng ta c th

    xem tri thc nh l cc thng tin tch hp, bao gm cc s kin v cc mi quan h gia chng. Cc

    mi quan h ny c th c hiu ra, c th c pht hin, hoc c th c hc. Ni cch khc,

    tri thc c th c coi l d liu c tru tng v t chc cao .

    Theo John Ladley, k ngh kho d liu (DWT - Data Warehouse Technology) l tp cc

    phng php, k thut v cc cng c c th kt hp, h tr nhau cung cp thng tin cho ngi

    s dng trn c s tch hp t nhiu ngun d liu, nhiu mi trng khc nhau.

    Kho d liu (Data Warehouse), l tuyn chn cc c s d liu tch hp, hng theo cc

    ch nht nh, c thit k h tr cho chc nng tr gip quyt nh, m mi n v d liu

    lin quan n mt khong thi gian c th.

    Kho d liu thng c dung lng rt ln, thng l hng Gigabytes hay c khi ti hng

    Terabytes.

    Kho d liu c xy dng tin li cho vic truy cp t nhiu ngun, nhiu kiu d liu

    khc nhau sao cho c th kt hp c c nhng ng dng ca cc cng ngh hin i v va c thk tha c t cc h thng c ttrc. D liu c pht sinh t cc hot ng hng ngy v

    c thu thp xl phc v cng vic nghip v c th ca mt t chc, v vy thng c gi

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    7/78

    7l d liu tc nghip v hot ng x l d liu ny gi lxl giao dch trc tuyn (OLPT - On

    Line Transaction Processing).

    Dng d liu trong mt t chc (c quan, x nghip, cng ty, vv) c th m t khi qut

    nh sau:

    D liu c nhn khng thuc phm viqun l ca h qun tr kho d liu. N cha cc

    thng tin c trch xut ra t cc h thng d liu tc nghip, kho d liu v t nhng kho d liu

    cc b ca nhng ch lin quan bng cc php gp, tng hp hay x l theo mt cch no .1.3. Mc ch ca kho dliu

    Mc tiu chnh ca kho d liu nhm p ng cc tiu chun c bn:

    Phi c kh nng p ng mi yu cu v thng tin ca ngi s dng. H tr cc nhn vin ca t chc thc hin tt, hiu qu cng vic ca mnh, nh c

    nhng quyt nh hp l, nhanh v bn c nhiu hng hn, nng sut cao hn, thu c

    li nhun cao hn ..v..v..

    Gip cho t chc xc nh, qun l v iu hnh cc d n, cc nghip v mt cch hiu quv chnh xc.

    Tch hp d liu v siu d liu t nhiu ngun khc nhau.Mun t c nhng yu cu trn th DW phi:

    Nng cao cht lng d liu bng cc phng php lm sch v tinh lc d liu theo nhnghng ch nht nh.

    Tng hp v kt ni d liu. ng b ho cc ngun d liu vi DW. Phn nh v ng nht cch qun tr c s d liu tc nghip nh l cc cng c chun

    phc v cho DW.

    H THNGDI SN(c sn)

    D liu tc nghip

    Kho d liu

    Kho d liu cc b

    Siu d liu

    Kho d liu c nhn

    Hnh 1.1. Lung d liu trong mt t chc

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    8/78

    8 Qun l siu d liu (metadata) Cung cp thng tin c tch hp, tm tt hoc c lin kt, t chc theo cc ch .

    Cc kt qu khai thc kho d liu c dng trong h thng h tr quyt nh (Decision

    Support System - DSS), cc h thng thng tin tc nghip hoc h tr cho cc truy vn c bit.

    Mc tiu c bn ca mi t chc l li nhun v iu ny c m t nh sau:

    thc hin chin lc kinh doanh hiu qu, cc nh lnh o vch ra phng hng kinh

    doanh hng ho. Vic xc nh gi ca hng ho v qu trnh bn hng s sn sinh li tc. Tuy

    nhin, c c hng ha kinh doanh th cn phi mt cc khon chi ph. Li tc tr i chi ph s

    cho li nhun ca n v.1.4. c tnh ca dliu trong kho dliu

    c im c bn ca kho d liu l mt tp hp d liu c cc c tnh sau :

    - Tnh tch hp- Tnh hng ch - Tnh n nh- D liu tng hp

    1.4.1.

    Tnh tch hp (Intergration)D liu trong kho d liu c t chc theo nhiu cch khc nhau sao cho ph hp vi cc

    quy c t tn, thng nht v s o, c cu m ho v cu trc vt l ca d liu, ..v..v.. Mt kho

    d liu l mt khung nhn thng tin mc ton b n v sn xut kinh doanh , thng nht ton b

    cc khung nhn khc nhau thnh mt khung nhn theo mt ch im no . V d, h thng x l

    giao dch trc tuyn (OLAP) truyn thng c xy dng trn mt vng nghip v. Mt h thng

    bn hng v mt h thng tip th (marketing) c th c chung mt dng thng tin khch hng. Tuy

    nhin, cc vn v ti chnh cn c mt khung nhn khc v khch hng. Khung nhn bao gm

    cc phn d liu khc nhau v ti chnh v marketing.

    Li nhun

    Li tc Chi ph

    Chi ph cnh Chi ph biniBn hng Xc nh gi

    xut kinh doanh Chi ph trong sn xut

    Hnh 1.2. Mi quan h v cch nhn nhn trong h thng

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    9/78

    9Tnh tch hp th hin ch: d liu tp hp trong kho d liu c thu thp t nhiu

    ngun c trn ghp vi nhau thnh mt th thng nht.

    1.4.2. Tnh hng chD liu trong kho d liu c t chc theo ch phc v cho t chc d dng xc nh

    c cc thng tin cn thit trong tng hot ng ca mnh. V d, trong h thng qun l ti chnh

    c c th c d liu c t chc cho cc chc nng: cho vay, qun l tn dng, qun l ngn sch,

    ..v..v.. Ngc li, trong kho d liu v ti chnh, d liu c t chc theo ch im da vo cc

    i tng: khch hng, sn phm, cc x nghip, ..v..v.. S khc nhau ca 2 cch tip cn trn dn

    n s khc nhau v ni dung d liu lu tr trong h thng.

    * Kho d liu khng lu tr d liu chi tit, ch cn lu tr d liu mang tnh tng hp phc

    v ch yu cho qu trnh phn tch tr gip quyt nh.

    * CSDL trong cc ng dng tc nghip li cn x l d liu chi tit, phc v trc tip cho

    cc yu cu x l theo cc chc nng ca lnh vc ng dng hin thi. Do vy, cc h thng ng

    dng tc nghip (Operational Application System - OAS) cn lu tr d liu chi tit. Mi quan h

    ca d liu trong h thng ny cng khc, i hi phi c tnh chnh xc, c tnh thi s, ..v..v..

    * D liu cn gn vi thi gian v c tnh lch s. Kho cha d liu bao hm mt khi

    lng ln d liuc tnh lch s. D liu c lu tr thnh mt lot cc snapshot (nh chp d

    liu). Mi bn ghi phn nh nhng gi tr ca d liu ti mt thi im nht nh th hin khung

    nhn ca mt ch im trong mt giai on. Do vy cho php khi phc li lch s v so snh tng

    i chnh xc cc giai on khc nhau. Yu t thi gian c vai tr nh mt phn ca kho m

    bo tnh n nht ca mi sn phm hng ho c cung cp c trng v thi gian cho d liu. V d,

    trong h thng qun l kinh doanh cn c d liu lu tr v n gi cu mt hng theo ngy (

    chnh l yu t thi gian). C th mi mt hng theo mt n v tnh v ti mt thi im xc nh

    phi c mt n gi khc nhau (s bin ng v gi c mt hng xng du trong thi gian qua l

    mt minh chng in hnh).

    D liu trong OAS th cn phi chnh xc ti thi im truy cp, cn DW th ch cn c

    hiu lc trong khong thi gian no , trong khong 5 n 10 nm hoc lu hn. D liu caCSDL tc nghip thng sau mt khong thi gian nht nhs tr thnh d liu lch s v chng

    s c chuyn vo trong kho d liu. chnh l nhng d liu hp l v nhng ch im cn lu

    tr.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    10/78

    10

    So snh v CSDL tc nghip v nh chp d liu, ta thy:

    CSDL tc nghip nh chp dliu

    Thi gian ngn (3060 ngy) Thi gian di (510 nm)

    C thc yu tthi gian hoc khng Lun c yu tthi gian

    Dliu c thc cp nht Khi dliu c chp li th khng cp

    nht c

    Bng 1.1. Tnh thi gian ca dliu

    1.4.3. Dliu c tnh n nh (nonvolatility)D liu trong DW l d liu ch c v ch c th c kim tra, khng th c thay i

    bi ngi dng u cui (terminal users). N ch cho php thc hin 2 thao tc c bn l np d

    liu vo kho v truy cp vo cc cung trong DW. Do vy, d liu khngbin ng.

    Thng tin trong DW phi c ti vo sau khi d liu trong h thng iu hnh c cho l

    qu c. Tnh khng bin ng th hin ch: d liu c lu tr lu di trong kho d liu. Mc d

    c thm d liu mi nhp vo nhng d liu c trong kho d liu vn khng b xo hoc thay i.

    iu cho php cung cp thng tin v mt khong thi gian di, cung cp s liu cn thit cho

    cc m hnh nghip v phn tch, d bo. T c c nhng quyt nh hp l, ph hp vi cc

    quy lut tin ho ca t nhin.

    1.4.4. Dliu tng hpD liu tc nghip thun tu khng c lu tr trong DW. D liu tng hp c tch hp

    li qua nhiu giai on khc nhau theo cc ch im nu trn.

    1.5. Phn bit kho dliu vi cc c sdliu tc nghipTrn c s cc c trng ca DW, ta phn bit DW vi nhng h qun tr CSDL tc nghip

    truyn thng:

    Kho d liu phi c xc nh hng theo ch . N c thc hin theo ca ngis dng u cui. Trong khi cc h CSDL tc nghip dng phc v cc mc ch p

    dng chung. Nhng h CSDL thng thng khng phi qun l nhng lng thng tin ln m qun l

    nhng lng thng tin va v nh. DW phi qun l mt khi lng ln cc thng tin c

    lu tr trn nhiu phng tin lu tr v x l khc nhau. cng lc th ca DW.

    DW c th ghp ni cc phin bn (version) khc nhau ca cc cu trc CSDL. DW tnghp thng tin th hin chng di nhng hnh thc d hiu i vi ngi s dng.

    DW tch hp v kt ni thng tin t nhiu ngun khc nhau trn nhiu loi phng tin lutr v x l thng tin nhm phc v cho cc ng dng x l tc nghip trc tuyn.

    DW c th lu tr cc thng tin tng hp theo mt ch nghip v no sao cho to racc thng tin phc v hiu qu cho vic phn tch ca ngi s dng.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    11/78

    11 DW thng thng cha cc d liu lch s kt ni nhiu nm trc ca cc thng tin tc

    nghip c t chc lu tr c hiu qu v c th c hiu chnh li d dng. D liu trong

    CSDL tc nghip thng l mi, c tnh thi s trong mt khong thi gianngn.

    D liu trong CSDL tc nghip c cht lc v tng hp li chuyn sang mi trngDW. Rt nhiu d liu khc khng c chuyn v DW, ch nhng d liu cn thit chocng tc qun l hay tr gip quyt nh mi c chuyn sang DW.

    Ni mt cch tng qut, DW lm nhim v phn pht d liu cho nhiu i tng (khch hng),

    x l thng tin nhiu dng nh: CSDL, truy vn d liu (SQL query), bo co (report) ..v..v..

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    12/78

    12

    BI TP:

    L THUYT:

    1. Kho d liu l g?2. Cho v d v cc h thng hoc lnh vc no c iu kin xy dng cc kho

    d liu ln?

    3. Mt bng d liu c 50.000 bn ghi liu c th c gi l mt kho d liu ln haycha? L gii cho cu tr li?

    4. Cho v d v mt ngun d liu lu tr c cu trc bng, cu trc semi-structured,hoc khng cu trc?

    5. Phn bit kho d liu vi c s d liu tc nghip?THC HNH:

    1. Ci t b ng dng Microsoft Visual Studio 2005?2. Ci t v tm hiu dch v Data analysis?3. Quan st v tm hiu c s d liu NorthWind?

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    13/78

    13

    Chng 2: Tng quan v khai ph d liu2.1. Khai ph dliu

    Khai ph d liu c dng m t qu trnh pht hin ra tri thc trong CSDL. Qu trnh

    ny kt xut ra cc tri thc tim n t d liu gip cho vic d bo trong kinh doanh, cc hot ng

    sn xut, ... Khai ph d liu lm gim chi ph v thi gian so vi phng php truyn thng trc

    kia (v d nh phng php thng k).

    Sau y l mt s nh nghi mang tnh m t ca nhiu tc gi v khai ph d liu.

    nh ngha ca Ferruzza: Khai ph d liu l tp hp cc phng php c dng trong

    tin trnh khm ph tri thc ch ra s khc bit cc mi quan h v cc mu cha bit bn trong

    d liu

    nh ngha ca Parsaye: Khai ph d liu l qu trnh tr gip quyt nh, trong chngta tm kim cc mu thng tin cha bit v bt ng trong CSDL ln

    nh ngha ca Fayyad: Khai ph tri thc l mt qu trnh khng tm thng nhn ra

    nhng mu d liu c gi tr, mi, hu ch, tim nng v c th hiu c.

    2.2. Cc ng dng ca khai ph dliuPht hin tri thc v khai ph d liu lin quan n nhiu ngnh, nhiu lnh vc: thng k,

    tr tu nhn to, c sd liu, thut ton, tnh ton song song v tc cao, thu thp tri thc cho

    cc h chuyn gia, quan st d liu... c bit pht hin tri thc v khai ph d liu rt gn gi vilnh vc thng k, s dng cc phng php thng k m hnh d liu v pht hin cc mu, lut

    ... Ngn hng d liu (Data Warehousing) v cc cng cphn tch trc tuyn (OLAP- On Line

    Analytical Processing) cng lin quan rt cht ch vipht hin tri thc v khai ph dliu.

    Khai ph d liu c nhiu ng dng trong thc t, v dnh:

    Bo him, ti chnh v thtrng chng khon: phn tch tnh hnh ti chnh v d bo gi cacc loi c phiu trong thtrng chng khon. Danh mc vn v gi, li sut, d liu th tn

    dng, pht hin gian ln, ... Thng k, phn tch d liu v h trra quyt nh. V dnh bng sau:

    NmDn s th gii

    (triu ngi)Nm

    Dn s th gii

    (triu ngi)Nm

    Dn s th gii

    (triu ngi)

    1950 2555 1970 3708 1990 5275

    1951 2593 1971 3785 1991 5359

    1952 2635 1972 3862 1992 5443

    1953 2680 1973 3938 1993 5524

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    14/78

    141954 2728 1974 4014 1994 5604

    1955 2779 1975 4087 1995 5685

    1956 2832 1976 4159 1996 5764

    1957 2888 1977 4231 1997 5844

    1958 2945 1978 4303 1998 5923

    1959 2997 1979 4378 1999 6001

    1960 3039 1980 4454 2000 6078

    1961 3080 1981 4530 2001 6153

    1962 3136 1982 4610 2002 6228

    1963 3206 1983 46901964 3277 1984 4769

    1965 3346 1985 4850

    1966 3416 1986 4932

    1967 3486 1987 5017

    1968 3558 1988 5102

    1969 3632 1989 5188

    Ngun: U.S. Bureau of the Census, International Data Base. Cp nht ngy 10/10/2002.

    Bng 2.1. Dn s th gii tnh ti thi im gia nm

    iu tr y hc v chm sc y t: mt s thng tin v chun on bnh lu trong cc h thngqun l bnh vin. Phn tch mi lin h gia cc triu chng bnh, chun on v phng

    php iu tr (chdinh dng, thuc, ...)

    Sn xut v ch bin: Quy trnh, phng php ch bin v x l s c. Text mining v Web mining: Phn lp vn bn v cc trang Web, tm tt vn bn,... Lnh vc khoa hc: Quan st thin vn, d liu gene, d liu sinh vt hc, tm kim, so snh

    cc h gene v thng tin di truyn, mi lin h gene v mt s bnh di truyn, ...

    Mng vin thng: Phn tch cc cuc gi in thoi v h thng gim st li, s c, cht lngdch v, ...

    2.3. Cc bc ca qu trnh khai ph dliuQuy trnh pht hin tri thc thng tun theo cc bc sau:

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    15/78

    15

    Hnh 2.1. Quy trnh pht hin tri thc

    Bc th nht:Hnh thnh, xc nh v nh ngha bi ton. L tm hiu lnh vc ng dngt hnh thnh bi ton, xc nh cc nhim v cn phi hon thnh. Bc ny s quyt nh cho

    vic rt ra c cc tri thc hu ch v cho php chn cc phng php khai ph d liu thch hp

    vi mc ch ng dng v bn cht ca d liu.

    Bc th hai:Thu thp v tin x l d liu. L thu thp v x l th, cn c gi l tin

    x l d liu nhm loi b nhiu (lm sch d liu), x l vic thiu d liu (lm giu d liu), bin

    i d liu v rt gn d liu nu cn thit, bc ny thng chim nhiu thi gian nht trong ton

    b qui trnh pht hin tri thc. Do d liu c ly t nhiu ngun khc nhau, khng ng nht, c th gy ra cc nhm ln. Sau bc ny, d liu s nht qun, y , c rt gn v ri rc ho.

    Bc th ba:Khai ph d liu, rt ra cc tri thc. L khai ph d liu, hay ni cch khc l

    trch ra cc mu hoc/v cc m hnh n di cc d liu. Giai on ny rt quan trng, bao gm

    cc cng on nh: chc nng, nhim v v mc ch ca khai ph d liu, dng phng php khai

    ph no? Thng thng, cc bi ton khai ph d liu bao gm: cc bi ton mang tnh m t - a

    ra tnh cht chung nht ca d liu, cc bi ton d bo -bao gm c vic pht hin cc suy din

    da trn d liu hin c. Tu theo bi ton xc nh c m ta la chn cc phng php khai ph

    d liu cho ph hp.

    Bc th t:S dng cc tri thc pht hin c. L hiu tri thc tm c, c bit l

    lm sng t cc m t v d on. Cc bc trn c th lp i lp li mt s ln, kt qu thu c

    c th c ly trung bnh trn tt c cc ln thc hin. Cc kt qu ca qu trnh pht hin tri thc

    c th c a v ng dng trong cc lnh vc khc nhau. Do cc kt qu c th l cc d on

    hoc cc m t nn chng cth c a vo cc h thng h tr ra quyt nh nhm t ng ho

    qu trnh ny.

    Tm li: KDD l mt qu trnh kt xut ra tri thc t kho d liu m trong khai ph d

    liu l cng on quan trng nht.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    16/78

    16

    2.4. Nhim v chnh trong khai thc dliuQu trnh khai ph d liu l qu trnh pht hin ra mu thng tin. Trong , gii thut khai

    ph tm kim cc mu ng quan tm theo dng xc nh nh cc lut, phn lp, hi quy, cy quyt

    nh, ...

    2.4.1. Phn lp (phn loi - classification)L vic xc nh mt hmnh x t mt mu d liu vo mt trong s cc lp c bit

    trc . Mc tiu ca thut ton phn lp l tm ra mi quan h no gia thuc tnh d bo v

    thuc tnh phn lp. Nh th qu trnh phn lp c th s dng mi quan h ny d bo cho cc

    mc mi. Cc kin thc c pht hin biu din di dng cc lut theo cch sau: Nu cc thuc

    tnh d bo ca mt mc tho mn iu kin ca cc tin th mc nm trong lp ch ra trong kt

    lun.

    V d: Mt mc biu din thng tin v nhn vin c cc thuc tnh d bo l: h tn, tui,

    gii tnh, trnh hc vn, v thuc tnh phn loi l trnh lnh o ca nhn vin.

    2.4.2. Hi qui (regression)L vic hc mt hm nh x t mt mu d liu thnh mt bin d on c gi tr thc.

    Nhim v ca hi quy tng t nh phn lp, im khc nhau chnh l ch thuc tnh d bo

    l lin tc ch khng phi ri rc. Vic d bo cc gi tr s thng c lm bi cc phng php

    thng k c in, chng hn nh hi quy tuyn tnh. Tuy nhin, phng php m hnh ho cng

    c s dng, v d: cy quyt nh.

    ng dng ca hi quy l rt nhiu, v d: don slng sinh vt pht quang hin thi

    trong khu rng bng cch d tm vi sng bng cc thit b cm bin txa; c lng sc xut ngi

    bnh c th cht bng cch kim tra cc triu chng; d bo nhu cu ca ngi dng i vi mt

    sn phm,

    2.4.3. Phn nhm (clustering)L vic m t chung tm ra cc tp hay cc nhm, loi m t d liu. Cc nhm c th

    tch nhau hoc phn cp hay gi ln nhau. C ngha ld liu c th va thuc nhm ny li vathuc nhm khc. Cc ng dng khai ph d liu c nhim v phn nhm nh pht hin tp cc

    khch hng c phn ng ging nhau trong CSDL tip th; xc nh cc quang ph t cc phng

    php o tia hng ngoi, Linquan cht ch n vic phn nhm l nhim v nh gi d liu,

    hm mt xc sut a bin/ cc trng trong CSDL.

    2.4.4. Tng hp (summarization)L cng vic lin quan n cc phng php tm kim mt m t tp con d liu [ 1, 2, 5].

    K thut tng hp thng p dng trong vic phn tch d liu c tnh thm d v bo co t ng.Nhim v chnh l sn sinh ra cc m t c trng cho mt lp. M t loi ny l mt kiu tng

    hp, tm tt cc c tnh chung ca tt c hay hu ht cc mc ca mt lp. Cc m t c trng th

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    17/78

    17hin theo lut c dng sau: Nu mt mc thuc v lp ch trong tin th mc c tt c

    cc thuc tnh nu trong kt lun. Lu rng lut dng ny c cc khc bit so vi lut phn

    lp. Lut pht hin c trng cho lp ch sn sinhkhi cc mc thuc v lp .

    2.4.5. M hnh ho sph thuc (dependency modeling)L vic tm kim mt m hnh m t s ph thuc gia cc bin, thuc tnh theo hai mc:

    Mc cu trc ca m hnh m t(thng di dng th). Trong , cc bin ph thuc b phn

    vo cc bin khc. Mc nh lng m hnh m t mc ph thuc. Nhng ph thuc ny thng

    c biu thdi dng theo lut nu - th (nu tin l ng th kt lun ng). V nguyn tc,

    c tin v kt lun u c th l s kt hp logic ca cc gi tr thuc tnh. Trn thc t, tin

    thng l nhm cc gi tr thuc tnh v kt lun ch l mt thuc tnh. Hn na h thng c th

    pht hin cc lut phn lp trong tt c cc lut cn phi c cng mt thuc tnh do ngi dng

    ch ra trong kt lun.

    Quan h ph thuc cng c th biu din di dng mng tin cy Bayes. l th c

    hng, khng chu trnh. Cc nt biu din thuc tnh v trng s ca lin kt ph thuc gia cc nt

    .

    2.4.6. Pht hin sbin i v lch (change and deviation dectection)Nhim v ny tp trung vo khm ph hu ht sthay i c ngha di dng o bit

    trc hoc gi tr chun, pht hin lch ng k gia ni dung ca tp con d liu thc v ni

    dung mong i. Hai m hnh lch hay dng l lch theo thi gian hay lch theo nhm. lch

    theo thi gian l sthay i c ngha ca d liu theo thi gian. lch theo nhm l s khc

    nhau ca gia d liu trong hai tp con d liu, y tnh ctrng hp tp con d liu ny thuc

    tp con kia, ngha xc nh d liu trong mt nhm con ca i tng c khc ng k so vi ton

    bi tng khng? Theo cch ny, sai st d liu hay sai lch so vi gi trthng thng c

    pht hin.

    V nhng nhim v ny yu cu s lng v cc dng thng tin rt khc nhau nn chng

    thng nh hng n vic thit k v chn phng php khai ph d liu khc nhau. V dnh

    phng php cy quyt nh (sc trnh by di y) to ra c mt m t phn bit c ccmu gia cc lp nhng khng c tnh cht v c im ca lp.

    2.5. Cc phng php khai ph dliuKhai ph d liu l lnh vc m con ngi lun tm cch t c mc ch s dng thng

    tin ca mnh. Qu trnh khai ph d liu l qu trnh pht hin mu, trong phng php khai ph

    d liu tm kim cc mu ng quan tm theo dng xc nh. C th k ra y mt vi phng

    php nh: s dng cng c truy vn, xy dng cy quyt nh, da theo khong cch (K-lng ging

    gn), gi tr trung bnh, pht hin lut kt hp, Cc phng php trn c th c phng theo vc tch hp vo cc h thng lai khai ph d liu theo thng k trong nhiu nm nghin cu.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    18/78

    18Tuy nhin, vi d liu rt ln trong kho d liu th cc phng php ny cng i din vi thch

    thc v mt hiu qu v quy m.

    2.5.1. Cc thnh phn ca gii thut khai ph dliuGii thut khai ph d liu bao gm 3 thnh phn chnh nh sau: biu din m hnh, kim

    nh m hnh v phng php tm kim.

    Biu din m hnh: M hnh c biu din theo mt ngn ng L no miu t cc mu

    c th khai thc c. M t m hnh r rng th hc my s to ra mu c m hnh chnh xc cho

    d liu. Tuy nhin, nu m hnh qu ln th kh nng d on ca hc my s b hn ch. Nh th

    s lm cho vic tm kim phc tp hn cng nh hiu c m hnh l khng n gin hoc s

    khng th c cc mu to ra c mt m hnh chnh xc cho d liu. V d m t cy quyt nh

    s dng phn chia cc nt theo 1 trng d liu, chia khng gian u vo thnh cc siu phng song

    song vi trc cc thuc tnh. Phng php cy quyt nh nh vy khng th khai ph c d liu

    dng cng thc X = Y d cho tp hc c quy m ln th no i na. V vy, vic quan trng l

    ngi phn tch d liu cn phi hiu y cc gi thit miu t. Mt iu cng kh quan trng l

    ngi thit k gii thut cng phi din t c cc gi thit m t no c to ra bi gii thut

    no. Kh nng miu t m hnh cng ln th cng lm tng mc nguy him do b hc qu v lm

    gim i kh nng d on cc d liu cha bit. Hn na, vic tm kim s cng tr ln phc tp

    hn v vic gii thch m hnh cng kh khn hn.

    M hnh ban u c xc nh bng cch kt hp bin u ra (ph thuc) vi cc bin c

    lp m bin u ra ph thuc vo. Sau phi tm nhng tham s m bi ton cn tp trung gii

    quyt. Vic tm kim m hnh s a ra c mt m hnh ph hp vi tham s c xc nh da

    trn d liu (trong mt s trng hp khc th m hnh v cc tham s li thay i ph hp vi

    d liu). Trong mt s trng hp, tp cc d liu c chia thnh tp d liu hc v tp d liu

    th. Tp d liu hc c dng lm cho tham s ca m hnh ph hp vi d liu. M hnh sau

    s c nh gi bng cch a cc d liu th vo m hnh v thay i cc tham s cho ph

    hp nu cn. M hnh la chn c th l phng php thng k nh SASS, mt s gii thut hc

    my (v d nh cy quyt nh v cc quyt nh hc c thy khc), mng neuron, suy din hngtnh hung (case based reasoning), cc k thut phn lp.

    Kim nh m hnh (model evaluation): L vic nh gi, c lng cc m hnh chi tit,

    chun trong qu trnh x l v pht hin tri thc vi s c lng c d bo chnh xc hay khng

    v c tho mn c s logic hay khng? c lng phi c nh gi cho (cross validation) vi

    vic m t c im bao gm d bo chnh xc, tnh mi l, tnh hu ch, tnh hiu c ph hp

    vi cc m hnh. Hai phng php logic v thng k chun c th s dng trong m hnh kim

    nh.Phng php tm kim: Phng php ny bao gm hai thnh phn: tm kim tham s v tm

    kim m hnh. Trong tm kim tham s, gii thut cn tm kim cc tham s ti u ha cc tiu

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    19/78

    19chun nh gi m hnh vi cc d liu quan st c v vi mt m t m hnh nh. Vic

    tm kim khng cn thit i vi mt s bi ton kh n gin: cc nh gi tham s ti u c th

    t c bng cc cch n gin hn. i vi cc m hnh chung th khng c cc cch ny, khi

    gii thut tham lam thng c s dng lp i lp li. V d nh phng php gim gradient

    trong gii thut lan truynngc (backpropagation) cho cc mng neuron. Tm kim m hnh xy

    ra ging nh mt vng lp qua phng php tm kim tham s: m t m hnh b thay i to nn

    mt h cc m hnh. Vi mi mt m t m hnh, phng php tm kim tham s c p dng

    nh gi cht lng m hnh. Cc phng php tm kim m hnh thng s dng cc k thut tm

    kim heuristic v kch thc ca khng gian cc m hnh c th thng ngn cn cc tm kim tng

    th, hn na cc gii php n gin (closed form) khng d t c.

    2.5.2. Phng php suy din / quy npMt c s d liu l mt kho thng tin nhng cc thng tin quan trng hn cng c th c

    suy din t kho thng tin . C hai k thut chnh thc hin vic ny l suy din v quy np.

    Phng php suy din: Nhm rt rathng tin l kt qu logic ca cc thng tin trong c s

    d liu. V d nh ton t lin kt p dng cho bng quan h, bng u cha thng tin v cc nhn

    vin v phng ban, bng th hai cha cc thng tin v cc phng ban v cc trng phng. Nh vy

    s suy ra c mi quan h gia cc nhn vin v cc trng phng. Phng php suy din da trn

    cc s kin chnh xc suy ra cc tri thc mi t cc thng tin c. Mu chit xut c bng cch

    s dng phng php ny thng l cc lut suy din.

    Phng php quy np: phng php quy np suy ra cc thng tin c sinh ra t c s d

    liu. C ngha l n t tm kim, to mu v sinh ra tri thc ch khng phi bt u vi cc tri thc

    bit trc. Cc thng tin m phng php ny em li l cc thng tin hay cc tri thc cp cao

    din t v cc i tng trong c s d liu. Phng php ny lin quan n vic tm kim cc mu

    trong CSDL. Trong khai ph d liu, quy np c s dng trong cy quyt nh v to lut.

    2.5.3. Phng php ng dng K-lng ging gnS miu t cc bn ghi trong tp d liu khi tr vo khng gian nhiu chiu l rt c ch i

    vi vic phn tch d liu. Vic dng cc miu t ny, ni dung ca vng ln cn c xc nh,trong cc bn ghi gn nhau trong khng gian c xem xt thuc v ln cn (hng xm lng

    ging) ca nhau. Khi nim ny c dng trong khoa hc k thut vi tn gi K-lng ging gn,

    trong K l s lng ging c s dng. Phng php ny rt hiu qu nhng li n gin.

    tng thut ton hc K-lng ging gn l thc hin nh cc lng ging gn ca bn lm.

    V d: d on hot ng ca c th xc nh, K-lng ging tt nht ca c th c xem

    xt, v trung bnh cc hot ng ca cc lng ging gn a ra c d on v hot ng ca c

    th .

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    20/78

    20K thut K-lng ging gn l mt phng php tm kim n gin. Tuy nhin, n c mt

    s mt hn ch gii l hn phm vi ng dng ca n. l thut ton ny c phc tp tnh ton

    l lu tha bc 2 theo s bn ghi ca tp d liu.

    Vn chnh lin quan n thuc tnh ca bn ghi. Mt bn ghi gm hiu thuc tnh c

    lp, n bng mt im trong khng gian tm kim c s chiu ln. Trong cc khng gian c s

    chiu ln, gia hai im bt k hu nh c cng khong cch. V th m k thut K-lng ging

    khng cho ta thm mt thng tin c ch no, khi tt c cc cp im u l cc lng ging. Cui

    cng, phng php K-lng ging khng a ra l thuyt hiu cu trc d liu. Hn ch c th

    c khc phc bng k thut cy quyt nh.

    2.5.4. Phng php sdng cy quyt nh v lutVi k thut phn lp da trn cy quyt nh, kt qu ca qu trnh xy dng m hnh s

    cho ra mt cy quyt nh. Cy ny c s dng trong qu trnh phn lp cc i tng d liu

    cha bit hoc nh gi chnh xc ca m hnh. Tng ng vi hai giai on trong qu trnh

    phn lp l qu trnh xy dng v s dng cy quyt nh.

    Qu trnh xy dng cy quyt nh bt u t mt nt n biu din tt c cc mu d liu.

    Sau , cc mu s c phn chia mt cch quy da vo vic la chn cc thuc tnh. Nu cc

    mu c cng mt lp th nt s tr thnh l, ngc li ta s dng mt o thuc tnh chn ra

    thuc tnh tip theo lm c s phn chia cc mu ra cc lp. Theo tng gi tr ca thuc tnh va

    chn, ta to ra cc nhnh tng ng v phn chia cc mu vo cc nhnh to. Lp li qu trnh

    trn cho ti khi to ra c cy quyt nh, tt c cc nt trin khai thnh l v c gn nhn.

    Qu trnh quy s dng li khi mt trong cc iu kin sau c tha mn:

    - Tt c cc mu thuc cng mt nt.- Khng cn mt thuc tnh no la chn.- Nhnh khng cha mu no.

    Phn ln cc gii thut sinh cy quyt nh u c hn ch chung l s dng nhiu b nh.

    Lng b nh s dng t l thun vi kch thc ca mu d liu hun luyn. Mt chng trnh

    sinh cy quyt nh c h tr s dng b nh ngoi song li c nhc im v tc thc thi. Dovy, vn ta bt cy quyt nh tr nn quan trng. Cc nt l khng n nh trong cy quyt

    nh s c ta bt.

    K thut ta trc l vic dng sinh cy quyt nh khi chia d liu khng c ngha.

    2.5.5. Phng php pht hin lut kt hpPhng php ny nhm pht hin ra cc lut kt hp gia cc thnh phn d liu trong c s

    d liu. Mu u ra ca gii thut khai ph d liu l tp lut kt hp tm c. Ta c th ly mt v

    d n gin v lut kt hp nh sau: s kt hp gia hai thnh phn A v B c ngha l s xut hinca A trong bn ghi ko theo s xut hin ca B trong cng bn ghi : A => B.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    21/78

    21Cho mt lc R={A

    1, , A

    p} cc thuc tnh vi min gi tr {0,1}, v mt quan h r

    trn R. Mt lut kt hp trn r c m t di dng X=>B vi X R v B R\X. V mt trc

    gic, ta c th pht biu ngha ca lut nh sau: nu mt bn ghi ca bng r c gi tr 1 ti mi

    thuc tnh thuc X th gi tr ca thuc tnh B cng l 1 trong cng bn ghi . V d nh ta c tp

    c s d liu v cc mt hng bn trong siu th, cc dng tng ng vi cc ngy bn hng, ccct tng ng vi cc mt hngth gi tr 1 ti (20/10, bnh m) xc nh rng bnh m bn

    ngy hm cng ko theo s xut hin gi tr 1 ti (20/10, b).

    Cho W R, t s(W,r) l tn s xut hin ca W trong r c tnh bng t l ca cc hng

    trong r c gi tr 1 ti mi ct thuc W. Tn s xut hin ca lut X=>B trong r c nh ngha l

    s(X {B}, r) cn gi l h tr ca lut, tin cy ca lut l s(X {B}, r)/s(X, r). y X c

    th gm nhiu thuc tnh, B l gi tr khng c nh. Nh vy m khng xy ra vic to ra cc lut

    khng mong mun trc khi qu trnh tm kim bt u. iu cng cho thy khng gian tmkim c kch thc tng theo hm m ca s lng cc thuc tnh u vo. Do vy cn phi ch

    khi thit k d liu cho vic tm kim cc lut kt hp.

    Nhim v ca vic pht hin cc lut kt hp l phitm tt c cc lut X=>B sao cho tn s

    ca lut khng nh hn ngng cho trc v tin cy ca lut khng nh hn ngng cho

    trc. T mt c s d liu ta c th tm c hng nghn v thm ch hng trm nghn cc lut kt

    hp.

    Ta gi mt tp con X R l thng xuyn trong r nu tha mn iu kin s(X, r). Nubit tt c cc tp thng xuyn trong r th vic tm kim cc lut rt d dng. V vy, gii thut tm

    kim cc lut kt hp trc tin i tm tt c cc tp thng xuyn ny, sau todng dn cc lut

    kt hp bng cch ghp dn cc tp thuc tnh da trn mc thng xuyn.

    Cc lut kt hp c th l mt cch hnh thc ha n gin. Chng rt thch hp cho vic to

    ra cc kt qu c d liu dng nh phn. Gii hn c bn ca phng php ny l ch cc quan h

    cn phi tha theo ngha khng c tp thng xuyn no cha nhiu hn 15 thuc tnh. Gii thut

    tm kim cc lut kt hp to ra s lut t nht phi bng vi s cc tp ph bin v nu nh mt tp

    ph bin c kch thc K thphi c t nht l 2K

    tp ph bin. Thng tin v cc tp ph bin c

    s dng c lng tin cy ca cc tp lut kt hp.

    2.6. Li th ca khai ph dliu so vi phng php c bnNh phn tch trn, ta thy phng php khai ph d liu khng cg l mi v hon

    ton da trn cc phng php c bn bit. Vy khai ph d liu c g khc so vi cc phng

    php ? V ti sao khai ph d liu li c u th hn hn chng? Cc phn tch sau y s gii p

    cc cu hi ny.

    2.6.1. Hc my (Machine Learning)

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    22/78

    22Mc d ngi ta c gng ci tin cc phng php hc my c th ph hp vi mc

    ch khai ph d liu nhng s khc bit gia cch thit k, cc c im ca c s d liu lm

    cho phng php hc my tr nn khng ph hp vi mc ch ny, mc d cho n nay, phn ln

    cc phng php khai ph d liu vn a trn nn tng c s ca phng php hc my. Nhng

    phn tch sau y s cho thy iu .

    Trong qun tr c s d liu, mt c s d liu l mt tp hp c tch hp mt cch logic

    ca d liu c lu trong mt hay nhiu tp v c t chc lu tr c hiu qu, sa i v ly

    thng tin lin quan c d dng. V d nh trong CSDL quan h, d liu c t chc thnh cc

    tp hoc cc bng c cc bn ghi c di c nh. Mi bn ghi l mt danh sch c th t cc gi

    tr, mi gi tr c t vo mt trng. Thng tin v tn trng v gi tr ca trng c t

    trong mt tp ring gi l th vin d liu (data dictionary). Mt h thng qun tr c s d liu s

    qun l cc th tc (procedures) ly, lu tr, v x l d liu trong cc c s d liu .

    Trong hc my, thut ng c s d liu ch yu cp n mt tp cc mu (instance hay

    example) c lu trong mt tp. Cc mu thng l cc vector c im c di c nh. Thng

    tin v cc tn c im, dy gi tr ca chng i khi cng c lu li nh trong t in d liu.

    Mt gii thut hc cn s dng tp d liu v cc thng tin km theo tp d liu lm u vo v

    u ra biu th kt qu ca vic hc (v d nh mt khi nim).

    Vi so snh c s d liu thng thng v CSDL trong hc my nh trn, c th thy l hc

    my c kh nng c p dng cho c s d liu, bi v khng phi hc trn tp cc mu m hc

    trn tp cc bn ghi ca CDSL.

    Tuy nhin, pht hin tri thc trong c s d liu lm tng thm cc vn vn l in

    hnh trong hc my v qu kh nng ca hc my. Trong thc t, c s d liu thng ng,

    khng y , b nhiu, v ln hn nhiu so vi tp cc d liu hc my in hnh. Cc yu t ny

    lm cho hu ht cc gii thut hc my tr nn khng hiu qu trong hu ht cc trng hp. V vy

    trong khai ph d liu, cn tp trung rt nhiu cng sc vo vic vt qua nhng kh khn, phc

    tp ny trong CSDL.

    2.6.2. Phng php h chuyn giaCc h chuyn gia c gng nm bt cc tri thc thch hp vi bi ton no . Cc k thut

    thu thp gip cho vip hp l mt cch suy din cc chuyn gia con ngi. Mi phng php

    l mt cch suy din cc lut t cc v d v gii php i vi bi ton chuyn gia a ra. Phng

    php ny khc vi khai ph d liu ch cc v d ca chuyn gia thng mc cht lng cao

    hn rt nhiu so vi cc d liu trong c s d liu, v chng thng ch bao c cc trng hp

    quan trng. Hn na, cc chuyn gia s xc nhn tnh gi tr v hu dng ca cc mu pht hin

    c. Cng nh vi cc cng c qun tr c s d liu, cc phng php ny i hi c s thamgia ca con ngi trong vic pht hin tri thc

    2.6.3. Pht kin khoa hc

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    23/78

    23Khai ph d liu rt khc vi pht kin khoa hc ch khai ph trong CSDL t c ch

    tm v c iu kin hn. Cc d liu khoa hc c thc nghim nhm loi b mt s tc ng ca

    cc tham s nhn mnh bin thin ca mt hay mt s tham s ch. Tuy nhin, cc c s d

    liu thng mi in hnh li ghi mt s lng tha thng tin v cc d n ca h t c mt

    s mc ch v mt t chc. d tha ny (hay c th gi l s ln ln confusion) c th nhn

    thy v cng c th n cha trong cc mi quan h d liu. Hn na, cc nh khoa hc c th to li

    cc th nghim v c th tm ra rng cc thit k ban u khng thch hp. Trong khi , cc nh

    qun l c s d liu hu nh khng th xa x i thit k li cc trng d liu v thu thp li d

    liu.

    2.6.4. Phng php thng kMt cu hi hin nhin l khaiph d liu khc g so vi phng php thng k. Mt cu hi

    hin nhin l khai ph d liu khc g so vi phng php thng k. T nhiu nm nay, con ngi

    s dng phng php thng k mt cch rt hiu qu t c mc ch ca mnh.

    Mc d cc phng php thng k cung cp mt nn tng l thuyt vng chc cho cc bi ton

    phn tch d liu nhng ch c tip cn thng k thun ty thi cha . Th nht, cc phng php

    thng k chun khng ph hp i vi cc kiu d liu c cu trc trong rt nhiu cc CSDL. Th

    hai, thng k hon ton theo d liu (data driven), n khng s dng tri thc sn c v lnh vc.

    Th ba, cc kt qu phn tch thng k c th s rt nhiu v kh c th lm r c. Cui cng,

    cc phng php thng k cn c s hng dnca ngi dng xc nh phn tch d liu nh

    th no v u.

    S khc nhau c bn gia khai ph d liu v thng k l ch khai ph d liu l mt phng

    tin c dng bi ngi s dng u cui ch khng phi l cc nh thng k. Khai ph d liut

    ng qu trnh thng k mt cch c hiu qu, v vy lm nh bt cng vic ca ngi dng u

    cui, to ra mt cng c d s dng hn. Nh vy, nh c khai ph d liu, vic d on v kim

    tra rt vt v trc y c th c a ln my tnh, c tnh, d on v kim tra mt cch t

    ng.

    2.7. La chn phng phpCc gii thut khai ph d liu t ng vn mi ch giai on pht trin ban u. Ngi ta

    vn cha a ra c mt tiu chun no trong vic quyt nh s dng phng php no v trong

    trng hp hp no th c hiu qu.

    Hu ht cc k thut khai ph d liu u mi i vi lnh vc kinh doanh. Hn na li c

    rt nhiu k thut, mi k thut c s dng cho nhiu bi ton khc nhau. V vy, ngay sau cu

    hi khai ph d liu l g? s l cu hi vy th dng k thut no?. Cu tr li tt nhin l

    khng n gin. Mi phng php u c im mnh v yu ca n, nhng hu ht cc im yuu c th khc phc c. Vy th phi lm nh th no p dng k thut mt cch tht n

    gin, d s dng khng cm thy nhng phc tp vn c ca k thut .

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    24/78

    24 so snh cc k thut cn phi c mt tp ln cc quy tc v cc phng php thc

    nghim tt. Thng th quy tc ny khng c s dng khi nh gi cc k thut mi nht. Vi vy

    m nhng yu cu ci thin chnh xc khng phi lc no cng thc hin c.

    Nhiu cng ty a ra nhng sn phm s dng kt hp nhiu k thut khai ph d liu

    khc nhau vi hy vng nhiu k thut s tt hn. Nhng thc t cho thy nhiu k thut ch thm

    nhiu rc ri v gy kh khn cho vic so snh gia cc phng php v cc sn phm ny. Theo

    nhiu nh gi cho thy, khi hiu c cc k thut v nghin cu tnh ging nhau gia chng,

    ngi ta thy rng nhiu k thut lc u th c v khc nhau nhng thc cht ra khi hiu c cc

    k thut ny th thy chng hon ton ging nhau. Tuy nhin, nh gi ny cng ch tham kho

    v cho n nay, khai ph d liu vn cn l k thut mi cha nhiu tim nng m ngi ta vn

    cha khai thc ht.

    2.8. Nhng thch thc trong ng dng v nghin cu trong k thut khai ph dliu y, ta a ra mt s kh khn trong vic nghin cu v ng dng k thut khai ph d

    liu. Tuy nhin, th khng c ngha l vic gii quyt l hon ton b tc m ch mun nu ln rng

    khai ph c d liu khng phi n gin, m phi xem xt cng nh tm cch gii quyt

    nhng vn ny. Ta c th lit k mt s kh khn nh sau:

    2.8.1. Cc vn vc sdliuu vo ch yu ca mt h thng khai thc tri thc l cc d liu th trong cspht sinh

    trong khai ph d liu chnh l ty. Do cc d liu trong thc tthng ng, khng y , ln

    v b nhiu. Trong nhng trng hp khc, ngi ta khng bit c sd liu c cha cc thng tin

    cn thit cho vic khai thc hay khng v lm thno gii quyt vi sd tha nhng thng tin

    khng thch hp ny.

    D liu ln: Cho n nay, cc c sd liu vi hng trm trng v bng, hng triu bn

    ghi v vi kch thc n gigabytes l chuyn bnh thng. Hin nay bt u xut hin cc c

    sd liu c kch thc ti terabytes. Cc phng php gii quyt hin nay l a ra mt ngng

    cho c sd liu, lu mu, cc phng php xp x, x l song song (Agrawal et al, Holsheimer et

    al). Kch thc ln: khng ch c slng bn ghi ln m scc trng trong c sd liu

    cng nhiu. V vy m kch thc ca bi ton trnn ln hn. Mt tp d liu c kch thc ln

    sinh ra vn lm tng khng gian tm kim m hnh suy din. Hn na, n cng lm tng kh

    nng mt gii thut khai ph d liu c th tm thy cc mu gi. Bin php khc phc l lm gim

    kch thc tc ng ca bi ton v s dng cc tri thc bit trc xc nh cc bin khng ph

    hp.

    D liu ng: c im c bn ca hu ht cc c sd liu l ni dung ca chng thayi lin tc. D liu c ththay i theo thi gian v vic khai ph d liu cng bnh hng bi

    thi im quan st d liu. V dtrong c sd liu v tnh trng bnh nhn, mt s gi tr d liu

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    25/78

    25l hng s, mt s khc li thay i lin tc theo thi gian (v d cn nng v chiu cao), mt s

    khc li thay i ty thuc vo tnh hung v ch c gi trc quan st mi nht l (v d nhp

    p ca mch). Vy thay i d liu nhanh chng c th lm cho cc mu khai thc c trc

    mt gi tr. Hn na, cc bin trong c sd liu ca ng dng cho cng c th bthay i, b

    xa hoc l tng ln theo thi gian. Vn ny c gii quyt bng cc gii php tng trng

    nng cp cc mu v coi nhng thay i nh l c hi khai thc bng cch s dng n tm

    kim cc mu bthay i.

    Cc trng khng ph hp: Mt c im quan trng khc l tnh khng thch hp ca d

    liu, ngha l mc d liu trthnh khng thch hp vi trng tm hin ti ca vic khai thc. Mt

    kha cnh khc i khi cng lin quan n ph hp l tnh ng dng ca mt thuc tnh i vi

    mt tp con ca c sd liu. V dtrng s ti khon Nostro khng p dng cho cc tc nhn.

    Cc gi tr b thiu: S c mt hay vng mt ca gi tr cc thuc tnh d liu ph hp c

    thnh hng n vic khai ph d liu. Trong h thng tng tc, s thiu vng d liu quan

    trng c th dn n vic yu cu cho gi tr ca n hoc kim tra xc nh gi tr ca n. Hoc

    cng c th s vng mt ca d liu c coi nh mt iu kin, thuc tnh b mt c thc coi

    nh mt gi tr trung gian v l gi tr khng bit.

    Cc trng b thiu: Mt quan st khng y c sd liu c th lm cho cc d liu c

    gi tr bxem nh c li. Vic quan st c sd liu phi pht hin c ton b cc thuc tnh c

    thdng gii thut khai ph d liu c th p dng nhm gii quyt bi ton. Gi s ta c cc

    thuc tnh phn bit cc tnh hung ng quan tm. Nu chng khng lm c iu th c

    ngha l c li trong d liu. i vi mt h thng hc chun on bnh st rt t mt c s

    d liu bnh nhn th trng hp cc bn ghi ca bnh nhn c triu chng ging nhau nhng li c

    cc chn on khc nhau l do trong d liu b li. y cng l vn thng xy ra trong c s

    d liu kinh doanh. Cc thuc tnh quan trng c th s b thiu nu d liu khng c chun b

    cho vic khai ph d liu.

    nhiu v khng chc chn: i vi cc thuc tnh thch hp, nghim trng ca li

    ph thuc vo kiu d liu ca cc gi tr cho php. Cc gi tr ca cc thuc tnh khc nhau c thl cc s thc, s nguyn, chui v c th thuc vo tp cc gi trnh danh. Cc gi trnh danh

    ny c th sp xp theo th t tng phn hoc y , thm ch c th c cu trc ngngha.

    Mt yu t khc ca khng chc chn chnh l tnh k tha hoc chnh xc m d liu

    cn c, ni cch khc l nhiu crn cc php o v phn tch c u tin, m hnh thng k m t

    tnh ngu nhin c to ra v c s dng nh ngha mong mun v dung sai ca d

    liu. Thng th cc m hnh thng k c p dng theo cch c bit xc nh mt cch ch

    quan cc thuc tnh t c cc thng k v nh gi khnng chp nhn ca cc (hay t hpcc) gi tr thuc tnh. c bit l vi d liu kiu s, sng n ca d liu c th l mt yu t

    trong vic khai ph. V dnh trong vic o nhit c th, ta thng cho php chnh lch 0.1 .

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    26/78

    26Nhng vic phn tch theo xu hng nhy cm nhit ca c th li yu cu chnh xc cao

    hn. mt h thng khai thc c th lin hn xu hng ny chun on th li cn c mt

    nhiu trong d liu u vo.

    Mi quan h phc tp gia cc trng: cc thuc tnh hoc cc gi tr c cu trc phn cp,

    cc mi quan h gia cc thuc tnh v cc phng tin phc tp din t tri thc v ni dung ca

    c sd liu yu cu cc gii thut phi c khnng s dng mt cch hiu qu cc thng tin ny.

    Ban u, k thut khai ph d liu chc pht trin cho cc bn ghi c gi tr thuc tnh n gin.

    Tuy nhin, ngy nay ngi ta ang tm cch pht trin cc k thut nhm rt ra mi quan h gia

    cc bin ny.

    2.8.2. Mt svn khcQu ph hp (Overfitting) Khi mt gii thut tm kim cc tham s tt nht cho s

    dng mt tp d liu hu hn, n c th s b tnh trng qu d liu (ngha l tm kim qu

    mc cn thit gy ra hin tng ch ph hp vi cc d liu m khng c khnng p ng cho

    cc d liu l), lm cho m hnh hot ng rt km i vi cc d liu th. Cc gii php khc phc

    bao gm nh gi cho (cross-validation), thc hin theo nguyn tc no hoc s dng cc bin

    php thng k khc.

    nh gi tm quan trng thng k: Vn (lin quan n overfitting) xy ra khi mt h

    thng tm kim qua nhiu m hnh. V dnh nu mt h thng kim tra N m hnh mc quan

    trng 0,001 th vi d liu ngu nhin trung bnh sc N/1000 m hnh c chp nhn l quan

    trng. x l vn ny, ta c th s dng phng php iu chnh thng k trong kim tra nh

    mt hm tm kim, v dnh iu chnh Bonferroni i vi cc kim tra c lp.

    Khnng biu t ca mu: Trong rt nhiu ng dng, iu quan trng l nhng iu khai

    thc c phi cng d hiu vi con ngi cng tt. V vy, cc gii php thng bao gm vic

    din tdi dng ha, xy dng cu trc lut vi cc thc hng (Gaines), biu din bng

    ngn ng t nhin (Matheus et al.) v cc k thut khc nhm biu din tri thc v d liu.

    Stng tc vi ngi s dng v cc tri thc sn c: rt nhiu cng cv phng php

    khai ph d liu khng thc stng tc vi ngi dng v khng d dng kt hp cng vi cc trithc bit trc . Vic s dng tri thc min l rt quan trng trong khai ph d liu. c

    nhiu bin php nhm khc phc vn ny nh s dng c sd liu suy din pht hin tri

    thc, nhng tri thc ny sau c s dng hng dn cho vic tm kim khai ph d liu

    hoc s dng s phn b v xc sut d liu trc nh mt dng m ha tri thc c sn.

    Bi tp:1. K thut khai ph d liu l g?2. Nhim v chnh ca qu trnh khai ph d liu?

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    27/78

    273. Trnh by cc nt khc nhau c bn gia k thut khai ph d liu vi cc phng php

    nh my hc, thng k?

    4. Ccbc ca qu trnhkhai ph d liu?5. Hy cho v d ng dng k thut khai ph d liu trong thc t?

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    28/78

    28

    Chng 3: Tin x l d liu3.1. Mc ch

    Cc K thut datamining u thc hin trn cc c sd liu, ngun d liu ln. l kt

    qu ca qu trnh ghi chp lin tc thng tin phn nh hot ng ca con ngi, cc qu trnh t

    nhin Tt nhin cc d liu lu trhon ton l di dng th, cha sn sng cho vic pht hin,

    khm ph thng tin n cha trong . Do vy chng cn phi c lm sch cng nh bin i v

    cc dng thch hp trc khi tin hnh bt k mt phn tch no.

    thc hin c vic trch rt thng tin hu ch, hay p dng cc phng php khai ph

    nh phn lp, don th ngun d liu th ban u cn phi tri qua nhiu cng on bin i.

    Cc cng on ny c rt nhiu cch thc hin ty thuc vo nhu cu v dnh: Gim thiu kch

    thc, chch chn cc d liu thc s quan trng, gii hn phm vi ca cc d liu thi gian thc,hoc thay i, iu chnh cc d liu sao cho ph hp nht vi yu cu t ra. Tt nhin khng nn

    qu k vng vo vic p dng my tnh tm ra cc tri thc hu ch m khng c s trgip ca

    con ngi, cng nh khng th mong mun rng mt ngun d liu sau khi bin i ca bi ton

    ny li c th ph hp vi mt bi ton khai ph khc.

    V d, Mt Cng ty in ta ra yu cu phn tch d liu bn hng ti cc chi nhnh. Khi

    nhn vin phn tch cn phi kim tra klng c sd liu bn hng ca ton cng ty cng

    nh kho xng xc nh v la chn cc thuc tnh hoc chiu thng tin a vo phn tch nh:Chng loi mt hng, mt hng, gi c, chi nhnh bn ra. Tuy nhin khng th trnh khi vic cc

    giao dch thng nht c nhng sai li nht nh trong qu trnh ghi chp ca nhn vin bn hng.

    Cc sai li rt a dng t vic khng ghi li thng tin cho n vic ghi sai thng tin so vi quy

    nh, quy chun bnh thng. Do vy cng vic phn tch s kh th trin khai c nu gi nguyn

    ngun d liu ban u trng thi cha y (thiu gi tr thuc tnh hoc cc thuc tnh nht

    nh ch cha cc d liu tng hp), nhiu (c cha li, hoc bin ca gi tr khc so vi d kin),

    v khng ph hp (v d, c s khc bit trong m schi nhnh c s dng phn loi).Nhng iu nu trong v d trn l hon ton c thc trong th gii hin ti, n gin l vo

    thi im thu thp chng khng c coi l quan trng, cc d liu lin quan khng c ghi li do

    mt s hiu nhm, hoc do trc trc thit b.Ngoi ra cn c cc trng hp cc d liu ghi sau

    khi qua mt qu trnh xem xt no trc bxa i, cng nh vic ghi chp s bin i mang

    tnh lch s ca cc giao dch c th b b qua m ch gi li nhng thng tin tng hp vo thi

    im xt. Do vy, lm pht sinh nhu cu lm sch d liu l tm (in) thm cc gi tr thiu,

    lm mn cc d liu nhiu hoc loi b cc gi trkhng ngha, d liu gy mu thun.

    Qu trnh chun b d liu phc v khai ph d liu thng thng gm:

    - Lm sch d liu;

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    29/78

    29- Tch hp d liu;

    - Bin i d liu;

    - Rt gn d liu.

    3.2. Lm sch d liu3.2.1. Thiu gi tr

    Hy xem xt mt kho d liu bn hng v qun l khch hng. Trong c th c mt hoc

    nhiu gi tr m kh c th thu thp c v dnh thu nhp ca khch hng. Vy lm cch no

    chng ta c c cc thng tin , hy xem xt cc phng php sau.

    - B qua cc b: iu ny thng c thc hin khi thng tin nhn d liu b mt. Phngphp ny khng phi lc no cng hiu qu tr khi cc b c cha mt s thuc tnh khng thc s

    quan trng.

    - in vo cc gi tr thiu bng tay: Phng php ny thng tn thi gian v c th khng

    kh thi cho mt tp d liu ngun ln vi nhiu gi tr b thiu.

    - S dng cc gi trquy c in vo cho gi tr thiu: Thay th cc gi tr thuc tnh

    thiu bi cng mt hng squy c, chng hn nh mt nhn ghi gi trKhng bit hoc .

    Tuy vy iu ny cng c th khin cho chng trnh khai ph d liu hiu nhm trong mt strng hp v a ra cc kt lun khng hp l.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    30/78

    30- S dng cc thuc tnh c ngha l in vo cho gi tr thiu: V d, ta bit thu nhp

    bnh qun u ngi ca mt khu vc l800.000, gi tr ny c thc dng th thay th cho gi

    tr thu nhp b thiu ca khch hng trong khu vc .

    - S dng cc gi tr ca cc b cng th loi thay th cho gi tr thiu: V d, nu khch

    hng A thuc cng nhm phn loi theo ri ro tn dng vi mt khch hng B khc trong khi

    khch hng ny c thng tin thu nhp bnh qun. Ta c th s dng gi tr in vo cho gi tr

    thu nhp bnh qun ca khch hng A .

    - S dng gi tr c t l xut hin cao in vo cho cc gi tr thiu.: iu ny c th xc

    nh bng phng php hi quy, cc cng c suy lun da trn l thuyt Bayersian hay cy quyt

    nh

    3.2.2. D liu nhiuNhiu d liu l mt li ngu nhin hay do bin ng ca cc bin trong qu trnh thc

    hin, hoc s ghi chp nhm ln ko c kim sot V d cho thuc tnh nh gi c, lm cch

    no c th lm mn thuc tnh ny loi b d liu nhiu. Hy xem xt cc k thut lm mn

    sau:

    Mng lu gi cc mt hng: 4, 8, 15, 21, 21, 24, 25, 28, 34

    Phn thnh cc bin

    Bin 1: 4, 8 , 15

    Bin 2: 21, 21, 24

    Bin 3: 25, 28, 34

    Lm mn sdng phng php trung v

    Bin 1: 9, 9 ,9

    Bin 2: 22, 22, 22

    Bin 3: 29, 29, 29

    Lm mn binBin 1: 4, 4, 15

    Bin 2: 21, 21, 24

    Bin 3: 25, 25, 34

    Bng 3.1. V d vphng php lm mn Binning

    a. Binning: Lm mn mt gi tr d liu c xc nh thng qua cc gi tr xung quanh n.

    V d, cc gi tr gi cc sp xp trc sau phn thnh cc di khc nhau c cng kch thc3 (tc mi Bin cha 3 gi tr).

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    31/78

    31- Khi lm mn trung v trong mi bin, cc gi tr sc thay th bng gi tr trung bnh

    cc gi tr c trong bin

    - Lm mn bin: cc gi tr nh nht v ln nht c xc nh v dng lm danh gii ca

    bin. Cc gi tr cn li ca bin sc thay th bng mt trong hai gi tr trn ty thuc vo lch

    gia gi trban u vi cc gi trbin .

    V d, bin 1 c cc gi tr 4, 8, 15 vi gi tr trung bnh l 9. Do vy nu lm mn trung v

    cc gi trban u sc thay th bng 9. Cn nu lm mn bin gi tr 8 gn gi tr4 hn nn

    n c thay th bng 4.

    b. Hi quy: Phng php thng dng l hi quy tuyn tnh, tm ra c mt mi quan

    h tt nht gia hai thuc tnh (hoc cc bin), t mt thuc tnh c thdng don thuc

    tnh khc. Hi quy tuyn tnh a im l mt s mrng ca phng php trn, trong c nhiu

    hn hai thuc tnh c xem xt, v cc d liu tnh ra thuc v mt min a chiu.

    Hnh 3.1. Phn cm d liu khch hng da trn thng tin a ch

    c. Nhm cm: Cc gi trtng tnhau c t chc thnh cc nhm hay cm" trc quan.Cc gi trri ra bn ngoi cc nhm ny sc xem xt lm mn a chng

    3.3. Tch hp v bin i d liu3.3.1.Tch hp d liu

    Trong nhiu bi ton phn tch, chng ta phi ng rng ngun d liu dng phn tch

    khng thng nht. c thphn tch c, cc d liu ny cn phi c tch hp, kt hp thnh

    mt kho d liu thng nht. V dng thc, cc ngun d liu c thc lu tr rt a dng t: cc

    c sd liu ph dng, cc tp tin flat-file, cc d liu khi . Vn t ra l lm th no c thtch hp chng m vn m bo tnh tng ng ca thng tin gia cc ngun.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    32/78

    32V d, lm thno m ngi phn tch d liu hoc my tnh chc chn rng thuc tnh id

    ca khch hng trong mt c sd liu A v s hiu cust trong mt flat-file l cc thuc tnh ging

    nhau v tnh cht?

    Vic tch hp lun cn cc thng tin din t tnh cht ca mi thuc tnh (siu d liu) nh:

    tn, ngha, kiu d liu, min xc nh, cc quy tc x l gi tr rng, bng khng . Cc siu d

    liu sc s dng gip chuyn i cc d liu. Do vy bc ny cng lin quan n qu trnh

    lm sch d liu.

    D tha dliu:y cng l mt vn quan trng, v dnh thuc tnh doanh thu hng

    nm c thl d tha nu nh n c thc suy din t cc thuc tnh hoc tp thuc tnh khc.

    Mt sd tha c thc pht hin thng qua cc phn tch tng quan, Gi s cho hai

    thuc tnh, vic phn tch tng quan c th ch ra mc mt thuc tnh ph thuc vo thuc tnh

    kia, da trn cc d liu c trong ngun. Vi cc thuc tnh s hc, chng ta c thnh gi s

    tng quan gia hai thuc tnh A v B bng cch tnh ton tng quan nh sau:

    Trong :

    - N l s b

    - ai v bi l cc gi tr ca thuc tnh A v B ti b th i

    - v biu din ngha cc gi tr ca A v B

    - v biu din lch chun ca A v B

    - l tng ca tch AB (vi mi b, gi tr ca thuc tnh A c nhn vi gi tr

    ca thuc tnh b trong b)

    - Lu rng

    Nu ln hn 0, th A v B c khnng c mi lin htng quan vi nhau, ngha l

    nu gi trA tng th gi trcua B cng tng ln. Gi tr ny cng cao th mi quan h cng cht ch.

    V h qu l nu gi tr cao th mt trong hai thuc tnh A (hoc B) c thc loi b.

    Nu bng 0th A v B l c lp vi nhau v gia chng khng c mi quan h no.

    Nu nhhn 0 th A v B c mi quan htng quan nghch, khi nu mt thuc

    tnh tng th gi tr ca thuc tnh kia gim i.

    Ch rng, nu gia A v B c mi quan htng quan th khng c ngha chng c mi

    quan h nhn qu, ngha l A hoc B bin i l do stc t thuc tnh kia. V d c th xem xt

    mi quan htng quan gia s bnh vin v s v tai nn t mt a phng. Hai thuc tnh

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    33/78

    33ny thc s khng c quan h nhn qu trc tip m chng quan h nhn qu vi mt thuc tnh

    th 3 l dn s.

    Vi ngun d liu ri rc, mt mi quan htng quan gia hai thuc tnh A v B c th

    c khm ph ra qua php kim 2. Gi s A c c gi tr khng lp c k hiu l a1, a2, , ac. B

    c r gi tr

    khng lp, k hi

    u b1, b2

    , , br. B

    ng bi

    u di

    n m

    i quan h

    A v B c th

    c xydng nh sau:

    - c gi tr ca A to thnh ct

    - r gi tr ca B to hnh hng.

    - Gi (Aj, Bj) biu din cc trng hp m thuc tnh A nhn gi tr ai, B nhn gi tr bi

    Gi tr 2 c tnh nh sau

    Trong :

    - l tn xut quan st c cc trng hp (Aj, Bj)

    - l tn xut d kin cc trng hp (Aj, Bj)

    Vi N l tng s b, l s b c cha gi tr ai cho thuc tnh A,

    l tng s b c cha tr bj cho thuc tnh B.

    V d:phn tch tng quan ca cc thuc tnh s dng phng php 2

    Gi s c mt nhm 1500 ngi c kho st. Gii tnh ca hc ghi nhn sau h s

    c hi v th loi sch yu thch thuc hai dng h cu v vin tng. Nh vy y c hai

    thuc tnh gii tnh v sthch c. S ln xut hin ca cc trng hp c cho trong bngsau

    Nam N Tng

    H c u 250 (90) 200 (360) 450

    Vin tng 50 (210) 1000 (840) 1050

    Tng 300 1200 1500

    Vy chng ta tnh c

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    34/78

    34Ch trn mi dng tng s cc tn xut xut hin d kin c ghi trong cp ngoc ()

    v tng s tn xut d kin trn mi ct bng vi tng s tn xut quan st c trn ct .

    T bng d liu cho thy bc t do (r-1)(c-1) = (2-1)(2-1) = 1. Vi 1 bc t do, gi tr cn

    bc b gi thit ny mc 0.001 l 10.828. V vi gi trtnh c nh trn 507.93 cho thy

    gii thuyt sthch c l c lp vi gii tnh l khng chc chn, hai thuc tnh ny c mt quan

    h tng quan kh mnh trong nhm ngi c kho st.

    3.3.2.Bin i d liuTrong phn ny cc d liu sc bin i sang cc dng ph hp cho vic khai ph d

    liu. Cc phng php thng thy nh:

    - Lm mn: Phng php ny loi b cc trng hp nhiu khi d liu v d nh cc

    phng php binning, hi quy, nhm cm.

    - Tng hp: trong tng hp hoc tp hp cc hnh ng c p dng trn d liu. V d

    thy rng doanh s bn hng hng ngy c thc tng hp tnh ton hng thng v hng nm.

    Bc ny thng c s dng xy dng mt khi d liu cho vic phn tch.

    - Khi qut ha d liu, trong cc d liu mc thp hoc th c thay th bng cc khi

    nim mc cao hn thng qua kin trc khai nim. V d, cc thuc tnh phn loi v d nhng ph c th khi qut ha ln mc cao hn thnh Thnh ph hay Quc gia. Tng t

    nh vy cc gi tr s, nh tui c thc nh x ln khi nim cao hn nh Tr, Trung nin,

    C tui

    - Chun ha, trong cc d liu ca thuc tnh c quy v cc khong gi tr nhhn v

    dnh t -1.0 n 1.0, hoc t0.0 n 1.0

    - Xc nh thm thuc tnh, trong o cc thuc tnh mi sc thm vo ngun d liu

    gip cho qu trnh khai ph.Trong phn ny chng ta sxem xt phng php chun ha lm cho

    Mt thuc tnh c chun ha bng cch nh x mt cch c t l d liu v mt khong

    xc nh v dnh 0.0 n 1.0. Chun ha l mt phn hu ch ca thut ton phn lp trong mng

    noron, hoc thut ton tnh ton lch s dng trong vic phn lp hay nhm cm cc phn t lin

    k. Chng ta sxem xt ba phng php: min-max, z-score, v thay i s ch s phn thp phn

    (decimal scaling)

    a. Min-Max

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    35/78

    35Thc hin mt bin i tuyn tnh trn d liu ban u. Gi s rng minA v maxA l gi

    tr ti thiu v ti a ca thuc tnh A. Chun ha min-max s nh x gi tr v ca thuc tnh A

    thnh v trong khong [new_minA, new_maxA] bng cch tnh ton

    V d: Gi s gi tr nh nht v ln nht cho thuc tnh thu nhp bnh qun l 500.000 v

    4.500.000. Chng ta mun nh x gi tr 2.500.000 v khong [0.0, 1.0] s dng chun ha min-

    max. Gi tr mi thu c l

    b. z-scoreVi phng php ny, cc gi tr ca mt thuc tnh A c chun ha da vo lch tiu

    chun v trung bnh ca A. Mt gi tr v ca thuc tnh A c nh xthnh v nh sau:

    Vi v d pha trn: Gi s thu nhp bnh qun c lch tiu chun v trung bnh l:

    1.000.000 v 500.000. S dng phng php z-score th gi tr2.500.000 c nh x thnh

    c. Thay i s chs phn thp phn (decimal scale)Phng php ny s di chuyn du phn cc phn thp phn ca cc gi tr ca thuc tnh A.

    S ch s sau du phn cch phn thp phn c xc nh ph thuc vo gi tr tuyt i ln nht

    c th c ca thuc tnh A. Khi gi tr v sc nh xthnh v bng cch tnh

    Trong j l gi tr nguyn nh nht tha mn Max(|v|) < 1

    V d: Gi s rng cc gi tr ca thuc tnh A c ghi nhn nm trong khong -968 n

    917. Gi tr tuyt i ln nht ca min l 986. thc hin chun ha theo phng php nh ny,

    trc chng ta mang cc gi trchia cho 1.000 (j = 3). Nh vy gi tr -986 s chuyn thnh -

    0.986 v 917 c chuyn thnh 0.917

    3.3.3.Thu nh d liuVic khai ph d liu lun c tin hnh trn cc kho d liu khng l v phc tp. Cc k

    thut khai ph khi p dng trn chng lun tn thi gian cng nh ti tuyn ca my tnh. Do vy

    i hi chng cn c thu nhtrc khi p dng cc k thut khai ph. Mt s chin lc thu nhd liu nh sau:

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    36/78

    36- Tng hp khi d liu, trong cc hnh ng tng hp c p dng trn d liu

    hnh thnh cc khi.

    - La chn tp thuc tnh con, trong cc thuc tnh khng thch hp, yu hoc d tha

    hay cc chiu sc loi b

    - Rt gn chiu, trong cc c ch m ha s rt gn kch thc d liu

    - Rt gn s hc, trong cc d liu sc thay th bng cc d liu ph nhhn nhng

    cng biu din vn .

    - Ri rc v phn cp khi nim , trong c gi tr ca cc thuc tnh c thay th bng

    cc di khi nim mc cao hn. Dng thc ri rc ha d liu s dng rt gn s hc thng rt

    hu dng cho vic tng pht sinh cc di phn cp khi nim. Phng php ny cho php vic

    khai ph d liu din ra cc mc tru tng.

    a. Tng hp khi dliuHy xem xt d liu bn hng ca mt n v, cc d liu c t chc bo co theo

    hng qu cho cc nm t2008 n 2010. Tuy nhin vic khai ph d liu li quan tm hn n cc

    bo co bn hng theo nm ch khng phi theo tng qu. Do cc d liu nn c tng hp

    thnh bo co tng vtnh hnh bn hng theo nm hn l theo qu.

    Hnh 3.2. D liu bn hng

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    37/78

    37

    Hnh 3.3. D liu tng hp

    Phn cp khi nim c th tn ti mi thuc tnh, n cho php phn tch d liu nhiu

    mc tru tng. V d, phn cp chi nhnh cho php cc chi nhnh c nhm li theo thng vng

    da trn a ch. Khi d liu cho php truy cp nhanh n cc d liu tnh ton, tng hp do vy

    n kh ph hp vi cc qu trnh khi ph.

    Cc khi d liu c to mc tru tng thp thng c gi l cuboid. Cc cuboid

    tng ng vi mt tp thc thno v dnh ngi bn hng, khch hng. Cc khi ny cung

    cp nhiu thng tin hu dng cho qu trnh phn tch. Khi d liu mc tru tng cao gi l

    apex cuboid, trong hnh 3.3 trn th hin d liu bn hng cho c3 nm, tt c cc loi mt hng vcc chi nhnh. Khi d liu c to t nhiu mc tru tng thng c gi l cuboids, do vy

    khi d liu thng c gi bng tn khc l li cuboids.

    b. La chn tp thuc tnh conNgun d liu dng phn tch c th cha hng trm thuc tch, rt nhiu trong s c th

    khng cn cho vic phn tch hoc chng l d tha. V d nu nhim v phn tch ch lin quan

    n vic phn loi khch hng xem h c hoc khng mun mua mt a nhc mi hay khng. Khi

    thuc tnh in thoi ca khch hng l khng cn thit khi so vi cc thuc tnh nh tui, sthch m nhc. Mc d vy vic la chn thuc tnh no cn quan tm l mt vic kh khn v mt

    thi gian t bit khi cc c tnh ca d liu l khng r rng. Gi cc thuc tnh cn, b cc thuc

    tnh khng hch cng s c th gy nhm ln, v sai lch kt qu ca cc thut ton khai ph d

    liu.

    Phng php ny rt gn kch thc d liu bng cch loi b cc thuc tnh khng hu ch

    hoc d tha (hoc loi b cc chiu). Mc ch chnh l tm ra tp thuc tnh nh nht sao cho khi

    p dng cc phng php khai ph d liu th kt quthu c l gn st nht vi kt qu khi sdng tt c cc thuc tnh.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    38/78

    38Vy lm cch no tm ra mt tp thuc tnh con tt t tp thuc tnh ban u. Nh

    rng vi N thuc tnh chng ta s c 2n tp thuc tnh con. Vic pht sinh v xem xt ht cc tp

    ny l kh tn cng sc cng nh ti nguyn c bit khi N v s cc lp d liu tng ln. Do vy

    cn c cc phng php khc, mt trong s l phng php tm kim tham lam, n s duyt qua

    khng gian thuc tnh v tm kim cc la chn tt nht vo thi im xt.

    La chn tng dn Loi bt Cy quyt nh

    Tpthuc tnh ban u

    {A1, A2, A3, A4, A5, A6}

    Tp rt gn ban u

    {}

    => {A1}

    => {A1, A4}

    => Kt qu {A1, A4, A6}

    Tpthuc tnh ban u

    {A1, A2, A3, A4, A5, A6}

    => {A1, A3, A4, A5, A6}

    => {A1, A4, A5, A6}

    => Kt qu {A1, A4, A6}

    Tpthuc tnh ban u

    {A1, A2, A3, A4, A5, A6}

    => Kt qu {A1, A4, A6}

    Bng 3.2. V d k thut rt gn

    Vic la chn ra thuc tnh tt (xu) c xc nh thng qua cc php kim thng k, trong

    gi s rng thuc tnh ang xt l c lp vi cc thuc tnh khc hoc phng php nh gi

    thuc tnh s dng o thng tin thng c dng trong vic xy dng cy quyt nh phn lp.

    Cc k thut la chn thng dng nh:

    1. La chn tng dn: Xut pht t mt tp rng cc thuc tnh, cc thuc tnh tt nht mi

    khi xc nh c sc thm vo tp ny. Lp li bc trn cho n khi khng thm c thuc

    tnh no na.

    2. Loi bt: Xut pht t tp c y cc thuc tnh. mi bc loi ra cc thuc tnh ti

    nht.

    3. Kt hp gia phng php loi bt v la chn tng dn bng cch ti mi bc ngoi

    vic la chn thm cc thuc tnh tt nht a vo tp th cng ng thi loi bi cc thuc tnh

    ti nht khi tp ang xt.

    4. Cy quyt inh: Khi s dng, cy c xy dng t ngun d liu ban u. Tt c cc

    thuc tnh khng xut hin trn cy c coi l khng hu ch. Tp cc thuc tnh c trn cy s l

    tp thuc tnh rt gn

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    39/78

    39Bi tp:

    1. Nu mt thuc tnh trong ngun d liu im-Sinh vin c cc gi tr A, B, C, D, F th kiud liu d kin ca thuc tnh trong qu trnh tin x l l g?

    2. Cho mng mt chiu X = {5.0, 23.0, 17.6, 7.23, 1.11}, hy chun ha mng s dnga. Decimal scaling: trong khong [1, 1].b. Min-max: trong khong [0, 1].c. Min-max: trong khong [1, 1].d. Phng php lche. So snh kt qu ca cc dng chun trn v cho nhn xt v u nhc im ca cc

    phng php?

    3. Lm mn d liu s dng k thut lm trncho tp sau:Y = {1.17, 2.59, 3.38, 4.23, 2.67, 1.73, 2.53, 3.28, 3.44}

    Sau biu din tp thu c vi cc chnh xc:

    a. 0.1b. 1.

    4. Cho tp mu vi cc gi tr b thiuo X1 = {0, 1, 1, 2}o X2 = {2, 1, , 1}o X3 = {1, , , 0}o X4 = {, 2, 1, }

    Nu min xc nh ca tt c cc thuc tnh l [0, 1, 2], hy xc nh cc gi tr b thiu bit

    rng cc gi tr c th l mt trong s cc xc tr ca min xc nh? Hy gii thch

    nhng ci c v mt nu rt gn chiu ca kho d liu ln trong qu trnh tin x l d

    liu?

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    40/78

    40

    Chng 4:Lut kt hp4.1. Khi nim c bn

    Tkhi c gii thiu tnm 1993, bi ton khai thc lut kt hp nhn c rt nhiu s

    quan tm ca nhiu nh khoa hc. Ngy nay vic khai thc cc lut nh th vn l mt trong nhng

    phng php khai thc mu ph bin nht trong vic khm ph tri thc v khai thc d liu (KDD:

    Knowledge Discovery and Data Mining).

    Mc ch chnh ca khai ph d liu l cc tri thc c kt xut ra sc s dng trong

    d bo thng tin trgip trong sn xut kinh doanh v nghin cu khoa hc.

    Trong hot ng sn xut kinh doanh, v d kinh doanh cc mt hng ti siu th, cc nh

    qun l rt thch c c cc thng tin mang tnh thng k nh: 90% ph n c xem my mu

    v eo ng h Thu S th dng nc hoa hiu Chanel hoc 70% khch hng l cng nhn khi

    mua TV thng mua loi TV 21 inches. Nhng thng tin nh vy rt hu ch trong vic nh

    hng kinh doanh. Vy vn t ra l liu c tm c cc lut nh vy bng cc cng c khai

    ph d liu hay khng? Cu tr li l hon on c th. chnh l nhim v khai ph lut kt hp.

    Gi s chng ta c mt c s d liu D.Lut kt hp cho bit phm vi m trong s xut

    hin ca tp cc thuc tnh Sno trong cc bn ghi ca D s ko theo s xut hin ca mt tp

    nhng thuc tnh khc U cng trong nhng record . Mi lut kt hp c c trng bi mt cp

    t l (ration) h tr. Mi t l h trc biu din bng t l % nhng bn ghi trongD cha c S vU.

    Vn khm ph lut kt hp c pht biu nh sau:

    Cho trc t l h tr (support ration) v tin cy (confidence) nh s tt c cc lut trongD c cc gi tr t l h tr v tin cy ln hn v tng

    ng.

    V d: D l CSDL mua bn v vi = 40%, = 90%. Vn pht hin lut kt hp KH c thc

    hin nh sau: Lit k (m) tt c nhng qui lut ch ra s xut hin mt s cc mc s ko theo mt

    s mc khc.

    Ch xt nhng qui lut m t l h tr ln hn 40% v tin cy ln hn 90%.Hay chng ta hy tng tng, mt cng ty bn hng qua mng Internet. Cc khch hng

    c yu cu in vo cc mu bn hng cng ty c c mt CSDL v cc yu cu ca khch

    hng. Gi s cng ty quan tm n mi quan h "tui, gii tnh, ngh nghip => sn phm". Khi c th c rt nhiu cu hi tng ng vi lut trn. V d: trong la tui no th nhng khch hng

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    41/78

    41n l cng nhn t mua mt hng g , v d o di chng hn l nhiu nht (tho mn mt

    ngng no )?

    4.2. Lut kt hp4.2.1. L thuyt v lut kt hp

    Cho mt tp I = {I1, I2, ...,Im} cc tp m khon mc (item), mt giao dch (transaction) T

    c nh ngha nh mt tp con (subset) ca cc khon mc trong I (T I). Tng t nh khi

    nim tp hp, cc giao dch khng c trng lp, nhng c th ni rng tnh cht ny ca tp hp

    v trong cc thut ton sau ny, ngi ta u gi thit rng cc khon mc trong mt giao dch v

    trong tt c cc tp mc (item set) khc, c th coi chng c sp xp theo th t t in ca

    cc item.

    Gi D l CSDL ca n giao dch v mi giao dch c nh nhn vi mt nh danh duy

    nht (Unique Transasction Identifier). Ni rng, mt giao dch T D h tr (support) cho mt tp

    X I nu n cha tt c cc item ca X, ngh l X T, trong mt s trnghp ngi ta dng k

    hiu T(X) ch tp cc giao dch h tr cho X. K hiu support(X) (hoc sup(X), s(X)) l t l

    phn trm ca cc giao dch h tr X trn tng cc giao dch trong D, ngha l:

    sup(X) =

    D

    TXDT (2.1.1)

    h tr ti thiu (minimum support) minsupl mt gi tr cho trc bi ngi s dng.

    Nu tp mc X c sup(X) minsup th ta ni X l mt tp cc mc ph bin (hoc large itemset).

    Mt tp ph bin c s dng nh mt tp ng quantm trong cc thut ton, ngc li, nhng

    tp khng phi tp ph bin l nhng tp khng ng quan tm. Trong cc trnh by sau ny, ta s

    s dng nhng cm t khc nh X c h tr ti thiu, hay X khng c h tr ti thiu

    cng ni ln rng X tha mn hay khng tha mn support(X) minsup.

    Mt khon mc X c gi l k- itemset nu lc lng ca X bng k, tc l kX .

    4.2.2. nh ngha lut kt hpMt lut kt hp c dng R: X => Y, trong X, Y l tp cc mc, X, Y I v X Y = .

    X c gi l tin v Y c gi l h qu ca lut.

    Lut X => Y tn ti mt tin cy c (confidence-conf). tin cy cc nh ngha l kh nng

    giao dch T h tr X th cng h tr Y. Ta c cng thc tnh tin cy cnh sau:

    conf(X =>Y) = p(Y I | X I ) =)sup(

    )sup(

    )(

    )TX(

    X

    YX

    TXp

    TYp

    (2.1.2)

    Tuy nhin, khng phi bt c lut kt hp no c mt trong tp cc lut c th c sinh ra

    cng u c ngha trn thc t. M cc lut u phi tho mn mt ngng h tr v tin cy c

    th. Thc vy, cho mt tp cc giao dch D, bi ton pht hin lut kt hp l sinh ra tt c cc lut

    kt hp m c tin cy conf ln hn tin cy ti thiu mincon v h tr sup ln hn h

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    42/78

    42tr ti thiu minsup tng ng do ngi dng xc nh. Khai ph lut kt hp c phn thnh

    hai bi ton con:

    Bi ton 1:Tm tt c cc tp mc m c h tr ln hn h tr tt thiu do ngi

    dng xc nh. Cc tp mc tho mn h tr ti thiu c gi l cc tp mc ph bin.

    Bi ton 2:Dng cc tp mc ph bin sinh ra cc lut mong mun. tng chung l

    nu gi ABCD v AB l cc tp mc ph bin, th chng ta c th xc nh lut nu AB =>

    CD gi li vi t l tin cy:

    )sup(

    )sup(

    AB

    ABCDconf (2.1.3)

    nu conf mincon th lut c gi li (lut ny s tho mn h tr ti thiu v ABCD l

    ph bin)

    4.2.3. Mt s tnh cht lin quan n cc hng mc ph bin (frequent itemset):Tnh cht 1. h tr (support) cho tt c cc tp con (subset): nu A B, A, B l tp cc

    mc th sup(A)sup(B) v tt c cc giao dch ca D h tr B th cng h tr A.

    Tnh cht 2.Nu mt mc trong A khng c h tr ti thiu trn D ngha l support(A) gi quc t= yes AND gi dch v108 = yes, vi h tr

    20% v tin cy 80%

    Lut kt hp c thuc tnh s v thuc tnh hng mc (quantitative and categorial

    association rule) : Cc thuc tnh ca cc c s d liu thc t c kiu rt a dng (nh phn -

    binary, s - quantitative, hng mc - categorial,). pht hin lut kt hp vi cc thuc tnh

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    43/78

    43ny, cc nh nghin cu xut mt sphng php ri rc ho nhm chuyn dng lut ny

    v dng nhphn c th p dng cc thut ton c. Mt v d v dng lut ny phng thc

    gi = Tng AND gigi IN [23:00:39.. 23:00:59] AND Thi gian m thoi IN [200.. 300]

    => gi lin tnh =c , vi h trl 23. 53% , v tin cy l 80%.

    Lut kt hp tip cn theo hng tp th (mining association rules base on rough set) : Tm

    kim lut kt hp da trn l thuyt tp th.

    Lut kt nhiu mc (multi-level association rule) : Vi cch tip cn theo lut ny s tm

    kim thm nhng lut c dng mua my tnh PC=> mua hiu hnh AND mua phn mm tin

    ch vn phng, thay v ch nhng lut qu c thnh mua my tnh IBM PC=> mua hiu

    hnh Microsoft Windows AND mua phn mm tin ch vn phng Microsoft Office, . Nh vy

    dng lut u l dng lut tng qut ho ca dng lut sau v tng qut theo nhiu mc khc nhau.

    Lut kt hp m(fuzzy association rule) : Vi nhng hn ch cn gp phi trong qu trnh

    ri rc ho cc thuc tnh s (quantitave attributes), cc nh nghin cu xut lut kt hp m

    nhm khc phc cc hn ch trn v chuyn lut kt hp v mt dng tnhin hn, gn gi hn vi

    ngi s dng mt v d ca dng ny l : thu bao t nhn = yesAND thi gian m thoi ln

    AND cc ni tnh = yes=> cc khng hp l= yes, vi h tr4% v tin cy 85%.

    Trong lut trn, iu kin thi gian m thoi ln v tri ca lut l mt thuc tnh c m

    ho.

    Lut kt vi thuc tnh c nh trng s(association rule with weighted items) : Trong

    thc t, cc thuc tnh trong c sd liu khng phi lc no cng c vai tr nh nhau. C mt s

    thuc tnh c ch trng hn v c mc quan trng cao hn cc thuc tnh khc. V d khi

    kho st v doanh thu hng thng, thng tin v thi gian m thoi, vng cc l quan trng hn

    nhiu so vi thng tin vphng thc gi... Trong qu trnh tm kim lut, chng ta s gn thi

    gian gi, vng cc cc trng s ln hn thuc tnh phng thc gi. y l hng nghin cu rt

    th vv c mt s nh nghin cu xut cch gii quyt bi ton ny. Vi lut kt hp c

    thuc tnh c nh trng s, chng ta s khai thc c nhng lut him (tc l c h tr

    thp, nhng c ngha c bit hoc mang rt nhiu ngha).Khai thcLut kt hp song song (parallel mining of association rules): Bn cnh khai thc

    lut kt hp tun t, cc nh lm tin hc cng tp trung vo nghin cu cc thut gii song song cho

    qu trnh pht hin lut kt hp. Nhu cu song song ho v x l phn tn l cn thit bi kch

    thc d liu ngy cng ln hn nn i hi tc xl cng nh dung lng b nhca h thng

    phi c m bo. C rt nhiu thut ton song song khc nhau xut c th khng ph

    thuc vo phn cng.

    Bn cnh nhng nghin cu v nhng bin th ca lut kt hp, cc nh nghin cu cn chtrng xut nhng thut ton nhm tng tc qu trnh tm kim tp ph bin tc sd liu.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    44/78

    44Ngoi ra, cn c mt shng nghin cu khc v khai thc lut kt hp nh: khai thc

    lut kt hp trc tuyn, khai thc lut kt hp c kt ni trc tuyn n cc kho d liu a chiu

    (Multidimensional data, data warehouse) thng qua cng ngh OLAP (Online Analysis Processing),

    MOLAP (multidimensional OLAP), ROLAP (Relational OLAP), ADO (Active X Data Object) for

    OLAP..v..v..

    4.3. Pht biu bi ton pht hin lut kt hpXt v d trong giao dch cc mt hng c khch hng mua ti siu th. Tp cc mt hng

    (y coi l tp cc mc) I = {Bnh m, B, Sa, Trng} v s cc giao dch mua hng l 4 giao

    dch (|T| = 4), trong T = {1, 2, 3, 4} k hiu cc giao dch TID.

    TID Tpcc mc trong giao dch

    1 Bnh m, B, Trng

    2 B, Sa, Trng

    3 B

    4 Bnh m, B

    T bng giao dch trn ta rt ra:

    TT Tp cc mc trong giao dch h trtng ng

    1 (khng c mt hng no) 0 %

    2 Bnh m 50 %

    3 B 100 %

    4 Sa 25 %

    5 Trng 50 %

    6 Bnh m, B 50 %

    7 Bnh m, Sa 0 %

    8 Bnh m, Trng 0 %

    9 B, Sa 25 %

    10 B, Trng 50 %

    11 Sa, Trng 25 %

    12 Bnh m, B, Sa 0 %

    13 Bnh m, B, Trng 25 %

    14 Bnh m, Sa, Trng 25 %

    15 B, Sa, Trng 25 %

    16 Bnh m, B, Sa, Trng 0 %

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    45/78

    45Vi gi tr h trti thiu minsup = 50 % th tp mc ph bin l:

    TT Tp cc mc ph bin h tr tng ng

    1 Bnh m 50 %

    2 B 100 %

    3 Trng 50 %4 Bnh m, B 50 %

    5 B, Trng 50 %

    Nu cho tin cy ti thiu l mincon = 60 % th ta c cc lut sau:

    TT Lut tin cy Tho mn 60%

    1 Bnh m => B 100 % C

    2 B => Bnh m 50 % Khng

    3 B =>Trng 50 % Khng

    4 Trng => B 100 % C

    Bng 2.1. Minh ho bi ton pht hin lut kt hp

    4.4. Pht hin lut kt hp da trn h thng tin nh phn4.4.1. Cc nh ngha hnh thc trn h thng tin nhphn

    nh ngha 2.2.3.1:H thng tin nh phn

    Cho cc tp sau:

    O ={o1, o2, , on} l mt tp hu hn gm n i tng,

    D = {d1, d2, , dm} l mt tp hu hn gm m ch bo,

    B = {0, 1}

    H thng tin nh phn c nh ngha l SB = (O, D, B, )trong l nh x :O x D

    B, (o,d) = 1 nu i tng o c ch bo d v (o,d) = 0 nu ngc li.

    nh ngha 2.2.3.2:Cc nh x thng tin nh phn

    Cho h thng tin nh phn SB = (O, D, B,). Cho P(O) l cc tpcon ca O, P(D) l cc tp

    con ca D. Cc nh x thng tin nh phn B v Bc nh ngha nh sau:

    B: P(D) P(O) vi nha:

    cho S D, B(S) = {o O| d S, (o, d) = 1}

    B: P(O) P(D) vi nha:

    cho X O, B(X) = {d D| o X, (o, d) = 1}

    nh ngha 2.2.3.3:Tp ch bo ph bin nh phnCho h thng tin nh phn SB = (O, D, B,)v mt ngng (0, 1).

    Cho S D, S l tp ch bo ph bin nh phn vi ngng nu:

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    46/78

    46card(B(S)) *card(O)

    Cho LBl mt tp gm tt c cc tp ch bo ph bin nh phn pht hin t S B, chng c

    thuc tnh nh sau: S LB, T S th T LB.

    Trong LB,hl tp con ca LBnu X LB,hth card(X) = h (vi h l s nguyn dng).

    nh ngha 2.2.3.4:Cc lut kt hp ph bin nh phn v h s tin cyCho h thng tin nh phn SB = (O, D, B,)v mt ngng (0, 1). Cho L l mt phn t

    ca LB, X v Y l hai tp con ca L, trong :

    L = X Y, X {}, Y {} v X Y = {}

    Chng ta xc nh cc lut kt hp nh phn gia tp ch s X v tp ch s Y l mt nh x

    thng tin: X Y. H s tin cy ca lut ny c biu din l:

    ))((

    ))()(()(CFB

    Xcard

    YXcardYX

    B

    BB

    (2.2.3.1)

    Ta biu din RB, l tp tt c cc lut kt hp ph bin nh phn c pht hin t SB.

    Trong CFB(r) , r RB,

    nh ngha 2.2.3.5:Cc vect ch bo nh phn v cc php ton

    Cho h thng tin nh phn SB = (O, D, B,)trong O ={o1, o2, , on} l mt tp hu hn

    gm n i tng, D = {d1, d2, , dm} l mt tp hu hn gm m ch bo.

    Vect ch bo nh phn

    Vect ch bo nh phn vB(X) = {X1, X2, , Xn} trong : X D l mt vect vi n thnhphn, mi thnh phn Xjchim mt gi tr trong B. Cho VSBl tp tt c cc vect ch bo

    nh phn ca SB, nu card(X) = 1 th X l b ch bo ca SB v Xj = (o, X)

    nh ngha 2.2.3.6:Tch vect ch bo nh phn

    Cho X1, X2 D, vB(X1) = (X11, X12, , X1n), vB(X2) = (X21, X22, , X2n) l cc phn t

    ca VSB. Tch vect ch bo nh phn vB(X1) v vB(X2) c biu hin l vB(X3) = vB(X1) B

    vB(X2). Trong :

    vB(X3) = (X31, X32, , X3n) vi X3j = min(X1j, X2j), j = 1nX3 = X1 X2 D

    T vect vB(X3), chng ta bit tt c cc i tng hin c trong tp ch bo X1 v X2.

    Chng ta dng vB(X1) trnh din B(X1), vB(X2) trnh din B(X2) v vB(X3) trnh din

    B(X3).

    nh ngha 2.2.3.7: h tr cc vect ch bo nh phn

    Cho X1D, h tr ca vB(X1) biu din supB(vB(X1)) c nh ngha l: supB(vB(X1))

    = {o O| d X1, (o, d) = 1} (1)D thy rng: card(supB(vB(X1))) = card(B(X1))

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    47/78

    47Tnh card(B(S))

    Cho S = {s1, s2, , sk} l tp con ca D. Trong sjl b ch bo ca SB, j = 1 k. Mi

    sjtng ng vi vect ch bo nh phn vB({sj}). Cc yu t ca B(S) c tnh bng:

    card(B(S)) = card(supB(vB{s1}) B supB(vB{s2}) B supB(vB{sk})) (2)

    Chng ta biu din VSB,h l tp con ca VSB cha ch vect vB(X) trong X D v

    card(X) = h (h l s nguyn dng cho trc).

    4.4.2. Thut ton pht hin tp chbo v lut kt hp nhphnThut ton pht trin t thut ton Apriori-Tid. pht hin cc tp ch bo nh phn ph

    bin t cc lut kt hp nh phn t h thng tin nh phn. Thut ton ny lm vic vi cc bit trong

    b nh v khng lm vic vi c s d liu trn a, v th c th ci tin tc qu trnh pht hin

    lut. Cho mt CSDL v hai ngng h tr ti thiu minsup v tin cy ti thiu mincomf calut kt hp. Thut ton Apriori-Tid c hai pha:

    Pha 1:Pht hin cc tp ch bo ph bin da trn ngng minsup cho trc.

    Pha 2:Xy dng cc lut kt hp da trn mt ngng mincom cho trc.

    Cho ma trn thng tin nh phn SB = (O, D, B,)v mt ngng , (0, 1). Trong l

    minsup v l mincon.

    Chi tit thut ton Apriori-Tid nh sau:

    Pha 1: Pht hin tp ch bo ph bin nh phn1. TraLoi = ;

    2. Sinh LB,1t SBtheo th tc 1.a. di y ;

    3. for (k = 2; LB,k {}; k++)

    4.{ Sinh LB,kt LB,k-1theo th tc 2.a. di y ;

    5. TraLoi = k LB,k-1 ;

    6. }

    7. Return TraLoi ;

    // = = = = = = = =

    1.a. Sinh LB,1

    1. LB,1 = ;

    2. for (i = 1; i * card(O))

    4. { SaveLargeSet({di}, VSB,1) ;

    5. SaveDescriptorVector(vB({di}, VSB,1)) ;

    6. }

    7. TraLoi = LB,1 ;

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    48/78

    488. Return TraLoi ;

    // Trong m = card(D) l lc lng ca lp D.

    2.a. Sinh LB,1

    Da trn thuc tnh S LB, T S th T LB, chng sinh ra LB,kt LB,k-1. Kt qu nh

    sau:To mt ma trn c cc dng v ct l cc thnh phn ca LB, k-1

    1. LB,k = ;

    2. for (Mi X LB,k-1 && XY)

    3. { T = X Y ;

    4. if(card(supB(vB(T)) > *card(O)) && card(T) ==k)

    5. { SaveLargeSet(T, LB,k) ;

    6. SaveDescriptorVector(vB(T), VSB,k)) ;7. }

    8. }

    9. TraLoi = LB,k ;

    10. Return TraLoi ;

    Trong :

    SaveLargeSet(T, LB,k)l mt hm ghi mt tp ch bo ph bin nh phn T vo LB,k. SaveDescriptorVector(v

    B(T), VS

    B,k))l mt hm lu mt vect ch bo ph bin nh

    phn vB(T) vo VSB,k.

    Da vo (1) v (2), ta c th tnh rt nhanh sup B(vB(T)) ti bc th k ca vng lp trn, t cc phn t ca VSB,k-1.

    Pha 2: Pht hin cc lut ph bin nh phn

    1. RB, = ; // Khi to tp lut ban u l rng

    2. for (Mi L LB)

    3. { for(Mi X, Y L v XY ={})

    4. { if(CFB(X=>Y) )

    5. SaveRule(X=>Y, RB,); // ghi lut X=>Y vo RB,

    6. if(CFB(Y=>X) )

    7. SaveRule(Y=>X, RB,); // ghi lut Y=>X vo RB,

    8. }

    9. }10. TraLoi = RB,;

    11. Return RB,; // Kt thc

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    49/78

    49

    4.4.3. V d minh hoCho h thng cc mt hng D = {d1, d2, d3, d4, d5} v s cc giao tc mua bn O = {o1, o2,

    o3, o4}. K hiu trong giao tc mua bn cho nh bng sau (mt hng no c mua th nh s 1,

    ngc li nh s 0).

    d1 d2 d3 d4 d5

    o1 1 0 1 1 0

    o2 0 1 1 0 1

    o3 1 1 1 0 1

    o4 0 1 0 0 1

    V = 0,5, = 0,5. Hy tm cc tp ph bin v rt ra cc lut kt hp.

    Bng 2.2. H thng tin nhphn mua bn hng ho

    Minh ho c th thut ton nh sau:

    Ta thy h thng cc giao tc mua cc mt hng tng ng nh h thng tin nh phn SB

    = (O, D, B,). T h thng tin nh phn ta c:

    O = {o1, o2, o3, o4} v D = {d1, d2, d3, d4, d5}

    Ta suy ra: card(O) = 4, card(D) = 5

    T SBta c cc vect ch bo nh phn nh sau:

    vB({d1}) = (1, 0, 1, 0)

    vB({d2}) = (0, 1, 1, 1)

    vB({d3}) = (1, 1, 1, 0)

    vB({d4}) = (1, 0, 0, 0)

    vB({d5}) = (0, 1, 1, 1)

    Lp 0: To LB,1:

    Tnh card(supB(vB({di}))) (s lng cc i tng c trong vect ch bo nh phn

    vB({di}))vi i = 1 5 ta c

    card(supB(vB({d1}))) = 2

    card(supB(vB({d2}))) = 3

    card(supB(vB({d3}))) = 3

    card(supB(vB({d4}))) = 1

    card(supB(vB({d5}))) = 3

    Vi minsup = 0,5 v mincon = 0,5 ta c: Cc tp ch bo ph bin l:

    LB,1 = {{d1}, {d2}, {d3}, {d5}}

    Lp 1: To LB,2t LB,1 :

    T LB,1ta xy dng mt ma trn biu din nh x f: LB,1 LB,1R nh sau:

    Cho (X, Y) LB,1 LB,1 v X Y, T = XY

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    50/78

    50Gi tr ca f(X, Y) = card(supB(vB(T))) vi:

    card(T) = 2, minsup = 0,5 ta c:

    card(supB(T)) = card(supB(vB(X) B vB(Y))) minsup*card(O) = 2

    ( trong bi ton ny X, Y{{d1}, {d2}, {d3}, {d5}} )

    Ch :Khi tnh supB(vB(T)) ta s dng cc vB(X) (cng l cc vB(Y)) c ghi trong VSB,1 tng tc tnh ton.

    Gi trca f(X, Y) c tnh trong bng sau:

    (cc gi tr f(X, X) ta khng quan tm)

    {d1} {d2} {d3} {d5}

    {d1} 2 2 2 1

    {d2} 1 3 2 3

    {d3} 2 2 3 2

    {d5} 1 3 2 3

    ( Cc m gi tr c nh s m th tho mn, nhng ta ch ly 1 ln v theo l thuyt tp hp

    th {X, Y} {Y, X} )

    Cui cng, ta c LB,2 = {{d1, d2}, {d1, d3}, {d2, d5}, {d3, d5}}

    Trong VSB,2ta c cc vect ch bo sau:

    vB({d1, d2}) = vB({d1}) B vB({d2}) = (1, 0, 1, 0)

    vB({d1, d3}) = vB({d1}) B vB({d2}) = (0, 1, 1, 0)vB({d2, d5}) = vB({d1}) B vB({d2}) = (0, 1, 1, 1)

    vB({d3, d5}) = vB({d1}) B vB({d2}) = (0, 1, 1, 0)

    Lp 2: To LB,3t LB,2 :

    T LB,2ta xy dng mt ma trn biu din nh x f: LB,2 LB,2 R

    Cho (X, Y) LB,2 LB,2 v X Y, T = XY

    Gi tr ca f(X, Y) = card(supB(vB(T))) vi:

    card(T) = 3, minsup = 0,5 ta c:card(supB(T)) = card(supB(vB(X) B vB(Y))) minsup*card(O)

    {d1, d2} {d1, d3} {d2, d5} {d3, d5}

    {d1, d2} 2 1 1 1

    {d1, d3} 1 2 2 2

    {d2, d5} 1 2 3 2

    {d3, d5} 1 2 2 2Ch :Khi tnh supB(vB(T)) ta s dng cc vB(X) (cng l cc vB(Y)) c ghi trong VSB,s

    tng tc tnh ton.

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    51/78

    51Cui cng ta c LB,3 = {d2, d3, d5}

    v VSB,3cha ch mt vect ch bo vB({d2, d3, d5}) = (0, 1, 1, 0)

    Lp 3: To LB,4t LB,3 :

    T LB,3ta xy dng mt ma trn biu din nhx f: LB,3 LB,3 R

    {d2, d3, d5}{d2, d3, d5} 2

    Ta c tp LB,4= {}. Dng th tc. Cui cng, ta c:

    LB = LB,1 LB,2 LB,3 hay

    LB = {{d1}, {d2}, {d3}, {d5}, {d1, d2}, {d1, d3}, {d2, d5}, {d3, d5},

    {d2, d3, d5}

    }

    Nh vy, cc lut kt hp m ta cth rt ra l (c 14 lut):

    TT Lut TT Lut

    1 d1 => d2 8 d5 => d3

    2 d1 => d3 9 d2 => d3, d5

    3 d2 => d1 10 d3 => d2, d5

    4 d3 => d1 11 d5 => d2, d3

    5 d2 => d5 12 d2, d3 => d5

    6 d5 => d2 13 d2, d5 => d3

    7 d3 => d5 14 d3, d5 => d2

    Bng 2.3. Cc lut kt hp th thng tin nhphn mua bn hng ho

    4.5. Khai ph lut kt hp trn h thng tin m4.5.1. Cc nh ngha v tp m

    Trong l thuyt tp hp kinh in hm thuc hon tontng ng vi vic xc nh mt

    tp hp Abt k, cho tp A ta c th xc nh c hm thuc )x(A

    , ngc li t hm thuc

    )x(A

    ta c th hon ton xc nh tpA, mt khc gi tr ca )x(A

    ch gm 0 v 1.

    Cch biu din hm thuc nh vy s khng ph hp vi nhng tp c m t m nh

    tpBgm cc s thc dng nh hn nhiu so vi 6, ta c th biu din nh sau:

    B = { xR / x

  • 8/3/2019 17409 - Bai Giang Khai Pha Du Lieu

    52/78

    52Mt khc nu khng khng nh c x= 3,5c thuc Bhay khng th cng khng

    khng nh c l s thc x=3,5 khng thuc B. Vy thx=3,5thucBbao nhiu phn trm? Gi

    s c cu tr li th lc ny hm thuc )x(A

    ti im x=3,5phi c mt gi tr trong khong

    [0,1] tc l 0 )x(A

    1. Tng t nh vy vi gi tr x = 2,5thuc Cbao nhiu phn trm?

    Khi nim m rng cho cc trng hp trn c Zadeh L. nu ln ln u tin vo nm 1965.

    Tp m A trong tp khng gian nn X c nh ngha nh sau :

    A =