dwhanddataminingpresentationpresentationfortjinstitute_2 - copy.ppt

Upload: sachidanandan-ananthasayanam

Post on 06-Jul-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    1/158

    D Venkata SubramaninAugust 2nd 2011

    For TJ Institute of TechnologyD!"#Data$ining for %S& IV A ' ( sections

     

    Data Ware HousingData Ware Housing

    & Data Mining& Data Mining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    2/158

     To)ics To (e %o*ered

    + ,eca) and -*er*ie. of D!" / IT I

    + (usiness Analysis ' tools / IT II

    + Details about -A3 / IT II+ Data $ining ' Algorithms / IT III T- V

    + 4uick introduction of the to)ic andalgorithms

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    3/158

    Decision Su))ort Systems

    + %reated to facilitate the decisionmaking )rocess

    +So much information that it isdi5cult to e6tract it all from atraditional database

    + eed for a more com)rehensi*e datastorage facility

    Data !arehouse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    4/158

    Decision Su))ort Systems

    + &6tract Information from data to use asthe basis for decision making

    + sed at all le*els of the -rgani7ation

    + Tailored to s)eci8c business areas

    + Interacti*e

    + Ad "oc 9ueries to retrie*e and dis)lay

    information+ %ombines historical o)eration data .ith

    business acti*ities

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    5/158

    : %om)onents of DSS

    • Data Store / The DSS Database

     – (usiness Data

     – (usiness $odel Data

     – Internal and &6ternal Data

    • Data &6traction and Filtering

     – &6tract and *alidate data from theo)erational database and the e6ternaldata sources

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    6/158

    : %om)onents of DSS

    • &nd;ser 4uery Tool

     – %reate 4ueries that access either the-)erational or the DSS database

    • &nd ser 3resentation Tools

     – -rgani7e and 3resent the Data

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    7/158

    !hat is a Data !arehouse <

     A data warehouse is a subject-oriented, integrated,  nonvolatile, 

    time-variant  collection of data in supportof management's decisions.

    - WH Inmon

    WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing

      Data stored forhistorical )eriod= Datais )o)ulated in thedata .arehouse ondaily#.eekly basisde)ending u)on there9uirement=

      Data stored forhistorical )eriod= Datais )o)ulated in thedata .arehouse ondaily#.eekly basisde)ending u)on there9uirement=

    %an I see creditre)ort fromAccounts Sales

    from marketing ando)en order re)ortfrom order entry forthis customer

    %an I see creditre)ort fromAccounts Sales

    from marketing ando)en order re)ortfrom order entry forthis customer

     Data frommulti)le

    sources isintegrated for asub>ect

     Data frommulti)le

    sources isintegrated for asub>ect

    Identical 9ueries.ill gi*e sameresults at di?erenttimes= Su))ortsanalysis re9uiringhistorical data

    Identical 9ueries.ill gi*e sameresults at di?erenttimes= Su))ortsanalysis re9uiringhistorical data

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    8/158

    @

    Data ro.th

    In 2 years (2003 to 2005),

    the size of the largest database TRIPLED

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    9/158

    B

    Data ro.th ,ate

    •  T.ice as much information .as createdin 2002 as in 1BBB CE0 gro.th rateG

    • -ther gro.th rate estimates e*en

    higher

    • Very little data .ill e*er be looked at bya human

    Hno.ledge Disco*ery is NEEDED to makesense and use of data=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    10/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 10

    Data !arehouseKSub>ect;-riented

    • -rgani7ed around ma>or sub>ects such as

    customer )roduct sales=

    • Focusing on the modeling and analysis of data for

    decision makers not on daily o)erations or

    transaction )rocessing=

    • 3ro*ide a sim)le and concise *ie. around)articular sub>ect issues by e6cluding data that

    are not useful in the decision su))ort )rocess=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    11/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 11

    Data !arehouseKIntegrated

    • %onstructed by integrating multi)leheterogeneous data sources

     – relational databases Lat 8les on;linetransaction records

    • Data cleaning and data integration techni9uesare a))lied=

     – &nsure consistency in naming con*entionsencoding structures attribute measures etc=

    among di?erent data sources• &=g= "otel )rice currency ta6 breakfast co*ered

    etc=

     – !hen data is mo*ed to the .arehouse it iscon*erted=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    12/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 12

    Data !arehouseKTimeVariant

    •  The time hori7on for the data .arehouse is

    signi8cantly longer than that of o)erational systems=

     – -)erational database current *alue data=

     – Data .arehouse data )ro*ide information from a

    historical )ers)ecti*e Ce=g= )ast M;10 yearsG

    • &*ery key structure in the data .arehouse

     – %ontains an element of time e6)licitly or im)licitly – (ut the key of o)erational data may or may not

    contain Ntime elementO=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    13/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 1E

    Data !arehouseKon;Volatile

    • A )hysically se)arate store of data transformed from

    the o)erational en*ironment=

    • -)erational u)date of data does not occur in the

    data .arehouse en*ironment=

     – Does not re9uire transaction )rocessing reco*ery

    and concurrency control mechanisms

     – ,e9uires only t.o o)erations in data accessing

    • initial loading of data and access of data=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    14/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 1:

    Data !arehouse *s= "eterogeneousD($S

    •  Traditional heterogeneous D( integration

     – (uild .ra))ers#mediators on to) of heterogeneous databases

     – 4uery dri*en a))roach

    • !hen a 9uery is )osed to a client site a meta;dictionary is

    used to translate the 9uery into 9ueries a))ro)riate for

    indi*idual heterogeneous sites in*ol*ed and the results areintegrated into a global ans.er set

    • %om)le6 information 8ltering com)ete for resources

    • Data .arehouse u)date;dri*en high )erformance

     – Information from heterogeneous sources is integrated in

    ad*ance and stored in .arehouses for direct 9uery and analysis

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    15/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 1M

    Data !arehouse *s= -)erationalD($S

    • -T3 Con;line transaction )rocessingG

     – $a>or task of traditional relational D($S

     – Day;to;day o)erations )urchasing in*entory banking

    manufacturing )ayroll registration accounting etc=

    • -A3 Con;line analytical )rocessingG

     – $a>or task of data .arehouse system – Data analysis and decision making

    • Distinct features C-T3 *s= -A3G

     – ser and system orientation customer *s= market

     – Data contents current detailed *s= historical consolidated – Database design &, P a))lication *s= star P sub>ect

     – Vie. current local *s= e*olutionary integrated

     – Access )atterns u)date *s= read;only but com)le6 9ueries

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    16/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 1

    -T3 *s= -A3

      OLTP OLAP

    users !ler", IT #rofessional "no$ledge $or"er

    function day to day o#erations de!ision s%##ort

    DB design a##li!ation&oriented s%b'e!t&oriented

    data !%rrent, %#&to&datedetailed, flat relational

    isolated

    histori!al,s%arized, %ltidiensional

    integrated, !onsolidated

    usage re#etitie ad&ho!

    access read*$rite

    inde+*hash on #ri "ey

    lots of s!ans

    unit of work short, si#le transa!tion !o#le+ -%ery# records accessed tens illions

    #users tho%sands h%ndreds

    DB size .00/&1 .001&T

    metric transa!tion thro%gh#%t -%ery thro%gh#%t, res#onse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    17/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues 1Q

    !hy Se)arate Data!arehouse<

    • "igh )erformance for both systems

     – D($SK tuned for -T3 access methods inde6ingconcurrency control reco*ery

     – !arehouseKtuned for -A3 com)le6 -A3 9ueries

    multidimensional *ie. consolidation=• Di?erent functions and di?erent data

     – missing data Decision su))ort re9uires historicaldata .hich o)erational D(s do not ty)ically maintain

     – data consolidation DS re9uires consolidation

    Caggregation summari7ationG of data fromheterogeneous sources

     – data 9uality di?erent sources ty)ically useinconsistent data re)resentations codes andformats .hich ha*e to be reconciled

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    18/158

    Sub>ect;-riented

    + Data is arranged and o)timi7ed to)ro*ide ans.er to 9uestions fromdi*erse functional areas

    Data is organi7ed and summari7ed byto)icSales # $arketing # Finance # Distribution #

    &tc=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    19/158

    Integrated

    + The data .arehouse is a centrali7edconsolidated database thatintegrated data deri*ed from the

    entire organi7ation$ulti)le Sources

    Di*erse Sources

    Di*erse Formats

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    20/158

     Time;Variant

    + The Data !arehouse re)resents theLo. of data through time

    + %an contain )ro>ected data fromstatistical models

    + Data is )eriodically u)loaded thentime;de)endent data is recom)uted

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    21/158

    on*olatile

    + -nce data is entered it is &V&, remo*ed

    + ,e)resents the com)anyRs entire historyear term history is continually added to it

    Al.ays gro.ing

    $ust su))ort terabyte databases andmulti)rocessors

    + ,ead;-nly database for data analysis and9uery )rocessing

    Sub>ect -riented

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    22/158

    Sub>ect;-riented;%haracteristics of a Data

    !arehouse

    4uotes

    eads

    -rders

    3ros)ects

    -)erational

    Data!arehouse

    %ustomers 3roducts

    ,egions  Time

    Focus is on Subject Areas rather than ApplicationsFocus is on Subject Areas rather than Applications

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    23/158

    Integrated ; %haracteristics ofa Data !arehouse

     Appl A - m,fAppl B - 1,0Appl C - male,female

    Appl A - balance dec fixed (13,2)Appl B - balance pic 9(9)V99

    Appl C - balance pic S9(7)V99 comp-3

    Appl A - bal-on-handAppl B - cuen!-balanceAppl C - ca"h-on-hand

    Appl A - da!e (#ulian)Appl B - da!e ($$mmdd)Appl C - da!e (ab"olu!e)

    m,f

    balance dec

    fixed (13,2)

    da!e (#ulian)

    %rrent balan!e

    Integrated Vie Is !he "ssence Of A Data WarehouseIntegrated Vie Is !he "ssence Of A Data Warehouse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    24/158

    on;*olatile ; %haracteristics ofa Data !arehouse

     

    -)erational Data!arehouse

    re)lacechange

    insert

    changeinsert

    delete load

    read onlyaccess

    Integrated Vie Is !he "ssence Of A Data WarehouseIntegrated Vie Is !he "ssence Of A Data Warehouse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    25/158

     Time Variant ; %haracteristicsof a Data !arehouse

     

    -)erationalData

    !arehouse

    %urrent Value data+ time hori7on 0;B0 days+ key may not ha*e element of

    time

    Sna)shot data+ time hori7on M;10 years+ key has an element of time

    + data .arehouse storeshistorical data

    Data Warehouse !#picall# Spans Across !imeData Warehouse !#picall# Spans Across !ime

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    26/158

    Alternate De8nitions

     A collection of integrated, subjectoriented databases designed tosupport the D function, where

    each unit of data is relevant to somemoment of time 

    ; Imho!

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    27/158

    Alternate De8nitions

    Data Warehouse is a repositor" of datasummari#ed or aggregated in

    simpli$ed form from operationals"stems. %nd user orientated dataaccess and reporting tools let user

    get at the data for decision support -&abcoc 

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    28/158

    12 ,ules of a Data!arehouse

    + Data !arehouse and -)erational&n*ironments are Se)arated

    + Data is integrated

    + %ontains historical data o*er a long)eriod of time

    + Data is a sna)shot data ca)tured at agi*en )oint in time

    + Data is sub>ect;oriented

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    29/158

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    30/158

    12 ,ules of Data !arehouse

    + &n*ironment is characteri7ed by ,ead;onlytransactions to *ery large data sets

    + System that traces data sources

    transformations and storage+ $etadata is a critical com)onent

    Source transformation integrationstorage relationshi)s history etc

    + %ontains a chargeback mechanism forresource usage that enforces o)timal useof data by end users

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    31/158

    eed for Data !arehousing

    + (etter business intelligence for end;users

    + ,eduction in time to locate access andanaly7e information

    +%onsolidation of dis)arate information sources+ Strategic ad*antage o*er com)etitors

    + Faster time;to;market for )roducts andser*ices

    + ,e)lacement of older less;res)onsi*e decisionsu))ort systems

    + ,eduction in demand on IS to generate re)orts

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    32/158

    I%S M:1 ; 02 Data $ining %once)ts E2

    Data !arehouseArchitecture

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    33/158

    A)ril 2 201 Data $ining %once)ts and Techni9ues EE

    Multi-Tiered ArchitectureMulti-Tiered Architecture

    Data

    areho%se

    E+tra!t

    Transfor

    Load

    Refresh

    4LP Engine

    nalysis

    6%eryRe#orts

    Data ining

    /onitor 

    7

    Integrator 

    /etadata

    Data 8o%r!es 9ront&End Tools

    8ere

    Data /arts

    4#erational 

    Ds

    other 

    so%r!e

    s

    Data 8torage

    4LP 8erer 

    y) ca ata are ouse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    34/158

    y) ca ata are ouseArchitecture

    Operational

    Systems/Data

    Select

    Extract

    Transform

    Integrate

    Maintain

    Data

    reparation

    Mi!!le"are/

     #I

    Data

    Ware$ouse

    Meta!ata

    EIS /DSS

    %uery Tools

    O#/'O#

    We( )ro"sers

    Data Mining

    Data

    Marts

    $ulti-tiered Data Warehouse ithout ODS$ulti-tiered Data Warehouse ithout ODS

    Ty)ical Data !arehouse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    35/158

     Ty)ical Data !arehouseArchitecture

    Operational

    Systems/Data

    Select

    Extract

    Transform

    Integrate

    Maintain

    Data

    reparation

    Data

    Marts

    Data

    Ware$ouse

    Meta!ata

    ODS

    Meta!ata

    Select

    Extract

    Transform

    oa!

    Data

    reparation

    $ulti-tiered Data Warehouse ith ODS$ulti-tiered Data Warehouse ith ODS

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    36/158

    &T; &6tract Transform andoadAs the name suggests &T )rocess co*ers the follo.ing )hases

    • &6traction of data from data sources=•  Transforming the e6tracted data to meet business

    re9uirements=

    • oading the data in to the target .arehouse#database=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    37/158

    &T / &6tract Transform andoad• Data &6tract

    et Data from source• Data Transformation

    Data %leansing ; Data 4uality Assurance

    Data Scrubbing ; ,emo*ing errors and inconsistencies

    3rocessing %alculations

    A))lying (usiness ,ules

    %hanging Data Ty)es

    $aking the Data $ore ,eadable

    ,e)lacing %odes .ith Actual Values

    Summari7ing the Data

    • Data oad

    oad data into !arehouse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    38/158

    &6traction

    +  The 8rst )art of the &T )rocess=+ Data under consideration is being e6tracted from the

    di?erent data sources=

    +  The source data may use a di?erent data

    organi7ation#format=+ Some of the common data sources are

    Databases

    Flat 8les

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    39/158

     Transform

    + It in*ol*es a))lying a series of rules to the datae6tracted from the source to deri*e the data to loadthe target=

    + De)ending on the re9uirement of the target the

    transformation rules may be sim)le or com)le6=+  Transformation may in*ol*e

    Selecting only certain columns to load

    Filtering

    Sorting

    %ombining data from multi)le sources

    enerating Surrogate keys etc=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    40/158

    oad

    +  The last ste) of the &T )rocess+  The load )hase loads the transformed data to the

    end target=

    + De)ending on the re9uirement the load )hase may

    be Full oad

    Incremental oad

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    41/158

    Im)ortance of &T

    + Data of an organi7ation s)read across multi)legeogra)hies and domains=

    + Data organi7ed in di?erent format in di?erentsources=

    + %onsolidation of the data to make it moremeaningful=

    + A))lying (usiness rules enriches the *alue the data)ro*ides=

    + Identifying the inconsistencies and )ro*iding auni8ed *ie.=

    + Im)ro*ing the data 9uality=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    42/158

    Sam)le &T Tools

    + Teradata !arehouse (uilder from Teradata

    + DataStage from I($

    + SAS System from SAS Institute

    + 3o.er $art#3o.er %enter from Informatica

    + Sagent Solution from Sagent Soft.are

    + "ummingbird enio Suite from"ummingbird %ommunications

    + Abinitio+ -racle Data !arehouse (uilder and -DI

    from -racle %or)oration

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    43/158

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    44/158

    Data Access and Analysis ; Terminologies

    • Reporting – A category of data access solution in .hich the information is

    )resented in the form of re)orts – ,e)orting tools are also referred as 4uery and re)orting tools

    • OLAP (On-Line Analytic Processing)

     – De8ned as NFast Analysis of $ultidimensional InformationO bythe -A3 council – sed interchangeably .ith (IR – -A3 tools are synonymous .ith $ultidimensional tools or

    a))lications

    • DSS tools that use multidimensional data analysis

    techni9ues – Su))ort for a DSS data store – Data e6traction and integration 8lter – S)eciali7ed )resentation interface

    d l i

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    45/158

    Data Access and Analysis ; Terminologies

    • Data $ining – A )rocess that uses a *ariety of statistical and

    arti8cial intelligence frame.orks to disco*er)atterns and relationshi)s in data

     – sed to make *alid )redictions in data analysis)roblems .here the e6act se9uence andnature of 9ueries#9uestions tobe .ritten#asked against the data to make the)rediction is not kno.n and the number of*ariables in*ol*ed in the analysis is too largeto be intuiti*ely handled by structured

    9uerying or -A3 tools• !eb Access

     – A category of data access solutions in .hichinformation is *ie.ed through a .eb bro.ser

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    46/158

    Im)ortance of Data Access

    • (usinesses today face challenges like – arge *olume of data – ser demands of Le6ible and timely access

    to information

     – &6tracting *alue from key business data• Data Access is the last mileR that

    enables decision makers to – ,each the database infrastructure

    • 3rom)t reliable data access – o.ers o)erating costs – ,educes error – Increases )roducti*ity=

    T di i l D i i $ ki

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    47/158

     Traditional Decision $akingtechni9ues

    • S)readsheets and S4 are traditionally usedas tool for analysis and decision making

    • imitations of Traditional techni9ues – It is *ery di5cult to de8ne the aggregationle*els *ie.s in s)readsheets

     – S4 does not ha*e a natural .ay of )ro*idingLe6ible *ie. reorgani7ations that .ill trans)ose

    the data – %ommon analytic functions such as cumulati*e

    a*erage and total are not su))orted in S4

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    48/158

    Design %onsiderations

     – 3latform ; S$3 $33 T ni6

     – Target Database ; ,D($S $DD(

     – 3artitioning

     – Data 3re)aration ; Data 4uality Audit%leansing &6traction Transformation

     – $odeling ; Facts ' Dimensions

     – Information Directory ; $etadata$anagement

     – !arehouse Administration

     – &nd ser Tools

     – ranularity ; Detail and Summari7ation

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    49/158

    Data !arehouse "ard.are

    Hardware onsiderations

    :Parallelis:8/P or /PP

    :Dis" 8torage

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    50/158

    "ard.are %onsiderations

    • 3arallelism

     – $ost de)loyments of VD( Data!arehouses are on S$3 or $33

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    51/158

    "ard.are %onsiderations

    •  Three o)tions for "ard.are – Symmetric $ulti)rocessing CS$3G

    • Shared $emory Architecture – $assi*ely 3arallel 3rocessing C$33G

    • Shared othing Architecture

    • &ach node has its o.n memory and I#-

     – on niform $emory Access C$AG• %luster of S$3 machines

    • %lassi8ed as large S$3 machines

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    52/158

    S$3 *s $33 machines

    SMP MPP%o &i""ion Ci!ical '*

    o medium +SS

    Complex Anal$!ical lae

    Scale +SS

    Scale !o 10-12 C*" (no. 30) Scale !o moe !han 100

    C*"

    /o.!h i" Slo. and

    S!ead$

    /o.!h i" apid and unpedic!able

    +a!aba"e Sie 200/B +a!aba"e Sie 00 /B

    Aim i" Au!oma!ion o Ba"ic +eci"ion

    Suppo!

    *ima$ aim i" "!a!eic ad4an!ae

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    53/158

    Ser*er Scalability

     ;

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    54/158

    -T3 Vs !arehouseOperational System  Data Warehouse 

    Transaction Processing Query Processing

    Time Sensitive History Oriented

    Operator View Managerial View

    Organized y transactions !Order" #nput"#nventory$

    Organized y su%ect !&ustomer" Product$

    'elatively smaller dataase (arge dataase size

    Many concurrent users 'elatively )ew concurrent users

    Volatile Data *on Volatile Data

    Stores all data Stores relevant data

    *ot +le,ile +le,ile

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    55/158

    %a)acity 3lanning

       P  r  o  c  e  s  s   i  n  g   P

      o  w  e  r

    Time of day

    %rocessing &oad %ea's During the (eginning and "nd of Da#%rocessing &oad %ea's During the (eginning and "nd of Da#

    &6am)les -f Some

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    56/158

    &6am)les -f SomeA))lications

    9:inancial 'eporting an!

    +onsoli!ation

    9Target Mar6eting

    9Mar6et Segmentation9)u!geting

    9+re!it 'ating #gencies

    +$urn #nalysis+$urn #nalysis rofita(ility Managementrofita(ility Management

    E*ent trac6ingE*ent trac6ing

    Manufacturers Manufacturers Manufacturers Manufacturers 

    Customers Customers Customers Customers 

    Retailers Retailers Retailers Retailers 

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    57/158

    Data $arts

    + Small Data Stores

    + $ore manageable data sets

    + Targeted to meet the needs of smallgrou)s .ithin the organi7ation

    + Small Single;Sub>ect data .arehousesubset that )ro*ides decision su))ortto a small grou) of )eo)le

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    58/158

    Data $arts

    + &nter)rise .ide data .arehousing)ro>ects ha*e a *ery large cycle time

    + etting consensus bet.een multi)le)arties may also be di5cult

    + De)artments may not be satis8ed .ith)riority accorded to them

    + Sometimes indi*idual de)artmentalneeds may be strong enough to .arranta local im)lementation

    + A))lication#database distribution is alsoan im)ortant factor

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    59/158

    Data $arts

    • Sub>ect or A))lication -riented(usiness Vie. of !arehouse

     – 4uick Solution to a s)eci8c (usiness

    3roblem – Finance $anufacturing Sales etc=

     – Smaller amount of data used for

    Analytic 3rocessing

    A &ogical Subset of !he )omplete Data WarehouseA &ogical Subset of !he )omplete Data Warehouse

    Data !arehouses or Data

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    60/158

    Data !arehouses or Data$artsFor com)anies interested in changing their

    cor)orate cultures or integrating se)arate

    de)artments an enter)rise .ide a))roach

    makes sense=

    %om)anies that .ant a 9uick solution to a

    s)eci8c business )roblem are better ser*ed by

    a standalone data mart=

    Some com)anies o)t to build a .arehouse

    incrementally data mart by data mart=

    A &ogical Subset of !he )omplete Data WarehouseA &ogical Subset of !he )omplete Data Warehouse

    h d

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    61/158

    Data !arehouse and Data$artData Warehouse Data Marts

    Scope Application NeutralCentralie!" Share!

    Cross LO#$enterprise

    Speci%c ApplicationRe&uire'ent

    LO#" !epart'ent

    #usiness ProcessOriente!

    DataPerspectie

    istorical Detaile! !ata

    So'e su''ary

    Detaile! (so'ehistory)

    Su''arie!

    Su*+ects Multiple su*+ect areas   Single Partial

    su*+ectMultiple partialsu*+ects

    h d

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    62/158

    Data !arehouse and Data$artData Warehouse Data Marts

    Data Sources ManyOperational$ E,ternalData

    e.

    Operational"e,ternal !ata

    /'ple'ent

    0i'e ra'e

    1-23 'onths 4or %rststage

    Multiple stagei'ple'entation

    5-26 'onths

    Characteristics

    le,i*le" e,tensi*le

    Dura*le$Strategic

    Data orientation

    Restrictie" none,tensi*le

    Short li4e$tactical

    Pro+ectOrientation

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    63/158

    !arehouse or $art First <

    Data Warehouse +irst  Data Mart )irst E,pensie Relatiely cheap

    Large !eelop'ent cycle Deliere! in 7 8 'onths

    Change 'anage'ent is !i9cult Easy to 'anage change

    Di9cult to o*tain continuouscorporate support

    Can lea! to in!epen!ent an!inco'pati*le 'arts

    0echnical challenges in *uil!inglarge !ata*ases

    Cleansing" trans4or'ation"'o!eling techni&ues 'ay *einco'pati*le

    -T3 Systems Vs Data

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    64/158

    -T3 Systems Vs Data!arehouse

     Remember  Between OLTP and Data Warehouse systems

    users are different 

    data content is different,

    data structures are different 

    hardware is different 

    *nderstanding !he Differences Is !he +e#*nderstanding !he Differences Is !he +e#

    - ti l D t St

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    65/158

    -)erational Data Store ;De8nition

     

     

    4#erational

     

    D88 

    Data

    areho%se

     

    4D8

     

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    66/158

    -)erational Data Store

    + The -DS a))lies only to the .orld ofo)erational systems=

    + The -DS contains current *alued and

    near current *alued data=

    + The -DS contains almost e6clusi*elyall detail data

    + The -DS re9uires a full functionu)date record oriented en*ironment=

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    67/158

    -)erational Data Store

    • Functions of an -DS+ %on*erts Data+ Decides !hich Data of $ulti)le Sources

    Is the (est

    + Summari7es Data+ Decodes#encodes Data+ Alters the Hey Structures+ Alters the 3hysical Structures+ ,eformats Data+ Internally ,e)resents Data+ ,ecalculates Data=

    Di?erent kinds of Information

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    68/158

    Di?erent kinds of Informationeeds

    • %urrent

    • ,ecent

    • "istorical

    • %urrent

    • ,ecent

    • "istorical

    Is this edi!ine aailable

    in sto!" 

    hat are the tests this #atient has !o#leted so

    far 

    =as the in!iden!e ofT%ber!%losis in!reased in

    last 5 years in 8o%thern

    region

    -T3 Vs -DS Vs D!"

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    69/158

    -T3 Vs -DS Vs D!"&haracteristic  O(TP  ODS  Data

    Warehouse 

    -udience  OperatingPersonnel 

    -nalysts  Managers andanalysts 

    Data access  #ndividualrecords"transaction

    driven 

    #ndividual records"transaction oranalysis driven 

    Set o) records"analysis driven 

    Data content  &urrent" real.time 

    &urrent and near.current 

    Historical 

    Data Structure  Detailed  Detailed and lightlysummarized 

    Detailed andSummarized 

    Data organization  +unctional  Su%ect.oriented  Su%ect.oriented 

    Type o) Data  Homogeneous  Homogeneous  Vast Supply o) veryheterogeneous data 

    -T3 Vs -DS Vs D!"

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    70/158

    -T3 Vs -DS Vs D!"&haracteristic  O(TP  ODS  Data

    Warehouse 

    Dataredundancy 

    *on.redundant withinsystem/ 0nmanagedredundancy amongsystems 

    Somewhatredundant withoperationaldataases 

    Managedredundancy 

    Data update  +ield y )ield  +ield y )ield  &ontrolled atch 

    Dataase size  Moderate  Moderate  (arge to verylarge 

    Development

    Methodology 

    'e1uirements driven"structured 

    Data driven"somewhatevolutionary 

    Data driven"evolutionary 

    Philosophy  Support day.to.dayoperation 

    Support day.to.day decisions2 operationalactivities 

    Support managingthe enterprise 

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    71/158

     

    E5D O: 35IT I 'E+#E5D O: 35IT I 'E+#

    )E0I55I50 O: 35IT II)E0I55I50 O: 35IT II

    TOI+TOI+

    )3SI5ESS #5#;SIS)3SI5ESS #5#;SIS

    3rinci)les of Data.arehouse

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    72/158

    3rinci)les of Data.arehouse#(usiness Analysis

    •  The )rinci)le )ur)ose of the data .arehousing is to )ro*ideinformation to business users for strategic decision making=

    •  The decision making )rocess is the business analysis of theinformation stored in a data .arehouse

    •  The business analysis is enabled by

     – umber of a))lications – umber of tools

     – umber of techni9ues

    0o proi!e arious *usiness 4ocuse! ie.s to *usiness

    !o'ain e,perts:

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    73/158

     T-- %AT&-,I&S

    • M $ain %ategories of decision su))orttools

     – ,e)orting

     – $anaged 4uery

     – &6ecuti*e Information Systems

     – -nline Analytical 3rocessing C -A3G

     – Data $ining

    % t 1 , ti T l

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    74/158

    • %ategory 1 ; ,e)orting Tools

     – 3roduction re)orting tools

    • sed by com)anies to generate regular re)orts orsu))ort high *olume batch >obs such as calculatingand )rinting )ay checks or summary of re*enuesby month

    !ritten using %obol or high le*el languages such

    as =net or >a*a or using custom tools / these aree6)ensi*e .ill be de*elo)ed or customi7ed basedon the needs of an organi7ation

     – Deskto) re)ort .riters

    • Designed for end users and used by end users or

    business users in their deskto) for designingde*elo)ing and generating re)orts daily or on;demand

     – &6am)le %rystal ,e)orts

    %ategory 2 $anaged 4uery Tools

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    75/158

    %ategory 2 ; $anaged 4uery Tools

     – These tools shield end users from the

    com)le6ities of S4 and databasestructures by inserting a meta;layerbet.een users and the database=

     – $eta;layer is the soft.are that )ro*ides

    sub>ect;oriented *ie.s of database andsu))orts )oint;and;click creation of S4==drag; and dro) and form the com)le6 S4to search or )roduce information= These

    follo.s three tiered architectures toim)ro*e the scalability=

    • %--S ' (SI&SS -(J&%TS

    %ategory E &6ecuti*e Information

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    76/158

    %ategory E ; &6ecuti*e InformationSystems

    • These tools )redate re)ort .riters andmanaged 9uery tools

    • They .ere 8rst de)loyed in mainframe

    • 3ro*ides customi7ed gra)hical decision

    su))ort a))lications that gi*es themanagers and e6ecuti*es a high le*el *ie.of the business and access to the e6ternalsources such as custom and online feeds

    • &6am)les 3ilot Soft.are 3latinum Technology Forest and Trees SAS

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    77/158

    Nee! 4or the tools an! applications 4or *usinessl i

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    78/158

    analysis

    Sim)le tabular form re)orting

    Ad;hoc user;s)eci8ed 9ueries

    3re;de8ned re)eatable 9ueries ' %om)le6 9ueries.ith multi;table >oins multi;le*el

    sub;9ueries ' so)histicated search criteria

    •,anking $ulti*ariable Analysis

    • Time Series Analysis

    •Data Visuali7ation ; gra)hing charting ' )i*oting

    •%om)le6 Te6tual Search

    •Statistical Analysis•Arti8cial Intelligence techni9ues for testing hy)othesis

    •Information $a))ing

    •Interacti*e Drill;Do.n ,e)orting and AnalysisC$iningG

    4&,U AD ,&3-,TI T--S

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    79/158

    4&,U AD ,&3-,TI T--S

    Must helps 4or the 4ollo.ing three

    !istinct types o4 reporting2:Creation an! ie.ing o4S0ANDARD REPOR0S

    6:De%nition an! creation o4 AD-OC REPOR0S

    ;:Data E,ploration

    %heck oogle or any .eb site or .iki)edia

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    80/158

    %heck oogle or any .eb site or .iki)ediato kno. more about some of the tools

    1G= %ognos Im)ro)mtu2G= 3o.er(uilder

    EG= Forte

    :G= Information (uilders / %actus ' FocusMG= $icrosoft S4ser*er / IRll )ro*ide thenotes

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    81/158

     

    35IT II35IT IITOI+TOI+

    O# & M3TI1DIME5SIO5# MODESO# & M3TI1DIME5SIO5# MODES

    -A3

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    82/158

    -A3eed or Dri*ers for -A3

    •eed for $ore Intensi*e Decision Su))ort

    •$ulti;dimensional nature of the )roblems•,etrie*al of *ery large data sets C100Rs of (Rs or T(RsG andsummari7e them on the Ly

    • The result set may look like a multi;dimensional s)read;sheet hencethe term multi;dimensional= Ctraditional ,D($S su))orts t.odimensional relational model through S4G

    •Sol*ing modern business )roblems such as market analysis 8nancialforecasting re9uires

    • 4uery centric and array oriented and multi;dimensionaldatabase schemas

    t f -A3 A l i

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    83/158

    @E

    ature of -A3 Analysis

    • Aggregation ;; Ctotal sales )ercent;to;totalG

    • %om)arison ;; (udget *s= &6)enses

    • ,anking ;; To) 10 9uartile analysis

    • Access to detailed and aggregate data

    • %om)le6 criteria s)eci8cation

    • Visuali7ation

    • eed interacti*e res)onse to aggregate 9ueries

    $ lti di i l D t

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    84/158

    @:

    MonthMonth

    33 44 55 66 778899

          P

         r     o      d     u     c      t

          P

         r     o      d     u     c      t

    ToothpasteToothpaste

    :uice:uice&ola&ola

    Mil;Mil;

    &ream&ream

    SoapSoap

       '  e  g    i  o

      n

       '  e  g    i  o

      n

    WWSS

    **

    DimensionsDimensions

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    85/158

    @M

    %once)tual $odel for -A3

    • umeric measures to be analy7ed – e=g= Sales C,sG sales C*olumeG budget

    re*enue in*entory

    • Dimensions – other attributes of data de8ne the

    s)ace

     – e=g= store )roduct date;of;sale – hierarchies on dimensions

    • e=g= branch ;W city ;W state

    - ti

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    86/158

    @

    -)erations

    • ,ollu) summari7e data – e=g= gi*en sales data summari7e sales

    for last year by )roduct category and

    region• Drill do.n get more details

     – e=g= gi*en summari7ed sales as abo*e

    8nd breaku) of sales by city .ithin eachregion or .ithin the Andhra region

    $ % b - ti

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    87/158

    @Q

    $ore %ube -)erations

    • Slice and dice select and )ro>ect

     – e=g= Sales of soft;drinks in Andhra o*erthe last 9uarter

    • 3i*ot change the *ie. of data

     –   41 42 Total S Total ,edS (lue

     Total Total

    22 EE MM

    1M :: MB

    EQ QQ 11:

    1: 0Q 21

    :1 M2 BEMM MB 11:

    $ -A3 - ti

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    88/158

    @@

    $ore -A3 -)erations

    • "y)othesis dri*en search &=g=factors a?ecting defaulters – *ie. defaulting rate on age aggregated o*er

    other dimensions – for )articular age segment detail along

    )rofession

    • eed interacti*e res)onse to aggregate

    9ueries – XW )recom)ute *arious aggregates

    $-A3 ,-A3

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    89/158

    @B

    $-A3 *s ,-A3

    • $-A3 $ultidimensional array -A3

    • ,-A3 ,elational -A3

      Ty#e 8ize 2olo%r 5(o%ntShirt S (lue 10

    Shirt (lue 2M

    Shirt A (lue EM

    Shirt S ,ed E

    Shirt ,ed Q

    Shirt A ,ed 10Shirt A A :M

    V V V V

    A A A 12B0

    S4 & tensions

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    90/158

    B0

    S4 &6tensions

    • %ube o)erator – grou) by on all subsets of a set of

    attributes CmonthcityG

     – redundant scan and sorting of data canbe a*oided

    • Various other non;standard S4

    e6tensions by *endors

    Strengths of -A3

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    91/158

    B1

    Strengths of -A3

    • It is a )o.erful *isuali7ationtool

    • It )ro*ides fast interacti*e

    res)onse times• It is good for analy7ing timeseries

    • It can be useful to 8ndsome clusters and outliners

    • $any *endors o?er -A3tools

    (rief "istory

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    92/158

    B2

    (rief "istory

    • &6)ress and System ! DSS• -nline Analytical 3rocessing ; coined by

    &F %odd in 1BB: ; .hite )a)er byArbor Soft.are

    • enerally synonymous .ith earlier terms such asDecisions Su))ort (usiness Intelligence &6ecuti*eInformation System

    • $-A3 $ultidimensional -A3 C"y)erion CArbor

    &ssbaseG -racle &6)ressG• ,-A3 ,elational -A3 CInformi6 $eta%ube$icrostrategy DSS AgentG

    -A3 and &6ecuti*e

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    93/158

    BE

    Information Systems

    • Andyne %om)uting ;;3ablo

    • Arbor Soft.are ;; &ssbase

    • %ognos ;; 3o.er3lay

    • %omshare ;; %ommander-A3

    • "olistic Systems ;; "olos

    • Information Ad*antage ;;AYSUS !eb-A3

    • Informi6 ;; $etacube

    • $icrostrategies;;DSS#Agent

    • -racle ;; &6)ress• 3ilot ;; ightShi)

    • 3lanning Sciences ;;entium

    • 3latinum Technology ;;3rodea(eacon Forest' Trees

    • SAS Institute ;;SAS#&IS -A3PP

    • S)eed.are ;; $edia

    $icrosoft -A3 strategy

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    94/158

    B:

    $icrosoft -A3 strategy

    • 3lato -A3 ser*er )o.erful integrating *ariouso)erational sources

    • -&;D( for -A3 emerging industry standardbased on $DY ;;W e6tension of S4 for -A3

    • 3i*ot;table ser*ices integrate .ith -5ce 2000

     – &*ery deskto) .ill ha*e -A3 ca)ability=

    • %lient side caching and calculations

    • 3artitioned and *irtual cube

    • "ybrid relational and multidimensional storage

    $ultidimensional Data Analysis

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    95/158

    y Techni9ues

    • Ad*anced Data 3resentation Functions – E;D gra)hics 3i*ot Tables %rosstabs etc=

     – %om)atible .ith S)readsheets ' Statistical

    )ackages – Ad*anced data aggregations consolidation

    and classi8cation across time dimensions

     – Ad*anced com)utational functions

     – Ad*anced data modeling functions

    Ad*anced Database Su))ort

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    96/158

    Ad*anced Database Su))ort

    • Ad*anced Data Access Features – Access to many kinds of D($SRs Lat 8les

    and internal and e6ternal data sources

     –Access to aggregated data .arehousedata

     – Ad*anced data na*igation Cdrill;do.nsand roll;u)sG

     – Ability to ma) end;user re9uests to thea))ro)riate data source

     – Su))ort for Very arge Databases

    &asy;to;se &nd;ser

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    97/158

    yInterface

    + ra)hical ser Interfaces

    + $uch more useful if access is ke)tsim)le

    %lient#Ser*er Architecture

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    98/158

    %lient#Ser*er Architecture

    + Frame.ork for the ne. systems to bedesigned de*elo)ed andim)lemented

    + Di*ide the -A3 system into se*eralcom)onents that de8ne itsarchitecture

    Same %om)uterDistributed among se*eral com)uter

    -A3 Architecture

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    99/158

    -A3 Architecture

    • E $ain $odules – I

     – Analytical 3rocessing ogic

     – Data;)rocessing ogic

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    100/158

    100

    -A3 E Tier DSS

    Data Warehouse

    Dataase (ayer 

    Store atomic datain industry

    standard Data!arehouse=

    O(-P >ngine

    -pplication (ogic (ayer 

    enerate S4e6ecution )lans in the

    -A3 engine to obtain-A3 functionality=

    Decision Support &lient

    Presentation (ayer 

    -btain multi;dimensional re)orts

    from the DSS %lient=

    -A3 %lient#Ser*er Architecture-A3 %lient#Ser*er Architecture

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    101/158

    ,elational -A3

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    102/158

    ,elational -A3

    + ,elational -nline Analytical 3rocessing-A3 functionality using relational

    database and familiar 9uery tools to store

    and analy7e multidimensional data+ $ultidimensional data schema su))ort

    + Data access language ' 9uery

    )erformance for multidimensional data+ Su))ort for Very arge Databases

    Data $odeling for Data

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    103/158

    !arehouse

    + "o. to structure the data in yourdata .arehouse <

    + 3rocess that )roduces abstract data

    models for one or more databasecom)onents of the data .arehouse

    + $odeling for !arehouse is di?erentfrom that for -)erational databaseDimensional $odeling Star Schema

    $odeling or Fact#Dimension $odeling

    $odeling Techni9ues

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    104/158

    $odeling Techni9ues

    • &ntity;,elationshi) $odeling – Traditional modeling techni9ue

     – Techni9ue of choice for -T3

     – Suited for cor)orate data .arehouse

    • Dimensional $odeling – Analy7ing business measures in the s)eci8c

    business conte6t

     – "el)s *isuali7e *ery abstract business9uestions

     – &nd users can easily understand and na*igatethe data structure

    &ntity;,elationshi) $odeling ;

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    105/158

    (asic %once)ts

    + The &, modeling techni9ue is adisci)line used to illuminate themicrosco)ic relationshi)s among

    data elements=+ The highest art form of &, modeling

    is to remo*e all redundancy in the

    data=

    n r er rocess ng$odel

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    106/158

    $odel

    Order Header

    Order Details

    ustomer Ta!leFK

    "tem Ta!le

    FK

    alesre$ ta!leit%

    ales District

    ales &egion

    ales ountr% Product Brand

    Product ategor%

    FK

    &ntity;,elationshi) $odeling ;

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    107/158

    (asic %once)ts

    • &ntity – -b>ect that can be obser*ed and

    classi8ed by its )ro)erties and

    characteristics – (usiness de8nition .ith a clear boundary

     – %haracteri7ed by a noun

     – &6am)le

    • 3roduct

    • &m)loyee

    &ntity;,elationshi) $odeling ;

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    108/158

    (asic %once)ts

    • ,elationshi) – ,elationshi) bet.een entities ;

    structural interaction and association

     – described by a *erb – %ardinality

    • 1;1

    • 1;$

    • $;$

     – &6am)le (ooks belong to 3rinted $edia

    &ntity;,elationshi) $odeling ;

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    109/158

    (asic %once)ts

    • Attributes – %haracteristics and )ro)erties of entities

     – &6am)le

    • (ook Id Descri)tion book category areattributes of entity N(ookO

     – Attribute name should be uni9ue andself;e6)lanatory

     – 3rimary Hey Foreign Hey %onstraintsare de8ned on Attributes

    &ntity;,elationshi) $odeling /

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    110/158

    !hy ot <

    + se of the &, modeling techni9uedefeats the basic allure of data.arehousing namely intuiti*e and

    high;)erformance retrie*al of data=

    Dimensional $odeling ; (asic

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    111/158

    %once)ts

    + ,e)resents the data in a standard intuiti*eframe.ork that allo.s for high;)erformanceaccessZ

    + Schema designed to )rocess large com)le6adhoc and data intensi*e 9ueries=

    + o concern for concurrency locking andinsert#u)date#delete )erformance

    + &*ery dimensional model is com)osed of onetable .ith a multi)art key called the fact table

    and a set of smaller tables called dimensiontables=+ This characteristic [star;like[ structure is often

    called a star >oin=

    Star Schema ,e)resentation

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    112/158

    Star Schema ,e)resentation

    + Fact and Dimensions are re)resented by)hysical tables in the data .arehouse database

    + Fact tables are related to each dimension tablein a $any to -ne relationshi) C3rimary#Foreign

    Hey ,elationshi)sG+ Fact Table is related to many dimension tables

     The )rimary key of the fact table is acom)osite )rimary key from the dimension

    tables+ &ach fact table is designed to ans.er a s)eci8cDSS 9uestion

    Star Schema

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    113/158

    Star Schema

    + The fact table is al.ays the largesttable in the star schema

    + &ach dimension record is related to

    thousand of fact records+ Star Schema facilitated data retrie*al

    functions

    + D($S 8rst searches the Dimension Tables before the larger fact table

    Star Schema

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    114/158

    Star Schema

     "T'

    P&OD(T

    P)&"OD(TOM)& 

    AL) AMO(*T

    (*"T

    &ea"ue"

    +imen"ion

    "

    &)+"O*

    TAT)

    D"T&"T

    "T'P&OD(T

    B&A*D

    OLO& 

    AT)+O&'

    ",)

    DA'

    MO*TH

    ')A& 

    (A&T)& 

    (TOM)& 

    AT)+O&'

    O*TAT

    ADD&)

    Star Schema for Sales

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    115/158

    act 0a*le

    Di'ension 0a*les

    Dimensional $odeling ; (asic%

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    116/158

    %once)ts

    • Fact Tables – The most useful facts in a fact table are

    numeric and additi*e

     – Ty)ically re)resents a businesstransaction or e*ent that can be used inanaly7ing business )rocess

     – (y nature fact tables are s)arse

     – sually *ery large ; billions of records

    Dimensional $odeling ; (asic%

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    117/158

    %once)ts

    • Dimension Tables – &ach dimension table has a single;)art )rimary

    key that corres)onds e6actly to one of thecom)onents of the multi)art key in the fact

    table= – Dimension tables most often containdescri)ti*e te6tual information

     – Determine conte6tual background for facts

     – &6am)les •  Time

    • ocation#,egion

    • %ustomers

    Dimensional $odeling ; (asic% t

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    118/158

    %once)ts

    • $easures – A numeric attribute of a fact

     – ,e)resents )erformance or beha*ior of thebusiness relati*e to the dimensions

     – The actual numbers are called *ariables – -ccu)y *ery little s)ace com)ared to Fact

     Tables

     – &6am)les • 4uantity su))lied

    • Transaction amount

    • Sales *olume

    Fact Table ' DimensionT bl

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    119/158

     Tables+Fact Tables+umerical$easurements ofbusiness are stored in

    Fact Tables=

    +Dimensional Tables+Dimensions areattributes about facts=

    %onformed Dimensions

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    120/158

    %onformed Dimensions

    + Dimension that means the same thing.ith e*ery )ossible fact table that it canbe >oined .ith

    + %onformed dimensions most essential

    For the (us ArchitectureIntegrated function of the Data !arehouse

    + Some common dimensions are %ustomer

    3roductocation Time

    Surrogate Heys

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    121/158

    Surrogate Heys

    + All tables Cfacts and dimensionsG shouldnot use )roduction keys but Data!arehouse generated surrogate keys3roductions keys get reused sometimes

    In case of mergers#ac9uisitions )rotects youfrom di?erent key formats

    3roduction systems may change theirsystems to generali7e key de8nitions

    sing surrogate key .ill be faster%an handle Slo.ly %hanging dimensions .ell

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    122/158

    Factless Fact Tables

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    123/158

    Factless Fact Tables

    • For &*ent Tracking e=g= attendance

    Date>?ey

    8t%dent>?eyo%rse>?ey

    Tea!her>?ey

    9a!ility>?ey

    Date

    Diension

    o%rse

    Diension

    9a!ility

    Diension

    8t%dent

    Diension

    Tea!her 

    Diension

    %o*erage Tables

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    124/158

    %o*erage Tables

    • 3roblem To 8nd out .hich 3roductson )romotion did not sell<

    Date>?eyProd%!t>?ey

    8tore>?ey

    Prootion>?ey

    Dollars 8old

    Date

    Diension

    8tore

    Diension

    Prod%!t

    Diension

    Prootion

    Diension

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    125/158

    Date>?ey

    Prod%!t>?ey

    8tore>?ey

    Prootion>?ey

    Date

    Diension

    8tore

    Diension

    Prod%!t

    Diension

    Prootion

    Diension

    8ales Prootion oerage Table

    %o*erage Tables

    • Solution ; %o*erage Tables

    Sno.Lake Schema

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    126/158

    S o a e Sc e a

    + Dimension tables are normali7ed bydecom)osing at the attribute le*el

    + &ach dimension has one key for eachle*el of the dimensionRs hierarchy

    + ood )erformance .hen 9ueriesin*ol*e aggregation

    + %om)licated maintenance andmetadata e6)losion in number of table=

    + $akes user re)resentation morecom)le6 and intricate

    Sno.Lake schema ;& l

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    127/158

    &6am)le

    •  

    9a!t

    Table

    Di

    Table

    Di

    Table

    Di

    Table

    Di

    Table

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    128/158

     

    35IT III to

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    129/158

    g

    + Data mining is the automated detection forne. *aluable and non tri*ial informationin large *olumes of data=

    + It )redicts future trends and 8nds beha*ior

    that the e6)erts may miss because it liesoutside their e6)ectationsData mining lets you be )roacti*e3ros)ecti*e rather than ,etros)ecti*e

    + Data $ining eads to sim)li8cation andautomation of the o*erall statistical)rocess of deri*ing information from huge*olume of data=

    &6am)les of Data $odelingTools

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    130/158

     Tools

    • &,!I – Su))orts Data !arehouse design as a

    modeling techni9ue

    • 3o.ersoft !arehouseArchitect

     – $odule of 3o.er Designer s)eci8cally for D!$odeling

    • -racle Designer

     – %an be e6tended for !arehouse modeling• -thers like Infomodeler Sil*errun are also

    used

    Data $ining Introduction

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    131/158

    g

    • D$ ; .hat it can do – &6)loit )atterns ' relationshi)s in data to

    )roduce models

     – T.o uses for models

    • 3redicti*e• Descri)ti*e

    • D$ ; .hat it canRt do

     – Automatically 8nd relationshi)s• .ithout user inter*ention

    • .hen no relationshi)s e6ist

    Data $ining Introduction

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    132/158

    g

    • Data $ining and Data !arehousing – Data )re)aration for D$ may be )art of the

    Data !arehousing – Data !arehouse not a re9uirement for Data

    $ining

    • D$ and -A3 – -A3 X %lassic descri)ti*e model – ,e9uires signi8cant user in)ut – &6am)le (eer and dia)er sales

    • An -A3 tools sho.s re)orts gi*ing sales of di?erentitems

    • A data mining tool analyses the data and )redictsho. many times beer and dia)ers are sold together

    Data $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    133/158

    g

    + 3roacti*e+ Automatically searches

     – Anomalies – 3ossible ,elationshi)s

     – Identify 3roblems before the end;user+ Data $ining tools analy7e the data unco*er

    )roblems or o))ortunities hidden in datarelationshi)s form com)uter models based

    on their 8ndings and then user the modelsto )redict business beha*ior / .ith minimalend;user inter*ention

    Data $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    134/158

    g

    + A methodology designed to )erformkno.ledge;disco*ery e6)editionso*er the database data .ith minimal

    end;user inter*ention+ E Stages of Data

    Data

    InformationHno.ledge

    &6traction of Hno.ledge from

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    135/158

    &6traction of Hno.ledge from

    Data

    : 3hases of Data $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    136/158

    g

    • Data 3re)aration – Identify the main data sets to be used by

    the data mining o)eration Cusually the data.arehouseG

    • Data Analysis and %lassi8cation – Study the data to identify common data

    characteristics or )atterns• Data grou)ings classi8cations clusters

    se9uences• Data de)endencies links or relationshi)s

    • Data )atterns trends de*iation

    : 3hases of Data $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    137/158

    g

    • Hno.ledge Ac9uisition – ses the ,esults of the Data Analysis and %lassi8cation

    )hase – Data mining tool selects the a))ro)riate modeling or

    kno.ledge;ac9uisition algorithms• eural et.orks• Decision Trees• ,ules Induction• enetic algorithms• $emory;(ased ,easoning

    • 3rognosis – 3redict Future (eha*ior – Forecast (usiness -utcomes

    • M of customers .ho did not use a )articular credit card inthe last months are @@ likely to cancel the account=

    Data $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    138/158

    g

    + Still a e. Techni9ue+ $ay 8nd many n;meaningful

    ,elationshi)s

    + ood at 8nding 3ractical ,elationshi)sDe8ne %ustomer (uying 3atternsIm)ro*e 3roduct De*elo)ment and Acce)tance

    &tc=

    + 3otential of becoming the ne6t frontier indatabase de*elo)ment

    !hy Data $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    139/158

    y g

    • %redit ratings#targeted marketing – i*en a database of 100000 names .hich )ersons are

    the least likely to default on their credit cards<

     – Identify likely res)onders to sales )romotions

    • Fraud detection – !hich ty)es of transactions are likely to be fraudulent

    gi*en the demogra)hics and transactional history of a)articular customer

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    140/158

    • 3rocess of semi;automatically analy7ing

    large databases to 8nd )atterns that are – *alid hold on ne. data .ith some certainity

     – no*el non;ob*ious to the system

     –useful should be )ossible to act on the item – understandable humans should be able tointer)ret the )attern

    • Also kno.n as Hno.ledge Disco*ery in

    Databases CHDDG

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    141/158

    A))lications CcontinuedG

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    142/158

    • $edicine disease outcome e?ecti*eness oftreatments

     – analy7e )atient disease history 8nd relationshi)bet.een diseases

    • $olecular#3harmaceutical identify ne. drugs• Scienti8c data analysis

     – identify ne. gala6ies by searching for sub clusters

    • !eb site#store design and )romotion

     – 8nd a5nity of *isitor to )ages and modify layout

    no. e ge sco*eryDe8nition

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    143/158

    1:E

    Hno.ledge Disco*ery in Data is the

    non-trivial  )rocess of identifying – valid – novel – )otentially useful

     – and ultimately understandable  patterns in data=

    from Advances in (nowledge Discover" and Data)ining, Fayyad 3iatetsky;Sha)iro Smyth andthurusamy C%ha)ter 1G AAAI#$IT 3ress 1BB

    ,elated Fields

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    144/158

    1::

     

    Statistics

    MachineLearning

    Data*ases

    =isualiation

    Data Mining an!

    >no.le!ge Discoery

    Statistics $achine earning andData $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    145/158

    1:M

    Data $ining

    • Statistics – more theory;based – more focused on testing hy)otheses

    • $achine learning – more heuristic – focused on im)ro*ing )erformance of a learning agent

     – also looks at real;time learning and robotics / areas not )artof data mining

    • Data $ining and Hno.ledge Disco*ery – integrates theory and heuristics – focus on the entire )rocess of kno.ledge disco*ery

    including data cleaning learning and integration and

    *isuali7ation of results• Distinctions are fu77y

    Hno.ledge Disco*ery 3rocessLo. according to %,IS3;D$

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    146/158

    1:

    Lo. according to %,IS3 D$

    $onitoring

    ontin%o%s

    onitoring and

    i#roeent isan addition to RI8P

    "istorical ote$any ames of Data $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    147/158

    1:Q

    • Data Fishing Data Dredging 1B0;

     – used by statisticians Cas bad nameG

    • Data $ining 1BB0 ;;

     – used in D( community business

    • Hno.ledge Disco*ery in DatabasesC1B@B;G

     – used by AI $achine earning %ommunity

    • also Data Archaeology Information "ar*estingInformation Disco*ery Hno.ledge &6traction ===

    &urrently< Data Mining and ?nowledge Discovery are used interchangealy

    Some De8nitions

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    148/158

    1:@

    • Instance Calso Item or ,ecordG – an e6am)le described by a number of

    attributes

     – e=g= a day can be described by tem)erature

    humidity and cloud status• Attribute or Field

     – measuring as)ects of the Instance e=g=tem)erature

    • %lass CabelG – grou)ing of instances e=g= days good for

    )laying

    $a>or Data $ining Tasks

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    149/158

    1:B

    • Classi%cation? 

    3redicting an item class # Decision Tree• Clustering? Finding clusters in data• Associations? e=g= A ' ( ' % occur

    fre9uently•

    =isualiation? to facilitate human disco*ery• Su''ariation? describing a grou)• De*iation Detection 8nding changes• &stimation )redicting a continuous *alue

    • ink Analysis 8nding relationshi)s•

    Data ro.th

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    150/158

    1M0

    In 2 years (2003 to 2005),the size of the largest database TRIPLED

    Data ro.th ,ate

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    151/158

    1M1

    •  T.ice as much information .as createdin 2002 as in 1BBB CE0 gro.th rateG

    • -ther gro.th rate estimates e*en

    higher• Very little data .ill e*er be looked at by

    a human

    Hno.ledge Disco*ery is NEEDED to makesense and use of data=

    no. e ge sco*eryDe8nition

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    152/158

    1M2

    Hno.ledge Disco*ery in Data is the

    non-trivial  )rocess of identifying – valid – novel – )otentially useful

     – and ultimately understandable  patterns in data=

    from Advances in (nowledge Discover" and Data)ining, Fayyad 3iatetsky;Sha)iro Smyth andthurusamy C%ha)ter 1G AAAI#$IT 3ress 1BB

    ,elated Fields

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    153/158

    1ME

     

    Statistics

    MachineLearning

    Data*ases

    =isualiation

    Data Mining an!

    >no.le!ge Discoery

    Statistics $achine earning andData $ining

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    154/158

    1M:

    g

    • Statistics – more theory;based – more focused on testing hy)otheses

    • $achine learning – more heuristic – focused on im)ro*ing )erformance of a learning agent

     – also looks at real;time learning and robotics / areas not )artof data mining

    • Data $ining and Hno.ledge Disco*ery – integrates theory and heuristics – focus on the entire )rocess of kno.ledge disco*ery

    including data cleaning learning and integration and

    *isuali7ation of results• Distinctions are fu77y

    Hno.ledge Disco*ery 3rocess

    Lo. according to %,IS3;D$

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    155/158

    1MM

    g

    $onitoring

    see

    $$$!ris#&dorg

    for ore

    inforation

    ontin%o%s

    onitoring and

    i#roeent isan addition to RI8P

    "istorical ote$any ames of Data $ining

    http://www.crisp-dm.org/http://www.crisp-dm.org/

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    156/158

    1M

    • Data Fishing Data Dredging 1B0;

     – used by statisticians Cas bad nameG

    • Data $ining 1BB0 ;;

     – used in D( community business

    • Hno.ledge Disco*ery in DatabasesC1B@B;G

     – used by AI $achine earning %ommunity

    • also Data Archaeology Information "ar*estingInformation Disco*ery Hno.ledge &6traction ===

    &urrently< Data Mining and ?nowledge Discovery

    are used interchangealy

    Some De8nitions

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    157/158

    1MQ

    • Instance Calso Item or ,ecordG – an e6am)le described by a number of

    attributes

     – e=g= a day can be described by tem)erature

    humidity and cloud status• Attribute or Field

     – measuring as)ects of the Instance e=g=tem)erature

    • %lass CabelG – grou)ing of instances e=g= days good for

    )laying

    $a>or Data $ining Tasks

  • 8/18/2019 DWHandDATAMiningPresentationPresentationForTJInstitute_2 - Copy.ppt

    158/158

    • Classi%cation? )redicting an item class

    • Clustering? 8nding clusters in data• Associations? e=g= A ' ( ' % occur

    fre9uently• =isualiation? to facilitate human disco*ery

    • Su''ariation? describing a grou)• De*iation Detection 8nding changes• &stimation )redicting a continuous *alue• ink Analysis 8nding relationshi)s