the data warehouse etl toolkit - chapter 03

Upload: abacus83

Post on 01-Jun-2018

229 views

Category:

Documents


2 download

TRANSCRIPT

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    1/55

    VSV Training

    Chapter 3: Extracting

    Prepared by: Thang Nguyen, Dung Phan

    Date: 09/02/2008

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    2/55

    3.0 Oerie!

    The "T# pr$%e&& need& t$ e'e%tie(y integrate&y&te)& that hae di'erent:

    Databa&e )anage)ent &y&te)&

    Operating &y&te)&*ard!are

    C$))uni%ati$n& pr$t$%$(&

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    3/55

    3.1 The Logical DataMap

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    4/55

    3.+ De&igning #$gi%a(

    e-$re Phy&i%a(+. Have a plan.

    2. Identify data source candidates.

    3. Analye source syste!s "ith a data#pro$ling tool.

    . %eceive "although of data lineage and'usiness rules.

    . %eceive "althrough of data"arehouse data !odel.

    (.)alidate calculations and for!ulas.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    5/55

    3.2 n&ide the #$gi%a(Data 1ape-$re de&%ending int$ the detai(& $- the

    ari$u& &$ur%e& y$u !i(( en%$unter, !e needt$ ep($re the a%tua( de&ign $- the ($gi%a( data)apping d$%u)ent.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    6/55

    3.2.+ C$)p$nent& $- the#$gi%a( Data 1apTarget ta'le na!e.

    Target colu!n na!e.

    Ta'le type.*+D ,slo"ly changing di!ension- type.

    *ource data'ase.

    *ource ta'le na!e.

    *ource colu!n na!e.

    Transfor!ation.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    7/55

    3.2.2C$)p$nent& $- the#$gi%a( Data 1ap %$n4t5

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    8/55

    3.2.2 6&ing T$$(& -$r the#$gi%a( Data 1apS$)e "T# and data7)$de(ing t$$(& dire%t(y

    %apture ($gi%a( data )apping in-$r)ati$n.There i& a natura( tenden%y t$ !ant t$indi%ate the data )apping dire%t(y in the&et$$(&.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    9/55

    3.3 ui(ding the #$gi%a(Data 1apThe ana(y&i& $- the &$ur%e &y&te) i& u&ua((y

    br$en int$ t!$ )a$r pha&e&:

    The data di&%$ery pha&eThe an$)a(y dete%ti$n pha&e

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    10/55

    3.3.+ Data Di&%$eryPha&e1. Collecting and Documenting Source

    Systems

    2. Keeping Track of the Source Systems3. Determining the System-of-Record

    . !naly"ing the Source System# $sing%indings from Data &ro'ling

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    11/55

    3.3.+.+ C$((e%ting andD$%u)enting S$ur%eSy&te)&The &$ur%e &y&te)& are u&ua((y e&tab(i&hed in

    ari$u& pie%e& $- d$%u)entati$n, in%(udinginterie! n$te&, rep$rt&, and the data)$de(er4& ($gi%a( data )apping.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    12/55

    3.3.+.2 eeping Tra% $-the S$ur%e Sy&te)&

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    13/55

    3.3.+.2 eeping Tra% $-the S$ur%e Sy&te)&%$n4t5*u'ect area.

    Interface na!e.

    /usiness na!e.0riority.

    Depart!ent/usiness use.

    /usiness o"ner.

    Technical 2"ner.

    D/M*.

    0roduction server2*.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    14/55

    3.3.+.2 eeping Tra% $-the S$ur%e Sy&te)&%$n4t5 Daily users.

    D/ sie.

    D/ co!plexity. Transactions per day.

    +o!!ents.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    15/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    16/55

    3.3.+. >na(y?ing the S$ur%eSy&te): 6&ing @inding& -r$) Data

    Pr$

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    17/55

    3.3.+. >na(y?ing theS$ur%e Sy&te): 6&ing@inding& -r$) Data Pr$

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    18/55

    3.3.2 Data C$ntent>na(y&i&64LL values.

    Dates in non#date $elds.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    19/55

    3.3.3 C$((e%ting u&ine&&

    ;u(e& in the "T# Pr$%e&&A$u )ight thin at thi& &tage in the pr$%e&&

    that a(( $- the bu&ine&& ru(e& )u&t hae been%$((e%ted.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    20/55

    3. ntegrating

    *eter$gene$u& DataS$ur%e&+. Identify the source syste!s.

    2. 4nderstand the source syste!s ,data

    pro$ling-.3. +reate record !atching logic.

    . Esta'lish survivorship rules.

    . Esta'lish non#&ey attri'ute 'usiness

    rules.B. Load confor!ed di!ension.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    21/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    22/55

    3..2 C$nne%ting t$Dier&e S$ur%e& thr$ughODC

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    23/55

    3..2 C$nne%ting t$ Dier&e

    S$ur%e& thr$ugh ODC%$n4t52D/+ !anager.

    2D/+ driver.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    24/55

    3. 1ain-ra)e S$ur%e&COO# %$pyb$$&

    "CDC %hara%ter &et&

    Nu)eri% data;ede

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    25/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    26/55

    3..2 "CDC Chara%terSet$th the (ega%y )ain-ra)e &y&te)& and the

    6NE7 and ind$!&7ba&ed &y&te)&, !here)$&t data !areh$u&e& re&ide, are &t$red a&bits and bytes.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    27/55

    3..3 C$nerting "CDCt$ >SCA$u )ight thin that &in%e b$th &y&te)& u&e

    bit& and byte&, data -r$) y$ur )ain-ra)e&y&te) i& readi(y u&ab(e $n y$ur 6NE $r

    ind$!& &y&te).

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    28/55

    3.. Tran&-erring Databet!een P(at-$r)u%i(y, tran&(ating data -r$) "CDC t$ >SC

    i& uite &i)p(e. n -a%t it4& irtua((y aut$)ati%,a&&u)ing y$u u&e @i(e Tran&-er Pr$t$%$( @TP5

    t$ tran&-er the data -r$) the )ain-ra)e t$y$ur data !areh$u&e p(at-$r).

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    29/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    30/55

    3..B 6&ing PCture&

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    31/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    32/55

    3..F 6npa%ing Pa%edDe%i)a(&;e-$r)at data

    Tran&-er data

    6&e r$bu&t "T# t$$(&6&e a uti(ity pr$gra)

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    33/55

    3..8 $ring !ith;ede

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    34/55

    3..9 1u(tip(e OCC6;S

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    35/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    36/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    37/55

    3..++ *and(ing 1ain-ra)e

    Variab(e ;e%$rd #ength&%$n4t5C$nert a(( data

    Tran&-er the

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    38/55

    3.B @(at @i(e&

    Delivery of source data.

    7or&ingstaging ta'les.

    0reparation for 'ul& load.

    N$t a(( Gat

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    39/55

    3.B.+ Pr$%e&&ing @ied

    #ength @(at @i(e&

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    40/55

    3.B.2 Pr$%e&&ing

    De(i)ited @(at @i(e&@(at

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    41/55

    3.F E1# S$ur%e&

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    42/55

    3.F.+ Chara%ter Set&Chara%ter &et& are gr$up& $- uniue &y)b$(&

    u&ed -$r di&p(aying and printing %$)puter$utput

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    43/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    44/55

    3.F.2 E1# 1eta Data

    %$n4t5+ Schema"(e)ent& that appear in an E1# d$%u)ent

    >ttribute& that appear in an E1# d$%u)ent

    The nu)ber and $rder $- %hi(d e(e)ent&

    Data type& $- e(e)ent& and attribute&

    De-au(t and

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    45/55

    3.8 eb #$g S$ur%e&

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    46/55

    3.8.+ 3C C$))$n and

    "tended @$r)at&;egard(e&& $- the OS, eb ($g& hae a %$))$n&et $- %$(u)n& that u&ua((y in%(ude the -$(($!ing:

    Date.

    Ti!e.

    c#ip

    *ervice 6a!e.

    s#ipcs#!ethod.

    cs#uri#ste!.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    47/55

    3.8.+ 3C C$))$n and

    "tended @$r)at& %$n4t5cs#uri#5uery.sc#status.

    sc#'ytes.

    cs,4ser#Agent-.

    cs,+oo&ie-.cs,%eferrer-.

    eb &erer d$%u)entati$n $r the3Ceb &ite

    *erver 6a!e.

    cs#userna!e*erver 0ort.

    /ytes %eceived.

    Ti!e Ta&en.

    0rotocol )ersion.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    48/55

    3.8.2 Na)e Va(ue Pair&

    ineb #$g&Tae a ($$ at the uery &tring and n$ti%e the-$(($!ing &eg)ent&:

    product

    product.asp

    89The 5uestion !ar& indicates thatpara!eters "ere sent to the pr$gra)

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    49/55

    3.8.2 Na)e Va(ue Pair&

    ineb #$g& %$n4t5The para)eter& are %aptured in the eb ($g inna)ea(ue pair&. n thi& ea)p(e, y$u %an &eethree para)eter&, ea%h &eparated by an

    a)per&and H5.p indi%ate& the pr$du%t nu)ber

    c indi%ate& the pr$du%t %ateg$ry nu)ber

    s indi%ate& the &ear%h &tring entered by theu&er t$

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    50/55

    3.9 ";P Sy&te) S$ur%e&

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    51/55

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    52/55

    3.+0.+ Dete%ting

    Change&$sing !udit Columns.Dataase og Scraping or Sni4ng

    Timed ()tracts&rocess of (limination

    *nitial and *ncremental oads

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    53/55

    3.+0.2 "tra%ti$n Tip&+onstrain on indexed colu!ns.%etrieve the data you need.

    4se DI*TI6+T sparingly.4se *ET operators sparingly.

    4se HI6T as necessary.

    Avoid 62T.

    Avoid functions in your "here clause.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    54/55

    3.+0.3 Dete%ting De(eted$r Oer!ritten @a%t

    ;e%$rd& at the S$ur%e

    Neg$tiate !ith the &$ur%e &y&te) $!ner&, i-p$&&ib(e, ep(i%it n$ti

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03

    55/55

    Su))aryThe #$gi%a( Data 1apThe Cha((enge $- "tra%ting -r$) Di&parate

    P(at-$r)&

    "tra%ting Changed Data