the data warehouse etl toolkit - chapter 03
TRANSCRIPT
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
1/55
VSV Training
Chapter 3: Extracting
Prepared by: Thang Nguyen, Dung Phan
Date: 09/02/2008
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
2/55
3.0 Oerie!
The "T# pr$%e&& need& t$ e'e%tie(y integrate&y&te)& that hae di'erent:
Databa&e )anage)ent &y&te)&
Operating &y&te)&*ard!are
C$))uni%ati$n& pr$t$%$(&
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
3/55
3.1 The Logical DataMap
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
4/55
3.+ De&igning #$gi%a(
e-$re Phy&i%a(+. Have a plan.
2. Identify data source candidates.
3. Analye source syste!s "ith a data#pro$ling tool.
. %eceive "although of data lineage and'usiness rules.
. %eceive "althrough of data"arehouse data !odel.
(.)alidate calculations and for!ulas.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
5/55
3.2 n&ide the #$gi%a(Data 1ape-$re de&%ending int$ the detai(& $- the
ari$u& &$ur%e& y$u !i(( en%$unter, !e needt$ ep($re the a%tua( de&ign $- the ($gi%a( data)apping d$%u)ent.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
6/55
3.2.+ C$)p$nent& $- the#$gi%a( Data 1apTarget ta'le na!e.
Target colu!n na!e.
Ta'le type.*+D ,slo"ly changing di!ension- type.
*ource data'ase.
*ource ta'le na!e.
*ource colu!n na!e.
Transfor!ation.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
7/55
3.2.2C$)p$nent& $- the#$gi%a( Data 1ap %$n4t5
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
8/55
3.2.2 6&ing T$$(& -$r the#$gi%a( Data 1apS$)e "T# and data7)$de(ing t$$(& dire%t(y
%apture ($gi%a( data )apping in-$r)ati$n.There i& a natura( tenden%y t$ !ant t$indi%ate the data )apping dire%t(y in the&et$$(&.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
9/55
3.3 ui(ding the #$gi%a(Data 1apThe ana(y&i& $- the &$ur%e &y&te) i& u&ua((y
br$en int$ t!$ )a$r pha&e&:
The data di&%$ery pha&eThe an$)a(y dete%ti$n pha&e
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
10/55
3.3.+ Data Di&%$eryPha&e1. Collecting and Documenting Source
Systems
2. Keeping Track of the Source Systems3. Determining the System-of-Record
. !naly"ing the Source System# $sing%indings from Data &ro'ling
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
11/55
3.3.+.+ C$((e%ting andD$%u)enting S$ur%eSy&te)&The &$ur%e &y&te)& are u&ua((y e&tab(i&hed in
ari$u& pie%e& $- d$%u)entati$n, in%(udinginterie! n$te&, rep$rt&, and the data)$de(er4& ($gi%a( data )apping.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
12/55
3.3.+.2 eeping Tra% $-the S$ur%e Sy&te)&
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
13/55
3.3.+.2 eeping Tra% $-the S$ur%e Sy&te)&%$n4t5*u'ect area.
Interface na!e.
/usiness na!e.0riority.
Depart!ent/usiness use.
/usiness o"ner.
Technical 2"ner.
D/M*.
0roduction server2*.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
14/55
3.3.+.2 eeping Tra% $-the S$ur%e Sy&te)&%$n4t5 Daily users.
D/ sie.
D/ co!plexity. Transactions per day.
+o!!ents.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
15/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
16/55
3.3.+. >na(y?ing the S$ur%eSy&te): 6&ing @inding& -r$) Data
Pr$
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
17/55
3.3.+. >na(y?ing theS$ur%e Sy&te): 6&ing@inding& -r$) Data Pr$
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
18/55
3.3.2 Data C$ntent>na(y&i&64LL values.
Dates in non#date $elds.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
19/55
3.3.3 C$((e%ting u&ine&&
;u(e& in the "T# Pr$%e&&A$u )ight thin at thi& &tage in the pr$%e&&
that a(( $- the bu&ine&& ru(e& )u&t hae been%$((e%ted.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
20/55
3. ntegrating
*eter$gene$u& DataS$ur%e&+. Identify the source syste!s.
2. 4nderstand the source syste!s ,data
pro$ling-.3. +reate record !atching logic.
. Esta'lish survivorship rules.
. Esta'lish non#&ey attri'ute 'usiness
rules.B. Load confor!ed di!ension.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
21/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
22/55
3..2 C$nne%ting t$Dier&e S$ur%e& thr$ughODC
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
23/55
3..2 C$nne%ting t$ Dier&e
S$ur%e& thr$ugh ODC%$n4t52D/+ !anager.
2D/+ driver.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
24/55
3. 1ain-ra)e S$ur%e&COO# %$pyb$$&
"CDC %hara%ter &et&
Nu)eri% data;ede
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
25/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
26/55
3..2 "CDC Chara%terSet$th the (ega%y )ain-ra)e &y&te)& and the
6NE7 and ind$!&7ba&ed &y&te)&, !here)$&t data !areh$u&e& re&ide, are &t$red a&bits and bytes.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
27/55
3..3 C$nerting "CDCt$ >SCA$u )ight thin that &in%e b$th &y&te)& u&e
bit& and byte&, data -r$) y$ur )ain-ra)e&y&te) i& readi(y u&ab(e $n y$ur 6NE $r
ind$!& &y&te).
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
28/55
3.. Tran&-erring Databet!een P(at-$r)u%i(y, tran&(ating data -r$) "CDC t$ >SC
i& uite &i)p(e. n -a%t it4& irtua((y aut$)ati%,a&&u)ing y$u u&e @i(e Tran&-er Pr$t$%$( @TP5
t$ tran&-er the data -r$) the )ain-ra)e t$y$ur data !areh$u&e p(at-$r).
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
29/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
30/55
3..B 6&ing PCture&
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
31/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
32/55
3..F 6npa%ing Pa%edDe%i)a(&;e-$r)at data
Tran&-er data
6&e r$bu&t "T# t$$(&6&e a uti(ity pr$gra)
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
33/55
3..8 $ring !ith;ede
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
34/55
3..9 1u(tip(e OCC6;S
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
35/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
36/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
37/55
3..++ *and(ing 1ain-ra)e
Variab(e ;e%$rd #ength&%$n4t5C$nert a(( data
Tran&-er the
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
38/55
3.B @(at @i(e&
Delivery of source data.
7or&ingstaging ta'les.
0reparation for 'ul& load.
N$t a(( Gat
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
39/55
3.B.+ Pr$%e&&ing @ied
#ength @(at @i(e&
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
40/55
3.B.2 Pr$%e&&ing
De(i)ited @(at @i(e&@(at
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
41/55
3.F E1# S$ur%e&
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
42/55
3.F.+ Chara%ter Set&Chara%ter &et& are gr$up& $- uniue &y)b$(&
u&ed -$r di&p(aying and printing %$)puter$utput
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
43/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
44/55
3.F.2 E1# 1eta Data
%$n4t5+ Schema"(e)ent& that appear in an E1# d$%u)ent
>ttribute& that appear in an E1# d$%u)ent
The nu)ber and $rder $- %hi(d e(e)ent&
Data type& $- e(e)ent& and attribute&
De-au(t and
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
45/55
3.8 eb #$g S$ur%e&
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
46/55
3.8.+ 3C C$))$n and
"tended @$r)at&;egard(e&& $- the OS, eb ($g& hae a %$))$n&et $- %$(u)n& that u&ua((y in%(ude the -$(($!ing:
Date.
Ti!e.
c#ip
*ervice 6a!e.
s#ipcs#!ethod.
cs#uri#ste!.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
47/55
3.8.+ 3C C$))$n and
"tended @$r)at& %$n4t5cs#uri#5uery.sc#status.
sc#'ytes.
cs,4ser#Agent-.
cs,+oo&ie-.cs,%eferrer-.
eb &erer d$%u)entati$n $r the3Ceb &ite
*erver 6a!e.
cs#userna!e*erver 0ort.
/ytes %eceived.
Ti!e Ta&en.
0rotocol )ersion.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
48/55
3.8.2 Na)e Va(ue Pair&
ineb #$g&Tae a ($$ at the uery &tring and n$ti%e the-$(($!ing &eg)ent&:
product
product.asp
89The 5uestion !ar& indicates thatpara!eters "ere sent to the pr$gra)
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
49/55
3.8.2 Na)e Va(ue Pair&
ineb #$g& %$n4t5The para)eter& are %aptured in the eb ($g inna)ea(ue pair&. n thi& ea)p(e, y$u %an &eethree para)eter&, ea%h &eparated by an
a)per&and H5.p indi%ate& the pr$du%t nu)ber
c indi%ate& the pr$du%t %ateg$ry nu)ber
s indi%ate& the &ear%h &tring entered by theu&er t$
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
50/55
3.9 ";P Sy&te) S$ur%e&
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
51/55
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
52/55
3.+0.+ Dete%ting
Change&$sing !udit Columns.Dataase og Scraping or Sni4ng
Timed ()tracts&rocess of (limination
*nitial and *ncremental oads
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
53/55
3.+0.2 "tra%ti$n Tip&+onstrain on indexed colu!ns.%etrieve the data you need.
4se DI*TI6+T sparingly.4se *ET operators sparingly.
4se HI6T as necessary.
Avoid 62T.
Avoid functions in your "here clause.
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
54/55
3.+0.3 Dete%ting De(eted$r Oer!ritten @a%t
;e%$rd& at the S$ur%e
Neg$tiate !ith the &$ur%e &y&te) $!ner&, i-p$&&ib(e, ep(i%it n$ti
-
8/9/2019 The Data Warehouse ETL Toolkit - Chapter 03
55/55
Su))aryThe #$gi%a( Data 1apThe Cha((enge $- "tra%ting -r$) Di¶te
P(at-$r)&
"tra%ting Changed Data