introduction to data warehousingtwiki.di.uniroma1.it/pub/bi/webhome/2.datawarehousesetl.pdf ·...
TRANSCRIPT
IntroductiontoDataWarehousing
TheBusinessdemandfordata,informationandanalytics
• Enterprisestodayaredrivenbydata,tobemoreprecise,byINFORMATIONthatcanbeextractedfromdata
• WhetherBIGDATAorplainolddata,itrequiresalotofworkbeforeitsisactuallysomethinguseful
• Rawdataisincomplete,inconsistent,unformatted,riddledwitherrors:itisunpalatabletobusinesspersonswhoneedtomakedecisions
• Rawdataneedsintegration,cleaning,designmodeling,architectingandotherbeforeitcanbetransformedinusefulinformation
• Nextlessonswilltreattheproblemofhowtointegrate,cleanandmanagethedatabeforetheycanbetransformedintoINFORMATION
Rawdataneedsintegration,cleaning,…..
ORDERS
REPAIRTRANSCRIPTS
USERS’OPINIONSAndyouwanttoseeitallinaniceway
WhatisaDataWarehouse?
• ADataWarehouseisacollectionofdata(=database)concerninganorganisation,usedinsupportofmanagementdecisions.
• Itisdesignedforqueryandanalysisratherthanfortransactionprocessing(suchastraditionalOLTP–onlinetransactionprocessing-systems)
• Usuallycontainshistoricaldataderivedfromtransactiondata,butcanincludedatafromothersources.
WhyorganizationsneedDW?• Organisationmayhavemanyoperational(fordailyoperation)databases.
• Thedifferentdatabasesare(usually)notsynchronised(meansthattheyarenotlinkedandtheremightbediscrepancies).
• Managementrequiresanintegrated,companywideviewofalldata.
• DataWarehouseseparatesinformationaldata,thatcanbeusedformanagementdecisions,fromdailyoperationaldata.
• Datacanbesummarisedasrequiredformanagement(notrelevantdetailsomitted).
Reportsareveryimportant..mustbedesignedcarefully
AnticipatedgrowthoftheuseofDataWarehousinginUSA(2016)
Usecase:aRegionalHealthCare(RHC)Group• ARHCorganisationmayhaveitsdataspreadacrossmanyseparateoperational
databases:• Ahealthcaregroupconsistsofmanycampuses(formallyindependenthospitals)• Eachcampushasitsowndatabaseforequipmentandminorassets• Majorassetsdataisstoredonaseparatecentraldatabase.• Eachcampuskeepsitsownpatientsdatabase• Eachcampusemploysitsownadministrativeandgeneral(cleaners,gardenersetc.)
staff,henceeachcampushasaseparatepayrolldatabase• Doctorsandconsultantsworkacrossthecampuses,sothereisaseparatedatabasefor
them• Otherdata,suchastimetables,workrosters,pettycashexpenses,etc.arestoredin
(e.g.)MicrosoftOutlookfiles,spreadsheetsandsmall,localPCdatabasessuchasMicrosoftAccess.
• Alarge,geographicallyseparateorganisationmayhavehundredsofsuch'small'databases.
ü ADataWarehousecollects(copies)allofthisdataintoasingle(virtual)location,combinesitandputsitintoaformatforanalysingandquerying.Theinformationprovidedfromthedatawarehouseisusedtopredicttrendsandhelpinhigh-leveldecisionmaking.TheDataWarehouseisseparatetothemanyoperationaldatabasesintheorganisationandshouldnotbeused(e.g.,)tolookupwhoisondutynextThursdayevening-thatinformationcomesfromtheoperationaldatabases.
SOYOUWOULDLIKETOOBTAINTHINGSLIKETHIS…
BUT,FIRSTOFF,YOUNEEDTOIDENTIFY,COLLECT,CLEANANDINTEGRATEDATAINA
DATABASE
…DOYOUKNOW:WHATADATABASEIS?WHATISANOPERATIONALDATABASE?
• Adatabaseisadigitalcollectionofdatathatisorganizedsothatitscontentscaneasilybeaccessed,managed,andupdated.
• AccesstothesedataisusuallyprovidedbyaTERMINOLOGY:"databasemanagementsystem"(DBMS),acomputersoftwarethatallowsuserstointeractwithoneormoredatabasesandprovidesaccesstoallofthedatacontainedinthedatabase
• InDBs,dataareorganizedinTables
DBsforthenon-techies(1)
Table
• TERMINOLOGY: “Atableistheprimaryunitofphysicalstoragefordatainadatabase.”1
• Itisalsoa“logical”structure:awayoforganizingdata
• Usuallyadatabasecontainsmorethanonetable.
1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign.SAMS,Indianapolis,IN.
Table(example)
ADatabasewithConnectedMultipleTables
Publishers Books Customers
Authors Inventory Orders
[1]
TableCustomers NAMEoftheTable
Field(Column)
afield
Customers
Fieldsareidentifiedbyalabelorfieldname(e.g.Name,Company…).FieldsarealsocalledATTRIBUTEsorKEY(willuseinterchangeably)
Record(Row)
arecord
Customers
Arecordisarowofthetablewherefields(attributes,keys)haveVALUESE.g.,Name=BugsBunny
DataTypesintables
• Alphanumeric(Text)• Numeric(Number,Currency,etc.)• Date/Time• Boolean(attributeswithonlytwovalues,e.g.:Yes/No,true/false,0/1..)ID Name-of-
productOrderdate availability
37000876 IPhone7pink 10/09/2017 Y
Thesearedifferentdatatypes
PrimaryKey
primarykeyfield
Customers
Primarykeyisauniqueidentifierofrecordsinatable.Therecannotberecordswiththesamevaluefortheprimarykey.Primarykeyvaluesmaybegeneratedmanuallyorautomatically.
PrimaryKey
primarykeyfields
Roles(Performances)
Aprimarykeycanconsistofmorethanonefield.WhatmattersisthatitisUNIQUE!!e.g.,actorsmighthavethesamename,butthetuple“actor,movie”is(hopefully)unambiguous
ForeignKey
foreignkeyfield
primarykeyfieldparenttable
Directors
Movieschildtablerelationship
TOCONNECTTABLES:Foreignkeyisdefinedinasecondtable,butitreferstotheprimarykeyorauniquekeyinthefirsttable.
Itisawayofconnectinginformationreferringtothesameitem
Anotherexamplewithmultipletables(primarykeysareunderlined)
HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode
CountriesCountry_codeCountry_currencyCountry_name
HotelroomsRoom_numberHotel_idRoom_typeRoom_floor
RoomtypesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN
RoomBookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count
GuestBookingsBooking_idGuest_number
GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email
HotelAmenitiesLookupCharacteristic_idCharacteristic_description
HotelAmenitiesCharacteristic_idHotel_id
HotelReservationDatabaseRelationsbetweenrecordsintablesaredeterminedbytheprimary/foreignkeys
“Common”keysareusedtoanswerqueries
HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode
CountriesCountry_codeCountry_currencyCountry_name
HotelroomsRoom_numberHotel_idRoom_typeRoom_floor
RoomtypesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN
RoomBookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count
GuestBookingsBooking_idGuest_number
GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email
HotelAmenitiesLookupCharacteristic_idCharacteristic_description
HotelAmenitiesCharacteristic_idHotel_id
HotelReservationDatabase
HowmanyhotelsinCountryX?
HowmanyroomsinHotelY??
Tablesdescribeentities
• TERMINOLOGY: “Anentityisabusinessobjectthatrepresentsagroup,orcategoryofdata.”1
• Example:hotel,hotel_room,guest..
1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign,pp.21.SAMS,Indianapolis,IN.
Instance(Record,Tuple)
• TERMINOLOGY “Asingle,specificoccurrenceofanentityisaninstance.Othertermsforaninstancearerecordandtuple.”1
• Hotel:Plaza• Instancesare“valued”entities!
1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign,pp.210.SAMS,Indianapolis,IN.
04100899Plaza5thAvenue,61NY00765
Genericentitydescription
ThisisaninstanceoftheentitytypeHotel
Attributes(fields,primary/foreignkeys)
• TERMINOLOGY: “Anattribute(orfield)isasub-groupofinformationwithinanentity.”1
• Country_CodeisanattributeoftheentitytypeHotel
• Aswesaid,anattributecanbeaprimarykeyoraforeignkey.Intheexample,Hotel_idisprimary,country_codeisforeign.
1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign,pp.21.SAMS,Indianapolis,IN.
Relationship• TERMINOLOGY:Arelationshipisalinkthatrelatestwo
entitiesthatshareoneormoreattributes(keys,fields).• Example:Guest_bookingandRoom_bookinghavethesame
attributeBookingid(sinceonewouldliketoknowwhichguestreservedagivenroom,orwhichroomhasbeenreservedforagivenguest)
Thoughoftenimplicit,relationshipshaveasemanticsandadirection,e.g.,Guest–(hasbooked)àRoomRoom–(hasbeenbookedby)àGuest
Indexes• TERMINOLOGY:Indexesaredatastructuresusedforfastlook-upin
tables• E.g.saythatyouwanttoknowhowmanyGuestshavethe“Name”
attribute=SMITH,withoutsearchingsequentiallyallthedatabase• Anindexisapointertothelocations(recordIDs)oftheDBwherethe
requiredattributehastherequiredvalue.Anindexisabitlikeanaddress..
• Clearly,sinceyouhavemanyfields(attributes),youcannotorganizeyourdatabaseinalphabeticorder(onWHICHfiled?)Sothereisanindexforeachfield.
NameID IDNAMEdate-of-birthAGE..
Operations• WhatarethemainoperationsinaDB?• DELETE,UPDATE,INSERT(selfexplanatoryoperations)• TheSELECToperatorisusedtoselectthoserecordswithgiven
valuesofoneormoreattributes(e.g.SELECTfromSALES_DATAwherePART_NAME=iPhone6andYEAR=2016)
• TheJOINoperator,isusedtomergevaluesfromdifferenttables:
Jointthese2tablestolearnthatMr.Raffertyworksatsalesdept.
Whynothavingoneuniquetable,soyoudon’tneedtomerge?
• Tablesmaynicelyseparatedifferentviewsofthedata(e.g.salespersons,managers,repairpersonnel..)
• Differenttablesmightbegeneratedindifferentdepartmentsandlocations
• Primarykeysandforeignkeysallowsittomergetheinformationwhenneeded
Summarysofar• Dataconcerningabusinessarecollectedintables.• Tableshaveattributes(fields,keys)thatdescribeentities.Eachtabledescribesand
entitytype(e.g.,hotel)• Instancesofanentitytype(e.g.hotelMajesticinRomaisaninstanceoftheentity
typehotel)arecalledRECORDS,andhavevaluestospecifythevariousattributes(e.g.ADDRESS=viaVittorioVeneto50)
• Therelevantdataofabusinessareorganizedinmanytables,offeringdifferentanddetailedviewsofthebusiness(e.g.reservation,restaurantandservices,billing,customercare..)
• Tablesarelinkedtogetherviatheirattributes(primaryandforeignkeys).Linksarecalledrelationshipsandusuallyhavea(hidden)semantics
• Operations(select,join,delete..)andindexesareusedtoQUERYthedatabaseandretrieveRELEVANTBUSINESSFACTS(e.g.,howmanyroomshavebeenreservedonJanuary2018?)
• Usuallyperformingoperationsondatabasesneedprogramminglanguages(e.g.SQL),butwithself-servicebusinessanalyticsyoucanretrievefactswithverysimpleinteractions(willseeinLabs)
Inclassexercise• ATVcompanywishestodevelopadatabasetostoredataabout
theTVseriesthatthecompanyproduces.Thedatabaseincludesinformationaboutactorswhoplayintheseries,anddirectorswhodirecttheepisodesoftheseries.
• Actorsanddirectorsareemployedbythecompany.TVseriesaredividedintoepisodes.Eachepisodemaybetransmittedatseveraloccasions(timestamps).Anactorishiredtoparticipateinaseries,butmayparticipateinmanyseries.Eachepisodeofaseriesisdirectedbyoneofthedirectors,butdifferentepisodesmaybedirectedbydifferentdirectors.
• Developadatabaseschemeofthissystem(=setofrelatedtableswithattributes).1)Identifyentitytypes.2)Createatableforeachentitytype3)Chooseattributesoftheentitysets.4)Determinewhichoftheattributescanbeusedasprimarykeys.5)Drawconnectionsbetweentablesthatarerelatedtroughprimary/foreignkeys
QueryingtheTVseriesdatabase
• Accordingtoyourschema,whichtablesshouldbeusedtoanswerthesetypesofquestions:– WhichactorsplayintheseriesX?– InwhichseriesdoestheactorYparticipate?– Whichactorsparticipateinmorethanoneseries?– HowmanytimeshasthefirstepisodeoftheseriesXbeentransmitted?Atwhattimes?
– Howmanydirectorsareemployedbythecompany?– Whichdirectorhasdirectedthegreatestnumberofepisodes?
TVcompanydatabasescheme
Thesymbol:1….0*meansthateachinstanceofagiventype(e.g.,aTVseries)isrelatedwith0ormoreinstancesofanotherentitytype(e.g.,episodes).Thisclearlyshowswhyyouneedseparatetables..Youcouldnotadd“episode”attributeintheTVseriestable,sincethenumberofepisodesisvariableforeveryTVseries.Therefore,wecreateanEpisodetable,andlinkTVserieswiththeirrespectiveepisodestroughprimary/foreignkeys.
Amorecomplexscheme
OLTPandOLAPdatabases
• WenowintroduceandcomparetwotypesofDBsystems:– OLTP(on-linetransactionprocessors)– OLAP(on-lineanalyticalprocessors,orDataWharehouses)
OLTPvrsOLAP(DW)
• TraditionalOnLineTransactionProcessors(OLTP,introducedinthefirstlesson!..Excel-liketables)areoperationalsystemstailoredforprocessingtransactionaldatabases
• Atransactionaldatabasesupportsbusinessprocessflows(sales,supplychain,etc.)andistypicallyanonline,real-timesystem.
• WithrespecttoOLTP,DW(alsonamedOLAP,On-LineTransactionAnalytics)aremuchmorepowerful
OLTPvrs.OLAP(DW)-2• Sourceofdata
OLTP:Operationaldata;OLTPsaretheoriginalsourceofthedataandeachsystemmanagesaspecifictransactionaldatabase.
• OLAP:OLAPdatacomesfromthevariousOLTPDatabases+externalsourcesandareaggregated(alsocalledOLAPcube)
OLTPvrs.OLAP(DW)-3Purposeofdata:• OLTP:Tocontrolandrunfundamentalday-to-daybusinesstasks(e.g.,
handleguestreservations,roomcleaning,payments..)• OLAP:Tohelpwithplanning,problemsolving,anddecisionsupport
OLTPvrs.OLAP(DW)-4Whatthedatarepresent
– OLTP:Revealsasnapshotofongoingbusinessprocesses
– OLAP:Multi-dimensionalviewsofvariouskindsofbusinessactivities
OLTPvrs.OLAP(DW)-5
Queries• OLTP:Relativelystandardizedandsimplequeries;Returningrelativelyfew
records(=answers)• OLAP:Oftencomplexqueriesinvolvingaggregationofmanydataand
INFERENCE
Howmanyi-Phonessoldinthisquarter?
Howmanyi-PhonessoldthismonthinFlorencecomparedtoprevious6months,andhowmanycanweexpecttosellinthenext6months?
OLAPvrsOLTP(DW)–moreissues• ProcessingSpeed
OLTP:TypicallyveryfastOLAP:Dependsontheamountofdatainvolved;TypicallyneedsBigDatasolutions.
• SpaceRequirementsOLTP:CanberelativelysmallifhistoricaldataisarchivedOLAP:Largerduetotheexistenceofaggregationstructuresandhistorydata;requiresmoreindexesthanOLTP(sincemoredimensionsareavailableorcanbedefined)
• BackupandRecoveryOLTP:Backupreligiously;operationaldataiscriticaltorunthebusiness,datalossislikelytoentailsignificantmonetarylossandlegalliabilityOLAP:Insteadofregularbackups,someenvironmentsmayconsidersimplyreloadingtheOLTPdataasarecoverymethod
CharacteristicsofDWs
• Datawharehousescanbe:– Subjectoriented– Integrated– NonVolatile– Timevariant
Finance,Marketing,Inventory
weblogs,Legacydata,sales..
Data(evenolddata)remainindatabase
Graincanbereal-time,day,month,quarterly..
Whatkindofqueries?• Usersofthedatawarehouseperformdataanalysesthat
requireto"sliceanddice"theirdata• DWuserswillsometimesneedhighlyaggregateddata,
andothertimestheywillneedtodrilldowntodetails.• Oftentemporalanalysesarerequired.Moresophisticated
analysesincludetrendanalysesanddatamining,whichuseexistingdataforpredictiveandprescripriveanalytics.
• Thedatawarehouseactsastheunderlyingengineusedbybusinessintelligenceenvironmentsthatservereports,dashboardsandotherinterfacestoendusers.
WhatkindofqueriesinanOLTP?
Whichcustomersarebasedin
Roma?
HowmanysparepartsofProduct222are
available?
Whohasbeenourbest
clientin2016?
Howmanydelaysweexperiencedinspareparts
supply?
Whathasbeenthetotalrevenuein2015?
WhatkindofqueriesinaOLAP/DW?
SummaryOLAPvrsOLTP(DW)
Inanutshell..
Summarysofar
• DataWarehouseisacollectionofdataconcerningtheorganisationusedinsupportofmanagementdecisions.
• Itisakind-ofdatabase:adatastructureorganizedintables
• ADataWarehouseallowsanalyticalprocessingofdata(OLAP)fordecisionsupport,contrarytooperationaldatabases,whichsupportreal-timetransactionprocessing(OLTP)
ArchitectureofaDW
DesignaspectsofaDW
1. Select:Whichdataandwhatfor2. Transform:so-calledETL:extraction,
cleaning,transformandloaddata3. Storeandprocessdata:dataMars,
metadata,aggregations
Step1:Whichdataandwhatfor?
Whichdata?DataSourcesandTypes
• Primarilysourcescomefromlegacy,operationalsystems– Mostlystructuredandnumericaldataatthepresenttime.Sales,vendors,transactions..
• Externaldatamaybeincluded,oftenpurchasedfromthird-partysources– Technologyexistsforstoringunstructureddata(images,text,sensors)andisbecomingmoreimportantovertime
– Externaldata(socialnetworksdata,userprofiles)arealsobecomingmoreandmoreimportant
Structuredvrs.Unstructureddata
Whatexternaldatadataandwhatfor?(2)
• Socialdata(socialnetworks,blogs):tomineuseropinions,trendingtopics,marketforecasts
• Sensorsdata(signalsfromdevicese.g.vendingmachines,packages,wearabledevices,sensornetworks..):todetectanomalies,learntrends..
• Clickstreamdata(cliklogsofwebsites):fortrafficande-commerceanalysis
• Environmentaldata(geolocations,metereologicaldata):toproducerecommendations,supplychain,marketforecasts..
• Images,videos,signals(medicalimaging,landscapes,portraits):todetectanomalies,security,frauddetection..
• Audio(speech,sound):tomineopinions,frauddetection,environmentalanalysis
Example1:applicationsofimageunderstanding(peoplerecognition)
Peoplerecognition Businessapplicatons:• Visitortrafficperhour,day,
season,storeoccupancyvrsopeninghours
• Schedulestaffing• Shoplifting,sweetharting• Customerdemographics• Security
Sweethearting• isatermusedinthe
retaillosspreventionindustrytomeanintentionalmarginlossthroughemployeetheftatthecashregister.Sweetheartingisthemostcommontypeofemployeetheft.
Shoplifting• (alsoknowninslangas
boostingandfive-fingerdiscount)isapopulartermusedfortheunnoticedtheftofgoodsfromanopenretailestablishment.
Example2:anomalydetection
Canbeappliedtoanysignal(outputofsensors/medicaldataetc)tolearn“normalbehaviour”anddetect/predictanomaliesinrealtime.RememberMagpieexampleofcoldchain.
Example3:Text
Challengeswithunstructureddata(images,signals,text)
• Needcomplexprocessingtobeuseful– Textprocessing,naturallanguageunderstanding– Imageprocessing,imageunderstanding– Signalprocessing
• Anumberoftehniques/methodsareavailable(ArtificialIntelligence,MachineLearning)
• E.g.seeCognitiveAppsinWatson(laterinthiscourse)
• WillseesomethingalsowhentalkingaboutSocialAnalytics
DesignaspectsofaDW
1. Select:Whichdataandwhatfor2. Transform:so-calledETL:Extraction,
cleaning,TransformandLoaddata3. Storeandprocessdata:dataMars,
metadata,aggregations
Step2:ETL:extraction,cleaning,transformandloaddata
• Itisimportanttounderstandthatadatawarehousehasthepurposeofintegratingdifferentsourcesofdata,notofCOLLECTINGnewdata.
• So,newdataareadded,deleted,andupdatedintheORIGINALsources(e.g.anOLTP).
• Thedatawarehousemustextractnewdataastheyaregenerated,detectandhandlechangesinolddata,andintegratedatafromthedifferentsources.
WhatisETL• Extraction–transformation–loading(ETL)toolsarepiecesofsoftwareresponsiblefor– theextractionofdatafromseveralsources,– itscleansing,customization,reformatting,integration,and
– insertionintoadatawarehouse.• BuildingtheETLprocessispotentiallyoneofthebiggesttasksofcreatingawarehouse;itiscomplex,timeconsuming,andconsumesmostofdatawarehouseproject’simplementationefforts,costs,andresources.
ETLFunctionalElements
• ETLsystemshaveacommonpurpose:theymovedatafromonedatabasetoanother.
• Generally,ETLsystemsmovedatafromOLTPsystemstoadatawarehouse,buttheycanalsobeusedtomovedatafromonedatawarehousetoanother,orfromanexternalsource(social,cliklogs..)tothewarehouse.
• AnETLsystemconsistsoffourdistinctfunctionalelements:– Extraction– Transformation– Loading– AddingMetadata
1.Extraction
• ThefirststepinanyETLscenarioisdataextraction.• TheETLextractionstepisresponsibleforextractingdatafromthesourcesystems.
• EachdatasourcehasitsdistinctsetofcharacteristicsthatneedtobemanagedinordertoeffectivelyextractdatafortheETLprocess.
• Theprocessneedstointegratesystemsthathavedifferentplatforms,suchasdifferentdatabasemanagementsystems,differentoperatingsystems,anddifferentcommunicationsprotocols.
ETL
Issues:Extractionfrequency• Thereareseveralwaystoperformtheextract:
– Updatenotification-ifthesourcesystemisabletoprovideanotificationthatarecordhasbeenchangedanddescribethechange(e.g.anewshipmenthasbeencompleted,andorderhasbeenfiled..),thisistheeasiestwaytogetthedata.
– Incrementalextract–Nonotifications,soingiventimeintervalstheextractionprocessstart,sourcesystemshouldbeabletoidentifywhichrecordshavebeenmodifiedandprovideanextractofsuchrecords.DuringfurtherETLsteps,thesystemneedstoidentifychangesandpropagateitdown.
– Fullextract-somesystemsarenotabletoidentifywhichdatahasbeenchangedatall,soafullextractistheonlywayonecangetthedataoutofthesystem.Thefullextractrequireskeepingacopyofthelastextractinthesameformatinordertocompareandbeidentifychanges.Fullextracthandlesdeletionsaswell.
– Extractfromunstructuredresources–Ifdataarenotstructured(notadatabase)systemextractseitherinrealtimeorincrementally,butnewdataaresimplyaddedtoolddata(e.g.newtweetsdiscussingaboutagivenproduct).
Exampleofextractionmethod• Supposethe“source”DBhas2tables,CustomersandSales.
AsonMay23rd,2012,thelatestaddedrecordsare:
NotethattheTablesstoredatafor2consecutivedays(22and23).Onthe22nd,wehave2customersand3sales,onthe23rd,3customersand5sales.SupposeIwanttoupdatethewarehouseeverynight.
Exampleofextractionmethod(2)• FULLLOADMETHODFORLOADINGDATAWAREHOUSE• Incasewearetooptforfullloadmethodforloading,wewillread
the2sourcetables(CustomersandSales)everydayinfull.• On22Mar2012:Wewillread2recordsfromCustomerand3
recordsfromSalesandloadalloftheminthetarget.• On23Mar2012:Wewillread3recordsfromcustomer(including
the2olderrecords)and5recordsfromsales(including3oldrecords)andwillloadorupdatetheminthetargetdatawarehouse.
• Asyoucanclearlyguess,thismethodofloadingunnecessarilyreadoldrecordsthatweneednotreadaswehavealreadyprocessedthembefore.Henceweneedtoimplementasmarterwayofloading.
• However,incase“old”dataarefrequentlymodifiedordeleted,thismethodcanbeeasierthanchekingforpossiblechanges.
Exampleofextractionmethod(2)• INCREMENTALLOADMETHODFORLOADINGDATA
WAREHOUSE• Incaseofincrementalloading,wewillonlyreadthose
recordsthatarenotalreadyreadandloadedintoourtargetsystem(datawarehouse).
• Thatis,on22March,wewillread2recordsfromcustomerand3recordsfromsales-however-on23March,wewillread1recordfromcustomerand2recordsfromsales.
• Buthowdoweensurethatwe"only"readthoserecordsthatarenot"already"read?Howdoweknowwhichrecordsarealreadyreadandwhichrecordsarenot?
Exampleofextractionmethod(3)• Wecanmakeuseof"entrydate"fieldinthecustomertable
and"salesdate"fieldinthesalestabletokeeptrackofthis.• Aftereachloadingwewill"store"thedateuntilwhichthe
loadinghasbeenperformedinsomedatawarehousetableandnextdayweonlyextractthoserecordsthathasadategreaterthanourstoreddate.Let'screateanewtabletostorethisdate.Wewillcallthistableas"Batch"
• Oncewehavedonethis,allwehavetodotoperformincrementalordeltaloading
Takeawaymessage
• Youarenotresponsiblefortheextractionprocess,ITpeoplewillbe
• Yourresponsibilityistohelpdeciding–havinginmindobjectivesoftheanalysisandtimingconstraints– whichdatashouldbeextracted,and(about)whatfrequencyofextraction.
• E.g.,iftheobjectiveistopredictcreditcardfrauds,needreal-timeupdating.Ifobjectiveistoanalyzeandcomparepoint-of-sales,weeklyormonthlyextractioncanbeenough
Whataboutunstructureddata?• Needsoftwaretodownloaddatastreams(e.g.TwitterAPI)• Usuallysomemetadataisavailableinstreams(e.g.date
andIDs)to“concatenate”streams
2.Transformation• ThesecondstepinanyETLscenarioisdatatransformation.• Objective:makesomecleaningandconformingonthe
incomingdatatogainaccuratedatawhichiscorrect,complete,consistent,andunambiguous.
• Thisprocessincludesdatacleaning,transformation,andintegration.Itdefinesthegranularityoffacttables,thedimensiontables,datastructures,etc.
• Alltransformationrulesandtheresultingschemasmustbedescribedinthemetadatarepository.
• Willselater,butyourresponsibility(asbusinessexpertsinaBIproject)isthatacomprehensible(bybusinesspeople)descriptionofwhatkindoftransformationsareperformedonthedataismaintained!
ETL
Exampleoftransformation• Asasmallexample,assumeyouhavedatacomingfrom
twodifferentsourcesystemswhichyouwanttomergeinthedatawarehouse:theremightbesomedifferencesbetweenthetwo.
• Forexample,onesourcemaydenotethegenderasMaleandFemalewhileothermaydenoteasFandM.
ComparingthesetwoTablesthereisanothermismatchinthewaythesameinformationisencoded.Whichone?
Typesoftransformations• Now,IfyouarestoringthegenderintargetasMandF,youmayneedto
"transform"MaleandFemaletoMandF(orviceversa).YoumaywriteasimpleCASEstatement(aRULE),oryoumayjustwritecodewhichtranslatesMale-->MandFemale-->F.ThistypeoftransformationisaMODIFICATION(youmodifythevaluesofaField/Attribute)
• IfyouwanttoencodetheNameattributeintwoattributes:FirstName,FamilyName,thenyoumustsplitthevaluesineachrecordofTable1andrecordthedataseparatelyintheTargetTable.Again,youdothiswritingsomecodeanddocumentingitwithaRULE.ThisisaCONFORMATION(youaremakingtwofieldscompatible).
• Inthesameway,ifyouhaveaRevenuefieldinaTablemaintainedinItalyandanotherRevenueFieldfromGermany,andyouneedaTotalRevenueinyourtargetwarehouse,youwillwriteafunctionwhichcalculatesthesumandstoresitinanothercolumn.ThisisanADDITION(youareaddinganewfield).
• Allthesemodifications,additions,conformationarepartoftheTransformstage.ThesetransformationsmustbeencodedinRULESreadablebynon-ICTusers.
• IMPORTANT:theSYNTAXandSEMANTICSofthedatayoucombineandstoreisaCRITICALFACTOR.Syntacticandsemanticmismatchesareamajorsourceofproblemswhenaggregatingdata!
Example:aligningattributenames(“reconciling”data)
OtherTransformationexamples• Moreexamplesoftransformations:
– Selectingonlycertaincolumnstoload– Translatingcodedvalues:(e.g.Fàfemale)– Derivinganewcalculatedvalue:(e.g.,sale_amount=qty*unit_price)– Sortingororderingthedatabasedonalistofcolumnstoimprovesearch
performance– Joiningdatafrommultiplesources(e.g.,lookup,merge)anddeduplicatingthedata– Aggregating(forexample,rollup—summarizingmultiplerowsofdata—total
salesforeachstore,andforeachregion,etc.)– Transposingorpivoting(turningmultiplecolumnsintomultiplerowsorviceversa)– Splittingacolumnintomultiplecolumns(e.g.,NameàFirstname,FamilyName)– Disaggregatingrepeatingcolumns– Lookingupandvalidatingtherelevantdatafromtablesorreferentialfiles– Applyinganyformofdatavalidation;failedvalidationmayresultinafullrejection
ofthedata,partialrejection,ornorejectionatall,andthusnone,some,orallofthedataishandedovertothenextstepdependingontheruledesignandexceptionhandling
Transformingunstructuredresources• Waymorecomplex!First,weneedtotransformfromunstructuredtostuctured
• Example:sentimentanalysisinTwitter
Here,thechallengeistoanalyzetextand,first,identifythoseofinterest(e.g.talkingaboutyourcompanyoragivenproduct)andthen,assigntothetextapositive,negativeor0(neutral)score.
Transformingunstructuredresources(2)
• Whatyougetfromthistransformation(let’signoreHOWfornow)?
date positive negative neutral
1/04/2016 500 237 1715
2/04/2016 451 277 2015
3/04/2016 816 300 3016
Table:StarbucksTwitterSentiment
Transformation:aggregation
• Wealreadymentionedanexampleofaggregation(summingrevenuesdatafromdifferentDBsinmaintainedindifferentdepartments)
• Aggregationmaybefarmorecomplex• E.g.wemaywanttoaggregatesentimentdatawithsalestodiscoverwhatwentwrong(orwhatwasthewinningmoveusersappreciatedbest)
Example(SocialEngagementIndex)
http://www.brandamplitude.com/blog/innovation/item/announcing-breakthrough-in-measuring-the-impact-of-social-media-on-sales
3.Loading
• ThirdstepisLoading• TheETLloadingelementisresponsibleforloadingtransformeddataintothedatawarehousedatabase.
• Datawarehousesareusuallyupdatedperiodicallyratherthancontinuously(asisthecaseforOLTP)andlargenumbersofrecordsareoftenloadedtomultipletablesinasingledataloadstep.
• Thedatawarehouseisoftentakenofflineduringupdateoperationssothatdatacanbeloadedfaster
ETL
DWETLTools
• SomeoftheWellKnownETLTools• ThemostwellknowncommercialtoolsareAbInitio,IBMInfoSphereDataStage,Informatica,OracleDataIntegratorandSAPDataIntegrator.
CaseStudy(Self-assessment)• Downloadthepaperat
http://bmjopen.bmj.com/content/bmjopen/6/8/e010962.full.pdfdescribingtheusecaseofDutchRedCrossdatawarehouse(alsooncoursewebsite)
• Answerthefollowing:– Whattypeofdatahavebeenintegrated,fromwhichsources?– Canyoudrawtheschemaofallneededtables?
• Whataretheobjects?Whataretheattributes?Whataretherelationships?Whatisthe“semantics”ofrelationships?
– CanyoulistsomeoftheTRANSFORMoperationsthatwereneededtoharmonizedataduringtheETLprocess?
– Whichadditionalchallengesareposedtothewarehousebythespecificapplicationdomain?
– Canyoulistthemaincategoriesofdatawhichhavebeenintegrated?– Canyoulistandsummarizethemaindataanalytictaskssupportedbythe
wharehouse?