introduction to data warehousingtwiki.di.uniroma1.it/pub/bi/webhome/2.datawarehousesetl.pdf ·...

86
Introduction to Data Warehousing

Upload: others

Post on 29-May-2020

10 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

IntroductiontoDataWarehousing

Page 2: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

TheBusinessdemandfordata,informationandanalytics

•  Enterprisestodayaredrivenbydata,tobemoreprecise,byINFORMATIONthatcanbeextractedfromdata

•  WhetherBIGDATAorplainolddata,itrequiresalotofworkbeforeitsisactuallysomethinguseful

•  Rawdataisincomplete,inconsistent,unformatted,riddledwitherrors:itisunpalatabletobusinesspersonswhoneedtomakedecisions

•  Rawdataneedsintegration,cleaning,designmodeling,architectingandotherbeforeitcanbetransformedinusefulinformation

•  Nextlessonswilltreattheproblemofhowtointegrate,cleanandmanagethedatabeforetheycanbetransformedintoINFORMATION

Page 3: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Rawdataneedsintegration,cleaning,…..

ORDERS

REPAIRTRANSCRIPTS

USERS’OPINIONSAndyouwanttoseeitallinaniceway

Page 4: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

WhatisaDataWarehouse?

•  ADataWarehouseisacollectionofdata(=database)concerninganorganisation,usedinsupportofmanagementdecisions.

•  Itisdesignedforqueryandanalysisratherthanfortransactionprocessing(suchastraditionalOLTP–onlinetransactionprocessing-systems)

•  Usuallycontainshistoricaldataderivedfromtransactiondata,butcanincludedatafromothersources.

Page 5: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

WhyorganizationsneedDW?•  Organisationmayhavemanyoperational(fordailyoperation)databases.

•  Thedifferentdatabasesare(usually)notsynchronised(meansthattheyarenotlinkedandtheremightbediscrepancies).

•  Managementrequiresanintegrated,companywideviewofalldata.

•  DataWarehouseseparatesinformationaldata,thatcanbeusedformanagementdecisions,fromdailyoperationaldata.

•  Datacanbesummarisedasrequiredformanagement(notrelevantdetailsomitted).

Page 6: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Reportsareveryimportant..mustbedesignedcarefully

Page 7: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

AnticipatedgrowthoftheuseofDataWarehousinginUSA(2016)

Page 8: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Usecase:aRegionalHealthCare(RHC)Group•  ARHCorganisationmayhaveitsdataspreadacrossmanyseparateoperational

databases:•  Ahealthcaregroupconsistsofmanycampuses(formallyindependenthospitals)•  Eachcampushasitsowndatabaseforequipmentandminorassets•  Majorassetsdataisstoredonaseparatecentraldatabase.•  Eachcampuskeepsitsownpatientsdatabase•  Eachcampusemploysitsownadministrativeandgeneral(cleaners,gardenersetc.)

staff,henceeachcampushasaseparatepayrolldatabase•  Doctorsandconsultantsworkacrossthecampuses,sothereisaseparatedatabasefor

them•  Otherdata,suchastimetables,workrosters,pettycashexpenses,etc.arestoredin

(e.g.)MicrosoftOutlookfiles,spreadsheetsandsmall,localPCdatabasessuchasMicrosoftAccess.

•  Alarge,geographicallyseparateorganisationmayhavehundredsofsuch'small'databases.

ü  ADataWarehousecollects(copies)allofthisdataintoasingle(virtual)location,combinesitandputsitintoaformatforanalysingandquerying.Theinformationprovidedfromthedatawarehouseisusedtopredicttrendsandhelpinhigh-leveldecisionmaking.TheDataWarehouseisseparatetothemanyoperationaldatabasesintheorganisationandshouldnotbeused(e.g.,)tolookupwhoisondutynextThursdayevening-thatinformationcomesfromtheoperationaldatabases.

Page 9: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

SOYOUWOULDLIKETOOBTAINTHINGSLIKETHIS…

Page 10: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

BUT,FIRSTOFF,YOUNEEDTOIDENTIFY,COLLECT,CLEANANDINTEGRATEDATAINA

DATABASE

Page 11: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

…DOYOUKNOW:WHATADATABASEIS?WHATISANOPERATIONALDATABASE?

Page 12: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

•  Adatabaseisadigitalcollectionofdatathatisorganizedsothatitscontentscaneasilybeaccessed,managed,andupdated.

•  AccesstothesedataisusuallyprovidedbyaTERMINOLOGY:"databasemanagementsystem"(DBMS),acomputersoftwarethatallowsuserstointeractwithoneormoredatabasesandprovidesaccesstoallofthedatacontainedinthedatabase

•  InDBs,dataareorganizedinTables

DBsforthenon-techies(1)

Page 13: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Table

•  TERMINOLOGY: “Atableistheprimaryunitofphysicalstoragefordatainadatabase.”1

•  Itisalsoa“logical”structure:awayoforganizingdata

•  Usuallyadatabasecontainsmorethanonetable.

1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign.SAMS,Indianapolis,IN.

Page 14: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Table(example)

Page 15: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

ADatabasewithConnectedMultipleTables

Publishers Books Customers

Authors Inventory Orders

[1]

Page 16: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

TableCustomers NAMEoftheTable

Page 17: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Field(Column)

afield

Customers

Fieldsareidentifiedbyalabelorfieldname(e.g.Name,Company…).FieldsarealsocalledATTRIBUTEsorKEY(willuseinterchangeably)

Page 18: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Record(Row)

arecord

Customers

Arecordisarowofthetablewherefields(attributes,keys)haveVALUESE.g.,Name=BugsBunny

Page 19: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

DataTypesintables

•  Alphanumeric(Text)•  Numeric(Number,Currency,etc.)•  Date/Time•  Boolean(attributeswithonlytwovalues,e.g.:Yes/No,true/false,0/1..)ID Name-of-

productOrderdate availability

37000876 IPhone7pink 10/09/2017 Y

Thesearedifferentdatatypes

Page 20: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

PrimaryKey

primarykeyfield

Customers

Primarykeyisauniqueidentifierofrecordsinatable.Therecannotberecordswiththesamevaluefortheprimarykey.Primarykeyvaluesmaybegeneratedmanuallyorautomatically.

Page 21: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

PrimaryKey

primarykeyfields

Roles(Performances)

Aprimarykeycanconsistofmorethanonefield.WhatmattersisthatitisUNIQUE!!e.g.,actorsmighthavethesamename,butthetuple“actor,movie”is(hopefully)unambiguous

Page 22: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

ForeignKey

foreignkeyfield

primarykeyfieldparenttable

Directors

Movieschildtablerelationship

TOCONNECTTABLES:Foreignkeyisdefinedinasecondtable,butitreferstotheprimarykeyorauniquekeyinthefirsttable.

Itisawayofconnectinginformationreferringtothesameitem

Page 23: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Anotherexamplewithmultipletables(primarykeysareunderlined)

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

CountriesCountry_codeCountry_currencyCountry_name

HotelroomsRoom_numberHotel_idRoom_typeRoom_floor

RoomtypesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN

RoomBookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count

GuestBookingsBooking_idGuest_number

GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

HotelAmenitiesLookupCharacteristic_idCharacteristic_description

HotelAmenitiesCharacteristic_idHotel_id

HotelReservationDatabaseRelationsbetweenrecordsintablesaredeterminedbytheprimary/foreignkeys

Page 24: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

“Common”keysareusedtoanswerqueries

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

CountriesCountry_codeCountry_currencyCountry_name

HotelroomsRoom_numberHotel_idRoom_typeRoom_floor

RoomtypesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN

RoomBookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count

GuestBookingsBooking_idGuest_number

GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

HotelAmenitiesLookupCharacteristic_idCharacteristic_description

HotelAmenitiesCharacteristic_idHotel_id

HotelReservationDatabase

HowmanyhotelsinCountryX?

HowmanyroomsinHotelY??

Page 25: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Tablesdescribeentities

•  TERMINOLOGY: “Anentityisabusinessobjectthatrepresentsagroup,orcategoryofdata.”1

•  Example:hotel,hotel_room,guest..

1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign,pp.21.SAMS,Indianapolis,IN.

Page 26: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Instance(Record,Tuple)

•  TERMINOLOGY “Asingle,specificoccurrenceofanentityisaninstance.Othertermsforaninstancearerecordandtuple.”1

•  Hotel:Plaza•  Instancesare“valued”entities!

1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign,pp.210.SAMS,Indianapolis,IN.

04100899Plaza5thAvenue,61NY00765

Genericentitydescription

ThisisaninstanceoftheentitytypeHotel

Page 27: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Attributes(fields,primary/foreignkeys)

•  TERMINOLOGY: “Anattribute(orfield)isasub-groupofinformationwithinanentity.”1

•  Country_CodeisanattributeoftheentitytypeHotel

•  Aswesaid,anattributecanbeaprimarykeyoraforeignkey.Intheexample,Hotel_idisprimary,country_codeisforeign.

1)Stephens,R.K.andPlew.R.R.,2001.DatabaseDesign,pp.21.SAMS,Indianapolis,IN.

Page 28: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Relationship•  TERMINOLOGY:Arelationshipisalinkthatrelatestwo

entitiesthatshareoneormoreattributes(keys,fields).•  Example:Guest_bookingandRoom_bookinghavethesame

attributeBookingid(sinceonewouldliketoknowwhichguestreservedagivenroom,orwhichroomhasbeenreservedforagivenguest)

Thoughoftenimplicit,relationshipshaveasemanticsandadirection,e.g.,Guest–(hasbooked)àRoomRoom–(hasbeenbookedby)àGuest

Page 29: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Indexes•  TERMINOLOGY:Indexesaredatastructuresusedforfastlook-upin

tables•  E.g.saythatyouwanttoknowhowmanyGuestshavethe“Name”

attribute=SMITH,withoutsearchingsequentiallyallthedatabase•  Anindexisapointertothelocations(recordIDs)oftheDBwherethe

requiredattributehastherequiredvalue.Anindexisabitlikeanaddress..

•  Clearly,sinceyouhavemanyfields(attributes),youcannotorganizeyourdatabaseinalphabeticorder(onWHICHfiled?)Sothereisanindexforeachfield.

NameID IDNAMEdate-of-birthAGE..

Page 30: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Operations•  WhatarethemainoperationsinaDB?•  DELETE,UPDATE,INSERT(selfexplanatoryoperations)•  TheSELECToperatorisusedtoselectthoserecordswithgiven

valuesofoneormoreattributes(e.g.SELECTfromSALES_DATAwherePART_NAME=iPhone6andYEAR=2016)

•  TheJOINoperator,isusedtomergevaluesfromdifferenttables:

Jointthese2tablestolearnthatMr.Raffertyworksatsalesdept.

Page 31: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Whynothavingoneuniquetable,soyoudon’tneedtomerge?

•  Tablesmaynicelyseparatedifferentviewsofthedata(e.g.salespersons,managers,repairpersonnel..)

•  Differenttablesmightbegeneratedindifferentdepartmentsandlocations

•  Primarykeysandforeignkeysallowsittomergetheinformationwhenneeded

Page 32: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Summarysofar•  Dataconcerningabusinessarecollectedintables.•  Tableshaveattributes(fields,keys)thatdescribeentities.Eachtabledescribesand

entitytype(e.g.,hotel)•  Instancesofanentitytype(e.g.hotelMajesticinRomaisaninstanceoftheentity

typehotel)arecalledRECORDS,andhavevaluestospecifythevariousattributes(e.g.ADDRESS=viaVittorioVeneto50)

•  Therelevantdataofabusinessareorganizedinmanytables,offeringdifferentanddetailedviewsofthebusiness(e.g.reservation,restaurantandservices,billing,customercare..)

•  Tablesarelinkedtogetherviatheirattributes(primaryandforeignkeys).Linksarecalledrelationshipsandusuallyhavea(hidden)semantics

•  Operations(select,join,delete..)andindexesareusedtoQUERYthedatabaseandretrieveRELEVANTBUSINESSFACTS(e.g.,howmanyroomshavebeenreservedonJanuary2018?)

•  Usuallyperformingoperationsondatabasesneedprogramminglanguages(e.g.SQL),butwithself-servicebusinessanalyticsyoucanretrievefactswithverysimpleinteractions(willseeinLabs)

Page 33: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Inclassexercise•  ATVcompanywishestodevelopadatabasetostoredataabout

theTVseriesthatthecompanyproduces.Thedatabaseincludesinformationaboutactorswhoplayintheseries,anddirectorswhodirecttheepisodesoftheseries.

•  Actorsanddirectorsareemployedbythecompany.TVseriesaredividedintoepisodes.Eachepisodemaybetransmittedatseveraloccasions(timestamps).Anactorishiredtoparticipateinaseries,butmayparticipateinmanyseries.Eachepisodeofaseriesisdirectedbyoneofthedirectors,butdifferentepisodesmaybedirectedbydifferentdirectors.

•  Developadatabaseschemeofthissystem(=setofrelatedtableswithattributes).1)Identifyentitytypes.2)Createatableforeachentitytype3)Chooseattributesoftheentitysets.4)Determinewhichoftheattributescanbeusedasprimarykeys.5)Drawconnectionsbetweentablesthatarerelatedtroughprimary/foreignkeys

Page 34: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

QueryingtheTVseriesdatabase

•  Accordingtoyourschema,whichtablesshouldbeusedtoanswerthesetypesofquestions:– WhichactorsplayintheseriesX?–  InwhichseriesdoestheactorYparticipate?– Whichactorsparticipateinmorethanoneseries?– HowmanytimeshasthefirstepisodeoftheseriesXbeentransmitted?Atwhattimes?

– Howmanydirectorsareemployedbythecompany?– Whichdirectorhasdirectedthegreatestnumberofepisodes?

Page 35: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

TVcompanydatabasescheme

Thesymbol:1….0*meansthateachinstanceofagiventype(e.g.,aTVseries)isrelatedwith0ormoreinstancesofanotherentitytype(e.g.,episodes).Thisclearlyshowswhyyouneedseparatetables..Youcouldnotadd“episode”attributeintheTVseriestable,sincethenumberofepisodesisvariableforeveryTVseries.Therefore,wecreateanEpisodetable,andlinkTVserieswiththeirrespectiveepisodestroughprimary/foreignkeys.

Page 36: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Amorecomplexscheme

Page 37: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OLTPandOLAPdatabases

•  WenowintroduceandcomparetwotypesofDBsystems:– OLTP(on-linetransactionprocessors)– OLAP(on-lineanalyticalprocessors,orDataWharehouses)

Page 38: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OLTPvrsOLAP(DW)

•  TraditionalOnLineTransactionProcessors(OLTP,introducedinthefirstlesson!..Excel-liketables)areoperationalsystemstailoredforprocessingtransactionaldatabases

•  Atransactionaldatabasesupportsbusinessprocessflows(sales,supplychain,etc.)andistypicallyanonline,real-timesystem.

•  WithrespecttoOLTP,DW(alsonamedOLAP,On-LineTransactionAnalytics)aremuchmorepowerful

Page 39: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OLTPvrs.OLAP(DW)-2•  Sourceofdata

OLTP:Operationaldata;OLTPsaretheoriginalsourceofthedataandeachsystemmanagesaspecifictransactionaldatabase.

•  OLAP:OLAPdatacomesfromthevariousOLTPDatabases+externalsourcesandareaggregated(alsocalledOLAPcube)

Page 40: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OLTPvrs.OLAP(DW)-3Purposeofdata:•  OLTP:Tocontrolandrunfundamentalday-to-daybusinesstasks(e.g.,

handleguestreservations,roomcleaning,payments..)•  OLAP:Tohelpwithplanning,problemsolving,anddecisionsupport

Page 41: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OLTPvrs.OLAP(DW)-4Whatthedatarepresent

–  OLTP:Revealsasnapshotofongoingbusinessprocesses

–  OLAP:Multi-dimensionalviewsofvariouskindsofbusinessactivities

Page 42: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OLTPvrs.OLAP(DW)-5

Queries•  OLTP:Relativelystandardizedandsimplequeries;Returningrelativelyfew

records(=answers)•  OLAP:Oftencomplexqueriesinvolvingaggregationofmanydataand

INFERENCE

Howmanyi-Phonessoldinthisquarter?

Howmanyi-PhonessoldthismonthinFlorencecomparedtoprevious6months,andhowmanycanweexpecttosellinthenext6months?

Page 43: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OLAPvrsOLTP(DW)–moreissues•  ProcessingSpeed

OLTP:TypicallyveryfastOLAP:Dependsontheamountofdatainvolved;TypicallyneedsBigDatasolutions.

•  SpaceRequirementsOLTP:CanberelativelysmallifhistoricaldataisarchivedOLAP:Largerduetotheexistenceofaggregationstructuresandhistorydata;requiresmoreindexesthanOLTP(sincemoredimensionsareavailableorcanbedefined)

•  BackupandRecoveryOLTP:Backupreligiously;operationaldataiscriticaltorunthebusiness,datalossislikelytoentailsignificantmonetarylossandlegalliabilityOLAP:Insteadofregularbackups,someenvironmentsmayconsidersimplyreloadingtheOLTPdataasarecoverymethod

Page 44: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

CharacteristicsofDWs

•  Datawharehousescanbe:– Subjectoriented–  Integrated– NonVolatile– Timevariant

Finance,Marketing,Inventory

weblogs,Legacydata,sales..

Data(evenolddata)remainindatabase

Graincanbereal-time,day,month,quarterly..

Page 45: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Whatkindofqueries?•  Usersofthedatawarehouseperformdataanalysesthat

requireto"sliceanddice"theirdata•  DWuserswillsometimesneedhighlyaggregateddata,

andothertimestheywillneedtodrilldowntodetails.•  Oftentemporalanalysesarerequired.Moresophisticated

analysesincludetrendanalysesanddatamining,whichuseexistingdataforpredictiveandprescripriveanalytics.

•  Thedatawarehouseactsastheunderlyingengineusedbybusinessintelligenceenvironmentsthatservereports,dashboardsandotherinterfacestoendusers.

Page 46: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

WhatkindofqueriesinanOLTP?

Whichcustomersarebasedin

Roma?

HowmanysparepartsofProduct222are

available?

Whohasbeenourbest

clientin2016?

Howmanydelaysweexperiencedinspareparts

supply?

Whathasbeenthetotalrevenuein2015?

Page 47: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

WhatkindofqueriesinaOLAP/DW?

Page 48: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

SummaryOLAPvrsOLTP(DW)

Page 49: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Inanutshell..

Page 50: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Summarysofar

•  DataWarehouseisacollectionofdataconcerningtheorganisationusedinsupportofmanagementdecisions.

•  Itisakind-ofdatabase:adatastructureorganizedintables

•  ADataWarehouseallowsanalyticalprocessingofdata(OLAP)fordecisionsupport,contrarytooperationaldatabases,whichsupportreal-timetransactionprocessing(OLTP)

Page 51: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

ArchitectureofaDW

Page 52: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

DesignaspectsofaDW

1.   Select:Whichdataandwhatfor2.   Transform:so-calledETL:extraction,

cleaning,transformandloaddata3.   Storeandprocessdata:dataMars,

metadata,aggregations

Page 53: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Step1:Whichdataandwhatfor?

Page 54: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Whichdata?DataSourcesandTypes

•  Primarilysourcescomefromlegacy,operationalsystems– Mostlystructuredandnumericaldataatthepresenttime.Sales,vendors,transactions..

•  Externaldatamaybeincluded,oftenpurchasedfromthird-partysources–  Technologyexistsforstoringunstructureddata(images,text,sensors)andisbecomingmoreimportantovertime

–  Externaldata(socialnetworksdata,userprofiles)arealsobecomingmoreandmoreimportant

Page 55: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Structuredvrs.Unstructureddata

Page 56: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Whatexternaldatadataandwhatfor?(2)

•  Socialdata(socialnetworks,blogs):tomineuseropinions,trendingtopics,marketforecasts

•  Sensorsdata(signalsfromdevicese.g.vendingmachines,packages,wearabledevices,sensornetworks..):todetectanomalies,learntrends..

•  Clickstreamdata(cliklogsofwebsites):fortrafficande-commerceanalysis

•  Environmentaldata(geolocations,metereologicaldata):toproducerecommendations,supplychain,marketforecasts..

•  Images,videos,signals(medicalimaging,landscapes,portraits):todetectanomalies,security,frauddetection..

•  Audio(speech,sound):tomineopinions,frauddetection,environmentalanalysis

Page 57: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Example1:applicationsofimageunderstanding(peoplerecognition)

Peoplerecognition Businessapplicatons:•  Visitortrafficperhour,day,

season,storeoccupancyvrsopeninghours

•  Schedulestaffing•  Shoplifting,sweetharting•  Customerdemographics•  Security

Page 58: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Sweethearting•  isatermusedinthe

retaillosspreventionindustrytomeanintentionalmarginlossthroughemployeetheftatthecashregister.Sweetheartingisthemostcommontypeofemployeetheft.

Shoplifting•  (alsoknowninslangas

boostingandfive-fingerdiscount)isapopulartermusedfortheunnoticedtheftofgoodsfromanopenretailestablishment.

Page 59: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Example2:anomalydetection

Canbeappliedtoanysignal(outputofsensors/medicaldataetc)tolearn“normalbehaviour”anddetect/predictanomaliesinrealtime.RememberMagpieexampleofcoldchain.

Page 60: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Example3:Text

Page 61: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Challengeswithunstructureddata(images,signals,text)

•  Needcomplexprocessingtobeuseful–  Textprocessing,naturallanguageunderstanding–  Imageprocessing,imageunderstanding–  Signalprocessing

•  Anumberoftehniques/methodsareavailable(ArtificialIntelligence,MachineLearning)

•  E.g.seeCognitiveAppsinWatson(laterinthiscourse)

•  WillseesomethingalsowhentalkingaboutSocialAnalytics

Page 62: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

DesignaspectsofaDW

1.   Select:Whichdataandwhatfor2.   Transform:so-calledETL:Extraction,

cleaning,TransformandLoaddata3.   Storeandprocessdata:dataMars,

metadata,aggregations

Page 63: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Step2:ETL:extraction,cleaning,transformandloaddata

Page 64: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

•  Itisimportanttounderstandthatadatawarehousehasthepurposeofintegratingdifferentsourcesofdata,notofCOLLECTINGnewdata.

•  So,newdataareadded,deleted,andupdatedintheORIGINALsources(e.g.anOLTP).

•  Thedatawarehousemustextractnewdataastheyaregenerated,detectandhandlechangesinolddata,andintegratedatafromthedifferentsources.

Page 65: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

WhatisETL•  Extraction–transformation–loading(ETL)toolsarepiecesofsoftwareresponsiblefor–  theextractionofdatafromseveralsources,–  itscleansing,customization,reformatting,integration,and

–  insertionintoadatawarehouse.•  BuildingtheETLprocessispotentiallyoneofthebiggesttasksofcreatingawarehouse;itiscomplex,timeconsuming,andconsumesmostofdatawarehouseproject’simplementationefforts,costs,andresources.

Page 66: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

ETLFunctionalElements

•  ETLsystemshaveacommonpurpose:theymovedatafromonedatabasetoanother.

•  Generally,ETLsystemsmovedatafromOLTPsystemstoadatawarehouse,buttheycanalsobeusedtomovedatafromonedatawarehousetoanother,orfromanexternalsource(social,cliklogs..)tothewarehouse.

•  AnETLsystemconsistsoffourdistinctfunctionalelements:–  Extraction–  Transformation–  Loading–  AddingMetadata

Page 67: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

1.Extraction

•  ThefirststepinanyETLscenarioisdataextraction.•  TheETLextractionstepisresponsibleforextractingdatafromthesourcesystems.

•  EachdatasourcehasitsdistinctsetofcharacteristicsthatneedtobemanagedinordertoeffectivelyextractdatafortheETLprocess.

•  Theprocessneedstointegratesystemsthathavedifferentplatforms,suchasdifferentdatabasemanagementsystems,differentoperatingsystems,anddifferentcommunicationsprotocols.

ETL

Page 68: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Issues:Extractionfrequency•  Thereareseveralwaystoperformtheextract:

–  Updatenotification-ifthesourcesystemisabletoprovideanotificationthatarecordhasbeenchangedanddescribethechange(e.g.anewshipmenthasbeencompleted,andorderhasbeenfiled..),thisistheeasiestwaytogetthedata.

–  Incrementalextract–Nonotifications,soingiventimeintervalstheextractionprocessstart,sourcesystemshouldbeabletoidentifywhichrecordshavebeenmodifiedandprovideanextractofsuchrecords.DuringfurtherETLsteps,thesystemneedstoidentifychangesandpropagateitdown.

–  Fullextract-somesystemsarenotabletoidentifywhichdatahasbeenchangedatall,soafullextractistheonlywayonecangetthedataoutofthesystem.Thefullextractrequireskeepingacopyofthelastextractinthesameformatinordertocompareandbeidentifychanges.Fullextracthandlesdeletionsaswell.

–  Extractfromunstructuredresources–Ifdataarenotstructured(notadatabase)systemextractseitherinrealtimeorincrementally,butnewdataaresimplyaddedtoolddata(e.g.newtweetsdiscussingaboutagivenproduct).

Page 69: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Exampleofextractionmethod•  Supposethe“source”DBhas2tables,CustomersandSales.

AsonMay23rd,2012,thelatestaddedrecordsare:

NotethattheTablesstoredatafor2consecutivedays(22and23).Onthe22nd,wehave2customersand3sales,onthe23rd,3customersand5sales.SupposeIwanttoupdatethewarehouseeverynight.

Page 70: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Exampleofextractionmethod(2)•  FULLLOADMETHODFORLOADINGDATAWAREHOUSE•  Incasewearetooptforfullloadmethodforloading,wewillread

the2sourcetables(CustomersandSales)everydayinfull.•  On22Mar2012:Wewillread2recordsfromCustomerand3

recordsfromSalesandloadalloftheminthetarget.•  On23Mar2012:Wewillread3recordsfromcustomer(including

the2olderrecords)and5recordsfromsales(including3oldrecords)andwillloadorupdatetheminthetargetdatawarehouse.

•  Asyoucanclearlyguess,thismethodofloadingunnecessarilyreadoldrecordsthatweneednotreadaswehavealreadyprocessedthembefore.Henceweneedtoimplementasmarterwayofloading.

•  However,incase“old”dataarefrequentlymodifiedordeleted,thismethodcanbeeasierthanchekingforpossiblechanges.

Page 71: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Exampleofextractionmethod(2)•  INCREMENTALLOADMETHODFORLOADINGDATA

WAREHOUSE•  Incaseofincrementalloading,wewillonlyreadthose

recordsthatarenotalreadyreadandloadedintoourtargetsystem(datawarehouse).

•  Thatis,on22March,wewillread2recordsfromcustomerand3recordsfromsales-however-on23March,wewillread1recordfromcustomerand2recordsfromsales.

•  Buthowdoweensurethatwe"only"readthoserecordsthatarenot"already"read?Howdoweknowwhichrecordsarealreadyreadandwhichrecordsarenot?

Page 72: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Exampleofextractionmethod(3)•  Wecanmakeuseof"entrydate"fieldinthecustomertable

and"salesdate"fieldinthesalestabletokeeptrackofthis.•  Aftereachloadingwewill"store"thedateuntilwhichthe

loadinghasbeenperformedinsomedatawarehousetableandnextdayweonlyextractthoserecordsthathasadategreaterthanourstoreddate.Let'screateanewtabletostorethisdate.Wewillcallthistableas"Batch"

•  Oncewehavedonethis,allwehavetodotoperformincrementalordeltaloading

Page 73: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Takeawaymessage

•  Youarenotresponsiblefortheextractionprocess,ITpeoplewillbe

•  Yourresponsibilityistohelpdeciding–havinginmindobjectivesoftheanalysisandtimingconstraints– whichdatashouldbeextracted,and(about)whatfrequencyofextraction.

•  E.g.,iftheobjectiveistopredictcreditcardfrauds,needreal-timeupdating.Ifobjectiveistoanalyzeandcomparepoint-of-sales,weeklyormonthlyextractioncanbeenough

Page 74: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Whataboutunstructureddata?•  Needsoftwaretodownloaddatastreams(e.g.TwitterAPI)•  Usuallysomemetadataisavailableinstreams(e.g.date

andIDs)to“concatenate”streams

Page 75: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

2.Transformation•  ThesecondstepinanyETLscenarioisdatatransformation.•  Objective:makesomecleaningandconformingonthe

incomingdatatogainaccuratedatawhichiscorrect,complete,consistent,andunambiguous.

•  Thisprocessincludesdatacleaning,transformation,andintegration.Itdefinesthegranularityoffacttables,thedimensiontables,datastructures,etc.

•  Alltransformationrulesandtheresultingschemasmustbedescribedinthemetadatarepository.

•  Willselater,butyourresponsibility(asbusinessexpertsinaBIproject)isthatacomprehensible(bybusinesspeople)descriptionofwhatkindoftransformationsareperformedonthedataismaintained!

ETL

Page 76: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Exampleoftransformation•  Asasmallexample,assumeyouhavedatacomingfrom

twodifferentsourcesystemswhichyouwanttomergeinthedatawarehouse:theremightbesomedifferencesbetweenthetwo.

•  Forexample,onesourcemaydenotethegenderasMaleandFemalewhileothermaydenoteasFandM.

ComparingthesetwoTablesthereisanothermismatchinthewaythesameinformationisencoded.Whichone?

Page 77: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Typesoftransformations•  Now,IfyouarestoringthegenderintargetasMandF,youmayneedto

"transform"MaleandFemaletoMandF(orviceversa).YoumaywriteasimpleCASEstatement(aRULE),oryoumayjustwritecodewhichtranslatesMale-->MandFemale-->F.ThistypeoftransformationisaMODIFICATION(youmodifythevaluesofaField/Attribute)

•  IfyouwanttoencodetheNameattributeintwoattributes:FirstName,FamilyName,thenyoumustsplitthevaluesineachrecordofTable1andrecordthedataseparatelyintheTargetTable.Again,youdothiswritingsomecodeanddocumentingitwithaRULE.ThisisaCONFORMATION(youaremakingtwofieldscompatible).

•  Inthesameway,ifyouhaveaRevenuefieldinaTablemaintainedinItalyandanotherRevenueFieldfromGermany,andyouneedaTotalRevenueinyourtargetwarehouse,youwillwriteafunctionwhichcalculatesthesumandstoresitinanothercolumn.ThisisanADDITION(youareaddinganewfield).

•  Allthesemodifications,additions,conformationarepartoftheTransformstage.ThesetransformationsmustbeencodedinRULESreadablebynon-ICTusers.

•  IMPORTANT:theSYNTAXandSEMANTICSofthedatayoucombineandstoreisaCRITICALFACTOR.Syntacticandsemanticmismatchesareamajorsourceofproblemswhenaggregatingdata!

Page 78: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Example:aligningattributenames(“reconciling”data)

Page 79: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

OtherTransformationexamples•  Moreexamplesoftransformations:

–  Selectingonlycertaincolumnstoload–  Translatingcodedvalues:(e.g.Fàfemale)–  Derivinganewcalculatedvalue:(e.g.,sale_amount=qty*unit_price)–  Sortingororderingthedatabasedonalistofcolumnstoimprovesearch

performance–  Joiningdatafrommultiplesources(e.g.,lookup,merge)anddeduplicatingthedata–  Aggregating(forexample,rollup—summarizingmultiplerowsofdata—total

salesforeachstore,andforeachregion,etc.)–  Transposingorpivoting(turningmultiplecolumnsintomultiplerowsorviceversa)–  Splittingacolumnintomultiplecolumns(e.g.,NameàFirstname,FamilyName)–  Disaggregatingrepeatingcolumns–  Lookingupandvalidatingtherelevantdatafromtablesorreferentialfiles–  Applyinganyformofdatavalidation;failedvalidationmayresultinafullrejection

ofthedata,partialrejection,ornorejectionatall,andthusnone,some,orallofthedataishandedovertothenextstepdependingontheruledesignandexceptionhandling

Page 80: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Transformingunstructuredresources•  Waymorecomplex!First,weneedtotransformfromunstructuredtostuctured

•  Example:sentimentanalysisinTwitter

Here,thechallengeistoanalyzetextand,first,identifythoseofinterest(e.g.talkingaboutyourcompanyoragivenproduct)andthen,assigntothetextapositive,negativeor0(neutral)score.

Page 81: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Transformingunstructuredresources(2)

•  Whatyougetfromthistransformation(let’signoreHOWfornow)?

date positive negative neutral

1/04/2016 500 237 1715

2/04/2016 451 277 2015

3/04/2016 816 300 3016

Table:StarbucksTwitterSentiment

Page 82: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Transformation:aggregation

•  Wealreadymentionedanexampleofaggregation(summingrevenuesdatafromdifferentDBsinmaintainedindifferentdepartments)

•  Aggregationmaybefarmorecomplex•  E.g.wemaywanttoaggregatesentimentdatawithsalestodiscoverwhatwentwrong(orwhatwasthewinningmoveusersappreciatedbest)

Page 83: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

Example(SocialEngagementIndex)

http://www.brandamplitude.com/blog/innovation/item/announcing-breakthrough-in-measuring-the-impact-of-social-media-on-sales

Page 84: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

3.Loading

•  ThirdstepisLoading•  TheETLloadingelementisresponsibleforloadingtransformeddataintothedatawarehousedatabase.

•  Datawarehousesareusuallyupdatedperiodicallyratherthancontinuously(asisthecaseforOLTP)andlargenumbersofrecordsareoftenloadedtomultipletablesinasingledataloadstep.

•  Thedatawarehouseisoftentakenofflineduringupdateoperationssothatdatacanbeloadedfaster

ETL

Page 85: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

DWETLTools

•  SomeoftheWellKnownETLTools•  ThemostwellknowncommercialtoolsareAbInitio,IBMInfoSphereDataStage,Informatica,OracleDataIntegratorandSAPDataIntegrator.

Page 86: Introduction to Data Warehousingtwiki.di.uniroma1.it/pub/BI/WebHome/2.DataWarehousesETL.pdf · Introduction to Data Warehousing The Business demand for data, information and analytics

CaseStudy(Self-assessment)•  Downloadthepaperat

http://bmjopen.bmj.com/content/bmjopen/6/8/e010962.full.pdfdescribingtheusecaseofDutchRedCrossdatawarehouse(alsooncoursewebsite)

•  Answerthefollowing:–  Whattypeofdatahavebeenintegrated,fromwhichsources?–  Canyoudrawtheschemaofallneededtables?

•  Whataretheobjects?Whataretheattributes?Whataretherelationships?Whatisthe“semantics”ofrelationships?

–  CanyoulistsomeoftheTRANSFORMoperationsthatwereneededtoharmonizedataduringtheETLprocess?

–  Whichadditionalchallengesareposedtothewarehousebythespecificapplicationdomain?

–  Canyoulistthemaincategoriesofdatawhichhavebeenintegrated?–  Canyoulistandsummarizethemaindataanalytictaskssupportedbythe

wharehouse?