midterm review 1 - github pages€¦ · review relational databases and relational algebra 2. next...
TRANSCRIPT
CS639:DataManagementfor
DataScienceMidtermReview1:RelationalDatabasesandRelationalAlgebra
TheodorosRekatsinas
1
Today’sLecture
1. ReviewRelationalDatabasesandRelationalAlgebra
2. NextLecture:ReviewMapReduceandNoSQLsystems
2
Datascienceworkflow
3
Section2
https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext
• Datarepresentsthetraces ofreal-worldprocesses.• Thecollectedtracescorrespondtoasample ofthoseprocesses.
• Thereisrandomness anduncertainty inthedatacollectionprocess.
• Theprocessthatgeneratesthedataisstochastic (random).• Example:Let’stossacoin!Whatwilltheoutcomebe?Headsortails?Therearemanyfactorsthatmakeacointossastochasticprocess.
• Thesamplingprocessintroducesuncertainty.• Example:ErrorsduetosensorpositionduetoerrorinGPS,errorsduetotheanglesoflasertraveletc.
4
Section2
UncertaintyandRandomness
• Datarepresentsthetraces ofreal-worldprocesses.
• Partofthedatascienceprocess:Weneedtomodel thereal-world.
• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters
5
Section2
Models
• Datarepresentsthetraces ofreal-worldprocesses.
• Thereisrandomness anduncertainty inthedatacollectionprocess.
• Amodelisafunction fθ(x)• x:inputvariables(canbeavector)• θ:modelparameters
• Modelsshouldrelyonprobabilitytheorytocaptureuncertaintyandrandomness!
6
Section2
ModelingUncertaintyandRandomness
TheRelationalModel:Schemata
• RelationalSchema:
Students(sid: string, name: string, gpa: float)
AttributesString,float,int,etc.arethedomains oftheattributes
Relationname
8
TheRelationalModel:Data
sid name gpa
001 Bob 3.2
002 Joe 2.8
003 Mary 3.8
004 Alice 3.5
Student
Anattribute (orcolumn)isatypeddataentrypresentineachtupleintherelation
Thenumberofattributesisthearity oftherelation
9
TheRelationalModel:Data
sid name gpa
001 Bob 3.2
002 Joe 2.8
003 Mary 3.8
004 Alice 3.5
Student
Atuple orrow (orrecord)isasingleentryinthetablehavingtheattributesspecifiedbytheschema
Thenumberoftuplesisthecardinality oftherelation
10
TheRelationalModel:DataStudent
Arelationalinstance isaset oftuplesallconformingtothesameschema
InpracticeDBMSsrelaxthesetrequirement,andusemultisets.
sid name gpa
001 Bob 3.2
002 Joe 2.8
003 Mary 3.8
004 Alice 3.5
• Arelationalschema describesthedatathatiscontainedinarelationalinstance
ToReiterate
LetR(f1:Dom1,…,fm:Domm)bearelationalschema then,aninstanceofRisasubsetofDom1 xDom2 x…xDomn
Inthisway,arelationalschema Risatotalfunctionfromattributenames totypes
• Arelationalschema describesthedatathatiscontainedinarelationalinstance
OneMoreTime
ArelationRofarity t isafunction:R:Dom1 x…xDomt à {0,1}
Then,theschemaissimplythesignatureofthefunction
I.e.returnswhetherornotatupleofmatchingtypesisamemberofit
Noteherethatordermatters,attributenamedoesn’t…We’ll(mostly)workwiththeothermodel(lastslide)in
whichattributenamematters,orderdoesn’t!
Arelationaldatabase
• Arelationaldatabaseschema isasetofrelationalschemata,oneforeachrelation
• Arelationaldatabaseinstance isasetofrelationalinstances,oneforeachrelation
Twoconventions:1. Wecallrelationaldatabaseinstancesassimplydatabases2. Weassumeallinstancesarevalid,i.e.,satisfythedomainconstraints
RDBMSArchitecture
HowdoesaSQLenginework?
SQLQuery
RelationalAlgebra(RA)
Plan
OptimizedRAPlan Execution
Declarativequery(fromuser)
Translatetorelationalalgebraexpression
Findlogicallyequivalent- butmoreefficient- RAexpression
Executeeachoperatoroftheoptimizedplan!
• Fivebasicoperators:1. Selection: s2. Projection:P3. CartesianProduct:´4. Union:È5. Difference:-
• Derivedorauxiliaryoperators:• Intersection,complement• Joins(natural,equi-join,thetajoin,semi-join)• Renaming: r• Division
RelationalAlgebra(RA)
NotethatRAOperatorsareCompositional!
SELECT DISTINCTsname,gpa
FROM StudentsWHERE gpa > 3.5;
Students(sid,sname,gpa)
HowdowerepresentthisqueryinRA?
Π"#$%&,()$(𝜎()$,-./(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠))
𝜎()$,-./(Π"#$%&,()$(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠))
Aretheselogicallyequivalent?
• Notation:R1⋈R2
• JoinsR1 andR2 onequalityofallsharedattributes• IfR1 hasattributesetA,andR2 hasattributesetB,andtheyshareattributesA⋂B=C,canalsobewritten:R1⋈ 𝐶R2
• OurfirstexampleofaderivedRA operator:• Meaning:R1⋈ R2 =PAUB(sC=D(𝜌=→?(R1)´ R2))• Where:
• Therename𝜌=→? renamesthesharedattributesinoneoftherelations
• TheselectionsC=Dchecksequalityofthesharedattributes• TheprojectionPAUBeliminatestheduplicate
commonattributes
NaturalJoin(⋈)
SELECT DISTINCTssid, S.name, gpa,ssn, address
FROM Students S,People P
WHERE S.name = P.name;
SQL:
RA:𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒
Students(sid,name,gpa)People(ssn,name,address)
Example:ConvertingSQLQuery->RA
SELECT DISTINCTgpa,address
FROM Students S,People P
WHERE gpa > 3.5 ANDsname = pname;
Π()$,$DDE&""(𝜎()$,-./(𝑆 ⋈ 𝑃))
Students(sid,sname,gpa)People(ssn,sname,address)
RAExpressionsCanGetComplex!
PersonPurchasePersonProduct
sname=fred sname=gizmo
P pidP ssn
seller-ssn=ssn
pid=pid
buyer-ssn=ssn
P name
RAhasLimitations!
• Cannotcompute“transitiveclosure”
• FindalldirectandindirectrelativesofFred• CannotexpressinRA!!!
• NeedtowriteCprogram,useagraphengine,ormodernSQL…
Name1 Name2 RelationshipFred Mary FatherMary Joe CousinMary Bill SpouseNancy Lou Sister
SQLTime!
SQLTime!
Find all the distinct names of all companies that are based in Japan.
SQLTime!
Find all the distinct names of all companies that are based in Japan.
SQLTime!
FindthedistinctnamesofallcompaniesthatarebasedinJapanandthatsoldaproducttoanAIbasedinCupertino.
SQLTime!
FindthedistinctnamesofallcompaniesthatarebasedinJapanandthatsoldaproducttoanAIbasedinCupertino.
SQLTime!
Findthedistinctnamesofallcompaniesthathavesoldatleastsixdistinctproducts.
SQLTime!
Findthedistinctnamesofallcompaniesthathavesoldatleastsixdistinctproducts.
SQLTime!
Findthedistinctnamesofallcompaniesthathavenotsoldevenasingleproduct.
SQLTime!
Findthedistinctnamesofallcompaniesthathavenotsoldevenasingleproduct.
SQLTime!
Findthedistinctnamesofallcompaniessuchthateveryproducttheyhaveeversoldcostsatleast10thousanddollars.Companiesthathavenotsoldanyproductsshouldnotbecounted,astheyarelosers.
SQLTime!
Findthedistinctnamesofallcompaniessuchthateveryproducttheyhaveeversoldcostsatleast10thousanddollars.Companiesthathavenotsoldanyproductsshouldnotbecounted,astheyarelosers.
Logicalvs.PhysicalOptimization
• Logicaloptimization(wewillonlyseethisone):• Findequivalentplansthataremoreefficient• Intuition:Minimize#oftuplesateachstepbychangingtheorderofRAoperators
• Physicaloptimization:• FindalgorithmwithlowestIOcosttoexecuteourplan• Intuition:Calculatebasedonphysicalparameters(buffersize,etc.)andestimatesofdatasize(histograms)
Execution
SQLQuery
RelationalAlgebra(RA)Plan
OptimizedRAPlan
Recall:LogicalEquivalenceofRAPlans
• GivenrelationsR(A,B)andS(B,C):
• Here,projection&selectioncommute:• 𝜎FG/(ΠF(𝑅)) = ΠF(𝜎FG/(𝑅))
• Whatabouthere?• 𝜎FG/(ΠJ(𝑅))?= ΠJ(𝜎FG/(𝑅))
ΠF,?
R(A,B) S(B,C)
T(C,D)
sA<10
ΠF,?(𝜎FLMN 𝑇 ⋈ 𝑅 ⋈ 𝑆 )
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
TranslatingtoRA
LogicalOptimization
• Heuristically,wewantselectionsandprojectionstooccurasearlyaspossibleintheplan• Terminology:“pushdownselections”and“pushingdownprojections.”
• Intuition:Wewillhavefewertuplesinaplan.• Couldfailiftheselectionconditionisveryexpensive(sayrunssomeimageprocessingalgorithm).• Projectioncouldbeawasteofeffort,butmorerarely.
ΠF,?
R(A,B) S(B,C)
T(C,D)
sA<10
ΠF,?(𝜎FLMN 𝑇 ⋈ 𝑅 ⋈ 𝑆 )
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
OptimizingRAPlan PushdownselectiononAsoitoccursearlier
ΠF,?
R(A,B)
S(B,C)
T(C,D)
ΠF,? 𝑇 ⋈ 𝜎FLMN(𝑅) ⋈ 𝑆
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
OptimizingRAPlan PushdownselectiononAsoitoccursearlier
sA<10
ΠF,?
R(A,B)
S(B,C)
T(C,D)
ΠF,? 𝑇 ⋈ 𝜎FLMN(𝑅) ⋈ 𝑆
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
OptimizingRAPlan Pushdownprojectionsoitoccursearlier
sA<10
ΠF,?
R(A,B)
S(B,C)
T(C,D)
ΠF,? 𝑇 ⋈ ΠF,P 𝜎FLMN(𝑅) ⋈ 𝑆
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
OptimizingRAPlan WeeliminateBearlier!
sA<10
ΠF,=
Ingeneral,whenisanattributenotneeded…?
Pleasegoovertheexampleshere:
• https://courses.cs.washington.edu/courses/cse544/99sp/homeworks/sample/sample.html
• Onlythefirst4questions!
40
41
TransactionProperties:ACID
• Atomic• Stateshowseitheralltheeffectsoftxn,ornoneofthem
• Consistent• Txn movesfromastatewhereintegrityholds,toanotherwhereintegrityholds
• Isolated• Effectoftxns isthesameastxns runningoneafteranother(ie lookslikebatchmode)
• Durable• Onceatxn hascommitted,itseffectsremaininthedatabase
ACIDcontinuestobeasourceofgreatdebate!
42
ACID:Atomicity
• TXN’sactivitiesareatomic:allornothing
• Intuitively:intherealworld,atransactionissomethingthatwouldeitheroccurcompletely ornotatall
• TwopossibleoutcomesforaTXN
• Itcommits:allthechangesaremade
• Itaborts:nochangesaremade
Transactions• Akeyconceptisthetransaction(TXN):an atomicsequenceofdbactions(reads/writes)
Atomicity:Anactioneithercompletesentirely ornotatall
43
Acct Balancea10 20,000a20 15,000
Acct Balancea10 17,000a20 18,000
Transfer$3kfroma10toa20:1. Debit$3kfroma102. Credit$3ktoa20
• Crashbefore1,• After1butbefore2,• After2.
Writtennaively,inwhichstatesis
atomicity preserved?
DBAlwayspreservesatomicity!
44
ACID:Consistency
• Thetablesmustalwayssatisfyuser-specifiedintegrityconstraints• Examples:
• Accountnumberisunique• Stockamountcan’tbenegative• Sumofdebitsandofcredits is0
• Howconsistencyisachieved:• Programmermakessureatxn takesaconsistentstatetoaconsistentstate• Systemmakessurethatthetxn isatomic
45
ACID:Isolation
• Atransactionexecutesconcurrentlywithothertransactions
• Isolation:theeffectisasifeachtransactionexecutesinisolation oftheothers.
• E.g.Shouldnotbeabletoobservechangesfromothertransactionsduringtherun
Challenge:SchedulingConcurrentTransactions• TheDBMSensuresthattheexecutionof{T1,…,Tn}isequivalenttosomeserial execution
• Onewaytoaccomplishthis:Locking• Beforereadingorwriting,transactionrequiresalockfromDBMS,holdsuntiltheend
• KeyIdea: IfTi wantstowritetoanitemxandTjwantstoreadx,thenTi,Tj conflict.Solutionvialocking:• onlyonewinnergetsthelock• loserisblocked(waits)untilwinnerfinishes
AsetofTXNsisisolated iftheireffectisasifallwereexecutedserially
46
WhatifTiandTj needXandY,andTi asksforXbeforeTj,andTj asksforYbeforeTi?->Deadlock!Oneisaborted…
AllconcurrencyissueshandledbytheDBMS…
47
ACID:Durability
• TheeffectofaTXNmustcontinuetoexist(“persist”)aftertheTXN• Andafterthewholeprogramhasterminated• Andeveniftherearepowerfailures,crashes,etc.• Andetc…
•Means:Writedatatodisk
EnsuringAtomicity&Durability• DBMSensuresatomicity evenifaTXNcrashes!
• Onewaytoaccomplishthis:Write-aheadlogging(WAL)
• KeyIdea: Keepalogofallthewritesdone.• Afteracrash,thepartiallyexecutedTXNsareundoneusingthelog
Write-aheadLogging(WAL): Beforeanyactionisfinalized,acorrespondinglogentryisforcedtodisk
48
Weassumethatthelogison“stable”storage
AllatomicityissuesalsohandledbytheDBMS…
ChallengesforACIDproperties
• Inspiteoffailures:Powerfailures,butnotmediafailures
• Usersmayaborttheprogram:needto“rollbackthechanges”• Needtolog whathappened
• Manyusersexecutingconcurrently• Canbesolvedvialocking(we’llseethisnextlecture!)
Andallthiswith…Performance!!