ibis: a provenance manager for mul‐layer systems · yahoo extracted table imdb extracted table...
TRANSCRIPT
Ibis:AProvenanceManagerforMul5‐LayerSystems
ChristopherOlston&AnishDasSarmaYahoo!Research
Mo5va5on:ManySub‐Systems
scalablefilesysteme.g.GFS
distributedsor5ng&hashinge.g.Map‐Reduce
dataflowprogrammingframeworke.g.Pig
workflowmanagere.g.Oozie
low‐latencyprocessor
servinginges5on
datumX
datumY
metadataqueries
provenanceofX?
IbisProject
• Benefits:– Provideuniformviewtousers– Factoroutmetadatamanagementcode– Decouplemetadatalife5mefromdata/subsystemlife5me
• Challenges:– Overheadofshippingmetadata– Disparatedata/processinggranulari5es
dataprocessingsub‐systems metadatamanager users
metadataqueries
answers
metadataIbis
integratedmetadata
THISPAPER
ExampleGranularityLaRces
Pigscript
PigjobPiglogicalopera5onMRjob
Pigphysicalopera5on
MRjobphase
MRtask
TaskaTempt
datagranulari5es processgranulari5es
Table
Columngroup
RowColumn
Cell
Version
Webpage
Workflow
MRprogram
Challenges
• Inference:Givenrela5onshipsexpressedatonegranularity,answerqueriesaboutothergranulari5es(theseman;csaretrickyhere!)
• Efficiency:Implementinferencewithoutresor5ngtomaterializingeverythingintermsoffinestgranularity(e.g.cells)
TalkOutline
• Informaloverview– Exampledataprovenancegraph
– Querylanguageoverview+examples
• Touchonformalmodel(detailsinpaper)
ExampleWorkflow
IMDbExtract
Y!Extract
Merge
ExtractedY!
ExtractedIMDb
MovieDB
IMDBwebpage
Yahoo!Movieswebpage
extractpigscript
5tle year leadactor
Avatar 2009 V1:WorthingtonV2:Saldana
Incep5on 2010 DiCaprio5tle year leadactor
Avatar 2009 Saldana
Incep5on 2010 DiCaprio
5tle year leadactor
Avatar 2009 Worthington
Incep5on 2010 DiCaprio
Yahoo!Movieswebpage
IMDBwebpage
mapoutput1
mapoutput2
pigjob2
Yahooextractedtable
IMDBextractedtable
combinedtable
maptask1,aTempt1
maptask2,aTempt1
reducetask1,aTempt1
mergepigscript
version=3wrapper=yahoo
pigjob1
version=2wrapper=imdb
license=yahooauth.score=5
license=imdbauth.score=4
ProvenanceGraph
MeaningofProvenanceRela5onships
• (P,D1,D2):ProcessPconsumedPARTOFdatumD1andemiTedALLOFdatumD2
• “partall”seman5csareanaturaldefault
• Upshot:ifD1andD2aretables,cannotinferthatagivenrowinD1influencedD2
• Inquerylanguage,cans5llask“partpart”ques5ons:d2εD2suchthatD1influencedd2?
QueryLanguage:“IQL”
• SQL‐stylelanguageforqueryingtheprovenancegraph
• Specialconstructs:– Under(containment):IsrowRundertableT?– Influence:DoesdataD1influencedataD2?– Feed:DoesdataDfeedprocessP?– Emit:DoesprocessPemitdataD?
IQLExamples
• Finddataitemsthatinfluencedthecombinedextractedtable:
• Finddatatablesthatare“contaminated”byversion3oftheextrac5onscript(foundtohaveabug):
select d.id from AnyData d, Table t where d influences t and t.id = (combined extracted table);
select t.id from PigScript p, PigJob j, AnyData d1, AnyData d2, Table t where p.id = (extract pig script) and j under p and j.version = 3 and j emits d1 and d1 influences d2 and d2 under t;
Implementa5onStatus
• Wehaveaworkingstorage/queryenginebasedonrewri5ngoverSQL/RDBMS(SQLite)
• We’recurrentlyworkingonautoma5cprovenancecapture(fromPig,Hadoop,etc.)
TalkOutline
• Informaloverview– Exampledataprovenancegraph
– Querylanguageoverview+examples
• Touchonformalmodel(detailsinpaper)– Open‐worldseman5cs
– Transi5veinferenceofcontainment&influence
Open‐WorldSeman5cs
• MetadataofIbisencodessetFoffacts• Open‐world:– Correctness:AllfactsinFarecorrect– Incomplete:MaybeotherfactsunknowntoIbis
• Extension,ext(F),offactsthatcanbederivedfromF
• TrueworldhassetoffactsF’
U|
U|• WehaveFext(F)F’
Open‐WorldSeman5cs:OneImplica5on
• SupposeFcontains:• ProcesspemiTedrowr1
• Currentlyr1istheonlyrowintableT• ``ProcesspemiTedtableT’’isafactthatmaybeinF’(trueworld)butcannotbeinferredinext(F)
Inferring“XisunderY”
• Definedintermsof“granulariza5on”:1. ResolveXandYintofinest‐grainelements(e.g.cells)2. Performsetcontainmentcheck
• Implementedviaashortcutthatavoidsenumera5ngsub‐elements
• Proofthatimplementa5on&defini5onareequivalent
Inferring“XisunderY”
Basicelementbdefinedbygranularityg,directparentsP(andaniden5fier).
Granulariza5onofbtofinestgranularitygmindefinedby:
G(b)={b’=(gmin,P’)|bcontainsb’}Containmentobtainedbyrecursiveapplica5onofparentrela5on
ComplexelementEdefinedbysetofgranularity{g1,…,gn},andcorrespondingbasicelements{b1,…,bn}.
Granulariza5onofcomplexelementEconsis5ngofb1,….,bnis:G(E)=iG(bi)
U
Inferring“XisunderY”
UnderCheck‐1:SetsofcomplexelementsE1,E2.E1isunderE2iffnotexistsatrueworldwithUe1εE1G(e1)Ue1εE1G(e1)
U|
EfficientUnderCheck‐2:SetsofcomplexelementsE1,E2.E1isunderE2iffforalle1εE1,existse2εE2suchthate1isundere2.
Givencomplexelementse1ande2withbasicelementsetsB(e1)andB(e2),e1isundere2iffforallb2εB(e2),existsb1εB(e1)suchthatb2containsb1.
Theorem:Check‐1isequivalenttoCheck‐2.
Inferring“XinfluencesY”
Giventwodataver5cesd1andd2:
(1)d1influences(0)d2iffd2isunderd1;(2)d1influences(1)d2iffoneofthefollowinghold:
(A)d1influences(0)d2(B)thereexistsaprovenancerela5onship(d1’,p,d2’)such
thatd1influences(0)d1’andd2’influences(0)d2
(3)Foranyintegerk>1,d1influences(k)d2iffexistsd*suchthatd1influences(1)d*andd*influences(k‐1)d2
RelatedWork
• Mul5‐layersystemprovenance:– HarvardPASSv2
• Nestedcollec5onsinscien5ficworkflowprovenance:– Kepler’sCOMADnestedcollec5ons– ZOOMuserviews– Openprovenancemodel
• Annota5onsonarbitrarysub‐regionsofrela5ons:– [Eltabakhetal.]– [Srivastavaetal.]
Summary
• Manysemi‐independentdatamgmt.layers+provenancequeryneedsintegratedprovenance
• Diversedata&processgranulari5escarefulseman5cs
• Ourcontribu5ons:– Formalmul5‐granularityprovenanceseman5cs– Querylanguage– Workingprototype(seepaper;workinprogress)