databases with uncertainty and lineage

Download Databases  With Uncertainty And Lineage

If you can't read please download the document

Upload: tyler

Post on 09-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Databases With Uncertainty And Lineage. Written by : Omar Benjelloun · Anish Das Sarma · Alon Halevy · Martin Theobald · Jennifer Widom Presented by : Alex Gorodetsky. Outline. Lineage? Uncertainty? Combined together? The logic behind the given solution - PowerPoint PPT Presentation

TRANSCRIPT

Databases with uncertainty and lineage

Databases With Uncertainty And LineageWritten by : Omar Benjelloun Anish Das Sarma Alon Halevy Martin Theobald Jennifer Widom

Presented by : Alex Gorodetsky1OutlineLineage? Uncertainty? Combined together?The logic behind the given solutionThe solution The Trio systemConclusions and future work

2Databases with lineageIn LDBs we focus on where lineage: the lineage of a tuple identies the data from which it was derived. Some tuples in an LDB are derived from other LDB tuples, e.g., as a result of queries. The lineage of derived tuples consists of references to other tuples in the LDB, via their unique identiers. Base tuples in some cases are derived from entities outside the LDB, such as an external data set or a sensor feed.

3Databases with lineage Example : crime-solver databaseIn this example, the lineage function is an obvious lineage function for the tuples in the result.Some operations have less obvious lineage functions, such as: NegationDuplicate-eliminationAggregationThe operations we consider in this paper: Simple well-behaved lineage functions.

Consider LDB relations Drives and Saw representing driver information and crime-vehicle sightings respectively.The ID column denotes the tuple identiersEmpty lineage is omitted.

As discussed in the introduction, lineage is particularly important in data integration settings. If base relations are derived from different data sources, external lineage may be used to convey that information, e.g., by encoding the URI of the source as part of external symbol identiers. Since LDBs keep track of lineage, data integration applications can easily query the lineage to examine how tuples potentially derived through multiple layers of complex queries relate to the original sources.4Uncertain databasesAn uncertain database represents a set of possible instances, each of which is one possible state of the database.Denition: An x-tuple is a multiset of tuples, called alternatives.Denition: A maybe x-tuple is a x-tuple annotated with a ?.Denition: An x-relation is a multiset of x-tuples.An x-relation R represents the set of possible instances P that can be constructed as follows: choose exactly one alternative from each x-tuple in R that is not a maybe x-tuple, and choose zero or one alternative from each x-tuple in R that is a maybe x-tuple.Orthogonality is a system design property which guarantees that modifying the technical effect produced by a component of a system neither creates nor propagates side effects to other components of the system.

5Uncertain databasesX-relations is a specic formalism for uncertain databases. X-relations provide a good balance of simplicity and expressiveness, and are orthogonal to the capabilities brought in by lineage.An x-relation R represents the set of possible instances P that can be constructed as follows: choose exactly one alternative from each x-tuple in R that is not a maybe x-tuple, and choose zero or one alternative from each x-tuple in R that is a maybe x-tuple.Orthogonality is a system design property which guarantees that modifying the technical effect produced by a component of a system neither creates nor propagates side effects to other components of the system.

6

Uncertain databasesExample

Here, Amy may have seen a Mazda, a Toyota, or no car at all, and the relation has three possible instances.x-relations are not a complete formalism. For example, the join Accuses cannot be represented as an x-relation: x-tuples are independent, so they cannot express the fact that if Amy accuses Jimmy (due to the Mazda), then she must accuse Billy as well.We will soon see that although x-relations alone are incomplete, adding lineage makes them complete.

Consider LDB relations Drives and Saw representing driver information and crime-vehicle sightings respectively.The ID column denotes the tuple identiersEmpty lineage is omitted.

As discussed in the introduction, lineage is particularly important in data integration settings. If base relations are derived from different data sources, external lineage may be used to convey that information, e.g., by encoding the URI of the source as part of external symbol identiers. Since LDBs keep track of lineage, data integration applications can easily query the lineage to examine how tuples potentially derived through multiple layers of complex queries relate to the original sources.7Data Integration SystemsSystems that offer a uniform interface to a multitude of data sources.We will focus on the semantics integration.

* .8Data Integration - Any Problem?Subjectivity effect - No standard.As data integration applications strive to offer a single objective and coherent integrated view of data sources, uncertainty is bound to appear. 3 main kinds of uncertainty in data integration applications: Data automatic extraction from unstructured data.Mappings between the schemas.Mappings between data objects in different sources.9Uncertainty, Lineage and Data IntegrationConsider an Uncertainty-generating data integration operations are performed.Lineage keeps track of the origins of data:Manages uncertaintyExplains uncertaintyPotentially correct the uncertaintySo maybe we need to consider combining uncertainty with lineage in databases?10OutlineLineage? Uncertainty? Combined together?The logic behind the given solutionThe solution The Trio systemConclusions and future work

11Combining lineage and uncertainty

Combining lineage and uncertainty ExampleWe combine the uncertain Saw x-relation with the earlier Drives relation to create a new version of Accuses that has both uncertainty and lineage:

Combining lineage and uncertainty

Combining lineage and uncertainty

Combining lineage and uncertainty

Combining lineage and uncertaintyCombining lineage and uncertaintyExampleConsider the choices for x-tuple 21 of Saw, do we satisfy all the conditions?

Why did we add them?To satisfy condition 3

Combining lineage and uncertaintyExampleCan we add (42,1)?No. Why?condition 2 would be violated

Combining lineage and uncertaintyExampleAll in all we have 3 possible LDBs:

Rk includes exactly the alternatives of x-tuples in R such that s(i, j ) Sk.k is the restriction of to Sk .

Note : not all combinations of the maybe x-tuples in Accuses are included in the possible instances.

20Combining lineage and uncertainty - CompletenessCombining lineage and uncertainty - Well-behaved lineage , ' . x-tuple, ' ' . ' x-tuples. , x-tuples. x-tuple ' .

ULDBs well-behaved .22Combining lineage and uncertainty DL-monotonic queriesIntuitively, any operation that can produce its results in a tuple-by-tuple fashion is DL-monotonic.Aggregation, duplicate-elimination, and some set operators are not DL-monotonic.From now on we assume all queries Q to be DL-monotonic.

23Combining lineage and uncertainty Query evaluation(Algorithm 1)?QSaw(witness,car)ID(Amy,Mazda)(21,1)(Amy,Toyota)(21,2)(Betty,Honda)(22,1)Drives(person,car)ID(Jimmy,Mazda)(31,1)(Jimmy,Toyota)(32,1)(Billy,Mazda)(33,1)(Billy,Honda)(34,1)Accuses(witness,person)ID(Amy,Jimmy)(41,1)(Amy,Jimmy)(41,2)(Amy,Billy)(41,3)(Betty,Billy)(42,1)Accuses(witness,person)ID(Amy,Jimmy) || (Amy,Jimmy) || (Amy,Billy)41(Betty,Billy)42ID{(21,1),(31,1)}(41,1){(21,2),(32,1)}(41,2){(21,1),(33,1)}(41,3){(22,1),(34,1)}(42,1)Saw(witness,car)ID(Amy,Mazda) || (Amy,Toyota)21(Betty,Honda)22Drives(person,car)ID(Jimmy,Mazda)31(Jimmy,Toyota)32(Billy,Mazda)33(Billy,Honda)34?Combining lineage and uncertainty Applying a query to a ULDBTheorem: Given a ULDB D and a query Q:Algorithm 1 returns Q(D).If D is a well-behaved ULDB, then so is Q(D).Combining lineage and uncertainty ULDB minimality

ULDBs do not have a unique representation.We can have two different x-relations that have exactly the same set of possible instances.Two notions of ULDB minimality:Data minimalityLineage minimalityCombining lineage and uncertainty Data minimalityD-minimality: A ULDB D is D-minimal if it does not include any extraneous alternatives or ?s.

An alternative (i, j ) of an x-tuple ti in a ULDB D is said to be extraneous if removing it from the x-relation does not change the possible instances of D. A ? on an x-tuple in D is said to be extraneous if removing it does not change the possible instances of D.

27Combining lineage and uncertainty Lazy Algorithm for D-minimizationExtraneous ?: (44,1)Extraneous alternative - Search recursively :(Carol,Acura,Lexus)

(44,1) '?' '?' . (6,1) . " ' , (4,1) (5,1) . (Carol,Acura,Lexus) .28Combining lineage and uncertainty Lineage minimality

29Combining lineage and uncertainty ExtractionExtraction is important in the context of data integration. A exible way to bring into the ULDB just the data that is needed from multiple external sources.The extracted x-relations preserves their information, while discarding irrelevant data and lineage.

30OutlineLineage? Uncertainty? Combined together?The logic behind the given solutionThe solution The Trio systemConclusions and future work

31The Trio systemThe Trio system: A relational DBMS that supports uncertainty and lineage.Based on the ULDB data model, and accepts queries in the TriQL language.TriQL: extension of SQL with uncertainty and lineage-specic features. Trio ' . ' , ' ULDBs.

32The Trio systemStandard relational DBMSStandard SQLThe Python layer presents a simple Trio API that extends the standard Python DB 2.0 API for database access (Pythons analog of JDBC). The Trio API accepts TriQL queries in addition to regular SQL, and query results may be x-tuples as well as regular tuples. The API also exposes lineage tracing, along with the other ULDB-specic operations. The Trio system

Trio API and translator(Python)Using the Trio API they constructed:A generic command-line interactive client similar to that provided by most DBMSs.A fully-featured graphical user interface called TrioExplorer.The Trio system

Command-lineclientTrioExplorer(GUI client)The Trio system Encoding ULDB dataNew Attribute functions:aid - is a unique alternative identier.xid - identies the x-tuple that this alternative belongs to.conf - stores the condence of the alternative.num - is a nonnegative integer that tracks whether the alternatives x-tuple has a ?. 36The Trio system Encoding ULDB data* lin_T aid src_aid.37The Trio system Encoding ULDB data example

ULDB . : 3 x-tuples Accuses ?, num>1.38Send the translated SQL query to the underlying DBMS, and opens a cursor on the result.Trio stored procedures:Tfetch: a cursor call to the Trio API for the original TriQL query.Sfetch: a cursor call to the underlying DBMS for the translated SQL query.

The Trio system TriQL to SQL query TriQL : '. , ' .39Enable Tfetch to collect all SQL result tuples for a single pair.In order to propagate the ? annotations, Multiply the num values underlying base tuples.These values, together with the table names, comprise the lineage for the alternatives in the result x-tuple.

The Trio system TriQL to SQL queryAccuses(witness,person)ID(Amy,Jimmy)(41,1)(Amy,Jimmy)(41,2)(Amy,Billy)(41,3)(Betty,Billy)(42,1)The Trio system Trio queriesDefinition: TriQL(Trios query language for ULDBs):Extension of SQL. TriQL queries return:Uncertain relationsLineage that connects query result data to the queried data.41The Trio system Built-in predicates and functions42The Trio system More possibilitiesTrioExplorer visualize the lineage.The trio system demoCoexistence checksExtraneous data removal- "" . Trio .43OutlineLineage? Uncertainty? Combined together?The logic behind the given solutionThe solution The Trio systemConclusions and future work

44Conclusions and future work

The writers are not aware of any previously proposed formal data representation that integrates both lineage and uncertainty.We have seen that ULDBs are a good way to solve data integration semantic problems.Boost in performance may be achieved over computing query operators and condences in tandem.ULDBs are not expressive enough to fully represent complex operations.Extend ULDBs with richer primitives to support data integration cost in complexity.Keep data integration outside the ULDB box and transfer the lineage and uncertainty primitives to data integration systems.Extending techniques and results : duplicate-elimination, aggregation, and negation.

45QUESTIONS?

Thank you!