argus: rete + dbms = efficient persistent profile matching on large-volume data streams chun jin...

ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams

Chun Jin

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon [email protected]

Chun Jin Carnegie Mellon 2

Stream Processing Model Stream Processing becomes

demanding and prevalent.

Storage

Data Streams Output


Stream Databases Stream Database Applications

Network Traffic Analysis and Router Configuration

Dynamic Internet Services Sensor Data Analysis Anomaly Detection

Stream Database Projects STREAM, TelegraphCQ, Aurora NiagaraCQ, OpenCQ, WebCQ Gigascope, Tribeca Tapestry, Alert, Tukwila, etc. ARGUS


Stream Anomaly Monitoring Systems (SAMS)

SAMS monitors structured data streams for anomalies or potential hazards.

Matches of queries may be high urgency alerts. Prompt detections are desirable.

Satisfaction of a SAMS query is often rare (very-high-selectivity).


SAMS Dataflow

Analyst

Stream Anomaly Monitoring System

Stream Anomaly Monitoring System

Storage

Que

ries

Alerts

Data Streams

FedWire Money TransfersPatient Records


Challenges to SAMS

Persistent queries may number in thousands or tens of thousands.

Daily stream volumes may exceed millions of records.

Prompt detections are desirable.

Very-high-selectivity Query Property.


Proposed ARGUS Approach Basic Framework:

Incremental evaluation schemes (Adapted Rete algorithm)

Rete (Forgy 1982): Incremental Evaluation based on Materialized Intermediate Results.

Upon a traditional DBMS platform Exploiting Very-High-Selectivity Query Property:

Transitivity Inference Conditional Materialization Optimizing Join Order Computation Sharing

Related to Other Applications Stream Databases Modern DBMS Query Optimization


Query Example 4 Suppose for every big transaction of

type code 1000, the analyst wants to check if the money stayed in the bank or left within ten days. An additional sign of possible fraud is that transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within ten days of this transaction using an intermediate bank.


SQL Query for Example 4FROM transaction r1, transaction r2, transaction r3WHERE r2.type_code = 1000 AND

r3.type_code = 1000 ANDr1.type_code = 1000 ANDr1.amount > 1000000 ANDr1.rbank_aba = r2.sbank_aba ANDr1.benef_account = r2.orig_account ANDr2.amount > 0.5 * r1.amount ANDr1.tran_date <= r2.tran_date ANDr2.tran_date <= r1.tran_date + 10 ANDr2.rbank_aba = r3.sbank_aba ANDr2.benef_account = r3.orig_account ANDr2.amount = r3.amount ANDr2.tran_date <= r3.tran_date ANDr3.tran_date <= r2.tran_date + 10;


ARGUS System Architecture

Rete NetworkGenerator

Query

ReteNetworks

Data Tables

Analyst

Identified Threats

IntermediateTables

Data Streams

QueryTable

StreamAnomalyMonitoring

Do_queries

Scheduler


ReteGenerator Architecture

SystemCatalog

TransitivityInference

SQL Queries

ReteGenerator

SharingModule

Join Order

ConditionalMaterialization

Optimizer

Common Computation Identification Predicate Indexing Extended Predicate Set Operations

Choose what and how to share Recording and Manipulating Network Topology Estimating Sharing Costs


Adapted Rete Algorithm (Selection)

n and m are old data sets Δn and Δm are the new much

smaller incremental data sets. Selection ơ

ơ(n+ Δn) ơ(n) ơ(Δn)= +


Adapted Rete Algorithm (Join) Join (n+Δn) (m+Δm)

= n m + Δn m + n Δm + Δn Δm

When Δn and Δm are very small compared to n and m, time complexity of incremental join is O(n+m)

Old ResultsNew Incremental Results


Incremental Evaluation in Rete Example 4

DataTable

r1, r2, r3

Type_code=1000Amount>1000000

Type_code=1000

Type_code=1000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_accountr2.amount > r1.amount*0.5r1.tran_date <= r2.tran_dater2.tran_date >= r1.tran_date+10

r2.rbank_aba = r3.sbank_abar2.benef_account = r3.orig_accountr2.amount = r3.amountr2.tran_date <= r3.tran_dater3.tran_date >= r2.tran_date+10


Complex Queries A persistent query may contain multiple

SQL statements, and a single SQL statement may contain unions of multiple SQL terms.

Each SQL term is mapped to a sub-Rete network.

These sub-Rete networks are then connected to form the statement-level sub-networks.

And the statement-level subnetworks are further connected based on the view references to form the final query-level Rete network.


Transitivity Inference Exploring transitivity properties of

comparison operators To derive hidden high-selective selection

predicates High-selective selection predicates can

significantly improve performance as they may produce very small intermediate results. Subsequent join could be performed very fast on the materialized intermediate results.

Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97


Transitivity Inference Example Given

r1.amount > 1000000 and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount

r1.amount > 1000000 is very high-selective on r1

We can infer high-selective predicates: r2.amount > 500000 r3.amount > 500000


Conditional Materialization

r2

r1

r2

r1

Unconditional Materialization

Conditional Materialization:

Choose materialization or not based on cost estimates


Preliminary Evaluation:Queries and Data

7 queries on synthesized FedWire money transfer database. 320006 records.

Two Data Conditions: Data1: Old: first 300000 records

New: remaining 20006 recordsALERT

Data2: Old: first 300000 recordsNew: next 20000 recordsNOT alert


Preliminary Results

Rete with Transitivity Inference

0

10

20

30

40

50

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Ex

ecu

tio

n T

ime

(s)

Rete Data1 SQL Data1 Rete Data2 SQL Data2


Transitivity Inference

Q2

Q4

0

5

10

15

20

25

Data1 Data2

Exe

cuti

on

Tim

e(s)

05

101520253035404550

Data1 Data2

Ex

ec

uti

on

Tim

e(s

)

Rete TI Rete Non-TI SQL Non-TI SQL TI


Conditional Materialization

Q4 assumes Transitivity Inference not applicable

05

101520253035404550

Data1 Data2

Ex

ecu

tio

n T

ime

(s)

Conditional

Rete

SQL


ARGUS Summary Adapted Rete Algorithm upon a

traditional DBMS platform Exploit the very-high-selectivity

query property for optimization: Transitivity Inference Conditional Materialization

Current and Future Work: Optimizing Join Order Computation Sharing


Thank you!

Questions and Comments?

argus: rete + dbms = efficient persistent profile matching on large-volume data streams chun jin...

Documents