complex event processing with rti data distribution service

6

Click here to load reader

Upload: supreet-oberoi

Post on 09-Jul-2015

1.029 views

Category:

Technology


2 download

DESCRIPTION

In this paper, we will survey challenges facing application developers that have to process real-time datato identify events. An event is a notable thing that happens inside or outside your business. An event,business or system, may signify a problem or impending problem, an opportunity, a threshold, or adeviation. We explain the fundamentals of Complex Event Processing, and introduce the semantics of theCEP query language, called CCL. Finally, we will discuss some of the challenges in integrating real-timedata with the enterprise, and how to use CEP to address these challenges.

TRANSCRIPT

Page 1: Complex Event Processing with RTI Data Distribution Service

Complex Event Processing with RTI Data Distribution Service

A Whitepaper by Real-Time Innovations Supreet Oberoi, Vice President, Engineering

Introduction

RTI has recently integrated its real-time middleware, RTI Data Distribution Service, with the market-leading Complex Event Processing (CEP) vendor, Coral8. With the integration of RTI Data Distribution Service with the CEP engine, we can now address many outstanding challenges of real-time application developers.

In this paper, we will survey challenges facing application developers that have to process real-time data to identify events. An event is a notable thing that happens inside or outside your business. An event, business or system, may signify a problem or impending problem, an opportunity, a threshold, or a deviation. We explain the fundamentals of Complex Event Processing, and introduce the semantics of the CEP query language, called CCL. Finally, we will discuss some of the challenges in integrating real-time data with the enterprise, and how to use CEP to address these challenges.

Challenges in Building Real-Time Distributed Applications RTI Data Distribution Service is becoming the de facto middleware standard for building high-throughput, real-time distributed systems. By providing a real-time middleware for key embedded and enterprise systems, RTI enables system architects to build a heterogeneous system-of-systems for real-time distributed applications. With RTI Data Distribution Service, distributed-application developers can reduce the cost and risk for sending and receiving data in a real-time system. However, key challenges remain for developers that integrate real-time distributed systems with the enterprise:

1. Need to process data before use: Traditionally, real-time distributed systems use RTI Data Distribution Service in embedded systems for sending sensor data. In many cases, applications need to process this streaming data before it becomes useful for the enterprise. The enterprise system

may only be interested when an “interesting event” occurs, but not in periodic sensor reads. Sometimes, applications need to clean the duplicate and erroneous readings before making the data available to the enterprise. For example, in RFID monitoring applications, the system needs to cleanse duplicate reads before processing the data. Also, in such examples, the data structures used for sending sensor data are different than the data structures that the application uses; applications need to perform transformations – syntactic (change names, types), semantic (a new meaning) -- before it acts on the data by performing persistence, generating business events, or creating real-time enterprise dashboards.

2. Need to correlate data: Consider the example of a RFID application. The RFID reader sends data that identifies a tag from a particular RFID reader. This event is not interesting for the business-application user, who is interested in learning the SKU ID and the price of the checked item. To make

Page 2: Complex Event Processing with RTI Data Distribution Service

Complex Event Processing with RTI Data Distribution Service

- 2 -

the event relevant to the user, the application requires a lookup of the RFID tag from a relational database that maps the tag to the product ID. Then, the application has to correlate the product ID to the price from the product catalog. Even though the sensor event may have come from RTI Data Distribution Service, not all the data required to infer a meaningful business/enterprise event resides at one place. Integrating and correlating data from multiple sources is a very significant need.

3. Need a time window for running queries: Not all the streaming real-time data may be relevant beyond a given time window. A real-time distributed application, such as one monitoring patterns of network intrusion, cannot persist all the network data – it only needs to check for patterns in a given window of data activity. For example, in click-stream analysis, applications need to monitor the HTTP traffic for certain patterns in real-time. There is no value for the data after a certain, short interval.

4. Need to process data at high-throughput rates: Traditionally, for complex queries (example: queries that require a join), the data and the intermediate result-set has to be persisted before executing the query. In some existing and emerging use-cases, storing the data before executing the query is not practical. For example, in the case of measuring network performance, financial tickers, click-streams, or data feeds from sensor networks, storing the entire data set before running the query is both

unacceptably slow and a waste of disk resources. In the same set of use-cases, the query has to run continuously over the current data set without the need to persist previously viewed data.

5. Need to support short-development cycles for generating new pattern-identification code: In many examples where applications monitor data in real time, users demand the ability to deploy new patterns for detecting events within a very short time span, from a few hours to a few days. The traditional approach for coding events within the application code using C, C++, and Java does not enable flexibility. What is required is a clean separation of pattern-detection code, and a high-level language (like SQL) to establish filters and joins. In addition, users and application developers demand an interface where they can use pre-built filter, correlation, and aggregation functions to identify patterns.

Introducing Complex Event Processing (CEP)

CEP engines manage event-driven information systems by employing techniques such as detecting complex patterns, and building correlations and relationships, such as causality and timing between many events.

RTI has implemented integration with a leading CEP engine, Coral 8. With this integration, the application developers can query, filter, and transform data from multiple topics and data sources to identify patterns for event detection in real time.

CEP Engine

Dashboards

Applications

Alerts

DDS

Figure 1: CEP Engine Provides Correlation between Multiple Sources to Infer Events

Page 3: Complex Event Processing with RTI Data Distribution Service

Complex Event Processing with RTI Data Distribution Service

- 3 -

The RTI Complex Event Processing Engine supports a high-level query language called CCL (Continuous Computational Language), based on SQL. We will cover the semantics and show some examples for CCL in the following sections.

The CCL query engine, or the Event Processing Engine, can process data from a wide variety of sources, called streams—RTI Data Distribution Service, Enterprise Service Bus, Java Messaging Service (JMS), XML, CSV files—and apply transformations and filtering through the SQL-like CCL language. The Event processing engine

RTI Complex Event Processing Architecture

Inp

ut A

dap

ters

Ou

tpu

t Ad

ap

ters

Figure 2: RTI Complex Event Processing Architecture

StockTrade Stream

Figure 3: CEP Conceptually Views Input Streams as Database Tables

Page 4: Complex Event Processing with RTI Data Distribution Service

Complex Event Processing with RTI Data Distribution Service

- 4 -

generates events based on matching patterns from the input streams, and can publish these events to a large variety of middleware and persistence storages. RTI Event Processing Engine also exposes interfaces to add your own custom input and/or output adapters.

As seen in Figure 3, the input-data streams can be viewed as database tables, with an associated schema. Each time a data stream receives new data, it appears in the form of a row, with a value for each of its columns represented in a schema.

However, there are some significant differences as well. Typically, a row that passes through a data stream in a query is processed only at the precise moment in time when it becomes available in the stream. Once an incoming row has been processed, it is normally no longer available to the query.

Continuous Computation Language (CCL) At the heart of the CEP Engine is an innovative programming language—the Continuous Computation Language (CCL) —which offers a natural way to develop CEP applications available to developers today. Based on industry-standard SQL, CCL increases programmer productivity by leveraging existing programming skills.

CCL leverages the strengths and data semantics of SQL for performing data selection, computation, grouping, aggregation, and joins. However, CCL has to extend the SQL semantics to perform continuous queries:

1. Windows and Aggregation: While one-row-at-a-time data stream model is useful for picking up rows, it does not accommodate complex analysis of data involving multiple rows in a stream. In practice, it is often necessary for queries to deal with more than just the current row in the stream. A window is a collection of rows kept on a variety of criteria:

a. Time interval (“5 seconds of data”)

b. Row counts

c. Groups ( “Keep the last row for each stock symbol”)

d. Ranking (“Keep the last 10 largest values”)

INSERT INTO StreamVWAP

SELECT Symbol, SUM(Price*Volume)/SUM(Volume)

FROM StreamTrades KEEP 5 MINUTES

GROUP BY Symbol

OUTPUT EVERY 1 MINUTE

In this example, the event-processing engine is continuously computing the five-minute moving average for a stock symbol by querying the price and volume data posted in the last five minutes to the StreamTrades input stream. Note that StreamTrades could be a DDS, JMS, even a relational data-stream – but it does not matter to the CCL query, as it views each stream as a table. As in SQL, CCL uses GROUP BY construct to aggregate data based on the symbol.

2. Correlation: CCL implements correlation between data stream by using the JOIN semantics from SQL.

INSERT INTO CombinedStockOption

SELECT InStock.Symbol, InOption.OptionSymbol, InStock.Price, InOption.Price

FROM InOption, InStock KEEP 10 SECONDS

WHERE InStock.Symbol=InOption.StockSymbol

In this case, the developer takes two data steams – InStock, InOptions – and creates a new stream with the combined information by correlating, or joining, the samples collected in the last ten seconds based on the stock symbol. Again, note that the InStock and InOption streams may use different sources and technology platforms – for example, one may be a DDS stream, and another JMS, or a relational database. The CCL engine normalizes all the data samples to a table form before executing the queries.

Page 5: Complex Event Processing with RTI Data Distribution Service

Complex Event Processing with RTI Data Distribution Service

- 5 -

3. Event Pattern Matching: CCL provides detailed semantics for identifying patterns of events.

INSERT INTO ProcessAlerts

SELECT StreamA.id

FROM StreamA a, StreamB b, StreamC c, StreamD d

MATCHING [10 SECONDS: a, b || c, !d]

ON a.id = b.id = c.id = d.id

In this example, we look for events where for a particular id, there was an event in stream a, b or c, and not in d. To implement this, streams a, b, c, d are correlated based on the id. Then based on the matching data set, CEP looks for pattern where in the last 10 seconds (the period is triggered by the arrival of the first event in the window) the following sequence of events is detected:

a) Event a was detected

b) Either event b, or event c, or both events were detected

c) In the time window, event d was not detected.

Unlike relational databases, the CEP query engine does not use persistence to store its transient joined-sets, thus making possible doing the correlation in real-time with large volumes of data.

We can extend the CCL semantics to include custom scalar functions. In addition, RTI provides a Java based development environment for developing, debugging, monitoring, and managing event-driven applications.

Integrating with the SOA and XML

Most CEP engines view an event as a collection of typed name/value pairs. Such events typically cannot be hierarchical or set based. In other words, these events cannot contain non-scalar values. A CEP engine that supports just these basic events is capable of solving a large number of simple problems, but there is an increasing class of problems that require support for complex, hierarchical, set-based events. The reason for this is simple: XML. XML events are already everywhere, and the number of XML events keeps growing. Service-Oriented Architecture (SOA) is all about XML. The CEP engine must be able to provide the following functionality:

• Consume XML events.

• Analyze XML events: It may be necessary to compute aggregations (Sum, Avg, Min, Max) over one or more XML events, or join/correlate different XML events.

• Produce XML events: Even if all input events are simple and non-XML, one may want to collect these simple events, and produce a complex XML event.

RTI Complex Event Processing engine treats a single XML document as a whole, rather than as a collection of bits and pieces stored across many tables. The advantages of native XML processing are two-fold:

• Ease of Programming: When the developer can keep the XML document in one place, and use the functions specifically designed for processing XML, writing queries is much easier.

• Performance: The functions and operators designed specifically for XML processing can be optimized much more effectively than generic SQL operators can. Moreover, these functions can take advantage of the growing number of hardware-based XML accelerators, further improving the performance of XML processing .

Page 6: Complex Event Processing with RTI Data Distribution Service

Complex Event Processing with RTI Data Distribution Service

- 6 -

Conclusion: RTI Simplifies Complex Event Processing By integrating the CEP engine from Coral8, RTI continues to extend the technology platform for building real-time distributed applications:

• Developers can build distributed applications that synthesize and analyze high-throughput data from multiple sources. This capability is relevant for a wide array of use cases including command and control, intelligence and surveillance, intrusion detection, algorithmic trading and network monitoring.

• Developers can extend RTI Data Distribution Service by using CEP to supplement features like Multi Topic and Content-filtered Topics.

• Developers can add time/sequence-based filtering and correlation, data aggregation/reduction, transformation to detect patterns from their streaming real-time data.

• Developers can provide high-performance alerts and notification, and integrate with other middleware standards.

©2007 Real-Time Innovations, Inc.

Real-Time Innovations and RTI are registered trademarks of Real-Time Innovations, Inc. All other names mentioned are trademarks, registered trademarks, or service marks of their respective companies or organizations.