1 the stream star schema stephen a. broeker 1010

Post on 31-Mar-2015

223 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

The Stream Star Schema

Stephen A. Broeker

10

2

Conclusion

The Stream Star Schema processes data streams in real-time. Up to gigabits per second.

Stream Star performance is O(1).

20

3

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

are data rich. But real-time analysis po

Large Fast Dynamic Data Streams

30

4

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

Data rich. But poor in real-time analysis.

Large Fast Dynamic Data Streams

40

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

5

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

What are the consequences?

Large Fast Dynamic Data Streams

50

6

hard tosee patterns

Therefore difficult to detect problems.

Large Fast Dynamic Data Streams

60

7

Network monitoring at high speed is difficult:

Packets arrive every nanosecond on a 1Gbps NIC

Must use SRAM for per-packet processing

Traditional solution of sampling is inherently not accurate due to the loss of data.

Challenge of Network Monitoring

70

8

Achieve real-time OLAP for massive data streams.

Achieve cybernetic control for systems that depend on rapid data analysis.

Vision

80

9

Detection

90

10

Forensics

10

11

Data RATES are measured in bits per second.

So, Gigabits (Gb) ≠ Gigabytes (GB).

Data Rates versus Data Storage

Lowercase ‘b’

11

12

Data RATES are measured in bits per second.

Data STORAGE is measured in Bytes.

So, Gigabits (Gb) ≠ Gigabytes (GB).

Data Rates versus Data Storage

Lowercase ‘b’ Uppercase ‘B’

12

13

Ethernet Network Interface Card transferring data at 1 Gbps.

Data accumulates at 450MB per hour.

That’s 10.5 TB per day, 73.8 TB per week!

Data Storage based on Data Rate

13

14

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

14

15

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

15

16

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

16

17

What if BYTES were pennies?

Picturing Orders of Magnitude

X At 1Gbps, 2.2 PB accumulate per month.Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

17

18

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

1018 = 260

17

19

The network stream is segmented into flows, which are inserted into a database.

Observed database input rate for 1 Gb Ethernet NIC: 700,000 flows per hour.

Existing databases can’t keep up!

From Streaming Data to Database

18

20

Disk Star Schema

STREAM Star Schema

Consider 2 Database Schemas

19

21So where’s the star?

Disk Star SchemaFrom Fact Table to Dimension Tables

Content Table

Sender Table

Subject TableRecipient Table

Destination IP TableContent

Destination IP

Sender

Recipient

Subject

That’s all there is to the “star” concept.

Here’s the star.

20

22

Value of the Disk Star Schema

Conserve Disk Space 21

23

Dimensions

Each Dimension gets a key. 22

24Resulting in a Dimension Table

1NF: No Repeating Groups

23

25Thus deriving a Fact Table.

Substitute Keys for Facts

24

26

Disk Star Schema = Slow data insertion time.

Relational databases are normalized to conserve space. Speed is sacrificed.

So real-time analysis is compromised.25

SlowBottleneck

27

Disk Star Schema

26

28

Disk Star Schema

27

29

Disk Star Schema

28

30

Disk Star Schema

29

31

Dimension table insertion time depends on the table size which is O(log n) where n is the number of records in a table.

Disk Star Schema insertion time, is the sum of all

dimension table insert times O(Ʃ1≤i ≤ l (log ni )) where l

is the number of attributes in the database and ni is the number of values for attribute i.

Can’t fill dimension tables fast enough!

Bottleneck

30

32

1,000,000,000 bit Ethernet NIC (1Gb)

700,000 Observed Flows per hour

460 MBs per hour, 10.5 TBs a day

All we can get is a snapshot-analysis!

Short Pause to Review Numbers

31

33

Disk Star Schema

STREAM Star Schema

Consider 2 Database Schemas

32

34

Stream Star Schema

33

Stream Star Schema

35

34

Stream Star Schema

36

Stream Star Schema

35

Stream Star Schema

37

Disk Star Schema

Nearly 1:1 Correspondence between string attributes and Dimension tables.

36

38

Disk Star Schema

Two kinds of tables - fact, dimension.All string dimensions have dimension tables.Minimize disk space.Dimension tables can be large.

Long insert time = O(Ʃ1≤i ≤ l (log ni ))No string duplication.

37

39

Many:1 38

Stream Star Schema

40

Three kinds of tables - fact, dimension, string.Few dimension tables.Dimension tables are small.Minimizes insertion time.I n s e r t t i m e i s c o n s t a n t.Allow string duplication. Allow string duplication.

39

Stream Star Schema

41

Side x Side Comparison

Slow FastOld New

40

42

Test Results

41

43

Test Results

The magnified area is different because I measured the insert time for (1, 10, 100) as opposed to (1000, 2000, 3000) streams.42

44

Test Results

The magnified area is different because of how MySQL works. I can only present a hypothesis since I don’t have the MySQL source code. But I suspect that MySQL is optimized for less than 100 streams for this problem. 43

45

Conclusion

44

46

Conclusion

The Stream Star Schema processes data streams in real-time. Up to gigabits per second.

Stream Star performance is O(1).

45

47

Hope

Detection

Forensics

RFID

46

48

There’s data flow

47

49

And then there’s DATA FLOW!

48

50

Disk Star Schema handles 3 million flows per hour, about this much.

49

51

The Stream Star Schemahandles 113 million flows per hour!

Disk Star Schema handles 3 million flows per hour, about this much.

50

52Nearly 40x Faster!51

53

For The Future

Implement the Stream Star Schema in the Cloud.

Use multiple Stream Star Schema computer nodes to handle an infinite stream. Storage could be handled similarly to S3.

52

54

For The Future

The Stream Star Schema fully supports the analysis of high-speed data streams thus enabling security applications and forensic processing.

53

55 END

top related