1 the stream star schema stephen a. broeker 1010

55
1 The Stream Star Schema Stephen A. Broeker 10

Upload: allie-cookson

Post on 31-Mar-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 The Stream Star Schema Stephen A. Broeker 1010

1

The Stream Star Schema

Stephen A. Broeker

10

Page 2: 1 The Stream Star Schema Stephen A. Broeker 1010

2

Conclusion

The Stream Star Schema processes data streams in real-time. Up to gigabits per second.

Stream Star performance is O(1).

20

Page 3: 1 The Stream Star Schema Stephen A. Broeker 1010

3

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

are data rich. But real-time analysis po

Large Fast Dynamic Data Streams

30

Page 4: 1 The Stream Star Schema Stephen A. Broeker 1010

4

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

Data rich. But poor in real-time analysis.

Large Fast Dynamic Data Streams

40

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

Page 5: 1 The Stream Star Schema Stephen A. Broeker 1010

5

phone callsroad trafficnetwork trafficwebsite traffic power suppliescredit card transactionssensor arrays financial markets

What are the consequences?

Large Fast Dynamic Data Streams

50

Page 6: 1 The Stream Star Schema Stephen A. Broeker 1010

6

hard tosee patterns

Therefore difficult to detect problems.

Large Fast Dynamic Data Streams

60

Page 7: 1 The Stream Star Schema Stephen A. Broeker 1010

7

Network monitoring at high speed is difficult:

Packets arrive every nanosecond on a 1Gbps NIC

Must use SRAM for per-packet processing

Traditional solution of sampling is inherently not accurate due to the loss of data.

Challenge of Network Monitoring

70

Page 8: 1 The Stream Star Schema Stephen A. Broeker 1010

8

Achieve real-time OLAP for massive data streams.

Achieve cybernetic control for systems that depend on rapid data analysis.

Vision

80

Page 9: 1 The Stream Star Schema Stephen A. Broeker 1010

9

Detection

90

Page 10: 1 The Stream Star Schema Stephen A. Broeker 1010

10

Forensics

10

Page 11: 1 The Stream Star Schema Stephen A. Broeker 1010

11

Data RATES are measured in bits per second.

So, Gigabits (Gb) ≠ Gigabytes (GB).

Data Rates versus Data Storage

Lowercase ‘b’

11

Page 12: 1 The Stream Star Schema Stephen A. Broeker 1010

12

Data RATES are measured in bits per second.

Data STORAGE is measured in Bytes.

So, Gigabits (Gb) ≠ Gigabytes (GB).

Data Rates versus Data Storage

Lowercase ‘b’ Uppercase ‘B’

12

Page 13: 1 The Stream Star Schema Stephen A. Broeker 1010

13

Ethernet Network Interface Card transferring data at 1 Gbps.

Data accumulates at 450MB per hour.

That’s 10.5 TB per day, 73.8 TB per week!

Data Storage based on Data Rate

13

Page 14: 1 The Stream Star Schema Stephen A. Broeker 1010

14

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

14

Page 15: 1 The Stream Star Schema Stephen A. Broeker 1010

15

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

15

Page 16: 1 The Stream Star Schema Stephen A. Broeker 1010

16

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

16

Page 17: 1 The Stream Star Schema Stephen A. Broeker 1010

17

What if BYTES were pennies?

Picturing Orders of Magnitude

X At 1Gbps, 2.2 PB accumulate per month.Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

106 = 220 109 = 230 1012 = 240 1015 = 250

17

Page 18: 1 The Stream Star Schema Stephen A. Broeker 1010

18

What if BYTES were pennies?

Picturing Orders of Magnitude

X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA

1018 = 260

17

Page 19: 1 The Stream Star Schema Stephen A. Broeker 1010

19

The network stream is segmented into flows, which are inserted into a database.

Observed database input rate for 1 Gb Ethernet NIC: 700,000 flows per hour.

Existing databases can’t keep up!

From Streaming Data to Database

18

Page 20: 1 The Stream Star Schema Stephen A. Broeker 1010

20

Disk Star Schema

STREAM Star Schema

Consider 2 Database Schemas

19

Page 21: 1 The Stream Star Schema Stephen A. Broeker 1010

21So where’s the star?

Disk Star SchemaFrom Fact Table to Dimension Tables

Content Table

Sender Table

Subject TableRecipient Table

Destination IP TableContent

Destination IP

Sender

Recipient

Subject

That’s all there is to the “star” concept.

Here’s the star.

20

Page 22: 1 The Stream Star Schema Stephen A. Broeker 1010

22

Value of the Disk Star Schema

Conserve Disk Space 21

Page 23: 1 The Stream Star Schema Stephen A. Broeker 1010

23

Dimensions

Each Dimension gets a key. 22

Page 24: 1 The Stream Star Schema Stephen A. Broeker 1010

24Resulting in a Dimension Table

1NF: No Repeating Groups

23

Page 25: 1 The Stream Star Schema Stephen A. Broeker 1010

25Thus deriving a Fact Table.

Substitute Keys for Facts

24

Page 26: 1 The Stream Star Schema Stephen A. Broeker 1010

26

Disk Star Schema = Slow data insertion time.

Relational databases are normalized to conserve space. Speed is sacrificed.

So real-time analysis is compromised.25

SlowBottleneck

Page 27: 1 The Stream Star Schema Stephen A. Broeker 1010

27

Disk Star Schema

26

Page 28: 1 The Stream Star Schema Stephen A. Broeker 1010

28

Disk Star Schema

27

Page 29: 1 The Stream Star Schema Stephen A. Broeker 1010

29

Disk Star Schema

28

Page 30: 1 The Stream Star Schema Stephen A. Broeker 1010

30

Disk Star Schema

29

Page 31: 1 The Stream Star Schema Stephen A. Broeker 1010

31

Dimension table insertion time depends on the table size which is O(log n) where n is the number of records in a table.

Disk Star Schema insertion time, is the sum of all

dimension table insert times O(Ʃ1≤i ≤ l (log ni )) where l

is the number of attributes in the database and ni is the number of values for attribute i.

Can’t fill dimension tables fast enough!

Bottleneck

30

Page 32: 1 The Stream Star Schema Stephen A. Broeker 1010

32

1,000,000,000 bit Ethernet NIC (1Gb)

700,000 Observed Flows per hour

460 MBs per hour, 10.5 TBs a day

All we can get is a snapshot-analysis!

Short Pause to Review Numbers

31

Page 33: 1 The Stream Star Schema Stephen A. Broeker 1010

33

Disk Star Schema

STREAM Star Schema

Consider 2 Database Schemas

32

Page 34: 1 The Stream Star Schema Stephen A. Broeker 1010

34

Stream Star Schema

33

Stream Star Schema

Page 35: 1 The Stream Star Schema Stephen A. Broeker 1010

35

34

Stream Star Schema

Page 36: 1 The Stream Star Schema Stephen A. Broeker 1010

36

Stream Star Schema

35

Stream Star Schema

Page 37: 1 The Stream Star Schema Stephen A. Broeker 1010

37

Disk Star Schema

Nearly 1:1 Correspondence between string attributes and Dimension tables.

36

Page 38: 1 The Stream Star Schema Stephen A. Broeker 1010

38

Disk Star Schema

Two kinds of tables - fact, dimension.All string dimensions have dimension tables.Minimize disk space.Dimension tables can be large.

Long insert time = O(Ʃ1≤i ≤ l (log ni ))No string duplication.

37

Page 39: 1 The Stream Star Schema Stephen A. Broeker 1010

39

Many:1 38

Stream Star Schema

Page 40: 1 The Stream Star Schema Stephen A. Broeker 1010

40

Three kinds of tables - fact, dimension, string.Few dimension tables.Dimension tables are small.Minimizes insertion time.I n s e r t t i m e i s c o n s t a n t.Allow string duplication. Allow string duplication.

39

Stream Star Schema

Page 41: 1 The Stream Star Schema Stephen A. Broeker 1010

41

Side x Side Comparison

Slow FastOld New

40

Page 42: 1 The Stream Star Schema Stephen A. Broeker 1010

42

Test Results

41

Page 43: 1 The Stream Star Schema Stephen A. Broeker 1010

43

Test Results

The magnified area is different because I measured the insert time for (1, 10, 100) as opposed to (1000, 2000, 3000) streams.42

Page 44: 1 The Stream Star Schema Stephen A. Broeker 1010

44

Test Results

The magnified area is different because of how MySQL works. I can only present a hypothesis since I don’t have the MySQL source code. But I suspect that MySQL is optimized for less than 100 streams for this problem. 43

Page 45: 1 The Stream Star Schema Stephen A. Broeker 1010

45

Conclusion

44

Page 46: 1 The Stream Star Schema Stephen A. Broeker 1010

46

Conclusion

The Stream Star Schema processes data streams in real-time. Up to gigabits per second.

Stream Star performance is O(1).

45

Page 47: 1 The Stream Star Schema Stephen A. Broeker 1010

47

Hope

Detection

Forensics

RFID

46

Page 48: 1 The Stream Star Schema Stephen A. Broeker 1010

48

There’s data flow

47

Page 49: 1 The Stream Star Schema Stephen A. Broeker 1010

49

And then there’s DATA FLOW!

48

Page 50: 1 The Stream Star Schema Stephen A. Broeker 1010

50

Disk Star Schema handles 3 million flows per hour, about this much.

49

Page 51: 1 The Stream Star Schema Stephen A. Broeker 1010

51

The Stream Star Schemahandles 113 million flows per hour!

Disk Star Schema handles 3 million flows per hour, about this much.

50

Page 52: 1 The Stream Star Schema Stephen A. Broeker 1010

52Nearly 40x Faster!51

Page 53: 1 The Stream Star Schema Stephen A. Broeker 1010

53

For The Future

Implement the Stream Star Schema in the Cloud.

Use multiple Stream Star Schema computer nodes to handle an infinite stream. Storage could be handled similarly to S3.

52

Page 54: 1 The Stream Star Schema Stephen A. Broeker 1010

54

For The Future

The Stream Star Schema fully supports the analysis of high-speed data streams thus enabling security applications and forensic processing.

53

Page 55: 1 The Stream Star Schema Stephen A. Broeker 1010

55 END