analyzing semi-structured data at volume in the cloud

20
Analyzing Semi-Structured Data at Volume in the Cloud Kevin Bair Solution Architect [email protected]

Upload: robert-dempsey

Post on 13-Apr-2017

565 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Analyzing Semi-Structured Data At Volume In The Cloud

Analyzing Semi-Structured Data at Volume in the Cloud

Kevin Bair Solution Architect

[email protected]

Page 2: Analyzing Semi-Structured Data At Volume In The Cloud

Topics this presentation will cover

1.  Structured vs. Semi-Structured

2.  ETL / Data Pipeline Architecture 3.  Analytics on Semi-Structured Data

Clickstream Demo

4.  Analyzing Structured with Semi-Structured Data

Twitter feed Demo

5.  Time permitting…..Cloud Big Data / Data Warehousing

2

Page 3: Analyzing Semi-Structured Data At Volume In The Cloud

3

Surge in cloud spending and supporting technology (IDC)

Of workloads will be processed In cloud data centers (Cisco)

Data in the cloud today is expected to grow in the next two years. (Gigaom)

Today’s data: big, complex, moving to cloud

Page 4: Analyzing Semi-Structured Data At Volume In The Cloud

Structured data and Semi-Structured data

•  Transactional data •  Relational •  Fixed schema •  OLTP / OLAP

•  Machine-generated •  Non-relational •  Varying schema •  Most common in cloud

environments

Page 5: Analyzing Semi-Structured Data At Volume In The Cloud

What does Semi Structured mean?

• Data that may be of any type • Data that is variable in length (arrays) • Structure that can rapidly and unpredictably

change • Usually Self Describing

• Examples •  XML •  AVRO •  JSON

Page 6: Analyzing Semi-Structured Data At Volume In The Cloud

XML Example <?xml version="1.0" encoding="UTF-8"?> <breakfast_menu>

<food> <name>Belgian Waffles</name> <price>$5.95</price> <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description> <calories>650</calories> </food> <food> <name>Strawberry Belgian Waffles</name> <price>$7.95</price> <description>Light Belgian waffles covered with strawberries and whipped cream</description> <calories>900</calories> </food> <food> <name>Berry-Berry Belgian Waffles</name> <price>$8.95</price> <description>Light Belgian waffles covered with an assortment of fresh berries and

whipped cream</description> <calories>900</calories> </food>

</breakfast_menu>

Page 7: Analyzing Semi-Structured Data At Volume In The Cloud

JSON Example {          "custkey":  "450002",          "useragent":  {                  "devicetype":  "pc",                  "experience":  "browser",                  "platform":  "windows"          },          "pagetype":  "home",          "productline":  "television",          "customerprofile":  {                  "age":  20,                  "gender":  "male",                  "customerinterests":  [                          "movies",                          "fashion",                          "music"                  ]          }  }  

Page 8: Analyzing Semi-Structured Data At Volume In The Cloud

Avro Example

Schema

Data

}

} JSON

Binary

Page 9: Analyzing Semi-Structured Data At Volume In The Cloud

Why is this so hard for a traditional Relational DBMS?

• Pre-defined Schema

• Store in Character Large Object (CLOB) data type

• Constantly Changing

•  Inefficient to Query

Page 10: Analyzing Semi-Structured Data At Volume In The Cloud

Data Warehousing

•  Complex: manage hardware, data

distribution, indexes, …

•  Limited elasticity: forklift upgrades, data redistribution, downtime

•  Costly: overprovisioning, significant

care & feeding

Hadoop

•  Complex: specialized skills, new tools

•  Limited elasticity: data

redistribution, resource contention •  Not a data warehouse: batch-

oriented, limited optimization,

incomplete security

Current architectures can’t keep up

10

Page 11: Analyzing Semi-Structured Data At Volume In The Cloud

Source

Website Logs

Operational Systems

External Providers

Stream Data

Stage

S3 • 10TB

Data Lake

Hadoop • 30 TB

Stage

S3 • 5 TB • Summary

EDW

MPP • 10 TB Disk

Data Pipeline / Data Lake Architecture – “ETL”

Page 12: Analyzing Semi-Structured Data At Volume In The Cloud

ETL vs ELT for Big Data • Think more strategically about file formats, size,

storage methods, standards • Processing Power – Tools vs “Services” • Pipeline – Where should the analysis occur? • Platform

•  Unlimited Processing Power •  Contention for resources •  Support SQL for both Schema-on-write and Schema-

on-read with full “indexing” for Structured / Semi-Structured

•  Compress, Clone metadata, don’t replicate…

Page 13: Analyzing Semi-Structured Data At Volume In The Cloud

Source Website Logs

Operational Systems

External Providers

Stream Data

Stage S3 • 10TB

EDW Snowflake • 2 TB Disk

Data Pipeline / Snowflake Architecture – “ELT”

Page 14: Analyzing Semi-Structured Data At Volume In The Cloud

Demo Scenarios • Clickstream Analysis (load JSON, multi-table insert)

•  Which Product Category is most clicked on? •  Which Product line does the customer self identify as

having the most interest in?

• Twitter Feed (Join Structured and Semi-Structured) •  From our twitter campaign, is there a correlation

between twitter volume and sales?

Page 15: Analyzing Semi-Structured Data At Volume In The Cloud

Clickstream Example {          "custkey":  "450002",          "useragent":  {                  "devicetype":  "pc",                  "experience":  "browser",                  "platform":  "windows"          },          "pagetype":  "home",          "productline":  "none",          "customerprofile":  {                  "age":  20,                  "gender":  "male",                  "customerinterests":  [                          "movies",                          "fashion",                          "music"                  ]          }  }  

Page 16: Analyzing Semi-Structured Data At Volume In The Cloud

Relational Processing of Semi-Structured Data

16

1. Variant data type compresses storage of semi-structured data

2. Data is analyzed during load to discern repetitive attributes within the hierarchy

3. Repetitive attributes are columnar compressed and statistics are collected for relational query optimization

4. SQL extensions enable relational queries against both semi-structured and structured data

Page 17: Analyzing Semi-Structured Data At Volume In The Cloud

FLATTEN() in Snowflake SQL (Removing one level of nesting)

SELECT S.fullrow:fullName, t.value:name, t.value:age FROM json_data_table as S, TABLE(FLATTEN(S.fullrow,'children')) t WHERE s.fullrow:fullName = 'Mike Jones’ AND t.value:age::integer > 6 ;

FLATTEN() Converts a repeated field into a set of rows:

Page 18: Analyzing Semi-Structured Data At Volume In The Cloud

What makes Snowflake unique for handling Semi-Structured Data?

•  Compression •  Encryption / Role Based Authentication •  Shredding •  History/Results •  Clone •  Time Travel •  Flatten •  Regexp •  No Contention •  No Tuning •  Infinitely scalable •  SQL based with extremely high performance

Page 19: Analyzing Semi-Structured Data At Volume In The Cloud

z

Map-Reduce Jobs

One Platform for all Business Data

Data Sink Structured Storage

HDFS

Relational Databases

Snowflake ü One System

ü One Common Skillset

ü Faster/Less Costly Data Conversion

ü For both Structured and Semi-Structured Business Data

Apple 101.12 250 FIH-2316

Pear 56.22 202 IHO-6912

Orange 98.21 600 WHQ-6090

Structured data { "firstName": "John",

"lastName": "Smith", "height_cm": 167.64, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" },

Semi-structured data

Other Systems ü Multiple Systems

ü Specialized Skillset

ü Slower/More Costly Data Conversion

Page 20: Analyzing Semi-Structured Data At Volume In The Cloud

THANK YOU!