lambda at weather scale by robbie strickland

121
Lambda at Weather Scale Robbie Strickland

Upload: spark-summit

Post on 21-Apr-2017

1.134 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Lambda at Weather Scale by Robbie Strickland

Lambdaat Weather Scale Robbie Strickland

Page 2: Lambda at Weather Scale by Robbie Strickland

Who Am I?

Robbie StricklandDirector of Engineering, [email protected]@rs_atl

An IBM Business

Page 3: Lambda at Weather Scale by Robbie Strickland

Who Am I?• Contributor to C*

community since 2010

• DataStax MVP 2014/15

• Author, Cassandra High Availability

• Founder, ATL Cassandra User Group

Page 4: Lambda at Weather Scale by Robbie Strickland

About TWC

~30 billion API requests per day

Page 5: Lambda at Weather Scale by Robbie Strickland

About TWC

~30 billion API requests per day

~120 million active mobile users

Page 6: Lambda at Weather Scale by Robbie Strickland

About TWC

~30 billion API requests per day

~120 million active mobile users

#3 most active mobile user base

Page 7: Lambda at Weather Scale by Robbie Strickland

About TWC

~30 billion API requests per day

~120 million active mobile users

#3 most active mobile user base

~360 PB of traffic daily

Page 8: Lambda at Weather Scale by Robbie Strickland

About TWC

~30 billion API requests per day

~120 million active mobile users

#3 most active mobile user base

~360 PB of traffic daily

Most weather data comes from us

Page 9: Lambda at Weather Scale by Robbie Strickland

Use CaseBillions of events per day (~1.3M per sec)

Web/mobile beaconsLogsWeather conditions + forecastsetc.

Page 10: Lambda at Weather Scale by Robbie Strickland

Use CaseBillions of events per day (~1.3M per sec)

Web/mobile beaconsLogsWeather conditions + forecastsetc.

Keep data forever

Page 11: Lambda at Weather Scale by Robbie Strickland

Use CaseEfficient batch + streaming analysis

Page 12: Lambda at Weather Scale by Robbie Strickland

Use CaseEfficient batch + streaming analysis

Self-serve data science

Page 13: Lambda at Weather Scale by Robbie Strickland

Use CaseEfficient batch + streaming analysis

Self-serve data science

BI / visualization tool support

Page 14: Lambda at Weather Scale by Robbie Strickland

Architecture

Page 15: Lambda at Weather Scale by Robbie Strickland

Attempt[0] ArchitectureOperational Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Events

3rd Party

Other DBs

S3

Stream Processing

Batch Sources

Storage and Processing

Consumers

Data Access

Kafka

Streaming

Custom Ingestion Pipeline

ETL

Streaming Sources

RESTful Enqueue service

SQL

Page 16: Lambda at Weather Scale by Robbie Strickland

Attempt[0] Data ModelCREATE TABLE events (

timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)

) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Page 17: Lambda at Weather Scale by Robbie Strickland

Attempt[0] Data ModelCREATE TABLE events (

timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)

) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Event payload == schema-less JSON

Page 18: Lambda at Weather Scale by Robbie Strickland

Attempt[0] Data ModelCREATE TABLE events (

timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)

) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Partitioned by time bucket + type

Page 19: Lambda at Weather Scale by Robbie Strickland

Attempt[0] Data ModelCREATE TABLE events (

timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)

) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Time-series data good fit for DTCS

Page 20: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhere

Page 21: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhereStreaming data via custom ingest process

Page 22: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful service

Page 23: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via Informatica

Page 24: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBC

Page 25: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payload

Page 26: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payloadDate-tiered compaction

Page 27: Lambda at Weather Scale by Robbie Strickland

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payloadDate-tiered compaction

Page 28: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsBatch loading large data sets into C* is silly

Page 29: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive

Page 30: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOW

Page 31: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessary

Page 32: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driver

Page 33: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driverDTCS is broken (see CASSANDRA-9666)

Page 34: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsSchema-less == bad:

Page 35: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsSchema-less == bad:

Must parse JSON to extract key data

Page 36: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsSchema-less == bad:

Must parse JSON to extract key dataExpensive to analyze by event type

Page 37: Lambda at Weather Scale by Robbie Strickland

Attempt[0] LessonsSchema-less == bad:

Must parse JSON to extract key dataExpensive to analyze by event typeCannot tune by event type

Page 38: Lambda at Weather Scale by Robbie Strickland

Attempt[1] Architecture

Data Lake

Operational Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Stream Processing

Long Term Raw Storage

Short Term Storage and Big Data Processing

Consumers

Amazon SQS

Streaming

Custom Ingestion Pipeline

Events

3rd Party

Other DBs

S3

Batch Sources

Streaming Sources

ETL

Data Access

SQL

Page 39: Lambda at Weather Scale by Robbie Strickland

Attempt[1] Data ModelEach event type gets its own table

Page 40: Lambda at Weather Scale by Robbie Strickland

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workload

Page 41: Lambda at Weather Scale by Robbie Strickland

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:

Page 42: Lambda at Weather Scale by Robbie Strickland

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:

We’re reading everything anyway

Page 43: Lambda at Weather Scale by Robbie Strickland

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:

We’re reading everything anywayMakes subsequent analysis much easier

Page 44: Lambda at Weather Scale by Robbie Strickland

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:

We’re reading everything anywayMakes subsequent analysis much easierAllows us to filter junk early

Page 45: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drUse C* for streaming data

Page 46: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drUse C* for streaming data

Rolling time window (TTL depends on type)

Page 47: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drUse C* for streaming data

Rolling time window (TTL depends on type)Real-time access to events

Page 48: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drUse C* for streaming data

Rolling time window (TTL depends on type)Real-time access to eventsData locality makes Spark jobs faster

Page 49: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Page 50: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)

Page 51: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)Daily C* backups

Page 52: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)Daily C* backupsStored as Parquet

Page 53: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storage

Page 54: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from Spark

Page 55: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externally

Page 56: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externallyOpen source Hive support

Page 57: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drKafka replaced by SQS:

Page 58: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drKafka replaced by SQS:

Scalable & reliable

Page 59: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drKafka replaced by SQS:

Scalable & reliableAlready fronted by a RESTful interface

Page 60: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drKafka replaced by SQS:

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)

Page 61: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drKafka replaced by SQS:

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security model

Page 62: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drKafka replaced by SQS:

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platform

Page 63: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drKafka replaced by SQS:

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platformBuilt-in monitoring

Page 64: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drDTCS replaced by Time-Window Compaction

Page 65: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drDTCS replaced by Time-Window Compaction

Developed by Jeff Jirsa at CrowdStrike

Page 66: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drDTCS replaced by Time-Window Compaction

Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations together

Page 67: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drDTCS replaced by Time-Window Compaction

Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply delete expired sstables

Page 68: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drDTCS replaced by Time-Window Compaction

Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply deletes expired sstablesImproved stability & throughput

Page 69: Lambda at Weather Scale by Robbie Strickland

Attempt[1] tl;drDTCS replaced by Time-Window Compaction

Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply deletes expired sstablesImproved stability & throughput

Page 70: Lambda at Weather Scale by Robbie Strickland

Fine PrintUse C* >= 2.1.8

CASSANDRA-9637 - fixes Spark input split computation

CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast

addresses for Spark/cloud environments

Page 71: Lambda at Weather Scale by Robbie Strickland

Fine PrintUse C* >= 2.1.8

CASSANDRA-9637 - fixes Spark input split computation

CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast

addresses for Spark/cloud environments

Version incompatibilities abound (check sbt file for Spark-Cassandra connector)

Page 72: Lambda at Weather Scale by Robbie Strickland

Fine PrintTwo main Spark clusters:

Page 73: Lambda at Weather Scale by Robbie Strickland

Fine PrintTwo main Spark clusters:

Co-located with C* for heavy analysisPredictable loadEfficient C* access

Page 74: Lambda at Weather Scale by Robbie Strickland

Fine PrintTwo main Spark clusters:

Co-located with C* for heavy analysisPredictable loadEfficient C* access

Self-serve in same DC but not co-locatedUnpredictable loadFavors mining S3 dataIsolated from production jobs

Page 75: Lambda at Weather Scale by Robbie Strickland

Data Modeling

Page 76: Lambda at Weather Scale by Robbie Strickland

PartitioningOpposite strategy from “normal” C* modeling

Page 77: Lambda at Weather Scale by Robbie Strickland

PartitioningOpposite strategy from “normal” C* modeling

Model for good parallelism

Page 78: Lambda at Weather Scale by Robbie Strickland

PartitioningOpposite strategy from “normal” C* modeling

Model for good parallelism… not for single-partition queries

Page 79: Lambda at Weather Scale by Robbie Strickland

PartitioningOpposite strategy from “normal” C* modeling

Model for good parallelism… not for single-partition queries

Avoid shuffling for most cases

Page 80: Lambda at Weather Scale by Robbie Strickland

PartitioningOpposite strategy from “normal” C* modeling

Model for good parallelism… not for single-partition queries

Avoid shuffling for most casesShuffles occur when NOT grouping by partition key

Page 81: Lambda at Weather Scale by Robbie Strickland

PartitioningOpposite strategy from “normal” C* modeling

Model for good parallelism… not for single-partition queries

Avoid shuffling for most casesShuffles occur when NOT grouping by partition keyPartition for your most common grouping

Page 82: Lambda at Weather Scale by Robbie Strickland

Secondary IndexesUseful for C*-level filtering

Page 83: Lambda at Weather Scale by Robbie Strickland

Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprint

Page 84: Lambda at Weather Scale by Robbie Strickland

Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprintLow cardinality is still the rule

Page 85: Lambda at Weather Scale by Robbie Strickland

Secondary Indexes (Client Access)

Page 86: Lambda at Weather Scale by Robbie Strickland

Secondary Indexes (with Spark)

Page 87: Lambda at Weather Scale by Robbie Strickland

Full-text IndexesEnabled via Stratio-Lucene custom index

(https://github.com/Stratio/cassandra-lucene-index)

Page 88: Lambda at Weather Scale by Robbie Strickland

Full-text IndexesEnabled via Stratio-Lucene custom index

(https://github.com/Stratio/cassandra-lucene-index)

Great for C*-side filters

Page 89: Lambda at Weather Scale by Robbie Strickland

Full-text IndexesEnabled via Stratio-Lucene custom index

(https://github.com/Stratio/cassandra-lucene-index)

Great for C*-side filtersSame access pattern as secondary indexes

Page 90: Lambda at Weather Scale by Robbie Strickland

Full-text IndexesCREATE CUSTOM INDEX email_index on emails(lucene)USING 'com.stratio.cassandra.lucene.Index'WITH OPTIONS = {

'refresh_seconds':'1','schema': '{

fields: {id : {type : "integer"},

user : {type : "string"},subject : {type : "text", analyzer : "english"},body : {type : "text", analyzer : "english"},time : {type : "date", pattern : "yyyy-MM-dd hh:mm:ss"}}

}'};

Page 91: Lambda at Weather Scale by Robbie Strickland

Full-text IndexesSELECT * FROM emails WHERE lucene='{

filter : {type:"range", field:"time", lower:"2015-05-26 20:29:59"},query : {type:"phrase", field:"subject", values:["test"]}

}';

SELECT * FROM emails WHERE lucene='{filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"},query : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1}

}';

Page 92: Lambda at Weather Scale by Robbie Strickland

WIDE ROWS

Caution:

Page 93: Lambda at Weather Scale by Robbie Strickland

Wide RowsIt only takes one to ruin your day

Page 94: Lambda at Weather Scale by Robbie Strickland

Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytes

Page 95: Lambda at Weather Scale by Robbie Strickland

Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytesUse toppartitions to find hot keys

Page 96: Lambda at Weather Scale by Robbie Strickland

Avoid NullsNulls are deletes

Page 97: Lambda at Weather Scale by Robbie Strickland

Avoid NullsNulls are deletesDeletes create tombstones

Page 98: Lambda at Weather Scale by Robbie Strickland

Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!

Page 99: Lambda at Weather Scale by Robbie Strickland

Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!Beware of nulls in prepared statements

Page 100: Lambda at Weather Scale by Robbie Strickland

Data Exploration

Page 101: Lambda at Weather Scale by Robbie Strickland

Data Warehouse Paradigm - Old

Ingest Model Transform Design

Visualize

Page 102: Lambda at Weather Scale by Robbie Strickland

Data Warehouse Paradigm - New

Ingest Explore Analyze Deploy

Visualize

Page 103: Lambda at Weather Scale by Robbie Strickland

VisualizationCritical to understanding your data

Page 104: Lambda at Weather Scale by Robbie Strickland

VisualizationCritical to understanding your dataReduced time to visualization

Page 105: Lambda at Weather Scale by Robbie Strickland

VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)

Page 106: Lambda at Weather Scale by Robbie Strickland

VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)Waterfall to agile

Page 107: Lambda at Weather Scale by Robbie Strickland

ZeppelinOpen source Spark notebook

Page 108: Lambda at Weather Scale by Robbie Strickland

ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,

CQL, Hive, Shell, & more

Page 109: Lambda at Weather Scale by Robbie Strickland

ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,

CQL, Hive, Shell, & moreData visualizations

Page 110: Lambda at Weather Scale by Robbie Strickland

ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,

CQL, Hive, Shell, & moreData visualizationsScheduled jobs

Page 111: Lambda at Weather Scale by Robbie Strickland

Zeppelin

Page 112: Lambda at Weather Scale by Robbie Strickland

Zeppelin

Page 113: Lambda at Weather Scale by Robbie Strickland

Zeppelin

Page 114: Lambda at Weather Scale by Robbie Strickland

Future Work

Page 115: Lambda at Weather Scale by Robbie Strickland

FiloDBLow latency time-series aggregations using

Spark + Cassandra/in-memory storage

Page 116: Lambda at Weather Scale by Robbie Strickland

FiloDBLow latency time-series aggregations using

Spark + Cassandra/in-memory storageSpace efficient – similar to Parquet

Page 117: Lambda at Weather Scale by Robbie Strickland

FiloDBLow latency time-series aggregations using

Spark + Cassandra/in-memory storageSpace efficient – similar to ParquetSQL queries using ODBC/JDBC

Page 118: Lambda at Weather Scale by Robbie Strickland

Direct to ParquetStream to Parquet directly

Page 119: Lambda at Weather Scale by Robbie Strickland

Direct to ParquetStream to Parquet directlyEliminate interim storage

Page 120: Lambda at Weather Scale by Robbie Strickland

Direct to ParquetStream to Parquet directlyEliminate interim storageCurrently in R&D

Page 121: Lambda at Weather Scale by Robbie Strickland

We’re Hiring!

Robbie [email protected]