olap with spark and cassandra
DESCRIPTION
TRANSCRIPT
![Page 1: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/1.jpg)
OLAP WITH SPARK ANDCASSANDRA
EVAN CHANJULY 2014
![Page 2: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/2.jpg)
WHO AM I?Principal Engineer, @evanfchan
Creator of
Socrata, Inc.
http://github.com/velviaSpark Job Server
![Page 3: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/3.jpg)
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MOREPEOPLE.
data.edmonton.ca finances.worldbank.org data.cityofchicago.orgdata.seattle.gov data.oregon.gov data.wa.govwww.metrochicagodata.org data.cityofboston.govinfo.samhsa.gov explore.data.gov data.cms.gov data.ok.govdata.nola.gov data.illinois.gov data.colorado.govdata.austintexas.gov data.undp.org www.opendatanyc.comdata.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.itdata.montgomerycountymd.gov data.cityofnewyork.usdata.acgov.org data.baltimorecity.gov data.energystar.govdata.somervillema.gov data.maryland.gov data.taxpayer.netbronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
![Page 4: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/4.jpg)
WE ARE SWIMMING IN DATA!
![Page 5: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/5.jpg)
BIG DATA AT OOYALA2.5 billion analytics pings a day = almost a trillion events ayear.Roll up tables - 30 million rows per day
![Page 6: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/6.jpg)
BIG DATA AT SOCRATAHundreds of datasets, each one up to 30 million rowsCustomer demand for billion row datasets
![Page 7: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/7.jpg)
HOW CAN WE ALLOW CUSTOMERS TO QUERY AYEAR'S WORTH OF DATA?
Flexible - complex queries includedSometimes you can't denormalize your data enough
Fast - interactive speeds
![Page 8: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/8.jpg)
RDBMS? POSTGRES?Start hitting latency limits at ~10 million rowsNo robust and inexpensive solution for querying across shardsNo robust way to scale horizontallyComplex and expensive to improve performance (eg rolluptables)
![Page 9: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/9.jpg)
OLAP CUBES?Materialize summary for every possible combinationToo complicated and brittleTakes forever to computeExplodes storage and memory
![Page 10: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/10.jpg)
When in doubt, use brute force- Ken Thompson
![Page 11: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/11.jpg)
![Page 12: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/12.jpg)
CASSANDRAHorizontally scalableVery flexible data modelling (lists, sets, custom data types)Easy to operateNo fear of number of rows or documentsBest of breed storage technology, huge communityBUT: Simple queries only
![Page 13: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/13.jpg)
APACHE SPARKHorizontally scalable, in-memory queriesFunctional Scala transforms - map, filter, groupBy, sortetc.SQL, machine learning, streaming, graph, R, many more pluginsall on ONE platform - feed your SQL results to a logisticregression, easy!THE Hottest big data platform, huge community, leavingHadoop in the dustDevelopers love it
![Page 14: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/14.jpg)
SPARK PROVIDES THE MISSING FAST, DEEPANALYTICS PIECE OF CASSANDRA!
![Page 15: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/15.jpg)
INTEGRATING SPARK AND CASSANDRAScala solutions:
Datastax integration:
(CQL-based)https://github.com/datastax/cassandra-driver-sparkCalliope
![Page 16: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/16.jpg)
A bit more work:
Use traditional Cassandra client with RDDsUse an existing InputFormat, like CqlPagedInputFormat
![Page 17: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/17.jpg)
EXAMPLE CUSTOM INTEGRATION USINGASTYANAX
val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }
![Page 18: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/18.jpg)
A SPARK AND CASSANDRAOLAP ARCHITECTURE
![Page 19: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/19.jpg)
SEPARATE STORAGE AND QUERY LAYERSCombine best of breed storage and query platformsTake full advantage of evolution of eachStorage handles replication for availabilityQuery can replicate data for scaling read concurrency -independent!
![Page 20: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/20.jpg)
SCALE NODES, NOTDEVELOPER TIME!!
![Page 21: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/21.jpg)
KEEPING IT SIMPLEMaximize row scan speedColumnar representation for efficiencyCompressed bitmap indexes for fast algebraFunctional transforms for easy memoization, testing,concurrency, composition
![Page 22: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/22.jpg)
SPARK AS CASSANDRA'S CACHE
![Page 23: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/23.jpg)
EVEN BETTER: TACHYON OFF-HEAP CACHING
![Page 24: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/24.jpg)
INITIAL ATTEMPTSval rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) )
sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])
![Page 25: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/25.jpg)
No existing generic query engine for Spark when we started(Shark was in infancy, had no indexes, etc.), so we built our ownFor every row, need to extract out needed columnsAbility to select arbitrary columns means using Seq[Any], notype safetyBoxing makes integer aggregation very expensive and memoryinefficient
![Page 26: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/26.jpg)
COLUMNAR STORAGE AND QUERYING
![Page 27: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/27.jpg)
The traditional row-based data storageapproach is dead- Michael Stonebraker
![Page 28: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/28.jpg)
TRADITIONAL ROW-BASED STORAGESame layout in memory and on disk:
Name AgeBarak 46
Hillary 66
Each row is stored contiguously. All columns in row 2 come afterrow 1.
![Page 29: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/29.jpg)
COLUMNAR STORAGE (MEMORY)Name column
0 10 1
Dictionary: {0: "Barak", 1: "Hillary"}
Age column
0 146 66
![Page 30: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/30.jpg)
COLUMNAR STORAGE (CASSANDRA)Review: each physical row in Cassandra (e.g. a "partition key")stores its columns together on disk.
Schema CF
Rowkey TypeName StringDict
Age Int
Data CF
Rowkey 0 1Name 0 1
Age 46 66
![Page 31: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/31.jpg)
ADVANTAGES OF COLUMNAR STORAGECompression
Dictionary compression - HUGE savings for low-cardinalitystring columnsRLE
Reduce I/OOnly columns needed for query are loaded from disk
Can keep strong types in memory, avoid boxingBatch multiple rows in one cell for efficiency
![Page 32: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/32.jpg)
ADVANTAGES OF COLUMNAR QUERYINGCache locality for aggregating column of dataTake advantage of CPU/GPU vector instructions for ints /doublesavoid row-ifying until last possible momenteasy to derive computed columnsUse vector data / linear math libraries
![Page 33: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/33.jpg)
COLUMNAR QUERY ENGINE VS ROW-BASED INSCALA
Custom RDD of column-oriented blocks of dataUses ~10x less heap10-100x faster for group by's on a single nodeScan speed in excess of 150M rows/sec/core for integeraggregations
![Page 34: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/34.jpg)
SO, GREAT, OLAP WITH CASSANDRA ANDSPARK. NOW WHAT?
![Page 35: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/35.jpg)
![Page 36: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/36.jpg)
DATASTAX: CASSANDRA SPARK INTEGRATIONDatastax Enterprise now comes with HA Spark
HA master, that is.cassandra-driver-spark
![Page 37: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/37.jpg)
SPARK SQLAppeared with Spark 1.0In-memory columnar storeCan read from Parquet now; Cassandra integration comingQuerying is not column-based (yet)No indexesWrite custom functions in Scala .... take that Hive UDFs!!Integrates well with MLBase, Scala/Java/Python
![Page 38: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/38.jpg)
WORK STILL NEEDEDIndexesColumnar querying for fast aggregationEfficient reading from columnar storage formats
![Page 39: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/39.jpg)
GETTING TO A BILLION ROWS / SECBenchmarked at 20 million rows/sec, GROUP BY on twocolumns, aggregating two more columns. Per core.50 cores needed for parallel localized grouping throughput of1 billion rows~5-10 additional cores budget for distributed exchange andgrouping of locally agggregated groups, depending on resultsize and network topology
Above is a custom solution, NOT Spark SQL.
Look for integration with Spark/SQL for a proper solution
![Page 40: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/40.jpg)
LESSONSExtremely fast distributed querying for these use cases
Data doesn't change much (and only bulk changes)Analytical queries for subset of columnsFocused on numerical aggregationsSmall numbers of group bys, limited network interchange ofdata
Spark a bit rough around edges, but evolving fastConcurrent queries is a frontier with Spark. Use additionalSpark contexts.
![Page 41: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/41.jpg)
THANK YOU!
![Page 42: Olap with Spark and Cassandra](https://reader034.vdocuments.site/reader034/viewer/2022051323/54b700e94a7959943a8b45b9/html5/thumbnails/42.jpg)
SOME COLUMNARALTERNATIVES
Monetdb and Infobright - true columnar stores (storage +querying)Cstore-fdw for PostGres - columnar storage onlyVoltDB - in-memory distributed columnar database (but needto recompile for DDL changes)Google BigQuery - columnar cloud database, Dremel basedAmazon RedShift