cassandra summit 2014: reading cassandra sstables directly for offline data analysis
DESCRIPTION
Presenter: Ben Vanberg, Senior Software Engineer at FullContact Here at FullContact we have lots and lots of contact data. In particular we have more than a billion profiles over which we would like to perform ad hoc data analysis. Much of this data resides in Cassandra, and we have many analytics MapReduce jobs that require us to iterate across terabytes of Cassandra data. To solve this problem we've implemented our own splittable input format which allows us to quickly process large SSTables for downstream analytics.TRANSCRIPT
#CassandraSummit 2014
A Journey
● Solving a problem for a specific use case
● Implementation
● Example Code
#CassandraSummit 2014
Person API
#CassandraSummit 2014
Goal: Analytics on Cassandra Data
● How many profile types?
● How many profiles have social data and what type? (facebook, twitter, etc)
● How many total social profiles of each type?
● Whatever!
#CassandraSummit 2014
Key Factors
● Netflix Priam for Backups (Snapshots, Compressed)
● Size-Tiered Compaction (SSTables 200 GB+)
● Compression enabled (SnappyCompressor)
● AWS
#CassandraSummit 2014
Where we started
#CassandraSummit 2014
Limiting Factors
● 3-10 days total processing time
● $2700+ in AWS resources
● Ad-Hoc analytics (not really!)
● Engineering time!
#CassandraSummit 2014
Moving Forward
● Querying Cassandra directly didn’t scale for MapReduce.
● Cassandra SSTables. Could we consume them directly?
● SSTables would need to be directly available (HDFS).
● SSTables would need to be available as MapReduce input.
● Did something already exist to do this?
#CassandraSummit 2014
Netflix Aegisthus
● We already use Netflix Priam for Cassandra backups
● Aegisthus works great for the Netflix use case: (C* 1.0, No compression)
● At the time there was an experimental C* 1.2 branch.
● Aegisthus splits only when compression is not enabled.
● Single thread processing 200 GB+ SSTables.
#CassandraSummit 2014
KassandraMRHelper
● Support for C* 1.2!
● We got the job done with KassandraMRHelper
● Copies SSTables to local file system in order to leverage existing C* I/O
libraries.
● InputFormat not splittable.
● Single thread processing 200 GB+ SSTables.
● 60+ hours to process
#CassandraSummit 2014
Existing Solutions
#CassandraSummit 2014
Implementing a Splittable InputFormat
● We needed splittable SSTables to make this work.
● With compression enabled this is more difficult.
● Cassandra I/O code makes the compression seamless but doesn’t support
HDFS.
● Need a way to define the splits.
#CassandraSummit 2014
Our Approach
● Leverage the SSTable metadata.
● Adapt Cassandra I/O libraries to HDFS.
● Leverage the SSTable Index to define splits. IndexIndex!
● Implement an InputFormat which leverages the IndexIndex to define splits.
● Similar to Hadoop LZO implementation.
#CassandraSummit 2014
Cassandra SSTables
Data file: This file contains the actual SSTable data. A binary format of key/
value row data.
Index file: This file contains an index into the data file for each row key.
CompressionInfo file: This file contains an index into the data file for each
compressed block. This file is available when compression has been enabled
for a Cassandra column family.
#CassandraSummit 2014
Cassandra I/O for HDFS
● Cassandra’s I/O allows for random access of the SSTable.
● Porting this code to HDFS allowed us to read the SSTable in the same
fashion directly within MapReduce.
#CassandraSummit 2014
The IndexIndex
#CassandraSummit 2014
Original Solution
#CassandraSummit 2014
Final Solution
#CassandraSummit 2014
Results
Reading via live queries to Cassandra 3-10 days $2700+
Unsplittable SSTable input format 60 hours $350+
Splittable SSTable input format 10 hours $165+
#CassandraSummit 2014
Example
#CassandraSummit 2014
Mapper
AbstractType keyType =
CompositeType.getInstance(Lists.<AbstractType<?>>newArrayList(UTF8Type.instance,
UTF8Type.instance));
protected void map(ByteBuffer key, SSTableIdentityIterator value, Context context)
throws IOException, InterruptedException {
final ByteBuffer newBuffer = key.slice();
final Text mapKey = new Text(keyType.getString(newBuffer));
Text mapValue = jsonColumnParser.getJson(value, context);
if (mapValue == null) {
return;
}
context.write(mapKey, mapValue);
}
#CassandraSummit 2014
Reducer
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Make things super simple and output the first value only.
// In reality we'd want to figure out which was the
// most correct value of the ones we have based on our C* cluster
configuration.
context.write(key, new Text(values.iterator().next().toString()));
}
#CassandraSummit 2014
Job Configuration
job.setMapperClass(SimpleExampleMapper.class);
job.setReducerClass(SimpleExampleReducer.class);
...
job.setInputFormatClass(SSTableRowInputFormat.class);
...
SSTableInputFormat.addInputPaths(job, inputPaths);
...
FileOutputFormat.setOutputPath(job, new Path(outputPath));
#CassandraSummit 2014
Running the indexer
hadoop jar hadoop-‐sstable-‐0.1.2.jar
com.fullcontact.sstable.index.SSTableIndexIndexer [SSTABLE_ROOT]
#CassandraSummit 2014
Running the job
hadoop jar hadoop-‐sstable-‐0.1.2.jar com.fullcontact.sstable.example.SimpleExample
\
-‐D hadoop.sstable.cql="CREATE TABLE ..." \
-‐D mapred.map.tasks.speculative.execution=false \
-‐D mapred.job.reuse.jvm.num.tasks=1 \
-‐D io.sort.mb=1000 \
-‐D io.sort.factor=100 \
-‐D mapred.reduce.tasks=512 \
-‐D hadoop.sstable.split.mb=1024 \
-‐D mapred.child.java.opts="-‐Xmx2G -‐XX:MaxPermSize=256m" [SSTABLE_ROOT]
[OUTPUT_PATH]
#CassandraSummit 2014
Example Summary
1. Write SSTable reader MapReduce jobs
2. Run the SSTable Indexer
3. Run SSTable reader MapReduce jobs
#CassandraSummit 2014
Goal Accomplished
● 96% decrease in processing times!
● 94% decrease in resource costs!
● Reduced Engineering time!
#CassandraSummit 2014
Open Source Project
Open Source @ https://github.com/fullcontact/hadoop-sstable
Roadmap:
● Cassandra 2.1 support
● Scalding support