cfs: cassandra backed storage for hadoop
DESCRIPTION
Nick Bailey @Nickmbailey [email protected]TRANSCRIPT
![Page 1: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/1.jpg)
CFSCassandra-backed storage for HadoopNick Bailey@[email protected]
![Page 2: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/2.jpg)
©2012 DataStax
Motivation
2
![Page 3: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/3.jpg)
©2012 DataStax
Help me Cassandra, you’re my only hope
3
![Page 4: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/4.jpg)
©2012 DataStax
Cassandra• Distributed architecture
• No SPOF
• Scalable
• Real time data
• No ad-hoc query support
4
![Page 5: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/5.jpg)
©2012 DataStax
Cassandra, why can’t you...
5
![Page 6: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/6.jpg)
©2012 DataStax
...do the things Hadoop was built for.
6
![Page 7: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/7.jpg)
©2012 DataStax
Cassandra + Hadoop = <3
7
![Page 8: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/8.jpg)
©2012 DataStax
The Solution• InputFormat/OutputFormat
• Unfortunately, still need a DFS
• Run tasktrackers/datanodes locally• Data Locality FTW!
• Run namenode/jobtracker somewhere
• Since Cassandra 0.6 (the dark ages)
8
![Page 9: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/9.jpg)
©2012 DataStax
Ok, but what about these parts that suck...
9
![Page 10: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/10.jpg)
©2012 DataStax
Do not want...• Multiple hadoop stacks?
• SPOF?
• 3 JVMS?
10
![Page 11: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/11.jpg)
©2012 DataStax
CFS
11
![Page 12: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/12.jpg)
©2012 DataStax
Cassandra Data model in 1 minute
12
![Page 13: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/13.jpg)
©2012 DataStax
Column Families• Column Family ~= Table
• Row Key + columns
• Columns are sparse
13
![Page 14: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/14.jpg)
©2012 DataStax
Static - Users Column Family
14
Row Key
nickmbailey password: * name: Nick
zznate password: * name: Nate phone: 512-7777
![Page 15: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/15.jpg)
©2012 DataStax
Select * from Users where name=Nick;
Secondary Indexes
15
![Page 16: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/16.jpg)
©2012 DataStax
Dynamic - Friends
16
Row Key
nickmbailey zznate: thobbs:
zznate jbeiber: thobbs: steve_watt:
![Page 17: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/17.jpg)
©2012 DataStax
So what about CFS...
17
![Page 18: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/18.jpg)
©2012 DataStax
Simple...
18
![Page 19: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/19.jpg)
©2012 DataStax 19
![Page 20: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/20.jpg)
©2012 DataStax
CF: inode• Essentially, namenode replacement
• File metadata
20
![Page 21: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/21.jpg)
©2012 DataStax 21
![Page 22: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/22.jpg)
©2012 DataStax
CF: inode• Row Key = UUID
• Allows for file renames
• Secondary indexes for file browsing
• Columns:
22
Columnfilename /home/nick/data.txt
parent_path /home/nick/attributes nick:nick:777
TimeUUID1 <block metadata>TimeUUID2 <block metadata>TimeUUID3 <block metadata>
...
![Page 23: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/23.jpg)
©2012 DataStax 23
![Page 24: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/24.jpg)
©2012 DataStax
CF: sblocks• Essentially, datanode replacement
• Stores actual contents of files
• Each row is an hdfs block
• Row Key = Block ID
24
Column
TimeUUID1 <compressed file data>
TimeUUID2 <compressed file data>
TimeUUID3 <compressed file data>
...
![Page 25: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/25.jpg)
©2012 DataStax 25
![Page 26: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/26.jpg)
©2012 DataStax
Writes• Write file metadata
• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subblock.size’
• Read in a block• split into sub blocks
• Update inode, sblocks
• rinse, repeat
26
![Page 27: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/27.jpg)
©2012 DataStax 27
![Page 28: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/28.jpg)
©2012 DataStax
Reads• Check for file in inode
• Determine appropriate blocks
• Request blocks via thrift
• If data is local...• ...get location on local filesystem
• If data is remote...• ...get actual file content via thrift
28
![Page 29: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/29.jpg)
©2012 DataStax
What Else?• Current Implementation: 1.0.4• <property>
<name>fs.cfs.impl</name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property>
• Supports HDFS append()
• Immutability makes things easy
• See the first incarnation• https://github.com/riptano/brisk
29
![Page 31: CFS: Cassandra Backed Storage for Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052621/558b27ddd8b42a1d268b45be/html5/thumbnails/31.jpg)
Questions?