cassandra: a decentralized structured storage systemdavid/cs848/pslides/cassandra.pdf · cassandra:...

31
CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM Avinash Lakshman, Prashant Malik - Facebook Presented by Dhruv Patel CS 886 | Spring 2016 1

Upload: hacong

Post on 06-Mar-2018

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

CASSANDRA: A

DECENTRALIZED STRUCTURED

STORAGE SYSTEM

Avinash Lakshman, Prashant Malik - Facebook

Presented by Dhruv PatelCS 886 | Spring 2016

1

Page 2: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Overview

Background

Data Model

Architecture

Implementation

Facebook Search Index

Conclusion

Discussion

2

Page 3: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

History

Developed by Facebook

Designed to fulfill the storage needs of Facebook

Index Search

Billions of writes/day

High wright throughput

Scale with number of users

Deployed as the backend storage system for

multiple services within Facebook

3

Page 4: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Motivation

High Scalability

Read & Write throughput increases linearly with

number of nodes

High Availability

Treats failures as norms rather than exceptions

Fail Tolerance

4

Page 5: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Data Model

Table: Distributed multi dimensional map, indexed

by a key

Value: Object which is highly structured

Row Key: String – no size restriction

Normally 16-36 bytes long

Operations are atomic on each row per replica

5

Page 6: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Data Model (contd.)

Column Families(CF) – columns grouped together

Number of CFs not limited per table

Types of Column Families:

Simple CF

Super CF – CF within CF

Column Sort Order – application specific

Time

Facebook Inbox Search – results displayed in time sorted order

Name

6

Page 7: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Data Model (contd.)

Each Column has

Name

Value

Timestamp

Column Access

Simple column

column_family: column

Super column

column_family:super_column:column

7

Page 8: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Data Model (contd.)

6576e768-8r73-78df User_Id User_Name Date

786780 John 2010-04-17T18:10:11

2456e124-6y78-12ef User_Id User_Name Date

218745 Bob 2010-02-12T14:12:16

Key

Column Value

Column Name

8

Page 9: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Data Model (contd.)

Row

Key 2

Column

1

Column

2

Column

3

Value 1 Value 2 Value 3

Column

1

Column

4

Value 1 Value 4

Column

3

Column

4

Value 3 Value 4

Column

1

Column

2

Value 1 Value 2

Row

Key 1

Row

Key 1

Super Column Family

Super Column 1 Super Column 1

Super Column FamilySimple Column Family

Figure adapted from [3]

9

Page 10: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

API

Methods:

insert (table, key, rowMutation)

get (table, key, columnName)

delete (table, key, columnName)

columnName:

Column within column family

Column family

Super column family

Column within a super column

10

Page 11: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

System Architecture

Partitioning – high scalability

Replication – high availability and durability

Cluster membership – how nodes are

added/deleted

Bootstrapping – how nodes start for the first time

Scaling the Cluster

Local Persistence

Implementation Details

11

Page 12: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Partitioning

Scale incrementally

Nodes: logically structured in Ring Topology

Each node assigned a random value – position

Hashing on data-item’s key, assign to node

Walk the ring clockwise

This node – coordinator of the key

12

Page 13: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Partitioning (contd.)

A

B

CD

E

h(key1)

h(key2)

13

Page 14: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Partitioning (contd.)

Consistent Hashing

Advantages

• Departure or arrival of the node only affects its immediate neighbours

Challenges

• Non – uniform data & load distribution

• Unaware of node performance heterogeneity

Solution

• Assign nodes to multiple position on the ring

• Analyze load information

• Load Distribution

• Move lightly loaded nodes to alleviate heavily loaded nodes

Cassandra uses the 2nd approach

14

Page 15: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Replication

Data items are replicated at N nodes

A

B

C

D

E

h(key1)

N=3

15

Page 16: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Replication (contd.)

Cassandra Replication Policies

Metadata about ranges a node is responsible for

Cached locally at each node and inside Zookeeper

Fault tolerant : Node crash and comes back – knows its range

• Replicate data on N-1 nodes after its coordinator nodeRack Unaware

• Zookeeper chooses a node leader

• Tells nodes the range they are replicas forRack Aware

• Leader is chosen at Datacenter level instead of Rack levelDatacenter Aware

16

Page 17: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Cluster Membership

Scuttlebutt

Gossip based protocol

Inspired from real life rumor spreading

Periodic, Pairwise & Inter-node communication

Failure Detection

Determine which node is up and down

Avoid attempts to communicate with unreachable nodes

17

Page 18: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Cluster Membership (contd.)

Failure Detection

Ø Accrual Failure Detection

emits Ø, instead of binary value (up/down)

Ø = Suspicion level for each monitored node

Network and load condition

Node maintains sliding window of inter arrival time of

gossip message from other nodes, Ø is calculated

If node is faulty: suspicion level increases with time

If node is correct: Ø will be constant set by the

application, generally 0

18

Page 19: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Bootstrapping

Node starts for the first time

Reads configuration file – list of few contact points

Zookeeper

Receives a random token for its position

Persists mapping locally and in Zookeeper

Token information is gossiped around the cluster

Or manually by administrator via command line tool

or Cassandra web interface

19

Page 20: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Scaling

Joining node assigned a token

s.t. it can alleviate a heavily loaded node

Splitting the range

Bootstrap algorithm initiated by CLT or web

interface

Overloaded node streams the data to new node

using kernel-kernel copy

40MBPS

20

Page 21: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Local Persistence

Relies on the local file system for data persistence

Write Operation

Write to commit log in local disk of the node

Durability and recoverability

Update in-memory data structure

Only after successful write in commit log

When in-memory data structure crosses certain limit, it dumps itself to disk

Merge process runs in background to merge such files into one file

21

Page 22: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Local Persistence (contd.)

Write Implementation

Figure adapted from [6]

22

Page 23: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Local Persistence (contd.)

Compaction: files that are close to each other

Figure adapted from [6]

23

Page 24: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Local Persistence (contd.)

Read Operation

Looks up in-memory data structure : newest to oldest

If not found, look into files on disk

Bloom Filter

Summarizing keys in one file

Avoid looking up files which do not contain the key

Stored in memory and on each data file

Column Index

Jump to right chunk on disk for column retrieval

At every 256k chunk boundary

24

Page 25: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Implementation Details

Cassandra process on a single machine

Partitioning module

Cluster membership and failure detection module

Storage engine module

Architecture is based on SEDA

Staged Event Driven Architecture

Operations transit from one stage to the next

Each stage can be handled by different thread pool

Gives high performance

Figure adapted from [8]

25

Page 26: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Implementation Details (contd.)

Commit Log Maintenance

Commit log is rolled out after it reaches certain

threshold (128MB)

Each commit log contains header

Bit vector

Shows if column family has successfully persisted to disk

This header will be checked before purging the commit log

Make sure all the data is persisted to disk

26

Page 27: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Facebook Index Search

Two kind of search

Term Search

Key: user_id

Super Column: words that make up the message

Interaction Search

Key: user_id

Super Column: recipients ids’

For each of these super columns, individual message identifiers are the columns

Term Search – not available currently!!!

27

Page 28: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Facebook Index Search (contd.)

Faster search technique

User clicks on search bar

Asynchronous message is sent to Cassandra cluster

Prime the buffer cache with that user’s index

Search results likely to be in memory when query is

executed

50+TB data on 150 node cluster

28

Page 29: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Conclusion

High write throughput

No single points of failure

Linear scalability

Durability

Clever integration of Bigtable and Dynamo

29

Page 30: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

Questions and Discussion

Security

Need to use trusted environments

Development of Security Layer externally

Comparison with other such system would have

generated more persuasive results

Dense reading

No figures

30

Page 31: CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

References

[1] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2): 35-40, 2010.

[2] T. Rabl, S. Gómez-Villamor, M. Sadoghi, V. Muntés-Mulero, H.-A. Jacobsen, and S. Mankovskii, “Solving Big Data Challenges for Enterprise Application Performance Management,” Proc. VLDB Endow., vol. 5, no. 12, pp. 1724–1735, Aug. 2012.

[3] E. Hewitt, Cassandra: The Definitive Guide, 1 edition. Sebastopol, CA; Koln u.a.: O’Reilly Media, 2010.

[4] http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/6.1-Cassandra.ppt

[5] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, “The " PHI Accrual Failure Detector,” in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, Washington, DC, USA, 2004, pp. 66–78.

[6] http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf

[7] http://vanets.vuse.vanderbilt.edu/dokuwiki/lib/exe/fetch.php?media=teaching:cassandra_presentation_final.pptx

[8] http://www.eecs.harvard.edu/~mdw/talks/seda-sosp01-talk.pdf

[9] http://www.datastax.com/documentation/articles/cassandra/cassandrathenandno w.html

[10] https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf

[11] http://www.slideshare.net/VaradMeru/cassandra-a-decentralized-structured-storage-system

[12] http://www.powershow.com/view4/56cc1c-NTgwM/Cassandra_-_A_Decentralized_Structured_Storage_System_powerpoint_ppt_presentation

[13] https://prezi.com/lzzsbrotjjcn/cassandra-a-decentralized-structured-storage-system/

[14] https://stephenholiday.com/notes/cassandra/

[15] https://huu.la/blog/review-of-cassandra--a-decentralized-structured-storage-system

[16] http://daizuozhuo.github.io/cassandra-review/

[17] https://xduan7.com/2016/02/09/paper-review-cassandra-a-decentralized-structured-storage-system/

[18] http://www.ibm.com/developerworks/library/os-apache-cassandra/

[19] http://www.computerweekly.com/tip/Securing-NoSQL-applications-Best-practises-for-big-data-security

31