presented by : nitya shankaran ritika sharma. overview motivation behind the project need to know...
Post on 23-Dec-2015
217 Views
Preview:
TRANSCRIPT
Survey : Cloud Storage Systems
Presented By :
Nitya Shankaran
Ritika Sharma
OverviewMotivation behind the project
Need to know how your data can be stored efficiently in the cloud i.e. choosing the right kind of storage.
What has been done so far
Types of Data StorageObject storageRelational Database storage Distributed File systems etc.
Object Storage
Uses data objects instead of files to store and retrieve dataMaintains an index of Object ID (OID) numbersIdeal for storing large files.
Amazon S3
“Provides a simple web interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web”.
Currently, S3 stores over 449 billion user objects as of July 2011 and handles 900 million user requests a day.
Amazon claims that S3 provides infinite storage capacity, infinite data durability, 99.99% availability and good data access performance.
How Amazon S3 stores dataMakes use of buckets
Objects are identified by a unique key which is assigned by the user
S3 stores objects of upto 5 TB in size, each accompanied by 2 KB of metadata (Content-type, date last modified etc.).
ACL’s
- Read for objects and buckets
- Write for buckets only
- Read and write for objects
Buckets and objects are created, listed and retrieved using REST-style HTTP interface or a SOAP interface.
Evaluation of S3
Experimental Setup
Features and Findings:
- Data Durability
- Replica Replacement- Data Reliability- Availability
- It provides versioning of object
- Data Access performance
- Security
- Easy to use
SwiftUsed for creating redundant, scalable object storage using clusters of
standardized servers to store petabytes of accessible data.
Provides greater scalability, redundancy and permanence due to no central point of control.
Objects are written to multiple hardware devices in the data center, with the OpenStack software responsible for ensuring data replication and integrity across the cluster.
Storage clusters can scale horizontally by adding new nodes. Should a node fail, OpenStack works to replicate its content from other active nodes.
Used mainly for virtual machine images, photo storage, email storage and backup archiving.
Swift has a ReST-ful API.
Architecture of Swift
- Proxy server processes API requests and routes requests to storage nodes
- Auth server authenticates and authorizes requests
- Ring represents mapping between the names of entities stored on disk and their physical location
- Replicator provides redundancy for accounts, containers, objects
- Updater processes failed or queued updates
- Auditor verifies integrity of objects, containers, and accounts
- Account Server handles listing of containers, stores as SQLite DB
- Container Server handles listing of objects, stores as SQLite DB
- Object Server stores, retrieves, and deletes objects stored on local devices
Evaluation of Swift- Data Durability
- Replica Replacement- Data Reliability- Availability
- Data scalability
- Security
- Objects must be < 5GB
- Not a Filesystem
- No User Quotas
- No Directory Hierarchies
- No writing to a byte offset in a file
- No ACL’s
Swift is mainly used for:
o Storing media libraries (photos, music, videos, etc.)o Archiving video surveillance fileso Archiving phone call audio recordingso Archiving compressed log fileso Archiving backups (< 5GB each object)o Storing and loading of OS Images, etc.o Storing file populations that grow continuously on a practically infinite basis.o Storing small files (<50 KB). OpenStack Object Storage is great at this.o Storing billions of files.o Storing Petabytes (millions of Gigabytes) of data.
Relational Database Storage (RDS)Aims to move much of the operational burden of provisioning,
configuration, scaling, performance tuning, backup, privacy, and access control from the database users to the service operator, offering lower overall costs to users.
Advantages:
- Hardware costs are lower- Operational costs are lower
Disadvantages:
- Inability to scale well
- Labor intensive (managing relational databases)
- Error prone
- Increased complexity since each database package comes with its own configuration options, tools, performances sensitivities and bugs.
Microsoft SQL AzureCloud-based relational database service built on SQL Server
Uses T-SQL as the query language and Tabular Data Stream (TDS) as the protocol
Does not provide a REST-based API to access the service over HTTP unlike S3. Instead, uses SQL Azure accessed via Tabular Data Stream (TDS).
Allows relational queries to be made against stored data
Enables querying data, search, data analysis and data synchronization.
Network Topology – Part 1
Client Layer(At custom premises or Windows Azure Platform)
HTTP/REST
Network Topology – Part 2
Evaluation Of SQL Azure
Replica Replacement- Data Reliability- Availability
Data Access performance
Security
Scalability:
You can store any amount of data, from kilobytes to terabytes, in SQL Azure. However, individual databases are limited to 10 GB in size.
Sharding
Data sharding is a technique used by many applications to improve performance, scalability and cost by partitioning the data.
For example, applications that store and process sales data using date or time predicates. These applications can benefit from processing a subset of the data instead of the entire data set.
Amazon S3 Openstack Swift
SQL Azure
Type of Storage Object Storage Object Storage RDS storage
Data Replication Store multiple redundant copies.95.89% availability rate
Consists of a Replicator. Ensures integrity of data.
Ensures data availability by replication (SQL Azure fabric) and provides load balancing
Data Scalability Scalable High scalability Individual databases limited to 10GB.
Security Clients authenticated by using public/private key scheme.
---- Provides set of security principles
Usage Suitable for large data objects and parallel applications
Media libraries, OS images, backups and log files
Querying the data and performing analysis
Distributed File system
Google File System
ArchitectureSingle MasterMultiple
Chunkservers HeartBeat MessagesChunk Size = 64MBMetadata
File & Chunk namespace
Mapping from files to chunks
Locations of each chunk’s replica
Gfs- Micro-benchmarksGFS Cluster
1 master and 2 master replicas
16 chunkservers, 16 clientsDual 1.4 GHz P3 processor100Mbps full-duplex
ethernet connection to switch
19 servers are connected to switch S1 and 16 clients are connected to switch S2.
S1 and S2 are connected to with a 1 Gbps link
Reads
Limit Peak 125MB/s for 1 Gbps link 12.5MB/s for 100Mbps link
When 1 client is reading Read rate = 10MB/s
= 80% of 12.5MB/s When 16 clients are reading
Read rate = 94MB/s
= 6MB/s per client
= 75% of limit peak
Gfs- Micro-benchmarks (cont’d) Writes
Limit Input connection = 12.5MB/s Limit = 67MB/s (write each
byte to 3 of 16 chunkservers) When 1 client is writing
6.3MB/s (delays in propagation data b/w servers)
When 16 clients are writing 35MB/s 2.2 MB/s per client
Records Append
Performance = n/w bandwidth of chunkserver having last chunk of file
Independent of no. of clients When 1 client is appending
Limit = 6MB/s When 16 clents are appending
4.8 MB/s
FeaturesData Integrity : ChecksumReplica Placement
Data reliabilityAvailability
Fast recovery Chunk and Master replication
Rebalancing ReplicasBetter disk spaceLoad balancing
Garbage CollectionDoes not immediately reclaim the available storageFile – renamed to a hidden nameRemoves after 3 days
Stale Replica DetectionChunk version numberIncrements on updation
Hadoop Distributed File system Architecture
NameNode - metadata Hierarchy of files & directories Attributes Block size = 128MB Primary and Secondary NameNode
DataNode - Application data Each block replica – 2 files
Data Block’s metadata
Checksum Handshake- Namenode & Datanode
Verify namespace ID s/w version
Communication via TCP Heartbeat Message HDFS Client
Interface b/w user application and HDFS Reading a file Writing a file Single-writer, multiple-reader model
Hdfs- benchmarksHDFS clusters at Yahoo!
3500 nodes2 quad core Xeopn
Processors @ 2.5ghzLinux16GB RAM1Gbps ethernet
DFSIO benchmarkRead: 66MB/s per nodeWrite: 40 MB/s per nodeBusy Cluster Read: 1.02
MB/s per nodeBusy Cluster Write: 1.09
MB/s per node
NNThroughput benchmarkStarts NameNode
app and multiple client threads on same node
Operation Throughput (op/s)
Open file for read
126100
Create file 5600
Rename file 8300
Delete file 207000
DataNode heartbeat
300000
Blocks report(blocks/s)
639700
Features Good Placement Policy
1 replica per DataNode < 2 replica per Rack
Data reliability Availability N/w bandwidth utilization
Replication Management Priority queue
Load balancing No strategy Balancer – tool
Application Program Disk usage Cluster utilization – node utilization =
range (0,1)
Data Integrity Block scanner on Node
Verifies Checksum
Inter-Cluster data Copy
GFS v/s HDFSGoogle File System
Hadoop Distributed File System
Read Rate 6MB/s per client 1.02 MB/s per node
Write Rate 2.2 MB/s per client 1.09 MB/s per node
Records Append 4.8 MB/s per client -
Availability High Single point of failure.
Data Integrity Checksum Checksum
Replica Placement Yes Yes
Rebalancing Yes Yes
Garbage Collection Yes No
Stale Replica Yes No
Inter-cluster Data Copy
No Yes
Thank You!
top related