big table presentation-final
DESCRIPTION
A presentation at Rice for COMP 520, distributed systems.TRANSCRIPT
A Distributed Storage System for Structured Data
Bigtable
Presenter:Yunming Zhang
Conglong Li
Saturday, September 21, 13
References
SOCC 2010 Key Note SlidesJeff Dean Google
Introduction to Distributed Computing, Winter 2008University of Washington
2Saturday, September 21, 13
Motivation
Lots of (semi) structured data at GoogleURLs
Contents, crawl metadata, linksPer-user data:
User preference settings, search resultsScale is large
Billions of URLs, hundreds of million of users,Existing Commercial database doesn’t meet the requirements
3Saturday, September 21, 13
Store and manage all the state reliably and efficientlyAllow asynchronous processes to update different pieces of data continuously
Very high read/write ratesEfficient scans over all or interesting subsets of data
Often want to examine data changes over time
Goals
4Saturday, September 21, 13
BigTable vs. GFS
GFS provides raw data storageWe need:
More sophisticated storageKey - value mapping
Flexible enough to be usefulStore semi-structured dataReliable, scalable, etc.
5Saturday, September 21, 13
BigTable
Bigtable is a distributed storage system for managing large scale structured data
Wide applicabilityScalabilityHigh performanceHigh availability
6Saturday, September 21, 13
Overview
Data ModelAPIImplementation StructuresOptimizationsPerformance EvaluationApplicationsConclusions
7Saturday, September 21, 13
Data Model
SparseSortedMultidimensional
8Saturday, September 21, 13
Cell
Contains multiple versions of the data
Can locate a data using row key, column key and a time stamp
Treats data as uninterpreted array of bytes that allow clients to serialize various forms of structured and semi-structured data
Supports automatic garbage collection per column family for management of versioned data
9Saturday, September 21, 13
Store and manage all the state reliably and efficientlyAllow asynchronous processes to update different pieces of data continuously
Very high read/write ratesEfficient scans over all or interesting subsets of data
Often want to examine data changes over time
Goals
10Saturday, September 21, 13
Row
Row key is an arbitrary stringAccess to column data in a row is atomic
Row creation is implicit upon storing dataRows ordered lexicographically
Rows close together lexicographically usually reside on one or a small number of machines
11Saturday, September 21, 13
Columns
Columns are grouped into Column Families:family:optional_qualifier
Column familyHas associated type informationUsually of the same type 12
Saturday, September 21, 13
Overview
Data ModelAPIImplementation StructuresOptimizationsPerformance EvaluationApplicationsConclusions
13Saturday, September 21, 13
API
Metadata operationsCreate/delete tables, column families, change metadata, modify access control list
Writes ( atomic )Set (), DeleteCells(), DeleteRow()
ReadsScanner: read arbitrary cells in a BigTable
14Saturday, September 21, 13
Overview
Data ModelAPIImplementation StructuresOptimizationsPerformance EvaluationApplicationsConclusions
15Saturday, September 21, 13
Tablets
Large tables broken into tablets at row boundariesTablet holds contiguous range of rows
Clients can often choose row keys for localityAim for ~100MB to 200MB of data per tablet
Serving machine responsible for ~100 tabletsFast recovery:
100 machine each pick up 1 tablet from failed machine
Fine-grained load balancing:Migrate tablets away from overloaded machine
16Saturday, September 21, 13
Tablets and Splitting
Saturday, September 21, 13
System Structure
MasterMetadata operationsLoad balancingKeep track of live tablet serversMaster failure
Tablet serverAccept read and write to data
18Saturday, September 21, 13
System Structure
Saturday, September 21, 13
System Structure
read/write
Saturday, September 21, 13
System Structure
Metadata operations
Saturday, September 21, 13
Locating Tablets
3-level hierarchical lookup scheme for tabletsLocation is ip port of servers in META tables
22Saturday, September 21, 13
Tablet Representationand serving
Append only tablet logSSTable on GFS
A Sorted map of string to stringIf you want to find a row data, all the data are contiguous
Memtable write bufferWhen a read comes in, you have to merge SSTable data and uncommitted value.
23Saturday, September 21, 13
Tablet Representationand Serving
24Saturday, September 21, 13
Tablet Representationand Serving
25Saturday, September 21, 13
Compaction
Tablet state represented as a set of immutable compacted SSTable files, plus tail of log
Minor compaction:When in-memory buffer fills up, it freezes the in-memory buffer and create a new SSTable
Major compaction:Periodically compact all SSTables for tablet into new base SSTable on GFS
Storage reclaimed from deletions at this point
Produce new tables 26
Saturday, September 21, 13
Overview
Data ModelAPIImplementation StructuresOptimizationsPerformance EvaluationApplicationsConclusions
27Saturday, September 21, 13
Reliable system for storing and managing all the statesAllow asynchronous processes to update different pieces of data continuously
Very high read/write ratesEfficient scans over all or interesting subsets of data
Often want to examine data changes over time
Goals
28Saturday, September 21, 13
Locality Groups
Clients can group multiple column families together into a locality group
A separate SSTable is generated for each locality group
Enable more efficient readCan be declared to be in-memory
29Saturday, September 21, 13
Compression
Many opportunities for compressionSimilar values in columns and cells
Within each SSTable for a locality group, encode compressed blocks
Keep blocks small for random access Exploit fact that many values very similar
30Saturday, September 21, 13
Reliable system for storing and managing all the statesAllow asynchronous processes to update different pieces of data continuously
Very high read/write ratesEfficient scans over all or interesting subsets of data
Often want to examine data changes over time
Goals
31Saturday, September 21, 13
Commit log and recovery
Single commit log file per tablet serverreduce the number of concurrent file writes to GFS
Tablet Recoveryredo points in log perform the same set of operations from last persistent state
32Saturday, September 21, 13
Overview
Data ModelAPIImplementation StructuresOptimizationsPerformance EvaluationApplicationsConclusions
33Saturday, September 21, 13
Performance evaluation
Test EnvironmentBased on a GFS with 1876 machines400 GB IDE hard drives in each machineTwo-level tree-shaped switched network
Performance TestsRandom Read/WriteSequential Read/Write
34Saturday, September 21, 13
Single tablet-server performance
Random reads is the slowestTransfer 64 KB SSTable over GFS to read 1000 byte
Random and sequential writes perform betterAppend writes to server to a single commit logGroup commit
35Saturday, September 21, 13
Performance Scaling
Performance didn’t scale linearlyLoad imbalance in multiple server configurationsLarger data transfer overhead
36Saturday, September 21, 13
Overview
Data ModelAPIImplementation StructuresOptimizationsPerformance EvaluationApplicationsConclusions
37Saturday, September 21, 13
Google Analytics
A service that analyzes traffic patterns at web sitesRaw Click Table
Row for each end-user sessionRow key is (website name, time)
Summary TableExtracts recent session data using MapReduce jobs
38Saturday, September 21, 13
Google Earth
Use one table for preprocessing and one for servingDifferent latency requirements (disk vs memory)
Each row in the imagery table represents a single geographic segment
Column family to store data sourceOne column for each raw imageVery sparse
39Saturday, September 21, 13
Personalized Search
Row key is a unique useridA column family for each type of user actionReplicated across Bigtable clusters to increase availability and reduce latency
40Saturday, September 21, 13
Conclusions
Bigtable provides a high scalability, high performance, high availability and flexible storage for structured data.
It provides a low level read / write based interface for other frameworks to build on top of it
It has enabled Google to deal with large scale data efficiently
41Saturday, September 21, 13