big table presentation-final

A Distributed Storage System for Structured Data

Bigtable

Presenter:Yunming Zhang

Conglong Li

Saturday, September 21, 13

References

SOCC 2010 Key Note SlidesJeff Dean Google

Introduction to Distributed Computing, Winter 2008University of Washington

2Saturday, September 21, 13

Motivation

Lots of (semi) structured data at GoogleURLs

Contents, crawl metadata, linksPer-user data:

User preference settings, search resultsScale is large

Billions of URLs, hundreds of million of users,Existing Commercial database doesn’t meet the requirements


Store and manage all the state reliably and efficientlyAllow asynchronous processes to update different pieces of data continuously

Very high read/write ratesEfficient scans over all or interesting subsets of data

Often want to examine data changes over time

Goals


BigTable vs. GFS

GFS provides raw data storageWe need:

More sophisticated storageKey - value mapping

Flexible enough to be usefulStore semi-structured dataReliable, scalable, etc.


BigTable

Bigtable is a distributed storage system for managing large scale structured data

Wide applicabilityScalabilityHigh performanceHigh availability


Overview

Data ModelAPIImplementation StructuresOptimizationsPerformance EvaluationApplicationsConclusions


Data Model

SparseSortedMultidimensional


Cell

Contains multiple versions of the data

Can locate a data using row key, column key and a time stamp

Treats data as uninterpreted array of bytes that allow clients to serialize various forms of structured and semi-structured data

Supports automatic garbage collection per column family for management of versioned data


Store and manage all the state reliably and efficientlyAllow asynchronous processes to update different pieces of data continuously



Goals


Row

Row key is an arbitrary stringAccess to column data in a row is atomic

Row creation is implicit upon storing dataRows ordered lexicographically

Rows close together lexicographically usually reside on one or a small number of machines


Columns

Columns are grouped into Column Families:family:optional_qualifier

Column familyHas associated type informationUsually of the same type 12


Overview



API

Metadata operationsCreate/delete tables, column families, change metadata, modify access control list

Writes ( atomic )Set (), DeleteCells(), DeleteRow()

ReadsScanner: read arbitrary cells in a BigTable


Overview



Tablets

Large tables broken into tablets at row boundariesTablet holds contiguous range of rows

Clients can often choose row keys for localityAim for ~100MB to 200MB of data per tablet

Serving machine responsible for ~100 tabletsFast recovery:

100 machine each pick up 1 tablet from failed machine

Fine-grained load balancing:Migrate tablets away from overloaded machine


Tablets and Splitting


System Structure

MasterMetadata operationsLoad balancingKeep track of live tablet serversMaster failure

Tablet serverAccept read and write to data


System Structure


System Structure

read/write


System Structure

Metadata operations


Locating Tablets

3-level hierarchical lookup scheme for tabletsLocation is ip port of servers in META tables


Tablet Representationand serving

Append only tablet logSSTable on GFS

A Sorted map of string to stringIf you want to find a row data, all the data are contiguous

Memtable write bufferWhen a read comes in, you have to merge SSTable data and uncommitted value.


Tablet Representationand Serving


Compaction

Tablet state represented as a set of immutable compacted SSTable files, plus tail of log

Minor compaction:When in-memory buffer fills up, it freezes the in-memory buffer and create a new SSTable

Major compaction:Periodically compact all SSTables for tablet into new base SSTable on GFS

Storage reclaimed from deletions at this point

Produce new tables 26


Overview



Reliable system for storing and managing all the statesAllow asynchronous processes to update different pieces of data continuously



Goals


Locality Groups

Clients can group multiple column families together into a locality group

A separate SSTable is generated for each locality group

Enable more efficient readCan be declared to be in-memory


Compression

Many opportunities for compressionSimilar values in columns and cells

Within each SSTable for a locality group, encode compressed blocks

Keep blocks small for random access Exploit fact that many values very similar


Reliable system for storing and managing all the statesAllow asynchronous processes to update different pieces of data continuously



Goals


Commit log and recovery

Single commit log file per tablet serverreduce the number of concurrent file writes to GFS

Tablet Recoveryredo points in log perform the same set of operations from last persistent state


Overview



Performance evaluation

Test EnvironmentBased on a GFS with 1876 machines400 GB IDE hard drives in each machineTwo-level tree-shaped switched network

Performance TestsRandom Read/WriteSequential Read/Write


Single tablet-server performance

Random reads is the slowestTransfer 64 KB SSTable over GFS to read 1000 byte

Random and sequential writes perform betterAppend writes to server to a single commit logGroup commit


Performance Scaling

Performance didn’t scale linearlyLoad imbalance in multiple server configurationsLarger data transfer overhead


Overview



Google Analytics

A service that analyzes traffic patterns at web sitesRaw Click Table

Row for each end-user sessionRow key is (website name, time)

Summary TableExtracts recent session data using MapReduce jobs


Google Earth

Use one table for preprocessing and one for servingDifferent latency requirements (disk vs memory)

Each row in the imagery table represents a single geographic segment

Column family to store data sourceOne column for each raw imageVery sparse


Personalized Search

Row key is a unique useridA column family for each type of user actionReplicated across Bigtable clusters to increase availability and reduce latency


Conclusions

Bigtable provides a high scalability, high performance, high availability and flexible storage for structured data.

It provides a low level read / write based interface for other frameworks to build on top of it

It has enabled Google to deal with large scale data efficiently


big table presentation-final

Technology

column data

user data

data changes

data rows

raw data storage

different pieces of

interesting subsets

management of versioned