the anatomy of the google architecture fina lv1.1

51
1 GOOGLE TALK Ed Austin 12-09-09

Upload: hasan-veldstra

Post on 08-Sep-2014

78.877 views

Category:

News & Politics


4 download

DESCRIPTION

A comprehensive overview of Google's architecture - starting from the search page and all the way to its internal networks.By Ed Austin, talk given at Edinburgh Techmeetup in December 2009http://techmeetup.co.uk

TRANSCRIPT

Page 1: The Anatomy Of The Google Architecture Fina Lv1.1

1

GOOGLE TALK Ed Austin 12-09-09

Page 2: The Anatomy Of The Google Architecture Fina Lv1.1

2

Pre PresentationThe Google Philosophy (according to ed)

• Jedis build their own lightsabres (the MS Eat your own Dog Food)

• Parallelize Everything

• Distribute Everything (to atomic level if possible)

• Compress Everything (CPU cheaper than bandwidth)

• Secure Everything (you can never be too paranoid)

• Cache (almost) Everything

• Redundantize Everything (in triplicate usually)

• Latency is VERY evil

Page 3: The Anatomy Of The Google Architecture Fina Lv1.1

The Anatomy of the Google Architecture“The unofficial Version“

V1.0 November 2009

Ed Austin{ed, edik} @i-dot.com

Page 4: The Anatomy Of The Google Architecture Fina Lv1.1

4

Section I – The Basic Glue

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

1. Exterior Network (Perimeter Architecture)

2. Data Centre

3. Rack Characteristics

4. Core Server Hardware

5. Operating System Implementation

6. Interior Network Architecture

Page 5: The Anatomy Of The Google Architecture Fina Lv1.1

5

THE PERIMETER

How does your data enter the Google empire?

Page 6: The Anatomy Of The Google Architecture Fina Lv1.1

6

Perimeter Network Security (as known)

• DNS Load Balanced splits traffic (country, .com multiple DNS, other X1) to FW

• Firewall filters traffic (http/s, smtp,pop etc)

• Netscalar Load Balancers take Request from FW blocks DOS attacks, ping floods (DOS) – blocks non IPv4/6 and none 80/443 ports and http multiplexes (limited caching capability)

• User Request forwarded to Squid (Reverse Proxy) probably HUGE cache (Petabytes?)

• If not in Cache forwarded to GWS (Custom C++ Web Server) – now not using Custom apache?

• GWS sends the Request to appropriate internal (Cell) servers

• Request is processed

• exterior https via thawte certs

• Dedicated Crawler Architecture separate from other infrastructure

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

CellInterior Network

GFS II etc

Firewall80/443

DNSLoad Balanced (.COM = 3, UK only one)

[ed@d800 ~]$ dig google.com.....

;; ANSWER SECTION:google.com. 223 IN A 74.125.45.100google.com. 223 IN A 74.125.53.100google.com. 223 IN A 74.125.67.100

[ed@d800 ~]$

NetScalarhttp multiplexing

SquidReverse Proxy

GWSWeb Server Farm

FirewallDMZPerimeter

Client Browser80/443

Possible Search Traffic PathBased upon Known Technologies employededge routing not shown/instances not shown

Page 7: The Anatomy Of The Google Architecture Fina Lv1.1

7

PERIMETER NETWORK CACHING

-Uses Squid Reverse Proxy-Perimeter Cache hit rates 30-60% = Huge!

- Dependent on search complexity/user preferences/traffic type

- All Image Thumbnails caches, much Multimedia cached- Expensive common queries cached (common words i.e. ‘Obama‘,

‘edinburgh‘) as they require significant back-end processing.

- On cache flush/update big latency spike and capacity drop- Index servers need to do significant work to rebuild cache

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ Squid

Reverse Proxy

80/443 80/443

Page 8: The Anatomy Of The Google Architecture Fina Lv1.1

8

THE DATA CENTRE

Where do they store all that Data?

Page 9: The Anatomy Of The Google Architecture Fina Lv1.1

9

Worldwide Data Centres

Where is Google Located?

Last estimated were 36 Data Centers, 300+ GFSII Clusters and upwards of 800K machines.

US (#1) – Europe (#2) – Asia (#3) – South America/Russia (#4)Australia – on Hold

Future: Taiwan, Malaysia, Lithuania, and Blythewood, South Carolina.

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 10: The Anatomy Of The Google Architecture Fina Lv1.1

10

The Modular Data Centre

Standard Google Modular DC (Cell) holds 1160 Servers / 250KW Power Consumption in 30 racks (40U).

This is the “Atomic“ Data Centre Building Block of Google.

A Data Centre would consist of 100‘s of Modular Cells.DC architecture then being the aggregation of smaller Cell level infrastructures in their own container – some being pure GFS, other BT, other Map, some mixed etc.

MDC‘s can also be deployed autonomously at the Perimeter (stand alone).

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 11: The Anatomy Of The Google Architecture Fina Lv1.1

11

THE RACK

How is a server stored in the Data Centre?

Page 12: The Anatomy Of The Google Architecture Fina Lv1.1

12

Google Rack (GOOG rack)• Why interesting?

– The rack Implementation!

–EVERYTHING custom!

• Mini Server Size– Old Servers are Custom 1U

– New Servers are 2U...

–again a custom design

–seem 1/3 width of a normal 2U Server

• 40U/80U Custom Racks (50% each side)– Design

– Huge Heating and Power Issues

– Optimized Motherboards

– Work closely with HW MB developers

– Have their own HW builds

–specified to component level

– Servers expected to be expendable –

– build redundancy on top of failure

• Motherboard directly mounted into Rack– servers have no casing - just bare boards

–– assist with heat dispersal issues

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 13: The Anatomy Of The Google Architecture Fina Lv1.1

13

THE HARDWARE

Millions of exactly what?

Page 14: The Anatomy Of The Google Architecture Fina Lv1.1

14

Server Hardware

• 2U Low-Cost (but not slow) Commodity Servers– 2009 Currently 2-Way, Dual Core/16GB/1-2TB +- Standard– Both Intel/AMD Chipsets – 1 NIC – 2 USB– Looks like they RAID1/mirror the disks for better I/O - read performance– SATA 7.2K/10K/15K drives? 8 x 2GB DDR3 ECC

• Standard HW Build (Several HW Build Versions at any one time)– Currently at 7Gen Build (1G 2005 was probably Dual Core/SMP)

– Each Server 12V Battery Backup and can run autonomously without external power (lasts 20-30s?)

– Work closely with chip manufacturers to improve design/reduce power – custom Intel chips that can withstand higher heat factors than generic versions

YEAR Average Server Specification

1999/2000 PII/PIII 128MB+

2003/2004 Celeron 533, PIII 1.4 SMP, 2-4GB DRAM, Dual XEON 2.0/1-4GB/40-160GB IDE - SATA Disks via Silicon Images SATA 3114/SATA 3124

2006 Dual Opteron/Working Set DRAM(4GB+)/2x400GB IDE (RAID0?)

2009 2-Way/Dual Core/16GB/1-2TB SATA

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 15: The Anatomy Of The Google Architecture Fina Lv1.1

15

THE OPERATING SYSTEM

The Core Software on each of those servers

Page 16: The Anatomy Of The Google Architecture Fina Lv1.1

16

OPERATING SYSTEM

-100% Redhat Linux Based since 1998 inception

- RHEL (Why not CentOS?)- 2.6.X Kernel- PAE- Custom glibc.. rpc... ipvs...- Custom FS (GFS II)- Custom Kerberos- Custom NFS- Custom CUPS- Custom gPXE bootloader- Custom EVERYTHING.....

Kernel/Subsystem Modificationstcmalloc – replaces glibc 2.3 malloc – much faster! works very well with threads...rpc – the rpc layer extensively modified to provide > perf increase < latency (52%/40%)

Significantly modified Kernel and Subsystems – all IPv6 enabled

Use Python as the primary scripting languageDeploy Ubuntu internally (likely for the Desktop) – also Chrome OS baseEasily the Worlds largest installed Linux base

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 17: The Anatomy Of The Google Architecture Fina Lv1.1

17

THE INTERIOR NETWORK

How does your datatravel around the Google empire?

Page 18: The Anatomy Of The Google Architecture Fina Lv1.1

18

INTERIOR NETWORK

ROUTING PROTOCOL

Internal network is IPv6 (exterior machines can be reached using IPv6)

Heavily Modified Version of OSPF as the IRP

Intra-rack network is 100baseTInter-rack network is 1000baseTInter-DC network pipes unknown but very fast

Technology:

Juniper, Cisco, Foundry, HP, routers and switches

Software:

ipvs (ip virtual server)

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 19: The Anatomy Of The Google Architecture Fina Lv1.1

19

THE MAJOR GLUE

The three foundation blocks of GooglesSecret Sauce

Page 20: The Anatomy Of The Google Architecture Fina Lv1.1

20

Section II – Googles Major Glue

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ 1. Google File System Architecture – GFS II

2. Google Database - Bigtable

3. Google Computation - Mapreduce

4. Google Scheduling - GWQ

Page 21: The Anatomy Of The Google Architecture Fina Lv1.1

21

GOOGLE FILE SYSTEM

Manages the underlying Data on behalf of the upper layers and ultimately the applications

Page 22: The Anatomy Of The Google Architecture Fina Lv1.1

22

FILE SYSTEM I – GFS v1

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

The GFS II cell is Googles fundamental building block – everything can be layered on top of this

Consists of (Highly distributed Linux based) Master Servers and Chunk Servers

Chunk Servers serve the Data in 64MB Chunks to the client directly via Master arbitration

DATA REDUNDANCY/FAULT TOLERANCE?Triplicate Copies of Chunks are kept often in other clusters / DCChunks can be pulled from outside the DC! Expensive.... And try not to do!However apps built on top of GFS/BT do this on an ad-hoc basis (i.e. Gmail)

On Chunk loss the Master handles the Recovery by sourcing a chunk copy

Data is compressed using BMDiff/Zippy

Chunk Server Fault-Tolerance achieved by Heart-beat to the Master (I am alive..)

Master Failure was problematic for Google (finally down from 2 minutes to 10 seconds)

Page 23: The Anatomy Of The Google Architecture Fina Lv1.1

23

FILE SYSTEM I – GFS II

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

GFS II “Colossus“ Version 2 improves in many ways (is a complete rewrite)

Elegant Master Failover (no more 2s delays...)

Chunk Size is now 1MB – likely to improve latency for serving data other than Indexing – for example GMail – this was the rationale behind the change

Master can store more Chunk Metadata (therefore more chunks addressable up to 100 million) = also more Chunk Servers

However according to Google Engineer they have only ever lost one 64MB chunk (in GFS I) during its entire production deployment (2004 – 2008?) so assumed extremely reliable

Page 24: The Anatomy Of The Google Architecture Fina Lv1.1

24

GOOGLE DATABASE

Accesses the underlying Data on behalf of the upper layers and ultimately the applications

Page 25: The Anatomy Of The Google Architecture Fina Lv1.1

25

Bigtable I - Introduction

What is it?

Googles Database Implementation since 1994

Used internally for all large scale (Search, Indexing, GMail etc)

Similar to a sharded Database implemention

GOALS

Huge Scalability to many PB‘s (Web Database currently around 40 Billion URL‘s)

Tight Latency

Highly efficient scans over Textual Data

Fault Tolerant

Load Balancable

Eliminate Googles dependency on an external provider

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTableMapreduce

BigTableChubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 26: The Anatomy Of The Google Architecture Fina Lv1.1

26

Bigtable II

How is Data Referenced?

Distributed Multi-Dimensional Sparse Map

Simple addressing model using a triple:

(row, column, {timestamp}) -> cell contents

ROWS

- Rows (arbitrary length usually 10-100 Bytes Max <=64KB)

- Rows stored lexographically - example row (URL))

COLUMNS

- example column (contents:, PR, anchor1: ..)

TIMESTAMP (OPTIONAL?)

- timestamp (various API func args, i.e. “ALL“, “LATEST“).

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTableMapreduce

BigTableChubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 27: The Anatomy Of The Google Architecture Fina Lv1.1

27

Bigtable III – Table StructureStudying contents: column shows three versions of contents of a page (current, cached and ?) – presumably all other columns are timestamped so could be used in a comparitive way (such as anchor increase/decrease) OTF in SERPS – alg must use a combo of TimeSt diff between n(=3 rest garbage collected) page Versions and crawled anchors - what else does table hold? Possibly PR (or OTF) and other search related weightings

Google keeps much more info for ranking purposes than it did in 1999

Webtable hinted at 100 columns+!

How do page units affect the URL reversal of the URL bigtable?-Does a Tables Tablets Cross a Clusters namespace (yes if unified else no?)

ENG <HTML>uk.co.bbc.news

language:ROW

10-100 Bytes <=64KB

contents:

tablet 1100-200MB Size

tablet ...100-200MB Size

tablet n100-200MB Size

---- Server/Cell Boundary ---

---- Server/Cell Boundary ---

COLUMN

html.test/za.zzzzz

au.aaa

Example of the the URL bigtable

anchorx:

C++ Bigtable Mutate of some Anchors //open tableTable *T=OpenOrDie(“/bigtable/web/bigtable“);//write new anchor and delete old anchorRowMutation r1(T,“uk.co.bbc.news“); r1.Set(“anchor:www.abc.org“,“CNN“);r1.Delete(“anchor:www.def.com“);Operation op;Apply(&op, &r1); //atomic mutate to the columns

Other primitives such as DeleteCells(), DeleteRow(), Scanner (read arbitrary cells in a row)

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTableMapreduce

BigTableChubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 28: The Anatomy Of The Google Architecture Fina Lv1.1

28

Bigtable IV

How tables are broken down in storage ?

For example Webtable is billions of pages!

-Large Tables broken (split) into tablets at row boundaries

-Tablets discontiguous (assists in fault-tolerance) – spread over DC but try to keep one copy in same rack

-Tablet Size is approximately 100-200MB of compressed Data

-Load Balanced – migrate tablets from heavily loaded machines to lightly loaded ones

- Heavily used tablets probably stay in working set (cached)

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTableMapreduce

BigTableChubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 29: The Anatomy Of The Google Architecture Fina Lv1.1

29

GOOGLE MAPREDUCE

Computes the underlying Data on behalf of the applications

Page 30: The Anatomy Of The Google Architecture Fina Lv1.1

30

Mapreduce I

Map Reduction can be seen as a way to exploit massive parallelism by breaking a task down into constituent parts and executing on multiple processors

The Major Functions are MAP & REDUCE (with a number of intermediatary steps)

MAP Break task down into parallel stepsREDUCE Combine results into final output

Shown is a 2-pipeline Map Reduction (There are 24 Map Reductions in the indexing pipeline)Mappers & Reducers usually run on separate processors (90% loss of reducers job still completed!)

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 31: The Anatomy Of The Google Architecture Fina Lv1.1

31

Mapreduce II

LANGUAGE BINDINGS

C++, Java, Python, Sawzall

DEPLOYED

Implemented 2004 – before this MySQL?

STATISTICS

-In September 2009 Google ran 3,467,000 MR Jobs with an average 475 sec completion time averaging 488 machines per MR and utilising 25.5K Machine years

-Technique extensively used by Yahoo with Hadoop (similar architecture to Google) and Facebook (since ‘06 multiple Hadoop clusters, one being 2500CPU/1PB with HBase).

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 32: The Anatomy Of The Google Architecture Fina Lv1.1

32

Chubby Lock

Googles Distributed File Locking Service for Bigtable

-Provides Mutex Support for Data Access (atomic access to column data)

- Used to synchronize access to shared resources

- Consists of a Master and Slaves (designated by election)

- Failover consists of a Slave replacing the functionality of a Master

-- Also servers as an ultra-fast high availability File Server for small fines (100‘s bytes)

- Provides an ACL for tablet authentication (row and column data)

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTableMapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 33: The Anatomy Of The Google Architecture Fina Lv1.1

33

GOOGLE WORKQUEUE

Provides Resource Management for the Computational Jobs

Page 34: The Anatomy Of The Google Architecture Fina Lv1.1

34

GWQ – Google Workqueue

Batch Submission/Scheduler System

-Software to submit Mapreduce Jobs to a Cell/Cluster

-Arbitrates (process priorities) Schedules, Allocates Resources, process failover, Reports status, collects results

- Often Workqueue overlaid on a GFS Cluster- i.e. GFS cluster not computational bound jobs – also

seems to match co-locate tasks near data = just disk I/O not Network I/O (on the Chunk Server?)

- Workqueue can manage many tens of thousands of machines

Launched via API or command line (sawzall example shown)

saw --program code.szl --workqueue testing--input_files /gfs/cluster1/2005-02-0[1-7]/submits.* \ --destination /gfs/cluster2/$USER/output@100

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 35: The Anatomy Of The Google Architecture Fina Lv1.1

35

Section III – Some more Glue

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

1. Languages employed

2. Development Environment

3. Google App Engine

4. Network Security

5. Future Google Architecture Advances

6. Odds n Sods

7. DIY Google

Page 36: The Anatomy Of The Google Architecture Fina Lv1.1

36

DEVELOPMENT LANGUAGES

- Initially Python, Java, C++

Usual Suspects

- Sawzall (since 2006)

- equivalent to Hadoops Pig Latin- written in C++ - interpreted bytecode output JIT‘d

An internal Procedural language employed to solve map reduction problems. The few published Google papers employ Sawzall in the algorithm examples. Runs in the Map phase, Aggregators run in the Reduce phase (from each Sawzall Map instance) to get the final output.

- Transparent Parallelization – no specialist Distrib Sys Knowledge Required (Good for developer)

- Simple Datatypes 64-bit signed int, float, string, byte and a few unique such as time

- Much STR regexp support - Compound Types arrays, tuples

- typesafed (and declarations) similar to Pascal (Probably an LL(1) lang?) - similar to Algol, C Syntax (no pointers though!)- No Processing of exceptions (no exception handlers)- Shorter than corresponding C++ code by a factor of 10

Early versions could not write into Bigtable. Now implemented?

Output sometimes pipelined into MySQL for further analysis

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 37: The Anatomy Of The Google Architecture Fina Lv1.1

37

GOOGLE APP ENGINEUsing “Application Platform technology stack“

Allows a developer to leverage components of Google Technology (but not necessarily primary Infrastructure i.e. The usual business resources)

-Supports Python, Java

- Bigtable support (via GQL)

-Uses GFS as underlying FS – usual Fault-tolerance/Load-balancing

-Task Queue similar to GWQ?

-Code exposed to Google

- No support for subprocess spawning – more importantly none of the google mapreduce library made available- isolates computational aspects to single servers but the I/O is probably the google standard implementation underneath- therefore computationally intensive tasks more problematic= keeping your resource usage under controlSERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 38: The Anatomy Of The Google Architecture Fina Lv1.1

38

Security

Rack Board Level (possible scenario)

gPXE on the board goes through DHCP/tftp sequence to pull over an encrypted image (this is not expensive as is done once per boot and boots are not usual)

Image is pulled from a Secure Image Distribution Server (and held encrypted on these)

Once at the board end the image is OTF decrypted and booted as normal RHEL

02/09 Google Engineer didn‘t dispute this and seemed to concur adding that in-core encryption might be a possibility (R/T decryption might not be that expensive) – this possibily means cryptology is used throughout the lifetime of the image – including components outside the working-set but sensitive parts of the in-core OS (OTF decrypted)

Enterprise

Kerberos is used throughout the enterprise

They have an Automated issuance system for SSL certificates, used by internal

(secure) infrastructure to validate https/TLS and generic SSL connections.

Complete internal network encryption unlikely due to latency introduced?

Likely that one of the reasons failover between DC‘s problematic is the latency introduced due to the expense of Wide Area Encryption (essential)

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 39: The Anatomy Of The Google Architecture Fina Lv1.1

39

Google Future Architecture

- 99%ile latency for all data <50ms is a key speed metric

-Single global namespace

-“Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers.”

-Spanner

Dynamic Load Balancing of upwards of 10M Servers between Data Centers

- “automatic, dynamic world-wide placement of data & computation to minimize latency or cost.”

Allegedly used to reduce heat issues at DC‘s by moving the load when the heat issue becomes a problem at the new chillerless DC‘s (i.e. Belgium DC) – not using chillers introducess significant savings.

- Translation Servers (automatic translation of documents)

- GDrive Servers

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 40: The Anatomy Of The Google Architecture Fina Lv1.1

40

Odds n Sods

borg – google technology/architecture (is a cluster..)Borg: a hybrid protocol for scalable application-level multicast in peer-to-peer networks(WAN multimedia steaming)

data cube – google technology

Have a “global loadbalancer“ – assume load balances across a unified namespace – probably worldwide

gmail designers implemented application level failover to move your session to an alternate DC in a seamless fashion to the end user.Probably all Google Apps will be able to migrate to an alternate DC cell (the application, and its GFS data if need be)

MySQL is used for back-end sys admin stuff (high availability master-slave implementations) and post Bigtable processing

Remote employee access is via VPN

Sys Admins maintain 5 and 30 minute SLA’s – so on the ball

Has its own internal archive.org equiv.SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 41: The Anatomy Of The Google Architecture Fina Lv1.1

41

BUILD YOUR OWN GOOGLE

The Basic Open Source Tools

Page 42: The Anatomy Of The Google Architecture Fina Lv1.1

42

The Google Stack (vs Yahoo‘ish/Open Source)

SERVER HARDWARE SERVER HARDWARE

RHEL 2.6.X PAE CentOS 2.6.X PAE

RACK RACK

INTERIOR NETWORK IPv6 INTERIOR NETWORK IPv6

GFS / GFS II HDFS (hadoop)

Hadoop FrameworkMapreduce

Hbase (Bigtable equiv.)

MapreduceBigTable

Chubby Lock

Pig Latin, Python, PHP, Java .... anything

Python, Java, C++, Sawzall, other

CLIENT APPLICATION

DC DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Conceptual OverviewGoogle vs. Open Source

Architecture

Open Source(Yahoo’ish)Architecture

Exterior Network Exterior Network

GWQ Job Tracker

Googles Secret Sauce

Hadoop Open Source

(Other Tools such as crawlers, indexers readily available)

BigTable

Python, Java, C++,

APP ENGINE

Task Queue

Page 43: The Anatomy Of The Google Architecture Fina Lv1.1

43

END

(Thankyou)

Page 44: The Anatomy Of The Google Architecture Fina Lv1.1

44

DIY GOOGLE

What you require:Preferably 2 Machines + 100BTCentOS/RHEL(squid)ApacheHadoop (HDFS, Mapreduce, Pig, HBase)HDFSbmdiff/zippy compression libraryGoogle glibc/tcmalloc – perftools

Supporting stuff – JRE etc

Browser with Search Box

pig mr call to scan a few filesprint results

SERVER HARDWARE

CentOS 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

HDFS (hadoop)

Hadoop FrameworkMapreduce

Hbase (Bigtable equiv.)

Python, PHP, Java .... anything

CLIENT APPLICATION

DC

Open Source(Yahoo’ish)Architecture

Exterior Network

Job Tracker (Work Queue equiv.)

Page 45: The Anatomy Of The Google Architecture Fina Lv1.1

45

DIY GOOGLE

Install Hadoop and Pig on ClusterInstall eclipse and dependenciesInstall PigPen for eclipse and configure to cluster (NFS)

SERVER HARDWARE

CentOS 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

HDFS (hadoop)

Hadoop FrameworkMapreduce

Hbase (Bigtable equiv.)

Python, PHP, Java .... anything

CLIENT APPLICATION

DC

Open Source(Yahoo’ish)Architecture

Exterior Network

Job Tracker (Work Queue equiv.)

Page 46: The Anatomy Of The Google Architecture Fina Lv1.1

46

TEMPLATE

- IPv6 enablement started 2008 (2009 finished?)- IRP OSPF

Google authored RFC points towards OSPF

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 47: The Anatomy Of The Google Architecture Fina Lv1.1

47

DEVELOPMENT ENVIRONMENT bits&bobs

A rare shot of some concrete google internal stuff (this of a GFS Master Server code execution found as a perftools profiling example)

Agile Methodologies Used (development iterations, teamwork, collaboration, and process adaptability throughout the life-cycle of the project)

“Libraries are the predominant way of building programs”

An infrastructure handles versioning of applications so they can be release without a fear of breaking things = roll out with minimal QA

- Internal Code uses replacement libraries- Google as you‘d expect rewrites everything!- Hungarian Notation?- Work in small teams 3-5 people – likely few scutters know ‘‘the big picture“

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

Page 48: The Anatomy Of The Google Architecture Fina Lv1.1

48

• Internal Linux development and deployment• Served as technical lead of team responsible for customizing and deploying Linux to internal systems and workstations.• Fixed bugs and added enterprise features to several Linux components, including NFS, Kerberos, CUPS. All relevant patches were

pushed to upstream maintainers, and most are in current released distributions.• Developed and maintained systems to automate installation, updates, and upgrades of Linux systems.• Developed IPv6 support for Linux load-balancing (ipvs).• Managed several interns and contractors.

• loadbalancing user accounts within a datacenter, and coordinating with the global loadbalancer, which uses linear programming to optimally allocate users. In particular, this avoids "shared fate" risks and reduces latency and costs incurred due to excessive transatlantic

data traffic. Learned Sketchup so as to document the four dimensional data structures effectively • The testing, evalulation, deployment, operations, and maintenance of Netscaler load balancers.

• automated Apache configuration reloader • gPXE open-source network booting software• GWS – custom C++ webserver = not apache?

Google 02/09 talk example was a Cluster is 30 racks (I believe this refers to Google). At a 40U rack 40Ux30racks = 1200 = approximately a MDC – can assume each MDC is a Cluster/cell at architectural level

Google engineer stated a DC is a collection of Modular Units (MDC‘s?) – the picture (not above) illustrated suggested this.

Page 49: The Anatomy Of The Google Architecture Fina Lv1.1

49

Some Pre Presentation Information

• 1 Million GB = 1000TB = 1 PB (x 1000 = 1 EXABYTE)– Internet Archive is around 3PB (2009)

– CLEAN UP BEFORE – all the poorly sourced stuff

– Add lock service to bt to all slides– Google rack server on rack page– SSTable

– Google PROFITS US $16M A DAY

Page 50: The Anatomy Of The Google Architecture Fina Lv1.1

50

Pre Presentation Disclaimer

• Put together in a week from knowing zero about Google

• I am not associated with Google

• Numbers are approximate but certainly are ball-park – Google often delivers contradictory figures and uses many terms for some items - cell/cluster – scheduler/workqueue (obfuscation?)

• Googles philosophy/paranoia of tell as little as possible (pausing presenters and sideways answers) makes it hard to fill in some (significant) gaps – inferences are sometimes drawn (in red)

• Google seem to design absolutely EVERYTHING themselves – from HW MB build, Racks, Switches(?), Software... So its hard to find sources of information beyond broad concepts

Page 51: The Anatomy Of The Google Architecture Fina Lv1.1

51

Bigtable VI

-

Latest (or at least since 2006..)

-Increased Scalability (across Namespace/Datacenters)- i.e. Tablets spread over DC‘s for a table but expensive

(both computationally and financially!)

-Service Clusters (?)

-Multiple Bigtable Clusters replicated throughout DC

Current Status

- Many Hundreds may be thousands of Bigtable Cells - Late 2009 stated 500 Bigtable clusters

- At minimum scaled to many thousands of machines per cell in production

- Cells manage Managing 3-figure TB data (0.X PB)

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTableMapreduce

BigTableChubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ