scaleable servers jim gray [email protected]
TRANSCRIPT
Scaleable ServersScaleable ServersJim GrayJim Gray
[email protected]@Microsoft.com
http://www.research.Microsoft.com/~Grayhttp://www.research.Microsoft.com/~Gray
Thesis: Scaleable ServersThesis: Scaleable Servers Scaleable ServersScaleable Servers
Commodity hardware allows new applicationsCommodity hardware allows new applications New applications need huge serversNew applications need huge servers Clients and servers are built of the same “stuff”Clients and servers are built of the same “stuff”
Commodity software and Commodity software and Commodity hardwareCommodity hardware
Servers should be able to Servers should be able to Scale up Scale up (grow node by adding CPUs, disks, networks)(grow node by adding CPUs, disks, networks)
Scale out Scale out (grow by adding nodes)(grow by adding nodes)
Scale down Scale down (can start small)(can start small)
Key software technologiesKey software technologies Objects, Transactions, Clusters, ParallelismObjects, Transactions, Clusters, Parallelism
1987: 256 tps Benchmark 1987: 256 tps Benchmark 14 M$ computer (Tandem)14 M$ computer (Tandem) A dozen peopleA dozen people False floor, 2 rooms of machinesFalse floor, 2 rooms of machines
Simulate 25,600 clients
A 32 node processor array
A 40 GB disk array (80 drives)
OS expert
Network expert
DB expert
Performance expert
Hardware experts
Admin expert
Auditor
Manager
1988: DB2 + CICS Mainframe1988: DB2 + CICS Mainframe65 tps65 tps
IBM 4391 IBM 4391 Simulated network of 800 clientsSimulated network of 800 clients 2m$ computer2m$ computer Staff of 6 to do benchmarkStaff of 6 to do benchmark
2 x 3725 network controllers
16 GB disk farm4 x 8 x .5GB
Refrigerator-sizedCPU
1997: 10 years later1997: 10 years later1 Person and 1 box = 1250 tps1 Person and 1 box = 1250 tps
1 Breadbox ~ 5x 1987 machine room1 Breadbox ~ 5x 1987 machine room 23 GB is hand-held23 GB is hand-held One person does all the workOne person does all the work Cost/tps is 1,000x lessCost/tps is 1,000x less
25 micro dollars per transaction25 micro dollars per transaction4x200 Mhz cpu1/2 GB DRAM12 x 4GB disk
Hardware expertOS expertNet expertDB expertApp expert
3 x7 x 4GB disk arrays
What Happened?What Happened? Moore’s law: Moore’s law:
Things get 4x better every 3 yearsThings get 4x better every 3 years (applies to computers, storage, and networks)(applies to computers, storage, and networks)
New Economics: CommodityNew Economics: Commodityclassclass price/mips softwareprice/mips software $/mips k$/year $/mips k$/yearmainframe mainframe 10,000 10,000 100 100 minicomputerminicomputer 100 100 10 10microcomputer 10 microcomputer 10 1 1
GUI: Human - computer tradeoffGUI: Human - computer tradeoffoptimize for people, not computersoptimize for people, not computers
mainframeminimicro
time
pric
e
Billions Of ClientsBillions Of ClientsNeed Millions Of ServersNeed Millions Of Servers
MobileMobileclientsclients
FixedFixedclients clients
ServerServer
SuperSuperserverserver
ClientsClients
ServersServers
All clients networked All clients networked to serversto servers May be nomadicMay be nomadic
or on-demandor on-demand Fast clients wantFast clients wantfasterfaster servers servers
Servers provide Servers provide Shared DataShared Data ControlControl CoordinationCoordination CommunicationCommunication
ThesisThesisMany little beat few bigMany little beat few big
Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?
$1 $1 millionmillion $100 K$100 K $10 K$10 K
MainframeMainframe MiniMiniMicroMicro NanoNano
14"14"9"9"
5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP
101066 clocks to bulk ram clocks to bulk ram
Event-horizon on chipEvent-horizon on chip
VM reincarnatedVM reincarnated
Multiprogram cache,Multiprogram cache,On-Chip SMPOn-Chip SMP
10 microsecond ram
10 millisecond disc
10 second tape archive
10 nano-second ram
Pico Processor
10 pico-second ram
1 MM 3
100 TB
1 TB
10 GB
1 MB
100 MB
Future Super Server:Future Super Server:4T Machine4T Machine
Array of 1,000 4B machinesArray of 1,000 4B machines1 bps processors1 bps processors1 BB DRAM 1 BB DRAM 10 BB disks 10 BB disks 1 Bbps comm lines1 Bbps comm lines1 TB tape robot1 TB tape robot
A few megabucksA few megabucks Challenge:Challenge:
ManageabilityManageabilityProgrammabilityProgrammabilitySecuritySecurityAvailabilityAvailabilityScaleabilityScaleabilityAffordabilityAffordability
As easy as a single systemAs easy as a single system
Future servers are CLUSTERSFuture servers are CLUSTERSof processors, discsof processors, discs
Distributed database techniquesDistributed database techniquesmake clusters workmake clusters work
CPU
50 GB Disc
5 GB RAM
Cyber BrickCyber Bricka 4B machinea 4B machine
The Hardware Is In Place…The Hardware Is In Place…And then a miracle occursAnd then a miracle occurs
? SNAP: scaleable networkSNAP: scaleable network
and platformsand platforms Commodity-distributedCommodity-distributed
OS built on:OS built on: Commodity platformsCommodity platforms Commodity networkCommodity network
interconnectinterconnect Enables parallel applicationsEnables parallel applications
Thesis: Scaleable ServersThesis: Scaleable Servers Scaleable ServersScaleable Servers
Commodity hardware allows new applicationsCommodity hardware allows new applications New applications need huge serversNew applications need huge servers Clients and servers are built of the same “stuff”Clients and servers are built of the same “stuff”
Commodity software and Commodity software and Commodity hardwareCommodity hardware
Servers should be able to Servers should be able to Scale up Scale up (grow node by adding CPUs, disks, networks)(grow node by adding CPUs, disks, networks)
Scale out Scale out (grow by adding nodes)(grow by adding nodes)
Scale down Scale down (can start small)(can start small)
Key software technologiesKey software technologies Objects, Transactions, Clusters, ParallelismObjects, Transactions, Clusters, Parallelism
Scaleable ServersScaleable ServersBOTH SMP And ClusterBOTH SMP And Cluster
Grow up with SMP; 4xP6Grow up with SMP; 4xP6is now standardis now standardGrow out with clusterGrow out with clusterCluster has inexpensive partsCluster has inexpensive parts
ClusterClusterof PCs of PCs
SMP superSMP superserverserver
DepartmentalDepartmentalserverserver
PersonalPersonalsystemsystem
SMPs Have AdvantagesSMPs Have Advantages
Single system image Single system image easier to manage, easier easier to manage, easier to program threads in to program threads in shared memory, disk, Netshared memory, disk, Net
4x SMP is commodity4x SMP is commodity Software capable of 16xSoftware capable of 16x Problems:Problems:
>4 not commodity>4 not commodity Scale-down problem Scale-down problem
(starter systems expensive)(starter systems expensive) There There isis a BIGGEST one a BIGGEST one
SMP superSMP superserverserver
DepartmentalDepartmentalserverserver
PersonalPersonalsystemsystem
Tpc-C Web-Based BenchmarksTpc-C Web-Based Benchmarks Client is a Web browser Client is a Web browser
(9,200 of them!)(9,200 of them!) Submits Submits
OrderOrder InvoiceInvoice Query to server via Web Query to server via Web
page interfacepage interface
Web server translates to DBWeb server translates to DB SQL does DB workSQL does DB work Net: Net:
easy to implement easy to implement performance is GREAT!performance is GREAT!
HT
TP
HT
TP
OD
BC
OD
BC
SQL SQL
IISIIS= Web= Web
TPC-C TPC-C Shows How Far SMPs have comeShows How Far SMPs have comeTPC-C TPC-C Shows How Far SMPs have comeShows How Far SMPs have come Performance is amazing: Performance is amazing:
2,000 users is the min!2,000 users is the min! 30,000 users on a 4x12 alpha cluster (Oracle)30,000 users on a 4x12 alpha cluster (Oracle)
Peak Performance: Peak Performance: 30,390 tpmC30,390 tpmC @ $305/tpmC @ $305/tpmC (Oracle/DEC)(Oracle/DEC)
Best Price/Perf: 7,693 tpmC @ Best Price/Perf: 7,693 tpmC @ $43/tpmC$43/tpmC ( (MS SQL/Dell)MS SQL/Dell)
graphs show UNIX high price & diseconomy of scaleupgraphs show UNIX high price & diseconomy of scaleuptpmC & Price Performance(only "best" data shown for each vendor)
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000
tpmC
$/t
pm
C
DB2
Informix
MS SQL Server
Oracle
Sybase
TPC C SMP PerformanceTPC C SMP Performance
tpmC vs CPS
0
5,000
10,000
15,000
20,000
0 5 10 15 20
CPUs
tpm
C
SUN Scaleability
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
0 5 10 15 20
cpus
tpm
C
SUN Scaleability
SQL Server
• SMPs do offer speedup but 4x P6 is better than some 18x MIPSco
18
What Happens To Prices?What Happens To Prices? No expensive UNIX front end (20$/tpmC)No expensive UNIX front end (20$/tpmC) No expensive TP monitor software (10$/tpmC)No expensive TP monitor software (10$/tpmC)
=> => 65$/tpmC65$/tpmCTPC Price/tpmC
164
93
188
39
66 64
54
3944
66
44 4440
42
31
3835
38
22
41
18
35
16
39
45
30
8
19
27
40
3
21
0
10
20
30
40
50
60
70
80
90
100
processor disk software net
Informix on SNIOracle on DEC UnixOracle on Compaq/NTSybase on Compaq/NTMicrosoft on Compaq with VisigenicsMicrosoft on HP with VisagenicsMicrosoft on Intergraph with IISMicrosoft on Compaq with IIS
Building the Largest NT NodeBuilding the Largest NT Node Build a 1 TB SQL Server databaseBuild a 1 TB SQL Server database
Show off NT and SQL Server ScaleabilityShow off NT and SQL Server Scaleability Stress test the productStress test the product
Demo it on the InternetDemo it on the Internet WWW accessible by anyoneWWW accessible by anyone
So data must beSo data must be 1 TB1 TB UnencumberedUnencumbered Interesting to everyone everywhereInteresting to everyone everywhere ANDAND not offensive to anyone anywhere not offensive to anyone anywhere
What’s TeraByte?What’s TeraByte? 1 Terabyte:1 Terabyte: 1,000,000,000 business letters 150 miles of book shelf1,000,000,000 business letters 150 miles of book shelf 100,000,000 book pages 100,000,000 book pages 15 miles of book shelf 15 miles of book shelf 50,000,000 FAX images50,000,000 FAX images 7 miles of book shelf 7 miles of book shelf 10,000,000 TV pictures (mpeg) 10 days of video 10,000,000 TV pictures (mpeg) 10 days of video
4,000 LandSat images 4,000 LandSat images 16 earth images (100m) 16 earth images (100m) 100,000,000 web page 10 copies of the web HTML100,000,000 web page 10 copies of the web HTML
Library of Congress (in ASCII) is 25 TBLibrary of Congress (in ASCII) is 25 TB 1980: $200 million of disc1980: $200 million of disc 10,000 discs 10,000 discs
$5 million of tape silo$5 million of tape silo 10,000 tapes 10,000 tapes
1997: 200 k$ of magnetic disc 48 discs1997: 200 k$ of magnetic disc 48 discs 30 k$ nearline tape 20 tapes30 k$ nearline tape 20 tapes
Terror Byte !Terror Byte !
The PlanThe Plan DEC Alpha + DEC Alpha + 324 StorageWorks 324 StorageWorks
Drives (1.4 TB)Drives (1.4 TB) 30K BTU, 30K BTU,
8 KW, 8 KW, 1.5 metric tons.1.5 metric tons.
SQL 7.0SQL 7.0 USGS dataUSGS data
(1 meter)(1 meter) Russian SpaceRussian Space
data (2 meter)data (2 meter)
DEC 41004 x 400 Mhz
Alpha Processors4GB DRAM
Microsoft
BackOffice
SPIN-2
Image Data SourcesImage Data Sources
Spin-2500 GBWorldWideLoB AppNew Data Coming
DOQ
300 GBSrc: USGS& UCSBUCSB missingsomeDOQs
DOQ coverage of the USDOQ coverage of the US
1 Meter images of many places1 Meter images of many places Problems: Problems:
most of data not yet published most of data not yet published interesting places missing interesting places missing
(LA, Portland, SD, Anchorage,…) (LA, Portland, SD, Anchorage,…)
Loaded published 130 GB.Loaded published 130 GB. CRDA for unpublished 3 TBCRDA for unpublished 3 TB
SPIN-2 SPIN-2 CoverageCoverage
The rest of the worldThe rest of the world The US Government can’t help, but....The US Government can’t help, but.... The Russian Space Agency is eager to cooperate.The Russian Space Agency is eager to cooperate. 2 Meter Geo Rectified imagery of anywhere2 Meter Geo Rectified imagery of anywhere More data coming, Earth has ~ 500 TeraMetersMore data coming, Earth has ~ 500 TeraMeters22
=> ~30 Tera Bytes of Land at 2x2 Meter=> ~30 Tera Bytes of Land at 2x2 Meter => we need 3% of the land (Urban World = the red stuff)=> we need 3% of the land (Urban World = the red stuff)
Demo InterfaceDemo Interface
Grow UP and OUT Grow UP and OUT
1 billion 1 billion transactions transactions
per dayper day
SMP superSMP superserverserver
DepartmentalDepartmentalserverserver
PersonalPersonalsystemsystem
1 Terabyte DB1 Terabyte DB
Cluster: •a collection of nodes •as easy to program and manage as a single node
Clusters Have AdvantagesClusters Have Advantages
Clients and servers made from the same stuffClients and servers made from the same stuff Inexpensive: Inexpensive:
Built with commodity components Built with commodity components
Fault tolerance: Fault tolerance: Spare modules mask failuresSpare modules mask failures
Modular growthModular growth Grow by adding small modulesGrow by adding small modules
Unlimited growth: Unlimited growth: no biggest oneno biggest one
Billion Transactions per Day ProjectBillion Transactions per Day Project
Built a 45-node Windows NT Cluster Built a 45-node Windows NT Cluster (with help from Intel & Compaq) (with help from Intel & Compaq) > 900 disks> 900 disks
All off-the-shelf partsAll off-the-shelf parts Using SQL Server & Using SQL Server &
DTC distributed transactionsDTC distributed transactions DebitCredit TransactionDebitCredit Transaction Each node has 1/20 th of the DB Each node has 1/20 th of the DB Each node does 1/20 th of the workEach node does 1/20 th of the work 15% of the transactions are “distributed”15% of the transactions are “distributed”
How Much Is 1 Billion How Much Is 1 Billion Transactions Per Day?Transactions Per Day?
Millions of transactions per dayMillions of transactions per day
0.10.1
1.1.
10.10.
100.100.
1,000.1,000.
1 B
tpd
1 B
tpd
Vis
aV
isa
AT
&T
AT
&T
Bo
fAB
ofA
NY
SE
NY
SE
Mtp
dM
tpd
1 Btpd = 11,574 tps 1 Btpd = 11,574 tps (transactions per second)(transactions per second) ~ 700,000 tpm ~ 700,000 tpm (transactions/minute)(transactions/minute)
AT&T AT&T 185 million calls 185 million calls
(peak day worldwide)(peak day worldwide) Visa ~20 M tpdVisa ~20 M tpd
400 M customers400 M customers 250,000 ATMs worldwide250,000 ATMs worldwide 7 billion transactions / year 7 billion transactions / year
(card+cheque) in 1994 (card+cheque) in 1994
Type nodes CPUs DRAM ctlrs disks RAIDspace
WorkflowMTS
20CompaqProliant
2500
20x
2
20x
128
20x
1
20x
1
20x
2 GB
SQL Server
20CompaqProliant
5000
20x
4
20x
512
20x
4
20x36x4.2GB7x9.1GB
20x
130 GB
DistributedTransactionCoordinator
5CompaqProliant
5000
5x
4
5x
256
5x
1
5x
3
5x
8 GB
TOTAL 45 140 13 GB 105 895 3 TB
Billion Transactions Per Day HardwareBillion Transactions Per Day Hardware 45 nodes (Compaq Proliant)45 nodes (Compaq Proliant) Clustered with 100 Mbps Switched EthernetClustered with 100 Mbps Switched Ethernet 140 cpu, 13 GB, 3 TB.140 cpu, 13 GB, 3 TB.
1.2 B tpd1.2 B tpd 1 B tpd ran for 24 hrs.1 B tpd ran for 24 hrs. Sized for 30 daysSized for 30 days Linear growthLinear growth 5 micro-dollars per 5 micro-dollars per
transactiontransaction Out-of-the-box Out-of-the-box
softwaresoftware Off-the-shelf hardwareOff-the-shelf hardware AMAZING!AMAZING!
Other StuntsOther Stunts 100 M Web Hits/day on one server100 M Web Hits/day on one server
(=1,300 hits/sec, Web Mark HTML server)(=1,300 hits/sec, Web Mark HTML server)
Email server (exchange)Email server (exchange) 50 GB database 50 GB database (up from 16GB, limit now 16TB)(up from 16GB, limit now 16TB)
50 k POP3 users (1.5 M msg/day)50 k POP3 users (1.5 M msg/day)
64-bit addressing SQL Server64-bit addressing SQL Server SAP Failover SAP Failover Theme: Theme:
conventional stuff is easyconventional stuff is easy
Thesis: Scaleable ServersThesis: Scaleable Servers Scaleable ServersScaleable Servers
Commodity hardware allows new applicationsCommodity hardware allows new applications New applications need huge serversNew applications need huge servers Clients and servers are built of the same “stuff”Clients and servers are built of the same “stuff”
Commodity software and Commodity software and Commodity hardwareCommodity hardware
Servers should be able to Servers should be able to Scale up Scale up (grow node by adding CPUs, disks, networks)(grow node by adding CPUs, disks, networks)
Scale out Scale out (grow by adding nodes)(grow by adding nodes)
Scale down Scale down (can start small)(can start small)
Key software technologiesKey software technologies Objects, Transactions, Clusters, ParallelismObjects, Transactions, Clusters, Parallelism
ParallelismParallelismThe OTHER aspect of clustersThe OTHER aspect of clusters
Clusters of machines Clusters of machines allow two kinds allow two kinds of parallelismof parallelism Many little jobs: online Many little jobs: online
transaction processingtransaction processing TPC-A, B, C…TPC-A, B, C…
A few big jobs: data A few big jobs: data search and analysissearch and analysis TPC-D, DSS, OLAPTPC-D, DSS, OLAP
Both give Both give automatic parallelismautomatic parallelism
Kinds of Parallel ExecutionKinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential Program
Any Sequential Program
Any Sequential
Any Sequential Program Program
Data RiversData Rivers Split + Merge StreamsSplit + Merge Streams
River
M ConsumersN producers
Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering
does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano.
N X M Data Streams
Partitioned ExecutionPartitioned Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL parallelism
N x M way ParallelismN x M way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.
Partitioned DataPartitioned and Pipelined Data Flows
Clusters (Plumbing)Clusters (Plumbing)
Single system imageSingle system image namingnaming protection/securityprotection/security management/load balancemanagement/load balance
Fault ToleranceFault Tolerance Wolfpack Wolfpack
Hot Pluggable hardware & SoftwareHot Pluggable hardware & Software
Windows NT Windows NT clustersclusters Key goals:Key goals:
Easy: to install, manage, programEasy: to install, manage, program Reliable: better than a single nodeReliable: better than a single node Scaleable: added parts add powerScaleable: added parts add power
Microsoft & 60 vendors Microsoft & 60 vendors defining NT clustersdefining NT clusters Almost all big hardware and Almost all big hardware and
software vendors involvedsoftware vendors involved No special hardware needed - No special hardware needed -
but it may helpbut it may help Enables Enables
Commodity fault-toleranceCommodity fault-tolerance Commodity parallelism Commodity parallelism
(data mining, virtual reality…)(data mining, virtual reality…) Also great for workgroups!Also great for workgroups!
Initial: two-node failoverInitial: two-node failover Beta testing since December96Beta testing since December96 SAP, Microsoft, Oracle giving SAP, Microsoft, Oracle giving
demos.demos. File, print, Internet, mail, DB, other File, print, Internet, mail, DB, other
servicesservices Easy to manageEasy to manage Each node can be 4x (or more) SMPEach node can be 4x (or more) SMP
Next (NT5) “Wolfpack” is modest Next (NT5) “Wolfpack” is modest size clustersize cluster About 16 nodes (so 64 to 128 CPUs)About 16 nodes (so 64 to 128 CPUs) No hard limit, algorithms designedNo hard limit, algorithms designed
to go furtherto go further
So, What’s New?So, What’s New? When slices cost 50k$, you buy 10 or 20.When slices cost 50k$, you buy 10 or 20. When slices cost 5k$ you buy 100 or 200.When slices cost 5k$ you buy 100 or 200. Manageability, programmability, usability Manageability, programmability, usability
become key issues (total cost of become key issues (total cost of ownership).ownership).
PCs are MUCH easier to use and programPCs are MUCH easier to use and program
New MPP &NewOS
New App
New MPP &NewOS
New App
New MPP &NewOS
New App
New MPP &NewOS
New App
Customers
MPPVicious CycleNo Customers!
CP/CommodityVirtuous Cycle:Standards allow progressand investment protection
Apps
Standardplatform
Thesis: Scaleable ServersThesis: Scaleable Servers Scaleable ServersScaleable Servers
Commodity hardware allows new applicationsCommodity hardware allows new applications New applications need huge serversNew applications need huge servers Clients and servers are built of the same “stuff”Clients and servers are built of the same “stuff”
Commodity software and Commodity software and Commodity hardwareCommodity hardware
Servers should be able to Servers should be able to Scale up Scale up (grow node by adding CPUs, disks, networks)(grow node by adding CPUs, disks, networks)
Scale out Scale out (grow by adding nodes)(grow by adding nodes)
Scale down Scale down (can start small)(can start small)
Key software technologiesKey software technologies Objects, Transactions, Clusters, ParallelismObjects, Transactions, Clusters, Parallelism
Objects Meet DatabasesObjects Meet DatabasesThe basis for The basis for universaluniversal
data servers, access, & integrationdata servers, access, & integration
DBMSDBMSengineengine
object-oriented (COM oriented) object-oriented (COM oriented) programming interface to dataprogramming interface to data
Breaks DBMS into componentsBreaks DBMS into components Anything can be a data sourceAnything can be a data source Optimization/navigation “on top Optimization/navigation “on top
of” other data sourcesof” other data sources A way to componentized a A way to componentized a
DBMSDBMS Makes an RDBMS and O-RMakes an RDBMS and O-R
DBMS (assumes optimizer DBMS (assumes optimizer understands objects)understands objects)
DatabaseDatabase
SpreadsheetSpreadsheet
PhotosPhotos
MailMail
MapMap
DocumentDocument