1 cyberbricks: the future of database and storage engines jim gray gray

72
1 CyberBricks: The future of Database And Storage Engines Jim Gray http://research.Microsoft.com/~Gray

Upload: josue-lusty

Post on 02-Apr-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

1

CyberBricks:The future of Database And Storage Engines

Jim Gray

http://research.Microsoft.com/~Gray

Page 2: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

2

Outline

• What storage things are coming from Microsoft?

• TerraServer: a 1 TB DB on the Web

• Storage Metrics: Kaps, Maps, Gaps, Scans

• The future of storage: ActiveDisks

Page 3: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

3

New Storage Software From Microsoft

• SQL Server 7.0:

»Simplicity: Auto-most-things

»Scalability on Win95 to Enterprise

»Data warehousing: built-in OLAP, VLDB

• NT 5:

»Better volume management (from Veritas)

»HSM architecture

»Intellimirror

»Active directory for transparency

Page 4: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

4

““Hydra” Hydra” ServerServer

Dedicated Dedicated Windows Windows terminalterminal

Existing, Existing, Desktop PC Desktop PC

MS-DOS, MS-DOS, UNIX, UNIX, Mac Mac clientsclients

Net Net PCPC

Thin Client SupportTSO comes to NT

• Lower Per-Client cost

• Huge centralized data stores.

Page 5: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

5

Windows NT 5.0

Intelli-Mirror™

• Files and settings mirrored on client and server

• Great for mobile users

• Facilitates roaming

• Easy to replace PCs

• Optimizes network performance

• Means HUGE data stores

Page 6: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

6

Outline

• What storage things are coming from Microsoft?

• TerraServer: a 1 TB DB on the Web

• Storage Metrics: Kaps, Maps, Gaps, Scans

• The future of storage: ActiveDisks

Page 7: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

7

Microsoft TerraServer: Scaleup to Big Databases

• Build a 1 TB SQL Server database• Data must be

» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere

• Loaded » 1.5 M place names from Encarta World Atlas» 3 M Sq Km from USGS (1 meter resolution)» 1 M Sq Km from Russian Space agency (2 m)

• On the web (world’s largest atlas)• Sell images with commerce server.

Page 8: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

8

Microsoft TerraServer Background• Earth is 500 Tera-meters square

» USA is 10 tm2

• 100 TM2 land in 70ºN to 70ºS

• We have pictures of 6% of it» 3 tsm from USGS

» 2 tsm from Russian Space Agency

• Compress 5:1 (JPEG) to 1.5 TB.

• Slice into 10 KB chunks

• Store chunks in DB

• Navigate with

» Encarta™ Atlas• globe

• gazetteer

» StreetsPlus™ in the USA

40x60 km2 jump image

20x30 km2 browse image

10x15 km2 thumbnail

1.8x1.2 km2 tile

• Someday» multi-spectral image

» of everywhere

» once a day / hour

Page 9: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

9

USGS Digital Ortho Quads (DOQ) • US Geologic Survey

• 4 Tera Bytes

• Most data not yet published

• Based on a CRADA» Microsoft TerraServer makes

data available.

USGS “DOQ”

1x1 meter4 TBContinentalUSNew DataComing

Page 10: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

10

Russian Space Agency(SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor)

• 1.5 Meter Geo Rectified imagery of (almost) anywhere

• Almost equal-area projection

• De-classified satellite photos (from 200 KM),

• More data coming (1 m)

• Selling imagery on Internet.

• Putting 2 tm2 onto Microsoft TerraServer.

SPIN-2

Page 11: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

11

http://www.TerraServer.Microsoft.com/

Demo

SPIN-2

Microsoft

BackOffice

Page 12: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

12

Demo

• navigate by coverage map to White House

• Download image

• buy imagery from USGS

• navigate by name to Venice

• buy SPIN2 image & Kodak photo

• Pop out to Expedia street map of Venice

• Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)

Page 13: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

14

The Microsoft TerraServer Hardware

• Compaq AlphaServer 8400Compaq AlphaServer 8400

• 8x400Mhz Alpha cpus8x400Mhz Alpha cpus

• 10 GB DRAM10 GB DRAM

• 324 9.2 GB StorageWorks Disks324 9.2 GB StorageWorks Disks» 3 TB raw, 2.4 TB of RAID53 TB raw, 2.4 TB of RAID5

• STK 9710 tape robot (~14 TB)STK 9710 tape robot (~14 TB)

• WindowsNT 4 EE, SQL Server 7.WindowsNT 4 EE, SQL Server 7.00

Page 14: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

15

browser

HTMLJava

Viewer

The Internet

Web Client

Microsoft AutomapActiveX Server

Internet InfoServer 4.0

Image DeliveryApplication

SQL Server7

MicrosoftSite Server EE

Internet InformationServer 4.0

Image Provider Site(s)

TerraServer DB Automap Server

Terra-ServerStored Procedures

InternetInformationServer 4.0

ImageServer

Active Server Pages

MTS

TerraServer Web Site

Software

SQL Server 7

Page 15: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

16

• Backup and Recovery

»STK 9710 Tape robot

»Legato NetWorker™

»SQL Server 7 Backup & Restore

»Clocked at 80 MBps (peak)(~ 200 GB/hr)

• SQL Server Enterprise Mgr

»DBA Maintenance

»SQL Performance Monitor

System Management & Maintenance

Page 16: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

17

Microsoft TerraServer File Group Layout

• Convert 324 disks to 28 RAID5 setsplus 28 spare drives

• Make 4 WinNT volumes (RAID 50)

595 GB per volume

• Build 30 20GB files on each volume

• DB is File Group of 120 files

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

E: F: G: H:

HSZ70 A

HSZ70 B

Page 17: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

18

Image Delivery and LoadIncremental load of 4 more TB in next 18 months

DLTTape “tar”

\Drop’N’ DoJobWait 4Load

LoadMgrDB

100mbitEtherSwitch

108 9.1 GBDrives

Enterprise Storage Array

AlphaServer8400

108 9.1 GBDrives

108 9.1 GBDrives

STKDLTTape

Library

604.3 GBDrives

AlphaServer4100

ESAAlphaServer4100

LoadMgr

DLTTape

NTBackup

ImgCutter

\Drop’N’ \Images

10: ImgCutter20: Partition30: ThumbImg40: BrowseImg45: JumpImg50: TileImg55: Meta Data60: Tile Meta70: Img Meta80: Update Place

...LoadMgr

Page 18: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

20

Some Tera-Byte DatabasesKilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

• The Web: 1 TB of HTML

• TerraServer 1 TB of images

• Several other 1 TB (file) servers

• Hotmail: 7 TB of email

• Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked

• EOS/DIS (picture of planet each week)» 15 PB by 2007

• Federal Clearing house: images of checks» 15 PB by 2006 (7 year history)

• Nuclear Stockpile Stewardship Program» 10 Exabytes (???!!)

Page 19: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

22

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

A novel A letter

Library of Library of Congress Congress (text)(text)

All Disks

All Tapes

A Movie

LoC (image)

All Photos

LoC (sound + cinima)

All Information!

Page 20: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

23

Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html

• Soon everything can be recorded and kept

• Most data will never be seen by humans

• Precious Resource: Human attention Auto-SummarizationAuto-Search

will be a key enabling technology.

Page 21: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

24

Outline

• What storage things are coming from Microsoft?

• TerraServer: a 1 TB DB on the Web

• Storage Metrics: Kaps, Maps, Gaps, Scans

• The future of storage: ActiveDisks

Page 22: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

25

Storage Latency: How Far Away is the Data?

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

109

106

This CampusThis Room

10 min

My Head 1 min

1.5 hrSacramento

2 YearsPluto

2,000 YearsAndromeda

Page 23: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

27

MetaMessage: Technology Ratios Are Important

MetaMessage: Technology Ratios Are Important• If everything gets faster&cheaper

at the same rate THEN nothing really changes.

• Things getting MUCH BETTER:»communication speed & cost 1,000x»processor speed & cost 100x»storage size & cost 100x

• Things staying about the same»speed of light (more or less constant)»people (10x more expensive)»storage speed (only 10x better)

Page 24: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

28

Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs

Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs

1e 2 1e 1 1e 0 1e -1 1

1015

1012

109

106

103

Typi

cal S

yste

m (

byte

s)

Size vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

Main

Secondary

Disc

Nearline Tape Offline

Tape

Online Tape

1e 2 1e 1 1e 0 1e -1 1

104

102

100

10-2

10-4

$/M

B

Price vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

MainSecondary

Disc

Nearline Tape

Offline Tape

Online Tape

Page 25: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

29

Storage Ratios Changed in Last 20 Years

• MediaPrice: 4000X, Bandwidth 10X, Access/s 10X

• DRAM:DISK $/MB: 100:1 25:1

• TAPE : DISK $/GB: 100:1 5:1

Disk Performance vs Time

1

10

100

1980 1990 2000

Year

seek

s p

er s

eco

nd

ban

dw

idth

: M

B/s

0.1

1.

10.

Cap

acit

y (G

B)

Disk accesses/second vs Time

1

10

100

1980 1990 2000

Year

Acc

esse

s p

er S

eco

nd

Storage Price vs TimeMegabytes per kilo-dollar

0.1

1.

10.

100.

1,000.

10,000.

1980 1990 2000

YearM

B/k

$

Page 26: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

31

Disk Access Time

• Access time = SeekTime 6 ms 5%/y + RotateTime 3 ms 5%/y+ ReadTime 1 ms 25%/y

• Other useful facts:»Power rises more than size3 (so small is indeed beautiful)

»Small devices are more rugged

»Small devices can use plastics (forces are much smaller)e.g. bugs fall without breaking anything

Page 27: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

32

Standard Storage Metrics Standard Storage Metrics • Capacity:

»RAM: MB and $/MB: today at 100MB & 1$/MB»Disk: GB and $/GB: today at 10GB and 50$/GB»Tape: TB and $/TB: today at .1TB and 10$/GB (nearline)

• Access time (latency)»RAM:100 ns»Disk: 10 ms»Tape: 30 second pick, 30 second position

• Transfer rate»RAM: 1 GB/s»Disk: 5 MB/s - - - Arrays can go to 1GB/s»Tape: 3 MB/s - - - not clear that striping works

Page 28: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

33

New Storage Metrics: Kaps, Maps, Gaps, SCANs

New Storage Metrics: Kaps, Maps, Gaps, SCANs

•Kaps: How many kilobyte objects served per second

» the file server, transaction procssing metric

•Maps: How many megabyte objects served per second

» the Mosaic metric

•Gaps: How many gigabyte objects served per hour

» the video & EOSDIS metric

• SCANS: How many scans of all the data per day

» the data mining and utility metric

• And: $/Kaps, $/Maps, $/Gaps, $/SCAN

Page 29: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

34

How To Get Lots of Maps, Gaps, SCANSHow To Get Lots of Maps, Gaps, SCANS

•parallelism: use many little devices in parallel

1 Terabyte

10 MB/s

At 10 MB/s: 1.2 days to scan

1 Terabyte

1,000 x parallel: 100 seconds/scan

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Page 30: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

35

Tape & Optical: Beware of the Media Myth

Tape & Optical: Beware of the Media Myth

Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (5x cheaper than disc)

Tape is cheap: 100 $/tape 40 GB/tape => 2.5 $/GB (100x cheaper than disc).

Page 31: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

36

Tape & Optical Reality: Media is 10% of System Cost

Tape & Optical Reality: Media is 10% of System CostTape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 40GB each) => 20$/GB ... 200$/GB

(1x…10x cheaper than disc)

Optical needs a robot (50 k$ ) 100 platters = 200GB ( TODAY ) => 250 $/GB

( more expensive than disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!

Page 32: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

37

The Access Time MythThe Access Time MythThe Myth: seek or pick time dominatesThe reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often shortImplication: many cheap servers

better than one fast expensive server»shorter queues

»parallel transfer

»lower cost/access and cost/byte

This is obvious for disk & tape arrays

Seek

Rotate

Transfer

Seek

Rotate

Transfer

Wait

Page 33: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

38

My Solution to Tertiary StorageTape Farms, Not Mainframe SilosMy Solution to Tertiary Storage

Tape Farms, Not Mainframe Silos

Scan in 12 hours.many independent tape robots(like a disc farm)

10K$ robot 10 tapes400 GB 6 MB/s 25$/GB 30 Maps 15 Gaps 2 Scans

100 robots

40TB 25$/GB 3K Maps1.5K Gaps2 Scans

1M$

Page 34: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

39

0.01

0.1

1

10

100

1,000

10,000

100,000

1,000,000

1000 x Disc Farm STK Tape Robot 6,000 tapes, 8 readers

100x DLT Tape Farm

GB/K$

Maps

Scans

SCANS/Day

Kaps

The Metrics: Disk and Tape Farms Win

The Metrics: Disk and Tape Farms Win

Data Motel:Data checks in, but it never checks out

Page 35: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

40

Cost Per Access (3-year)Cost Per Access (3-year)

0.1

1

10

100

100,000

120

2

1000 x Disc Farm STK Tape Robot 6,000 tapes, 16

readers

100x DLT Tape Farm

Kaps/$

Maps/$

Gaps/$

SCANS/k$

500K

540,000

67,000

68

77 4.3

1.5

0.2

23

100

Page 36: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

41

Storage Ratios Impact on Software

• Gone from 512 B pages to 8192 B pages (will go to 64 KB pages in 2006)

• Treat disks as tape:

»Increased use of sequential access

»Use disks for backup copies

• Use tape for

»VERY COLD data or

»Offsite Archive

»Data interchange

Page 37: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

42

Summary Summary

• Storage accesses are the bottleneck

• Accesses are getting larger (Maps, Gaps, SCANS)

• Capacity and cost are improvingBUT

• Latencies and bandwidth are not improving muchSO

• Use parallel access (disk and tape farms)

• Use sequential access (scans)

Page 38: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

43Controller

The Memory Hierarchy

• Measuring & Modeling Sequential IO

• Where is the bottleneck?

• How does it scale with

»SMP, RAID, new interconnects

Adapter SCSIFile cache PCI

MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)

Mem

bus

App address space

Page 39: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

45

PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point)

System Bus422 MBps

7.2 MB/s

133 MBps7.2 MB/s

10-15 MBps7.2 MB/s

SCSIFile System Buffers

ApplicationData

Disk

PCI

40 MBps7.2 MB/s

Page 40: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

46

The Best Case: Temp File, NO IO• Temp file Read / Write File System Cache

• Program uses small (in cpu cache) buffer.

• So, write/read time is bus move time (3x better than copy)

• Paradox: fastest way to move data is to write then read it.

• This hardware islimited to 150 MBpsper processor

Temp File Read/Write

148 136

54

0

50

100

150

200

Temp read Temp write Memcopy ()

MB

ps

Page 41: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

47

Bottleneck Analysis

• Drawn to linear scale

TheoreticalBus Bandwidth

422MBps = 66 Mhz x 64 bits

MemoryRead/Write

~150 MBps

MemCopy~50 MBps

Disk R/W~9MBps

Page 42: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

51

PAP vs RAP• Reads are easy, writes are hard

• Async write can match WCE.

422 MBps

142 MBps

133 MBps

72 MBps

10-15 MBps

9 MBps

SCSI

File System

ApplicationData

PCI SCSI

Disks40 MBps

31 MBps

Page 43: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

52

Bottleneck Analysis• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI

~ 65 MBps Unbuffered read~ 43 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

Adapter

70 M

Bps

Page 44: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

53

Peak Thrughput on Intel/NT• NTFS Read/Write 24 disk, 4 SCSI, 2 PCI (64 bit)

~ 190 MBps Unbuffered read~ 95 MBps Unbuffered write

so: 0.8 TB/hr read, 0.4 TB/hr write

on a 25k$ server.

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

PCI

Adapter

Adapter

Adapter

190

MB

ps

Page 45: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

54

Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark

• How much can you sort for a penny.» Hardware and Software cost» Depreciated over 3 years» 1M$ system gets about 1 second,» 1K$ system gets about 1,000 seconds.» Time (seconds) = SystemPrice ($) / 946,080

• Input and output are disk resident

• Input is » 100-byte records (random data)» key is first 10 bytes.

• Must create output file and fill with sorted version of input file.

• Daytona (product) and Indy (special) categories

Page 46: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

55

PennySort• Hardware

» 266 Mhz Intel PPro

» 64 MB SDRAM (10ns)

» Dual Fujitsu DMA 3.2GB EIDE

• Software» NT workstation 4.3

» NT 5 sort

• Performance» sort 15 M 100-byte records (~1.5 GB)

» Disk to disk

» elapsed time 820 sec • cpu time = 404 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

Page 47: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

56

Cluster Sort Conceptual Model

•Multiple Data Sources

•Multiple Data Destinations

•Multiple nodes

•Disks -> Sockets -> Disk -> DiskB

AAABBBCCC

A

AAABBBCCC

C

AAABBBCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

Page 48: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

60

Outline

• What storage things are coming from Microsoft?

• TerraServer: a 1 TB DB on the Web

• Storage Metrics: Kaps, Maps, Gaps, Scans

•The future of storage: ActiveDisks

Page 49: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

61

Crazy Disk Ideas• Disk Farm on a card: surface mount disks

• Disk (magnetic store) on a chip: (micro machines in Silicon)

• NT and BackOffice in the disk controller(a processor with 100MB dram)

ASIC

Page 50: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

62

Remember Your Roots

Page 51: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

63

Year 2002 Disks• Big disk (10 $/GB)

» 3”

» 100 GB

» 150 kaps (k accesses per second)

» 20 MBps sequential

• Small disk (20 $/GB)» 3”

» 4 GB

» 100 kaps

» 10 MBps sequential

• Both running Windows NT™ 7.0?(see below for why)

Page 52: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

64

The Disk Farm On a CardThe Disk Farm On a CardThe 1 TB disc card

An array of discs

Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc

LOTS of accesses/second bandwidth

14"

Life is cheap, its the accessories that cost ya.

Processors are cheap, it’s the peripherals that cost ya

(a 10k$ disc card).

Page 53: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

65

Put Everything in Future (Disk) Controllers(it’s not “if”, it’s “when?”)

Acknowledgements:

Dave Patterson explained this to me a year ago

Kim Keeton

Erik Riedel

Catharine Van Ingen

Helped me sharpen these arguments

Page 54: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

66

Technology Drivers: Disks• Disks on track

• 100x in 10 years 2 TB 3.5” drive

• Shrink to 1” is 200GB

• Disk replaces tape?

• Disk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

Page 55: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

67

Data Gravity Processing Moves to Transducers(moves to data sources & sinks)

• Move Processing to data sources

• Move to where the power (and sheet metal) is

• Processor in

»Modem

»Display

»Microphones (speech recognition) & cameras (vision)

»Storage: Data storage and analysis

Page 56: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

68

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer

• You get a

»several network interfaces

»A Postscript engine • cpu, • memory, • software,• a spooler (soon)

»and… a print engine.

Page 57: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

69

Functionally Specialized Cards• Storage

• Network

• Display

M MB DRAM

P mips processor

ASIC

ASIC

ASIC Today:

P=50 mips

M= 2 MB

In a few years

P= 200 mips

M= 64 MB

Page 58: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

71

Basic Argument for x-Disks• Future disk controller is a super-computer.

»1 bips processor»128 MB dram»100 GB disk plus one arm

• Connects to SAN via high-level protocols» RPC, HTTP, DCOM, Kerberos, Directory Services,…. »Commands are RPCs»Management, security,….»Services file/web/db/… requests» Managed by general-purpose OS

with good dev environment

• Apps in disk saves data movement

»need programming environment in controller

Page 59: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

72

The Slippery Slope

• If you add function to server

•Then you add more function to server

•Function gravitates to data.

Nothing = Sector Server

Everything = App Server

Something =

Fixed App Server

Page 60: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

73

Why Not a Sector Server?(let’s get physical!)

• Good idea, that’s what we have today.

• But

»cache added for performance

»Sector remap added for fault tolerance

»error reporting and diagnostics added

»SCSI commends (reserve,.. are growing)

»Sharing problematic (space mgmt, security,…)

• Slipping down the slope to a 2-D block server

Page 61: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

74

Why Not a 1-D Block Server?Put A LITTLE on the Disk Server• Tried and true design

»HSC - VAX cluster»EMC»IBM Sysplex (3980?)

• But look inside»Has a cache »Has space management»Has error reporting & management»Has RAID 0, 1, 2, 3, 4, 5, 10, 50,…»Has locking»Has remote replication»Has an OS»Security is problematic»Low-level interface moves too many bytes

Page 62: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

75

Why Not a 2-D Block Server?Put A LITTLE on the Disk Server

• Tried and true design»Cedar -> NFS»file server, cache, space,..»Open file is many fewer msgs

• Grows to have»Directories + Naming»Authentication + access control»RAID 0, 1, 2, 3, 4, 5, 10, 50,…»Locking»Backup/restore/admin»Cooperative caching with client

• File Servers are a BIG hit: NetWare™»SNAP! is my favorite today

Page 63: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

76

Why Not a File Server?Put a Little on the Disk Server

• Tried and true design

»Auspex, NetApp, ...

» Netware

• Yes, but look at NetWare

»File interface gives you app invocation interface

»Became an app server• Mail, DB, Web,….

»Netware had a primitive OS• Hard to program, so optimized wrong thing

Page 64: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

77

Why Not Everything?

Allow Everything on Disk Server(thin client’s)

• Tried and true design

»Mainframes, Minis, ...

»Web servers,…

»Encapsulates data

»Minimizes data moves

»Scaleable

• It is where everyone ends up.

• All the arguments against are short-term.

Page 65: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

79

Disk = Node• has magnetic storage (100 GB?)

• has processor & DRAM

• has SAN attachment

• has execution environment

OS KernelSAN driver Disk driver

File System RPC, ...Services DBMS

Applications

Page 66: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

80

Technology Drivers: System on a Chip

• Integrate Processing with memory on chip»chip is 75% memory now»1MB cache >> 1960 supercomputers»256 Mb memory chip is 32 MB!»IRAM, CRAM, PIM,… projects abound

• Integrate Networking with processing on chip»system bus is a kind of network»ATM, FiberChannel, Ethernet,.. Logic on chip.»Direct IO (no intermediate bus)

• Functionally specialized cards shrink to a chip.

Page 67: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

82

Technology Drivers: What if Networking Was as Cheap As Disk IO?

• TCP/IP

»Unix/NT 100% cpu @ 40MBps

• Disk

»Unix/NT 8% cpu @ 40MBps

Why the Difference?Host Bus Adapter does

SCSI packetizing, checksum,…flow controlDMA

Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers

Page 68: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

83

Technology Drivers: The Promise of SAN/VIA:10x in 2 years

http://www.ViArch.org/• Today:

»wires are 10 MBps (100 Mbps Ethernet)

»~20 MBps tcp/ip saturates 2 cpus

»round-trip latency is ~300 us

• In the lab»Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,…

» Fast user-level communication• tcp/ip ~ 100 MBps 10% of each processor

• round-trip latency is 15 us

Page 69: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

84

Gbps Ethernet: 110 MBps

SAN: Standard

Interconnect

PCI: 70 MBps

UW Scsi: 40 MBps

FW scsi: 20 MBps

scsi: 5 MBps

• LAN faster than memory bus?

• 1 GBps links in lab.

• 100$ port cost soon

• Port is computer

RIPFDDI

RIPATM

RIPSCI

RIPSCSI

RIPFC

RIP?

Page 70: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

86

Technology Drivers

Plug & Play Software• RPC is standardizing: (DCOM, IIOP, HTTP)

» Gives huge TOOL LEVERAGE» Solves the hard problems for you:

• naming, • security, • directory service, • operations,...

• Commoditized programming environments » FreeBSD, Linix, Solaris,…+ tools» NetWare + tools» WinCE, WinNT,…+ tools» JavaOS + tools

• Apps gravitate to data.

• General purpose OS on controller runs apps.

Page 71: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

87

Basic Argument for x-Disks• Future disk controller is a super-computer.

»1 bips processor

»128 MB dram

»100 GB disk plus one arm

• Connects to SAN via high-level protocols» RPC, HTTP, DCOM, Kerberos, Directory Services,….

»Commands are RPCs

»management, security,….

»Services file/web/db/… requests» Managed by general-purpose OS with good dev environment

• Move apps to disk to save data movement»need programming environment in controller

Page 72: 1 CyberBricks: The future of Database And Storage Engines Jim Gray Gray

88

Outline• What storage things are coming from Microsoft?

• TerraServer: a 1 TB DB on the Web

• Storage Metrics: Kaps, Maps, Gaps, Scans

• The future of storage: ActiveDisks

• Papers and Talks at

http://research.Microsoft.com/~Gray