Download - The Personal Petabyte The Enterprise Exabyte
![Page 1: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/1.jpg)
1
The Personal PetabyteThe Enterprise Exabyte
Jim GrayMicrosoft ResearchPresented at IIST
Asilomar 10 December 2003http://research.microsoft.com/~gray/talks
![Page 2: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/2.jpg)
2
Outline
• History• Changing Ratios
• Who Needs a Petabyte?
Thesis: in 20 years, Personal Petabyte will be affordable.Most personal bytes will be video.Enterprise Exabytes will be sensor data.
![Page 3: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/3.jpg)
3
An Early Disk
• Phaistos Disk: – 1700 BC– Minoan
(Cretian, Greek)
• No one can read it
![Page 4: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/4.jpg)
4
Early Magnetic Disk 1956• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer & accounting software(tubes not transistors)
![Page 5: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/5.jpg)
5
10 years later (1966 Illiac)1.
6 m
eter
s 30 MB
![Page 6: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/6.jpg)
6
Or 1970 IBM 2314 at 29MB
970
![Page 7: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/7.jpg)
7
History: 1980 Winchester
• Seagate 5 ¼” 5 MB • Fujitsu Eagle 10” 450MB
![Page 8: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/8.jpg)
8
The MAD FutureTerror Bytes8
• In the beginning there was the Paramagnetic Limit: 10Gbpsi
• Limit keeps growing (now ~ 200Gbpsi)
• Mark H. Kryder, Seagate Future Magnetic Recording TechnologiesFAST 2001@Monterey PDF. apologizes:
“Only 100x density improvement, then we are out of ideas”
• That’s 20 TB desktop 4 TB laptop!
Bit Density
3 2
3,000 2,000
1,000 600
300 200
100 60
30 20
10 6
b/µm2 Gb/in2
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
1 0.6
CD
DVD ODD
Wavelength Limit
SuperParmagnetic Limit
?: NEMS, Florescent? Holograpic,
DNA?
Density vs Timeb/µm2 & Gb/in2
![Page 9: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/9.jpg)
9
Outline• History
• Changing Ratios– Disk to Ram– DASD is Dead– Disk space is free– Disk Archive-Interchange – Network faster than disk– Capacity, Access– TCO == people cost– Smart disks happened– The entry cost barrier
• Who Needs a Petabyte?
Disk Performance vs Time
1
10
100
1980 1990 2000
Year
se
ek
s p
er
se
co
nd
ba
nd
wid
th:
MB
/s
0.1
1.
10.
Ca
pa
cit
y (
GB
)
![Page 10: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/10.jpg)
10
Storage Ratios Changed• 10x better access time• 10x more bandwidth• 100x more capacity• Data 25x cooler (1Kaps/20MB vs 1Kaps/500MB)
• 4,000x lower media price• 20x to 100x lower disk price• Scan takes 10x longer (3 min vs 45 min)
Disk Performance vs Time
1
10
100
1980 1990 2000
Year
seek
s p
er s
eco
nd
ban
dw
idth
: MB
/s
0.1
1.
10.
Cap
acity
(GB
)
Disk accesses/second vs Time
1
10
100
1980 1990 2000
Year
Acc
esse
s p
er S
eco
nd
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
• RAM/disk media price ratio changed– 1970-1990 100:1 – 1990-1995 10:1 – 1995-1997 50:1– today ~ 1$/GB disk 200:1
200$/GB dram
![Page 11: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/11.jpg)
11
Price_Ram_TB(t+10) = Price_Disk_TB(t) Disk Data Can Move to RAM in 10 years
• Disk ~100x cheaper than RAM per byte
• Both get 100x bigger in 10 years.
• Move data to main memory
• Seems: RAM/Disk bandwidth ~100:1
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
100:1
10 years
![Page 12: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/12.jpg)
12
Kaps over time
1.E+0
1.E+1
1.E+2
1.E+3
1.E+4
1.E+5
1.E+6
1970 1980 1990 2000
Kap
s/$
10
100
1000
Kap
s/d
isk
Kaps
Kaps/$
DASD (direct access storage device) is Dead• accesses got cheaper
– Better disks– Cheaper disks!
• Disk access/bandwidth: the scarce resource
• 2003: 100 minute Scan 1990: 5 minute Scan
• Sequential bandwidth50x faster than randomRandom Scan 3 days
• Ratio will get 10x worse in 10 years100x more capacity, 10x more bandwidth.
• Invent ways to trade capacity for bandwidthUse the capacity without using bandwidth.
300 GB
50 MB/s
![Page 13: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/13.jpg)
13
Disk Space is “free”Bandwidth & Accesses/sec are not• 1k$/TB going to 100$/TB• 20 TB disks on the (distant) horizon• 100x density, • Waste capacity intelligently
– Version everything– Never delete anything– Keep many copies
• Snapshots• Mirrors (triple and geoplex)• Cooperative caching (Farsite and OceanStore)• Disk Archive
![Page 14: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/14.jpg)
14
Disk as Archive-Interchange • Tape is archive / interchange / low cost
• Disc now competitive in all 3 categories
• What format? Fat? CDFS?..
• What tools?
• Need the software to do disk-based backup/restore
• Commonly snapshot (multi-version FS)
• Radical: peer-to-peer file archiving– Many researchers looking at this
OceanStore, Farsite, others…
![Page 15: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/15.jpg)
15
Disk vs NetworkNow the Network is Faster (!)
• Old days: – 10 MBps disk, low cpu cost ( 0.1 ins/b) – 1 MBps net, huge cpu cost (10 ins/b)
• New days:– 50 MBps disk, low cpu cost– 100 MBps net, low cpu cost (toe, rdma)
• Consequence:– You can remote disks.– Allows consolidation– Aggregate (bisection) bandwidth
still a problem.
Disk vs Net Bandwidth
0.1
1
10
100
1000
1970 1980 1990 2000 2010
MB
/s
disk
net
![Page 16: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/16.jpg)
16
Storage TCO == people time1980 rules-of-thumb:
1 systems programmer per mips1 data admin per 10GB 800 sys programmers + 4 data admins
for your laptop
Sometimes it must seem like that but…
Today one data admin per 1 TB ... 300 TB Depending on process and data value.
• Automate everything• Use redundancy to
mask (and repair) problems.
• Save people, spend hardware
![Page 17: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/17.jpg)
18
Smart Disks Happened
Disk appliances are here:
Cameras
Games
PVRs
FileServers
Challenge:
entry price
![Page 18: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/18.jpg)
19
The Entry Cost BarrierConnect the Dots
• Consumer electronics want low entry cost
• 1970: 20,000$
• 1980: 2,000$
• 2000: 200$
• 2010 20$
• If magnetics can’t do this, another technology will.
• Think: copiers, hydraulic shovels,…
Time
ln(p
rice
)
WantedToday
![Page 19: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/19.jpg)
20
Outline• History• Changing Ratios• Who Needs a Petabyte?
– Petabyte for 1k$ in 15-20 years– Affordable but useless– How much information is there?– The Memex vision– MyLifeBits– The other 20% (enterprise storage)
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
We are here
![Page 20: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/20.jpg)
21
A Bleak Future: The ½ Platter Society?
• Conclusion from Information Storage Industry Consortium
HDD Applications Roadmap Workshop:– “Most users need only 20GB”
– We are heading to a ½ platter industry.
• 80% of units and capacity is personal disks(not enterprise servers).
• The end of disk capacity demand.
• A zero billion dollar industry?
![Page 21: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/21.jpg)
22
Try to fill a terabyte in a year
Item Items/TB Items/day
300 KB JPEG 3 M 9,800
1 MB Doc 1 M 2,900
1 hour 256 kb/s MP3 audio
9 K 26
1 hour 1.5 Mbp/s MPEG video
290 0.8
Petabyte volume has to be some form of video.
![Page 22: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/22.jpg)
23
Growth Comes From NEW Apps
• The 10M$ computer of 1980 costs 1k$ today
• If we were still doing the same things,IT would be a 0 B$/y industry
• NEW things absorb the new capacity
• 2010 Portable ?– 100 Gips processor– 1 GB RAM– 1 TB disk– 1 Gbps network– Many form factors
![Page 23: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/23.jpg)
24
The Terror Bytes are Here
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
We are here
• 1 TB costs 1k$ to buy
• 1 TB costs 300k$/y to own• Management & curation are expensive• (I manage about 15TB in my spare time.
no, I am not paid 4.5M$/y to manage it)
– Searching 1TB takes minutes or hours or days or..
• I am Petrified by Peta Bytes
• But… people can “afford” them so, we have lots to do – Automate!
![Page 24: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/24.jpg)
25
How much information is there?
• Soon everything can be recorded and indexed
• Most bytes will never be seen by humans.
• Data summarization, trend detection anomaly detection are key technologies
See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much informationhttp://www.sims.berkeley.edu/research/projects/how-much-info/
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
KiloA BookA Book
.Movie
All books(words)
All Books MultiMedia
Everything!
Recorded
A PhotoA Photo
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
![Page 25: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/25.jpg)
26
MemexAs We May Think, Vannevar Bush, 1945
“A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility”
“yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”
![Page 26: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/26.jpg)
27
Why Put Everything in Cyberspace?
Low rentmin $/byte
Shrinks timenow or later
Shrinks spacehere or there
Automate processingknowbots
Point-to-Point OR Broadcast
Imm
edia
te O
R T
ime
Del
ayed
LocateProcessAnalyzeSummarize
![Page 27: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/27.jpg)
28
How Will We Find Anything?• Need Queries, Indexing, Pivoting,
Scalability, Backup, Replication,Online update, Set-oriented access
• If you don’t use a DBMS, you will implement one!
• Simple logical structure: – Blob and link is all that is inherent– Additional properties (facets == extra tables)
and methods on those tables (encapsulation) • More than a file system • Unifies data and meta-data
SQL ++SQL ++DBMSDBMS
![Page 28: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/28.jpg)
29
MyLifeBits The guinea pig• Gordon Bell is digitizing his life• Has now scanned virtually all:
– Books written (and read when possible)– Personal documents (correspondence, memos, email, bills, legal,0…) – Photos– Posters, paintings, photo of things (artifacts, …medals, plaques)– Home movies and videos– CD collection– And, of course, all PC files
• Now recording: phone, radio, TV (movies), web pages… conversations
• Paperless throughout 2002. 12” scanned, 12’ discarded.• Only 30 GB!!! Excluding digital videos• Video is 2+ TB and growing fast
![Page 29: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/29.jpg)
30
Capture and encoding
![Page 30: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/30.jpg)
31
I mean everything
![Page 31: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/31.jpg)
32
gbell wag: 67 yr, 25Kday lifea Personal Petabyte
0.001
0.01
0.1
1.
10.
100.
1000.
TB
Msgs webpages
Tifs Books jpegs 1KBpssound
music Videos
Lifetime Storage1PB
![Page 32: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/32.jpg)
33
80% of data is personal / individual. But, what about the other 20%?
• Business– Wall Mart online: 1PB and growing….– Paradox: most “transaction” systems < 1 PB.– Have to go to image/data monitoring for big data
• Government– Government is the biggest business.
• Science– LOTS of data.
![Page 33: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/33.jpg)
34
Information Avalanche• Both
– better observational instruments and – Better simulations are producing a data avalanche
• Examples– Turbulence: 100 TB simulation
then mine the Information – BaBar: Grows 1TB/day
2/3 simulation Information 1/3 observational Information
– CERN: LHC will generate 1GB/s10 PB/y
– VLBA (NRAO) generates 1GB/s today– NCBI: “only ½ TB” but doubling each year, very rich dataset.– Pixar: 100 TB/Movie
Image courtesy of C. Meneveau & A. Szalay @ JHU
![Page 34: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/34.jpg)
35
Q: Where will the Data Come From?A: Sensor Applications
• Earth Observation – 15 PB by 2007
• Medical Images & Information + Health Monitoring– Potential 1 GB/patient/y 1 EB/y
• Video Monitoring– ~1E8 video cameras @ 1E5 MBps
10TB/s 100 EB/y filtered???
• Airplane Engines– 1 GB sensor data/flight, – 100,000 engine hours/day– 30PB/y
• Smart Dust: ?? EB/y
http://robotics.eecs.berkeley.edu/~pister/SmartDust/http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html
![Page 35: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/35.jpg)
40
DataGrid Computing
• Store exabytes twice (for redundancy)
• Access them from anywhere• Implies huge archive/data
centers• Supercomputer centers
become super data centers• Examples:
Google, Yahoo!, Hotmail,BaBar, CERN, Fermilab, SDSC, …
![Page 36: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/36.jpg)
41
Outline
• History• Changing Ratios
• Who Needs a Petabyte?
Thesis: in 20 years, Personal Petabyte will be affordable.Most personal bytes will be video.Enterprise Exabytes will be sensor data.
![Page 37: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/37.jpg)
42
Bonus Slides
![Page 38: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/38.jpg)
43
SQL x4SQL x4
SANSAN
TerraServer V4
• 8 web front end• 4x8cpu+4GB DB • 18TB triplicate disks
Classic SAN(tape not shown)
• ~2M$ • Works GREAT!• 2000…2004• Now replaced by..
WEBWEBx8x8
![Page 39: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/39.jpg)
44
KVM / IPKVM / IP
TerraServer V5• Storage Bricks
– “White-box commodity servers”– 4tb raw / 2TB Raid1 SATA storage– Dual Hyper-threaded Xeon 2.4ghz, 4GB RAM
• Partitioned Databases (PACS – partitioned array)– 3 Storage Bricks = 1 TerraServer data – Data partitioned across 20 databases– More data & partitions coming
• Low Cost Availability– 4 copies of the data
• RAID1 SATA Mirroring• 2 redundant “Bunches”
– Spare brick to repair failed brick 2N+1 design
– Web Application “bunch aware”• Load balances between redundant databases• Fails over to surviving database on failure
• ~100K$ capital expense.
![Page 40: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/40.jpg)
45
How Do You Move A Terabyte?
14 minutes6172001,920,0009600OC 192
2.2 hours1000Gbps
1 day100100 Mpbs
14 hours97631649,000155OC3
2 days2,01065128,00043T3
2 months2,4698001,2001.5T1
5 months360117700.6Home DSL
6 years3,0861,000400.04Home phone
Time/TB$/TBSent
$/MbpsRent
$/monthSpeedMbps
Context
Source: TeraScale Sneakernet, Microsoft Research, Jim Gray et. all Source: TeraScale Sneakernet, Microsoft Research, Jim Gray et. all
![Page 41: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/41.jpg)
46
Key Observationsfor Personal Store
And for Larger Stores.• Schematized storage can help
organization and search.• Schematized XML data sets
a universal way exchange data answers and new data.
• If data are objects, thenneed standard representation for classes & methods.
![Page 42: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/42.jpg)
47
Longhorn - For Knowledge Workers• Simple (Self-*): auto install/manage/tune/repair.• Schema: data carries semantics• Search: find things fast (driven by schema)• Sync: “desktop state” anywhere• Security: (Palladium) -- trustworthy - privacy - trustworthy (virus, spam,..)
- DRM (protect IP) • Shell: task-based UI (aka activity-based UI)• Office-Longhorn
– Intelligent documents– XML and Schemas
![Page 43: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/43.jpg)
48
How Do We Represent It To The Outside World?Schematized Storage
• File metaphor too primitive: just a blob• Table metaphor too primitive: just records• Need Metadata describing data context
– Format– Providence (author/publisher/ citations/…)– Rights– History– Related documents
• In a standard format• XML and XML schema• DataSet is great example of this• World is now defining standard schemas
schema
Data ordifgram
<?xml version="1.0" encoding="utf-8" ?>
- <DataSet xmlns="http://WWT.sdss.org/">
- <xs:schema id="radec" xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">
<xs:element name="radec" msdata:IsDataSet="true">
<xs:element name="Table">
<xs:element name="ra" type="xs:double" minOccurs="0" />
<xs:element name="dec" type="xs:double" minOccurs="0" /> …
- <diffgr:diffgram xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" xmlns:diffgr="urn:schemas-microsoft-com:xml-diffgram-v1">
- <radec xmlns="">
- <Table diffgr:id="Table1" msdata:rowOrder="0">
<ra>184.028935351008</ra>
<dec>-1.12590950121524</dec>
</Table>
…
- <Table diffgr:id="Table10" msdata:rowOrder="9">
<ra>184.025719033547</ra>
<dec>-1.21795827920186</dec>
</Table>
</radec>
</diffgr:diffgram>
</DataSet>
![Page 44: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/44.jpg)
49
There Is A Problem
• GREAT!!!!– XML documents are portable objects– XML documents are complex objects– WSDL defines the methods on objects
(the class)
• But will all the implementations match?– Think of UNIX or SQL or C or…
• This is a work in progress.
Niklaus Wirth: Algorithms + Data Structures = Programs
![Page 45: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/45.jpg)
50
Disk Storage Cheaper Than Paper• File Cabinet (4 drawer) 250$
Cabinet: Paper (24,000 sheets) 250$Space (2x3 @ 10€/ft2) 180$Total 700$0.03 $/sheet 3 pennies per page
• Disk: disk (250 GB =) 250$ASCII: 100 m pages 2e-6 $/sheet(10,000x cheaper) micro-dollar per pageImage: 1 m photos 3e-4 $/photo (100x cheaper) milli-dollar per photo
• Store everything on disk
Note: Disk is 100x to 1000x cheaper than RAM
![Page 46: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/46.jpg)
51
Data Analysis• Looking for
– Needles in haystacks – the Higgs particle– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at same rate, we can only keep up with N logN
• A way out? – Discard notion of optimal
(data is fuzzy, answers are approximate)– Don’t assume infinite computational resources or memory
• Requires combination of statistics & computer science
![Page 47: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/47.jpg)
52
Analysis and Databases• Much statistical analysis deals with
– Creating uniform samples – – Data filtering– Assembling relevant subsets– Estimating completeness – Censoring bad data– Counting and building histograms– Generating Monte-Carlo subsets– Likelihood calculations– Hypothesis testing
• Traditionally these are performed on files• Most of these tasks are much better done inside DB• Bring Mohamed to the mountain, not the mountain to him
![Page 48: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/48.jpg)
53
Data Access is hitting a wallFTP and GREP are not adequate
• You can GREP 1 MB in a second• You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days• You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~5,000 disks
• At some point you need indices to limit searchparallel data search and analysis
• This is where databases can help
• You can FTP 1 MB in 1 sec• You can FTP 1 GB / min (= 1 $/GB)
• … 2 days and 1K$• … 3 years and 1M$
![Page 49: The Personal Petabyte The Enterprise Exabyte](https://reader035.vdocuments.site/reader035/viewer/2022081511/56813ac7550346895da2ddac/html5/thumbnails/49.jpg)
54
Smart Data (active databases)• If there is too much data to move around,
take the analysis to the data!• Do all data manipulations at database
– Build custom procedures and functions in the database
• Automatic parallelism • Easy to build-in custom functionality
– Databases & Procedures being unified– Example temporal and spatial indexing
pixel processing, …
• Easy to reorganize the data– Multiple views, each optimal for certain types of analyses– Building hierarchical summaries are trivial
• Scalable to Petabyte datasets